Rizwan Qureshi

CV
h-index19
19papers
283citations
Novelty44%
AI Score55

19 Papers

IVDec 4, 2022Code
Brain Tumor Synthetic Data Generation with Adaptive StyleGANs

Usama Tariq, Rizwan Qureshi, Anas Zafar et al.

Generative models have been very successful over the years and have received significant attention for synthetic data generation. As deep learning models are getting more and more complex, they require large amounts of data to perform accurately. In medical image analysis, such generative models play a crucial role as the available data is limited due to challenges related to data privacy, lack of data diversity, or uneven data distributions. In this paper, we present a method to generate brain tumor MRI images using generative adversarial networks. We have utilized StyleGAN2 with ADA methodology to generate high-quality brain MRI with tumors while using a significantly smaller amount of training data when compared to the existing approaches. We use three pre-trained models for transfer learning. Results demonstrate that the proposed method can learn the distributions of brain tumors. Furthermore, the model can generate high-quality synthetic brain MRI with a tumor that can limit the small sample size issues. The approach can addresses the limited data availability by generating realistic-looking brain MRI with tumors. The code is available at: ~\url{https://github.com/rizwanqureshi123/Brain-Tumor-Synthetic-Data}.

36.7CLMay 31
Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking

Fachrina Dewi Puspitasari, Chaoning Zhang, Jiaquan Zhang et al.

The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32$\times$ lower latency with the utilization of SLM.

CVApr 15, 2022
Guiding Attention using Partial-Order Relationships for Image Captioning

Murad Popattia, Muhammad Rafi, Rizwan Qureshi et al.

The use of attention models for automated image captioning has enabled many systems to produce accurate and meaningful descriptions for images. Over the years, many novel approaches have been proposed to enhance the attention process using different feature representations. In this paper, we extend this approach by creating a guided attention network mechanism, that exploits the relationship between the visual scene and text-descriptions using spatial features from the image, high-level information from the topics, and temporal context from caption generation, which are embedded together in an ordered embedding space. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space to maintain a partial order in the visual-semantic hierarchy and hence, helps the model to produce more visually accurate captions. The experimental results based on MSCOCO dataset shows the competitiveness of our approach, with many state-of-the-art models on various evaluation metrics.

CLFeb 17, 2025Code
VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment

Shaina Raza, Ashmal Vayani, Aditya Jain et al.

Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media. Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench

53.5CVApr 9
Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Sumra Khan, Sagar Chhabriya, Aizan Zafar et al.

Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate that context alignment improves discriminative performance (AUC 0.918 to 0.925) while maintaining calibrated uncertainty. The framework also substantially reduces hallucinated keywords (1.14 to 0.25) and produces more concise reasoning explanations (19.4 to 15.3 words) without increasing model confidence (0.70 to 0.68). Cross-dataset evaluation on CheXpert further reveals that modality informativeness significantly influences reasoning behavior. These results suggest that enforcing multi-evidence agreement improves both reliability and trustworthiness in medical multimodal reasoning, while preserving the underlying model architecture.

QMDec 17, 2025
Foundation Models in Biomedical Imaging: Turning Hype into Reality

Amgad Muneer, Kai Zhang, Ibraheem Hamdi et al.

Foundation models (FMs) are driving a prominent shift in artificial intelligence across different domains, including biomedical imaging. These models are designed to move beyond narrow pattern recognition towards emulating sophisticated clinical reasoning, understanding complex spatial relationships, and integrating multimodal data with unprecedented flexibility. However, a critical gap exists between this potential and the current reality, where the clinical evaluation and deployment of FMs are hampered by significant challenges. Herein, we critically assess the current state-of-the-art, analyzing hype by examining the core capabilities and limitations of FMs in the biomedical domain. We also provide a taxonomy of reasoning, ranging from emulated sequential logic and spatial understanding to the integration of explicit symbolic knowledge, to evaluate whether these models exhibit genuine cognition or merely mimic surface-level patterns. We argue that a critical frontier lies beyond statistical correlation, in the pursuit of causal inference, which is essential for building robust models that understand cause and effect. Furthermore, we discuss the paramount issues in deployment stemming from trustworthiness, bias, and safety, dissecting the challenges of algorithmic bias, data bias and privacy, and model hallucinations. We also draw attention to the need for more inclusive, rigorous, and clinically relevant validation frameworks to ensure their safe and ethical application. We conclude that while the vision of autonomous AI-doctors remains distant, the immediate reality is the emergence of powerful technology and assistive tools that would benefit clinical practice. The future of FMs in biomedical imaging hinges not on scale alone, but on developing hybrid, causally aware, and verifiably safe systems that augment, rather than replace, human expertise.

CVApr 18, 2025Code
DanceText: A Training-Free Layered Framework for Controllable Multilingual Text Transformation in Images

Zhenyu Yu, Mohd Yamani Idna Idris, Hua Wang et al.

We present DanceText, a training-free framework for multilingual text editing in images, designed to support complex geometric transformations and achieve seamless foreground-background integration. While diffusion-based generative models have shown promise in text-guided image synthesis, they often lack controllability and fail to preserve layout consistency under non-trivial manipulations such as rotation, translation, scaling, and warping. To address these limitations, DanceText introduces a layered editing strategy that separates text from the background, allowing geometric transformations to be performed in a modular and controllable manner. A depth-aware module is further proposed to align appearance and perspective between the transformed text and the reconstructed background, enhancing photorealism and spatial consistency. Importantly, DanceText adopts a fully training-free design by integrating pretrained modules, allowing flexible deployment without task-specific fine-tuning. Extensive experiments on the AnyWord-3M benchmark demonstrate that our method achieves superior performance in visual quality, especially under large-scale and complex transformation scenarios. Code is avaible at https://github.com/YuZhenyuLindy/DanceText.git.

CVMar 3
Beyond Anatomy: Explainable ASD Classification from rs-fMRI via Functional Parcellation and Graph Attention Networks

Syeda Hareem Madani, Noureen Bibi, Adam Rafiq Jeraj et al.

Anatomical brain parcellations dominate rs-fMRI-based Autism Spectrum Disorder (ASD) classification, yet their rigid boundaries may fail to capture the idiosyncratic connectivity patterns that characterise ASD. We present a graph-based deep learning framework comparing anatomical (AAL, 116 ROIs) and functionally-derived (MSDL, 39 ROIs) parcellation strategies on the ABIDE I dataset. Our FSL preprocessing pipeline handles multi-site heterogeneity across 400 balanced subjects, with site-stratified 70/15/15 splits to prevent data leakage. Gaussian noise augmentation within training folds expands samples from 280 to 1,680. A three phase pipeline progresses from a baseline GCN with AAL (73.3% accuracy, AUC=0.74), to an optimised GCN with MSDL (84.0%, AUC=0.84), to a Graph Attention Network ensemble achieving 95.0% accuracy (AUC=0.98), outperforming all recent GNN-based benchmarks on ABIDE I. The 10.7-point gain from atlas substitution alone demonstrates that functional parcellation is the most impactful modelling decision. Gradient-based saliency and GNNExplainer analyses converge on the Posterior Cingulate Cortex and Precuneus as core Default Mode Network hubs, validating that model decisions reflect ASD neuropathology rather than acquisition artefacts. All code and datasets will be publicly released upon acceptance.

CLAug 31, 2025Code
Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations

Shaina Raza, Maximus Powers, Partha Pratim Saha et al.

Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility https://github.com/maximus-powers/img-gen-bias-analysis.

AIJul 1, 2025
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

Rizwan Qureshi, Ranjan Sapkota, Abbas Shah et al.

Can machines truly think, reason and act in domains like humans? This enduring question continues to shape the pursuit of Artificial General Intelligence (AGI). Despite the growing capabilities of models such as GPT-4.5, DeepSeek, Claude 3.5 Sonnet, Phi-4, and Grok 3, which exhibit multimodal fluency and partial reasoning, these systems remain fundamentally limited by their reliance on token-level prediction and lack of grounded agency. This paper offers a cross-disciplinary synthesis of AGI development, spanning artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems. We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination. In particular, we emphasize the rise of Agentic RAG frameworks that combine retrieval, planning, and dynamic tool use to enable more adaptive behavior. We discuss generalization strategies, including information compression, test-time adaptation, and training-free methods, as critical pathways toward flexible, domain-agnostic intelligence. Vision-Language Models (VLMs) are reexamined not just as perception modules but as evolving interfaces for embodied understanding and collaborative task completion. We also argue that true intelligence arises not from scale alone but from the integration of memory and reasoning: an orchestration of modular, interactive, and self-improving components where compression enables adaptive behavior. Drawing on advances in neurosymbolic systems, reinforcement learning, and cognitive scaffolding, we explore how recent architectures begin to bridge the gap between statistical learning and goal-directed cognition. Finally, we identify key scientific, technical, and ethical challenges on the path to AGI.

CVFeb 11, 2025
Confidence-calibrated covariate shift correction for few-shot classification in Vision-Language Models

Behraj Khan, Rizwan Qureshi, Nouman Muhammad Durrani et al.

Since the establishment of vision-language foundation models as the new mainstay in low-shot vision classification tasks, the question of domain generalization arising from insufficient target data is assuming more importance. This scarcity challenge induces sampling bias and amplifies model sensitivity to variations and shifts in data distributions. While fine-tuning on multiple domains could mitigate such domain generalization issues, it is resource-intensive and demands diverse data sources. In this work, we systematically analyze two critical challenges: (1) covariate shift between the pre-training distribution and the underspecified target distribution, and (2) confidence misalignment, where predictions on novel data are overconfident. To address both challenges simultaneously, we introduce \textbf{Confidence-Calibrated Covariate Shift Correction (CalShift)} -- a unified approach that combines a Fisher information penalty to mitigate covariate shift and a Confidence Misalignment Penalty (CMP) to reduce overconfidence in misclassified examples. Experimental evaluations across various vision and covariate shift benchmarks demonstrate that CalShift significantly improves model calibration, achieving up to a 5.82\% reduction in Expected Calibration Error (ECE). Furthermore, CalShift enhances robustness, improving accuracy by 3.5\% on challenging datasets impacted by covariate shifts. Our results highlight CalShift as a promising strategy for building robust and reliable low-shot vision-language systems for real-world applications.

CLMay 23, 2025
Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods

Shaina Raza, Rizwan Qureshi, Marcelo Lotif et al.

Generative AI models often learn and reproduce false information present in their training corpora. This position paper argues that, analogous to biological immunization, where controlled exposure to a weakened pathogen builds immunity, AI models should be fine tuned on small, quarantined sets of explicitly labeled falsehoods as a "vaccine" against misinformation. These curated false examples are periodically injected during finetuning, strengthening the model ability to recognize and reject misleading claims while preserving accuracy on truthful inputs. An illustrative case study shows that immunized models generate substantially less misinformation than baselines. To our knowledge, this is the first training framework that treats fact checked falsehoods themselves as a supervised vaccine, rather than relying on input perturbations or generic human feedback signals, to harden models against future misinformation. We also outline ethical safeguards and governance controls to ensure the safe use of false data. Model immunization offers a proactive paradigm for aligning AI systems with factuality.

CVJun 15, 2025
Leveraging MIMIC Datasets for Better Digital Health: A Review on Open Problems, Progress Highlights, and Future Promises

Afifa Khaled, Mohammed Sabir, Rizwan Qureshi et al.

The Medical Information Mart for Intensive Care (MIMIC) datasets have become the Kernel of Digital Health Research by providing freely accessible, deidentified records from tens of thousands of critical care admissions, enabling a broad spectrum of applications in clinical decision support, outcome prediction, and healthcare analytics. Although numerous studies and surveys have explored the predictive power and clinical utility of MIMIC based models, critical challenges in data integration, representation, and interoperability remain underexplored. This paper presents a comprehensive survey that focuses uniquely on open problems. We identify persistent issues such as data granularity, cardinality limitations, heterogeneous coding schemes, and ethical constraints that hinder the generalizability and real-time implementation of machine learning models. We highlight key progress in dimensionality reduction, temporal modelling, causal inference, and privacy preserving analytics, while also outlining promising directions including hybrid modelling, federated learning, and standardized preprocessing pipelines. By critically examining these structural limitations and their implications, this survey offers actionable insights to guide the next generation of MIMIC powered digital health innovations.

LGMar 7
Scaling Laws in the Tiny Regime: How Small Models Change Their Mistakes

Mohammed Alnemari, Rizwan Qureshi, Nader Begrazadah

Neural scaling laws describe how model performance improves as a power law with size, but existing work focuses on models above 100M parameters. The sub-20M regime -- where TinyML and edge AI operate -- remains unexamined. We train 90 models (22K--19.8M parameters) across two architectures (plain ConvNet, MobileNetV2) on CIFAR-100, varying width while holding depth and training fixed. Both follow approximate power laws in error rate: $α= 0.156 \pm 0.002$ (ScaleCNN) and $α= 0.106 \pm 0.001$ (MobileNetV2) across five seeds. Since prior work fit cross-entropy loss rather than error rate, direct exponent comparison is approximate; with that caveat, these are 1.4--2x steeper than $α\approx 0.076$ for large language models. The power law does not hold uniformly: local exponents decay with scale, and MobileNetV2 saturates at 19.8M parameters ($α_{\mathrm{local}} = 0.006$). Error structure also changes. Jaccard overlap between error sets of the smallest and largest ScaleCNN is only 0.35 (25 seed pairs, $\pm 0.004$) -- compression changes which inputs are misclassified, not merely how many. Small models concentrate capacity on easy classes (Gini: 0.26 at 22K vs. 0.09 at 4.7M) while abandoning the hardest (bottom-5 accuracy: 10% vs. 53%). Counter to expectation, the smallest models are best calibrated (ECE = 0.013 vs. peak 0.110 at mid-size). Aggregate accuracy is therefore misleading for edge deployment; validation must happen at the target model size.

IVJul 25, 2025
Enhancing Diabetic Retinopathy Classification Accuracy through Dual Attention Mechanism in Deep Learning

Abdul Hannan, Zahid Mahmood, Rizwan Qureshi et al.

Automatic classification of Diabetic Retinopathy (DR) can assist ophthalmologists in devising personalized treatment plans, making it a critical component of clinical practice. However, imbalanced data distribution in the dataset becomes a bottleneck in the generalization of deep learning models trained for DR classification. In this work, we combine global attention block (GAB) and category attention block (CAB) into the deep learning model, thus effectively overcoming the imbalanced data distribution problem in DR classification. Our proposed approach is based on an attention mechanism-based deep learning model that employs three pre-trained networks, namely, MobileNetV3-small, Efficientnet-b0, and DenseNet-169 as the backbone architecture. We evaluate the proposed method on two publicly available datasets of retinal fundoscopy images for DR. Experimental results show that on the APTOS dataset, the DenseNet-169 yielded 83.20% mean accuracy, followed by the MobileNetV3-small and EfficientNet-b0, which yielded 82% and 80% accuracies, respectively. On the EYEPACS dataset, the EfficientNet-b0 yielded a mean accuracy of 80%, while the DenseNet-169 and MobileNetV3-small yielded 75.43% and 76.68% accuracies, respectively. In addition, we also compute the F1-score of 82.0%, precision of 82.1%, sensitivity of 83.0%, specificity of 95.5%, and a kappa score of 88.2% for the experiments. Moreover, in our work, the MobileNetV3-small has 1.6 million parameters on the APTOS dataset and 0.90 million parameters on the EYEPACS dataset, which is comparatively less than other methods. The proposed approach achieves competitive performance that is at par with recently reported works on DR classification.

CVJul 12, 2025
Calibrated and Robust Foundation Models for Vision-Language and Medical Image Tasks Under Distribution Shift

Behraj Khan, Tahir Qasim Syed, Nouman M. Durrani et al.

Foundation models like CLIP and SAM have advanced computer vision and medical imaging via low-shot transfer learning, aiding CADD with limited data. However, their deployment faces two key challenges. \textit{distribution shift} where pre-training and post-training data distributions differ (e.g., due to inter-center image acquisition) and \textit{confidence misalignment}, which leads to overconfident errors. These issues surface differently, vision-language models (e.g., CLIP) suffer from 2D embedding shift (image-text misalignment), while medical models (e.g., SAM) encounter 3D domain shifts (e.g., scanner variation) and voxel-wise calibration need. Existing solutions are domain-specific. We propose \textbf{StaRFM}, a fusion of Fisher information penalty (FIP) and confidence misalignment penalty (CMP) tackling both challenges. It applies FIP, extended to 3D via patch-wise regularization, to reduce embedding shift, and CMP, reformulated for voxel-level predictions, to calibrate segmentation uncertainty. We derive PAC-Bayes bounds. FIP controls generalization via the Fisher-Rao norm, and CMP reduces calibration error via Brier score minimization. StaRFM surpasses baselines by \texttt{+}3.5\% accuracy and 28\% lower ECE on 19 vision datasets (e.g., ImageNet, Office-Home), achieves +4.2\% DSC over SAM-FT and 4.8mm HD95 on medical benchmarks (e.g., BraTS, ATLAS), and reduces cross-domain gaps by up to 20\%. The framework is plug-and-play, requiring minimal architectural changes. Code and models are available at: \href{https://anonymous.4open.science/r/StaRFM-C0CD/}{\textcolor{blue}{\underline{StaRFM}}}

CVJun 8, 2025
A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge

Tarique Dahri, Zulfiqar Ali Memon, Zhenyu Yu et al.

We introduce Layered Self-Supervised Knowledge Distillation (LSSKD) framework for training compact deep learning models. Unlike traditional methods that rely on pre-trained teacher networks, our approach appends auxiliary classifiers to intermediate feature maps, generating diverse self-supervised knowledge and enabling one-to-one transfer across different network stages. Our method achieves an average improvement of 4.54\% over the state-of-the-art PS-KD method and a 1.14% gain over SSKD on CIFAR-100, with a 0.32% improvement on ImageNet compared to HASSKD. Experiments on Tiny ImageNet and CIFAR-100 under few-shot learning scenarios also achieve state-of-the-art results. These findings demonstrate the effectiveness of our approach in enhancing model generalization and performance without the need for large over-parameterized teacher networks. Importantly, at the inference stage, all auxiliary classifiers can be removed, yielding no extra computational cost. This makes our model suitable for deploying small language models on affordable low-computing devices. Owing to its lightweight design and adaptability, our framework is particularly suitable for multimodal sensing and cyber-physical environments that require efficient and responsive inference. LSSKD facilitates the development of intelligent agents capable of learning from limited sensory data under weak supervision.

CVNov 15, 2024
Voxel-Aggregated Feature Synthesis: Efficient Dense Mapping for Simulated 3D Reasoning

Owen Burns, Rizwan Qureshi

We address the issue of the exploding computational requirements of recent State-of-the-art (SOTA) open set multimodel 3D mapping (dense 3D mapping) algorithms and present Voxel-Aggregated Feature Synthesis (VAFS), a novel approach to dense 3D mapping in simulation. Dense 3D mapping involves segmenting and embedding sequential RGBD frames which are then fused into 3D. This leads to redundant computation as the differences between frames are small but all are individually segmented and embedded. This makes dense 3D mapping impractical for research involving embodied agents in which the environment, and thus the mapping, must be modified with regularity. VAFS drastically reduces this computation by using the segmented point cloud computed by a simulator's physics engine and synthesizing views of each region. This reduces the number of features to embed from the number of captured RGBD frames to the number of objects in the scene, effectively allowing a "ground truth" semantic map to be computed an order of magnitude faster than traditional methods. We test the resulting representation by assessing the IoU scores of semantic queries for different objects in the simulated scene, and find that VAFS exceeds the accuracy and speed of prior dense 3D mapping techniques.

CVJun 12, 2024
YOLO advances to its genesis: a decadal and comprehensive review of the You Only Look Once (YOLO) series

Ranjan Sapkota, Marco Flores Calero, Rizwan Qureshi et al.

This review systematically examines the progression of the You Only Look Once (YOLO) object detection algorithms from YOLOv1 to the recently unveiled YOLOv12. Employing a reverse chronological analysis, this study examines the advancements introduced by YOLO algorithms, beginning with YOLOv12 and progressing through YOLO11 (or YOLOv11), YOLOv10, YOLOv9, YOLOv8, and subsequent versions to explore each version's contributions to enhancing speed, detection accuracy, and computational efficiency in real-time object detection. Additionally, this study reviews the alternative versions derived from YOLO architectural advancements of YOLO-NAS, YOLO-X, YOLO-R, DAMO-YOLO, and Gold-YOLO. Moreover, the study highlights the transformative impact of YOLO models across five critical application areas: autonomous vehicles and traffic safety, healthcare and medical imaging, industrial manufacturing, surveillance and security, and agriculture. By detailing the incremental technological advancements in subsequent YOLO versions, this review chronicles the evolution of YOLO, and discusses the challenges and limitations in each of the earlier versions. The evolution signifies a path towards integrating YOLO with multimodal, context-aware, and Artificial General Intelligence (AGI) systems for the next YOLO decade, promising significant implications for future developments in AI-driven applications. YOLO Review, YOLO Advances, YOLOv13, YOLOv14, YOLOv15, YOLOv16, YOLOv17, YOLOv18, YOLOv19, YOLOv20, YOLO review, YOLO Object Detection