Nidhal Jegham

CV
h-index19
3papers
246citations
Novelty22%
AI Score25

3 Papers

CVOct 31, 2024
YOLO Evolution: A Comprehensive Benchmark and Architectural Review of YOLOv12, YOLO11, and Their Previous Versions

Nidhal Jegham, Chan Young Koh, Marwan Abdelatti et al.

This study presents a comprehensive benchmark analysis of various YOLO (You Only Look Once) algorithms. It represents the first comprehensive experimental evaluation of YOLOv3 to the latest version, YOLOv12, on various object detection challenges. The challenges considered include varying object sizes, diverse aspect ratios, and small-sized objects of a single class, ensuring a comprehensive assessment across datasets with distinct challenges. To ensure a robust evaluation, we employ a comprehensive set of metrics, including Precision, Recall, Mean Average Precision (mAP), Processing Time, GFLOPs count, and Model Size. Our analysis highlights the distinctive strengths and limitations of each YOLO version. For example: YOLOv9 demonstrates substantial accuracy but struggles with detecting small objects and efficiency whereas YOLOv10 exhibits relatively lower accuracy due to architectural choices that affect its performance in overlapping object detection but excels in speed and efficiency. Additionally, the YOLO11 family consistently shows superior performance maintaining a remarkable balance of accuracy and efficiency. However, YOLOv12 delivered underwhelming results, with its complex architecture introducing computational overhead without significant performance gains. These results provide critical insights for both industry and academia, facilitating the selection of the most suitable YOLO algorithm for diverse applications and guiding future enhancements.

CYMay 14, 2025
How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference

Nidhal Jegham, Marwan Abdelatti, Chan Young Koh et al.

This paper introduces an infrastructure-aware benchmarking framework for quantifying the environmental footprint of LLM inference across 30 state-of-the-art models in commercial datacenters. The framework combines public API performance data with company-specific environmental multipliers and statistical inference of hardware configurations. We additionally utilize cross-efficiency Data Envelopment Analysis (DEA) to rank models by performance relative to environmental cost and provide a dynamically updated dashboard that visualizes model-level energy, water, and carbon metrics. Results show the most energy-intensive models exceed 29 Wh per long prompt, over 65 times the most efficient systems. Even a 0.42 Wh short query, when scaled to 700M queries/day, aggregates to annual electricity comparable to 35{,}000 U.S. homes, evaporative freshwater equal to the annual drinking needs of 1.2M people, and carbon emissions requiring a Chicago-sized forest to offset. These findings highlight a growing paradox: as AI becomes cheaper and faster, global adoption drives disproportionate resource consumption. Our methodology offers a standardized, empirically grounded basis for sustainability benchmarking and accountability in AI deployment.

CVFeb 23, 2025
Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT

Nidhal Jegham, Marwan Abdelatti, Abdeltawab Hendawi

Traditional evaluations of multimodal large language models (LLMs) have been limited by their focus on single-image reasoning, failing to assess crucial aspects like contextual understanding, reasoning stability, and uncertainty calibration. This study addresses these limitations by introducing a novel benchmark that integrates multi-image reasoning tasks with rejection-based evaluation and positional bias detection. To evaluate these dimensions, we further introduce entropy as a novel metric for quantifying reasoning consistency across reordered answer variants. We applied this benchmark to assess Grok 3, ChatGPT-4o, ChatGPT-o1, Gemini 2.0 Flash Experimental, DeepSeek Janus models, Qwen2.5-VL-72B-Instruct, QVQ-72B-Preview, and Pixtral 12B across eight visual reasoning tasks, including difference spotting and diagram interpretation. Our findings reveal ChatGPT-o1 leading in overall accuracy (82.5\%) and rejection accuracy (70.0\%), closely followed by Gemini 2.0 Flash Experimental (70.8\%). QVQ-72B-Preview demonstrated superior rejection accuracy (85.5\%). Notably, Pixtral 12B (51.7\%) showed promise in specific domains, while Janus models exhibited challenges in bias and uncertainty calibration, reflected in low rejection accuracies and high entropy scores. High entropy scores in Janus models (Janus 7B: 0.8392, Janus 1B: 0.787) underscore their susceptibility to positional bias and unstable reasoning, contrasting with the low entropy and robust reasoning of ChatGPT models. The study further demonstrates that model size is not the sole determinant of performance, as evidenced by Grok 3 underperformance despite its substantial parameter count. By employing multi-image contexts, rejection mechanisms, and entropy-based consistency metrics, this benchmark sets a new standard for evaluating multimodal LLMs, enabling a more robust and reliable assessment of next-generation AI systems.