CVJul 14, 2024
3DEgo: 3D Editing on the Go!Umar Khalid, Hasan Iqbal, Azib Farooq et al.
We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset. Project Page: https://3dego.github.io/
CVMar 25, 2022
Interpretation of Chest x-rays affected by bullets using deep transfer learningShaheer Khan, Azib Farooq, Israr Khan et al.
The potential of deep learning, especially in medical imaging, initiated astonishing results and improved the methodologies after every passing day. Deep learning in radiology provides the opportunity to classify, detect and segment different diseases automatically. In the proposed study, we worked on a non-trivial aspect of medical imaging where we classified and localized the X-Rays affected by bullets. We tested Images on different classification and localization models to get considerable accuracy. The replicated data set used in the study was replicated on different images of chest X-Rays. The proposed model worked not only on chest radiographs but other body organs X-rays like leg, abdomen, head, even the training dataset based on chest radiographs. Custom models have been used for classification and localization purposes after tuning parameters. Finally, the results of our findings manifested using different frameworks. This might assist the research enlightening towards this field. To the best of our knowledge, this is the first study on the detection and classification of radiographs affected by bullets using deep learning.
12.0CLApr 15
Reducing Hallucinations in LLMs via Factuality-Aware Preference LearningSindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq et al.
Preference alignment methods such as RLHF and Direct Preference Optimization (DPO) improve instruction following, but they can also reinforce hallucinations when preference judgments reward fluency and confidence over factual correctness. We introduce F-DPO (Factuality-aware Direct Preference Optimization), a simple extension of DPO that uses only binary factuality labels. F-DPO (i) applies a label-flipping transformation that corrects misordered preference pairs so the chosen response is never less factual than the rejected one, and (ii) adds a factuality-aware margin that emphasizes pairs with clear correctness differences, while reducing to standard DPO when both responses share the same factuality. We construct factuality-aware preference data by augmenting DPO pairs with binary factuality indicators and synthetic hallucinated variants. Across seven open-weight LLMs (1B-14B), F-DPO consistently improves factuality and reduces hallucination rates relative to both base models and standard DPO. On Qwen3-8B, F-DPO reduces hallucination rates by 5x(from 0.424 to 0.084) while improving factuality scores by 50% (from 5.26 to 7.90). F-DPO also generalizes to out-of-distribution benchmarks: on TruthfulQA, Qwen2.5-14B achieves +17% MC1 accuracy (0.500 to 0.585) and +49% MC2 accuracy (0.357 to 0.531). F-DPO requires no auxiliary reward model, token-level annotations, or multi-stage training.
CYSep 29, 2025
AI in Pakistani Schools: Adoption, Usage, and Perceived Impact among EducatorsSyed Hassan Raza, Azib Farooq
Artificial Intelligence (AI) is increasingly permeating classrooms worldwide, yet its adoption in schools of developing countries remains under-explored. This paper investigates AI adoption, usage patterns, and perceived impact in Pakistani K-12 schools based on a survey of 125 educators. The questionnaire covered educator's familiarity with AI, frequency and modes of use, and attitudes toward AI's benefits and challenges. Results reveal a generally positive disposition towards AI: over two-thirds of teachers expressed willingness to adopt AI tools given proper support and many have begun integrating AI for lesson planning and content creation. However, AI usage is uneven - while about one-third of respondents actively use AI tools frequently, others remain occasional users. Content generation emerged as the most common AI application, whereas AI-driven grading and feedback are rarely used. Teachers reported moderate improvements in student engagement and efficiency due to AI, but also voiced concerns about equitable access. These findings highlight both the enthusiasm for AI's potential in Pakistan's schools and the need for training and infrastructure to ensure inclusive and effective implementation.
CVMar 14, 2025
PSF-4D: A Progressive Sampling Framework for View Consistent 4D EditingHasan Iqbal, Nazmul Karim, Umar Khalid et al.
Instruction-guided generative models, especially those using text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene, we introduce a progressive sampling framework for 4D editing (PSF-4D) that ensures temporal and multi-view consistency by intuitively controlling the noise initialization during forward diffusion. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Additionally, to ensure spatial consistency across views, we implement a cross-view noise model, which uses shared and independent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, PSF-4D incorporates view-consistent iterative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key challenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g., style transfer, multi-attribute editing, object removal, local editing, etc.), we show the effectiveness of our proposed method. Experimental results demonstrate that our proposed method outperforms state-of-the-art 4D editing methods in diverse benchmarks.
AIDec 22, 2024
ViLBias: Detecting and Reasoning about Bias in Multimodal ContentShaina Raza, Caesar Saleh, Azib Farooq et al.
Detecting bias in multimodal news requires models that reason over text--image pairs, not just classify text. In response, we present ViLBias, a VQA-style benchmark and framework for detecting and reasoning about bias in multimodal news. The dataset comprises 40,945 text--image pairs from diverse outlets, each annotated with a bias label and concise rationale using a two-stage LLM-as-annotator pipeline with hierarchical majority voting and human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended classification and open-ended reasoning (oVQA), and compare parameter-efficient tuning strategies. Results show that incorporating images alongside text improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle framing and text--image inconsistencies than SLMs. Parameter-efficient methods (LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with $<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and faithfulness 68--89\%, both improved by instruction tuning; closed accuracy correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable benchmark and strong baselines for multimodal bias detection and rationale quality.
CVDec 13, 2024
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual EditingUmar Khalid, Kashif Munir, Hasan Iqbal et al.
Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.