Qi Dong

CV
h-index61
18papers
1,145citations
Novelty44%
AI Score45

18 Papers

AIMar 17, 2025
The Amazon Nova Family of Models: Technical Report and Model Card

Amazon AGI, Aaron Langford, Aayush Shah et al. · amazon-science

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

CVJun 2, 2023
DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al. · amazon-science

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine datasets shows state-of-the-art performance over strong baselines e.g. TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, Doc- Formerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive ablations show that due to its pre-training, DocFormerv2 understands multiple modalities better than prior-art in VDU.

LGAug 29, 2024
Short-Term Electricity-Load Forecasting by Deep Learning: A Comprehensive Survey

Qi Dong, Rubing Huang, Chenhui Cui et al.

Short-Term Electricity-Load Forecasting (STELF) refers to the prediction of the immediate demand (in the next few hours to several days) for the power system. Various external factors, such as weather changes and the emergence of new electricity consumption scenarios, can impact electricity demand, causing load data to fluctuate and become non-linear, which increases the complexity and difficulty of STELF. In the past decade, deep learning has been applied to STELF, modeling and predicting electricity demand with high accuracy, and contributing significantly to the development of STELF. This paper provides a comprehensive survey on deep-learning-based STELF over the past ten years. It examines the entire forecasting process, including data pre-processing, feature extraction, deep-learning modeling and optimization, and results evaluation. This paper also identifies some research challenges and potential research directions to be further investigated in future work.

62.3CLMar 25
Stateful Evidence-Driven Retrieval-Augmented Generation with Iterative Reasoning

Qi Dong, Ziheng Lin, Ning Ding

Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) in external knowledge but often suffers from flat context representations and stateless retrieval, leading to unstable performance. We propose Stateful Evidence-Driven RAG with Iterative Reasoning, a framework that models question answering as a progressive evidence accumulation process. Retrieved documents are converted into structured reasoning units with explicit relevance and confidence signals and maintained in a persistent evidence pool capturing both supportive and non-supportive information. The framework performs evidence-driven deficiency analysis to identify gaps and conflicts and iteratively refines queries to guide subsequent retrieval. This iterative reasoning process enables stable evidence aggregation and improves robustness to noisy retrieval. Experiments on multiple question answering benchmarks demonstrate consistent improvements over standard RAG and multi-step baselines, while effectively accumulating high-quality evidence and maintaining stable performance under substantial retrieval noise.

CVMar 4, 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models

Kunyu Shi, Qi Dong, Luis Goncalves et al.

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

CLOct 14, 2025
Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

Yukun Zhang, Qi Dong

Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model's layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the "alignment tax" observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.

CLMay 24, 2025
Empirical Investigation of Latent Representational Dynamics in Large Language Models: A Manifold Evolution Perspective

Yukun Zhang, Qi Dong

This paper introduces the Dynamical Manifold Evolution Theory (DMET), a conceptual framework that models large language model (LLM) generation as a continuous trajectory evolving on a low-dimensional semantic manifold. The theory characterizes latent dynamics through three interpretable metrics-state continuity ($C$), attractor compactness ($Q$), and topological persistence ($P$)-which jointly capture the smoothness, stability, and structure of representation evolution. Empirical analyses across multiple Transformer architectures reveal consistent links between these latent dynamics and text quality: smoother trajectories correspond to greater fluency, and richer topological organization correlates with enhanced coherence. Different models exhibit distinct dynamical regimes, reflecting diverse strategies of semantic organization in latent space. Moreover, decoding parameters such as temperature and top-$p$ shape these trajectories in predictable ways, defining a balanced region that harmonizes fluency and creativity. As a phenomenological rather than first-principles framework, DMET provides a unified and testable perspective for interpreting, monitoring, and guiding LLM behavior, offering new insights into the interplay between internal representation dynamics and external text generation quality.

CLMay 24, 2025
Multi-Scale Manifold Alignment for Interpreting Large Language Models: A Unified Information-Geometric Framework

Yukun Zhang, Qi Dong

We present Multi-Scale Manifold Alignment(MSMA), an information-geometric framework that decomposes LLM representations into local, intermediate, and global manifolds and learns cross-scale mappings that preserve geometry and information. Across GPT-2, BERT, RoBERTa, and T5, we observe consistent hierarchical patterns and find that MSMA improves alignment metrics under multiple estimators (e.g., relative KL reduction and MI gains with statistical significance across seeds). Controlled interventions at different scales yield distinct and architecture-dependent effects on lexical diversity, sentence structure, and discourse coherence. While our theoretical analysis relies on idealized assumptions, the empirical results suggest that multi-objective alignment offers a practical lens for analyzing cross-scale information flow and guiding representation-level control.

CVMay 26, 2025
HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

Ming Meng, Qi Dong, Jiajie Li et al.

Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models. While existing methods have achieved notable progress, they still face significant challenges in maintaining consistency across different poses. Specifically, geometric distortions lead to a lack of spatial consistency, mismatches in garment structure and texture across poses result in semantic inconsistency, and the loss or distortion of fine-grained details diminishes visual fidelity. To address these challenges, we propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses. HF-VTON consists of three key modules: (1) the Appearance-Preserving Warp Alignment Module (APWAM), which aligns garments to human poses, addressing geometric deformations and ensuring spatial consistency; (2) the Semantic Representation and Comprehension Module (SRCM), which captures fine-grained garment attributes and multi-pose data to enhance semantic representation, maintaining structural, textural, and pattern consistency; and (3) the Multimodal Prior-Guided Appearance Generation Module (MPAGM), which integrates multimodal features and prior knowledge from pre-trained models to optimize appearance generation, ensuring both semantic and geometric consistency. Additionally, to overcome data limitations in existing benchmarks, we introduce the SAMP-VTONS dataset, featuring multi-pose pairs and rich textual annotations for a more comprehensive evaluation. Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS, excelling in visual fidelity, semantic consistency, and detail preservation.

CLMay 23, 2025
Multi-Scale Probabilistic Generation Theory: A Unified Information-Theoretic Framework for Hierarchical Structure in Large Language Models

Yukin Zhang, Qi Dong

Large Language Models (LLMs) exhibit remarkable emergent abilities but remain poorly understood at a mechanistic level. This paper introduces the Multi-Scale Probabilistic Generation Theory (MSPGT), a theoretical framework that models LLMs as Hierarchical Variational Information Bottleneck (H-VIB) systems. MSPGT posits that standard language modeling objectives implicitly optimize multi-scale information compression, leading to the spontaneous formation of three internal processing scales-Global, Intermediate, and Local. We formalize this principle, derive falsifiable predictions about boundary positions and architectural dependencies, and validate them through cross-model experiments combining multi-signal fusion and causal interventions. Results across Llama and Qwen families reveal consistent multi-scale organization but strong architecture-specific variations, partially supporting and refining the theory. MSPGT thus advances interpretability from descriptive observation toward predictive, information-theoretic understanding of how hierarchical structure emerges within large neural language models.

LGJun 23, 2024
Unveiling LLM Mechanisms Through Neural ODEs and Control Theory

Yukun Zhang, Qi Dong

This paper proposes a framework combining Neural Ordinary Differential Equations (Neural ODEs) and robust control theory to enhance the interpretability and control of large language models (LLMs). By utilizing Neural ODEs to model the dynamic evolution of input-output relationships and introducing control mechanisms to optimize output quality, we demonstrate the effectiveness of this approach across multiple question-answer datasets. Experimental results show that the integration of Neural ODEs and control theory significantly improves output consistency and model interpretability, advancing the development of explainable AI technologies.

CVMay 5, 2021
Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

Qi Dong, Zhuowen Tu, Haofu Liao et al.

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.

CVApr 25, 2019
Unsupervised Deep Learning by Neighbourhood Discovery

Jiabo Huang, Qi Dong, Shaogang Gong et al.

Deep convolutional neural networks (CNNs) have demonstrated remarkable success in computer vision by supervisedly learning strong visual feature representations. However, training CNNs relies heavily on the availability of exhaustive training data annotations, limiting significantly their deployment and scalability in many application scenarios. In this work, we introduce a generic unsupervised deep learning approach to training deep models without the need for any manual label supervision. Specifically, we progressively discover sample anchored/centred neighbourhoods to reason and learn the underlying class decision boundaries iteratively and accumulatively. Every single neighbourhood is specially formulated so that all the member samples can share the same unseen class labels at high probability for facilitating the extraction of class discriminative feature representations during training. Experiments on image classification show the performance advantages of the proposed method over the state-of-the-art unsupervised learning models on six benchmarks including both coarse-grained and fine-grained object image categorisation.

CVNov 20, 2018
Single-Label Multi-Class Image Classification by Deep Logistic Regression

Qi Dong, Xiatian Zhu, Shaogang Gong

The objective learning formulation is essential for the success of convolutional neural networks. In this work, we analyse thoroughly the standard learning objective functions for multi-class classification CNNs: softmax regression (SR) for single-label scenario and logistic regression (LR) for multi-label scenario. Our analyses lead to an inspiration of exploiting LR for single-label classification learning, and then the disclosing of the negative class distraction problem in LR. To address this problem, we develop two novel LR based objective functions that not only generalise the conventional LR but importantly turn out to be competitive alternatives to SR in single label classification. Extensive comparative evaluations demonstrate the model learning advantages of the proposed LR functions over the commonly adopted SR in single-label coarse-grained object categorisation and cross-class fine-grained person instance identification tasks. We also show the performance superiority of our method on clothing attribute classification in comparison to the vanilla LR function.

CYSep 4, 2018
Constructing Trustworthy and Safe Communities on a Blockchain-Enabled Social Credits System

Ronghua Xu, Xuheng Lin, Qi Dong et al.

The emergence of big data and Artificial Intelligence (AI) technology is reshaping the world. While the technological revolution improves the quality of our life, new concerns are triggered. The superhuman capability enables AI to outperform human workers in many data- and/or computing-intensive tasks. Also, digital superpowers are showing arrogance towards individuals, which erodes the trust foundation of the society. In this position paper, we suggest to construct trustworthy and safe communities based on a BLockchain-Enabled Social credits System (BLESS) that rewards the residents who commit in socially beneficial activities. Human being's true value lies in serving other people. The BLESS system is considered as an efficient approach to promote the value and dignity in efforts focused on enhancing our communities and regulating business and private behaviors. The BLESS system leverages the decentralized architecture of the blockchain network, which not only allows grassroots individuals to participate rating process of a social credit system (SCS), but also provides tamper proof of transaction data in the trustless network environment. The anonymity in blockchain records also protects individuals from being targeted in the fight against powerful enterprises. Smart contract enabled authentication and authorization strategy prevents any unauthorized entity from accessing the credit system. The BLESS scheme is promising to offer a secure, transparent and decentralized SCS.

CVApr 28, 2018
Imbalanced Deep Learning by Minority Class Incremental Rectification

Qi Dong, Shaogang Gong, Xiatian Zhu

Model learning from class imbalanced training data is a long-standing and significant challenge for machine learning. In particular, existing deep learning methods consider mostly either class balanced data or moderately imbalanced data in model training, and ignore the challenge of learning from significantly imbalanced training data. To address this problem, we formulate a class imbalanced deep learning model based on batch-wise incremental minority (sparsely sampled) class rectification by hard sample mining in majority (frequently sampled) classes during model training. This model is designed to minimise the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes in an iterative batch-wise learning process. To that end, we introduce a Class Rectification Loss (CRL) function that can be deployed readily in deep network architectures. Extensive experimental evaluations are conducted on three imbalanced person attribute benchmark datasets (CelebA, X-Domain, DeepFashion) and one balanced object category benchmark dataset (CIFAR-100). These experimental results demonstrate the performance advantages and model scalability of the proposed batch-wise incremental minority class rectification model over the existing state-of-the-art models for addressing the problem of imbalanced data learning.

CVDec 8, 2017
Class Rectification Hard Mining for Imbalanced Deep Learning

Qi Dong, Shaogang Gong, Xiatian Zhu

Recognising detailed facial or clothing attributes in images of people is a challenging task for computer vision, especially when the training data are both in very large scale and extremely imbalanced among different attribute classes. To address this problem, we formulate a novel scheme for batch incremental hard sample mining of minority attribute classes from imbalanced large scale training data. We develop an end-to-end deep learning framework capable of avoiding the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes. This is made possible by introducing a Class Rectification Loss (CRL) regularising algorithm. We demonstrate the advantages and scalability of CRL over existing state-of-the-art attribute recognition and imbalanced data learning models on two large scale imbalanced benchmark datasets, the CelebA facial attribute dataset and the X-Domain clothing attribute dataset.

CVOct 12, 2016
Multi-Task Curriculum Transfer Deep Learning of Clothing Attributes

Qi Dong, Shaogang Gong, Xiatian Zhu

Recognising detailed clothing characteristics (fine-grained attributes) in unconstrained images of people in-the-wild is a challenging task for computer vision, especially when there is only limited training data from the wild whilst most data available for model learning are captured in well-controlled environments using fashion models (well lit, no background clutter, frontal view, high-resolution). In this work, we develop a deep learning framework capable of model transfer learning from well-controlled shop clothing images collected from web retailers to in-the-wild images from the street. Specifically, we formulate a novel Multi-Task Curriculum Transfer (MTCT) deep learning method to explore multiple sources of different types of web annotations with multi-labelled fine-grained attributes. Our multi-task loss function is designed to extract more discriminative representations in training by jointly learning all attributes, and our curriculum strategy exploits the staged easy-to-complex transfer learning motivated by cognitive studies. We demonstrate the advantages of the MTCT model over the state-of-the-art methods on the X-Domain benchmark, a large scale clothing attribute dataset. Moreover, we show that the MTCT model has a notable advantage over contemporary models when the training data size is small.