HCMay 8Code
ECNUClaw: A Learner-Profiled Intelligent Study Companion Framework for K-12 Personalized EducationYizhou Zhou, Jiayin Li, Zhi Zhang
We introduce ECNUClaw, an open-source framework for building learner-profiled intelligent study companions in K-12 education. The system constructs and maintains a five-dimension learner profile -- covering cognitive, behavioral, emotional, metacognitive, and contextual dimensions -- by extracting signals from student-companion dialogues at each turn. Profile updates feed directly into an adaptive strategy engine that adjusts the companion's guidance intensity, encouragement frequency, and Bloom's taxonomy scaffolding in real time. The framework design draws on three theoretical strands from the Chinese educational technology literature: Zhang's Digital Portrait Three-Layer Framework for learner assessment, the Education Brain model for educational system architecture, and the Human-AI Collaborative IQ concept for companion design philosophy. ECNUClaw is implemented in Python and supports seven Chinese LLM providers through a unified OpenAI-compatible adapter layer. We describe the system architecture, the profiling and adaptation mechanisms, and discuss limitations and next steps. The source code is available at https://github.com/bushushu2333/ECNUClaw.
CVAug 21, 2024
EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language ModelFeipeng Ma, Yizhou Zhou, Zheyu Zhang et al.
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories: self-attention-based and cross-attention-based methods. However, both approaches present inherent limitations, forcing a trade-off between data and computational efficiency. To address this issue, we introduce the Data-$\textbf{E}$fficient and Compute-$\textbf{E}$fficient $\textbf{MLLM}$ ($\textbf{EE-MLLM}$). Specifically, we modify the original self-attention mechanism in MLLM to a composite attention mechanism. This mechanism has two key characteristics: 1) eliminating the computational overhead of self-attention among visual tokens to achieve $\textbf{compute efficiency}$, and 2) reusing the weights from each layer of LLM to facilitate effective vision-language modality alignment for $\textbf{data efficiency}$. As a result, EE-MLLM significantly outperforms Flamingo with limited training data, and reduces the prefilling time to 79 ms on an H800 GPU, compared to LLaVA's 277 ms. To further investigate the efficiency of EE-MLLM, we present a training-free variant named EE-MLLM-F, which reduces the computation cost of self-attention-based method without additional training. Experimental results demonstrate the effectiveness of EE-MLLM across a range of benchmarks, including general-purpose datasets like MMBench and SeedBench, as well as fine-grained tasks such as TextVQA and DocVQA.
CVNov 15, 2024Code
Number it: Temporal Grounding Videos like Flipping MangaYongliang Wu, Xinting Hu, Yuyang Sun et al.
Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.
LGAug 7, 2025Code
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward RectificationYongliang Wu, Yizhou Zhou, Zhou Ziheng et al.
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM), addressing its limited generalization compared to reinforcement learning (RL). Through mathematical analysis, we reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model. To rectify this, we propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token. Remarkably, this single-line code change significantly outperforms standard SFT across multiple challenging benchmarks and base models, demonstrating greatly improved generalization. Additionally, our approach shows competitive results in offline RL settings, offering an effective yet simpler alternative. This work bridges theoretical insight and practical solutions, substantially advancing SFT performance. The code will be available at https://github.com/yongliang-wu/DFT.
CVMay 6, 2021Code
Unsupervised Visual Representation Learning by Tracking Patches in VideoGuangting Wang, Yizhou Zhou, Chong Luo et al.
Inspired by the fact that human eyes continue to develop tracking ability in early and middle childhood, we propose to use tracking as a proxy task for a computer vision system to learn the visual representations. Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations that would help with video-related tasks. In the proposed pretraining framework, we cut an image patch from a given video and let it scale and move according to a pre-set trajectory. The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame. We discover that using multiple image patches simultaneously brings clear benefits. We further increase the difficulty of the game by randomly making patches invisible. Extensive experiments on mainstream benchmarks demonstrate the superior performance of CtP against other video pretraining methods. In addition, CtP-pretrained features are less sensitive to domain gaps than those trained by a supervised action recognition task. When both trained on Kinetics-400, we are pleasantly surprised to find that CtP-pretrained representation achieves much higher action classification accuracy than its fully supervised counterpart on Something-Something dataset. Code is available online: github.com/microsoft/CtP.
CHEM-PHJan 17, 2021Code
Data-driven discovery of multiscale chemical reactions governed by the law of mass actionJuntao Huang, Yizhou Zhou, Wen-An Yong
In this paper, we propose a data-driven method to discover multiscale chemical reactions governed by the law of mass action. First, we use a single matrix to represent the stoichiometric coefficients for both the reactants and products in a system without catalysis reactions. The negative entries in the matrix denote the stoichiometric coefficients for the reactants and the positive ones for the products. Second, we find that the conventional optimization methods usually get stuck in the local minima and could not find the true solution in learning the multiscale chemical reactions. To overcome this difficulty, we propose a partial-parameters-freezing (PPF) technique to progressively determine the network parameters by using the fact that the stoichiometric coefficients are integers. With such a technique, the dimension of the searching space is gradually reduced in the training process and the global mimina can be eventually obtained. Several numerical experiments including the classical Michaelis-Menten kinetics, the hydrogen oxidation reactions, and the simplified GRI-3.0 mechanism verify the good performance of our algorithm in learning the multiscale chemical reactions. The code is available at \url{https://github.com/JuntaoHuang/multiscale-chemical-reaction}.
CVDec 26, 2023
Inter-X: Towards Versatile Human-Human Interaction AnalysisLiang Xu, Xintao Lv, Yichao Yan et al.
The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions, lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions, we propose Inter-X, a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns, together with detailed hand gestures. The dataset includes ~11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions, semantic interaction categories, interaction order, and the relationship and personality of the subjects. Based on the elaborate annotations, we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions. Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis. Our dataset and benchmark will be publicly available for research purposes.
CVMar 18, 2024
ReGenNet: Towards Human Action-Reaction SynthesisLiang Xu, Yizhou Zhou, Yichao Yan et al.
Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects, while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper, we comprehensively analyze the asymmetric, dynamic, synchronous, and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with, we propose to annotate the actor-reactor order of the interaction sequences for the NTU120, InterHuman, and Chi3D datasets. Based on them, a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner, where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines, and can generalize to unseen actor motions and viewpoint changes.
CVMar 9, 2025
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual TrainingCong Chen, Mingyu Liu, Chenchen Jing et al.
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
MADec 30, 2024
AI Agent for Education: von Neumann Multi-Agent System FrameworkYuan-Hao Jiang, Ruijia Li, Yizhou Zhou et al.
The development of large language models has ushered in new paradigms for education. This paper centers on the multi-Agent system in education and proposes the von Neumann multi-Agent system framework. It breaks down each AI Agent into four modules: control unit, logic unit, storage unit, and input-output devices, defining four types of operations: task deconstruction, self-reflection, memory processing, and tool invocation. Furthermore, it introduces related technologies such as Chain-of-Thought, Reson+Act, and Multi-Agent Debate associated with these four types of operations. The paper also discusses the ability enhancement cycle of a multi-Agent system for education, including the outer circulation for human learners to promote knowledge construction and the inner circulation for LLM-based-Agents to enhance swarm intelligence. Through collaboration and reflection, the multi-Agent system can better facilitate human learners' learning and enhance their teaching abilities in this process.
CVOct 14, 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic ModelingJian Yang, Dacheng Yin, Yizhou Zhou et al.
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Extensive evaluations on 18 image understanding benchmarks show that MMAR significantly outperforms most of the existing joint multi-modal models, surpassing the method that employs pre-trained CLIP vision encoder. Meanwhile, MMAR is able to generate high quality images. We also show that our method is scalable with larger data and model size.
AIJan 12, 2025
A Study on Educational Data Analysis and Personalized Feedback Report Generation Based on Tags and ChatGPTYizhou Zhou, Mengqiao Zhang, Yuan-Hao Jiang et al.
This study introduces a novel method that employs tag annotation coupled with the ChatGPT language model to analyze student learning behaviors and generate personalized feedback. Central to this approach is the conversion of complex student data into an extensive set of tags, which are then decoded through tailored prompts to deliver constructive feedback that encourages rather than discourages students. This methodology focuses on accurately feeding student data into large language models and crafting prompts that enhance the constructive nature of feedback. The effectiveness of this approach was validated through surveys conducted with over 20 mathematics teachers, who confirmed the reliability of the generated reports. This method can be seamlessly integrated into intelligent adaptive learning systems or provided as a tool to significantly reduce the workload of teachers, providing accurate and timely feedback to students. By transforming raw educational data into interpretable tags, this method supports the provision of efficient and timely personalized learning feedback that offers constructive suggestions tailored to individual learner needs.
AIMar 22, 2025
Slide2Text: Leveraging LLMs for Personalized Textbook Generation from PowerPoint PresentationsYizhou Zhou
The rapid advancements in Large Language Models (LLMs) have revolutionized educational technology, enabling innovative approaches to automated and personalized content creation. This paper introduces Slide2Text, a system that leverages LLMs to transform PowerPoint presentations into customized textbooks. By extracting slide content using OCR, organizing it into a coherent structure, and generating tailored materials such as explanations, exercises, and references, Slide2Text streamlines the textbook creation process. Flexible customization options further enhance its adaptability to diverse educational needs. The system highlights the potential of LLMs in modernizing textbook creation and improving educational accessibility. Future developments will explore multimedia inputs and advanced user customization features.
AIJun 25, 2024
Enhancing Explainability of Knowledge Learning Paths: Causal Knowledge NetworksYuang Wei, Yizhou Zhou, Yuan-Hao Jiang et al.
A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendation technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.
CVMay 29, 2023
Image Captioning with Multi-Context Synthetic DataFeipeng Ma, Yizhou Zhou, Fengyun Rao et al.
Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This potential can be harnessed to create synthetic image-text pairs for training captioning models. Synthetic data can improve cost and time efficiency in data collection, allow for customization to specific domains, bootstrap generalization capability for zero-shot performance, and circumvent privacy concerns associated with real-world data. However, existing methods struggle to attain satisfactory performance solely through synthetic data. We identify the issue as generated images from simple descriptions mostly capture a solitary perspective with limited context, failing to align with the intricate scenes prevalent in real-world imagery. To tackle this, we present an innovative pipeline that introduces multi-context data generation. Beginning with an initial text corpus, our approach employs a large language model to extract multiple sentences portraying the same scene from diverse viewpoints. These sentences are then condensed into a single sentence with multiple contexts. Subsequently, we generate intricate images using the condensed captions through diffusion models. Our model is exclusively trained on synthetic image-text pairs crafted through this process. The effectiveness of our pipeline is validated through experimental results in both the in-domain and cross-domain settings, where it achieves state-of-the-art performance on well-known datasets such as MSCOCO, Flickr30k, and NoCaps.
CVJan 28, 2021
VAE^2: Preventing Posterior Collapse of Variational Video Predictions in the WildYizhou Zhou, Chong Luo, Xiaoyan Sun et al.
Predicting future frames of video sequences is challenging due to the complex and stochastic nature of the problem. Video prediction methods based on variational auto-encoders (VAEs) have been a great success, but they require the training data to contain multiple possible futures for an observed video sequence. This is hard to be fulfilled when videos are captured in the wild where any given observation only has a determinate future. As a result, training a vanilla VAE model with these videos inevitably causes posterior collapse. To alleviate this problem, we propose a novel VAE structure, dabbed VAE-in-VAE or VAE$^2$. The key idea is to explicitly introduce stochasticity into the VAE. We treat part of the observed video sequence as a random transition state that bridges its past and future, and maximize the likelihood of a Markov Chain over the video sequence under all possible transition states. A tractable lower bound is proposed for this intractable objective function and an end-to-end optimization algorithm is designed accordingly. VAE$^2$ can mitigate the posterior collapse problem to a large extent, as it breaks the direct dependence between future and observation and does not directly regress the determinate future provided by the training data. We carry out experiments on a large-scale dataset called Cityscapes, which contains videos collected from a number of urban cities. Results show that VAE$^2$ is capable of predicting diverse futures and is more resistant to posterior collapse than the other state-of-the-art VAE-based approaches. We believe that VAE$^2$ is also applicable to other stochastic sequence prediction problems where training data are lack of stochasticity.
COMP-PHSep 28, 2020
Learning Thermodynamically Stable and Galilean Invariant Partial Differential Equations for Non-equilibrium FlowsJuntao Huang, Zhiting Ma, Yizhou Zhou et al.
In this work, we develop a method for learning interpretable, thermodynamically stable and Galilean invariant partial differential equations (PDEs) based on the Conservation-dissipation Formalism of irreversible thermodynamics. As governing equations for non-equilibrium flows in one dimension, the learned PDEs are parameterized by fully-connected neural networks and satisfy the conservation-dissipation principle automatically. In particular, they are hyperbolic balance laws and Galilean invariant. The training data are generated from a kinetic model with smooth initial data. Numerical results indicate that the learned PDEs can achieve good accuracy in a wide range of Knudsen numbers. Remarkably, the learned dynamics can give satisfactory results with randomly sampled discontinuous initial data and Sod's shock tube problem although it is trained only with smooth initial data.
CVApr 10, 2020
Spatiotemporal Fusion in 3D CNNs: A Probabilistic ViewYizhou Zhou, Xiaoyan Sun, Chong Luo et al.
Despite the success in still image recognition, deep neural networks for spatiotemporal signal tasks (such as human action recognition in videos) still suffers from low efficacy and inefficiency over the past years. Recently, human experts have put more efforts into analyzing the importance of different components in 3D convolutional neural networks (3D CNNs) to design more powerful spatiotemporal learning backbones. Among many others, spatiotemporal fusion is one of the essentials. It controls how spatial and temporal signals are extracted at each layer during inference. Previous attempts usually start by ad-hoc designs that empirically combine certain convolutions and then draw conclusions based on the performance obtained by training the corresponding networks. These methods only support network-level analysis on limited number of fusion strategies. In this paper, we propose to convert the spatiotemporal fusion strategies into a probability space, which allows us to perform network-level evaluations of various fusion strategies without having to train them separately. Besides, we can also obtain fine-grained numerical information such as layer-level preference on spatiotemporal fusion within the probability space. Our approach greatly boosts the efficiency of analyzing spatiotemporal fusion. Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
LGJun 23, 2019
Posterior-Guided Neural Architecture SearchYizhou Zhou, Xiaoyan Sun, Chong Luo et al.
The emergence of neural architecture search (NAS) has greatly advanced the research on network design. Recent proposals such as gradient-based methods or one-shot approaches significantly boost the efficiency of NAS. In this paper, we formulate the NAS problem from a Bayesian perspective. We propose explicitly estimating the joint posterior distribution over pairs of network architecture and weights. Accordingly, a hybrid network representation is presented which enables us to leverage the Variational Dropout so that the approximation of the posterior distribution becomes fully gradient-based and highly efficient. A posterior-guided sampling method is then presented to sample architecture candidates and directly make evaluations. As a Bayesian approach, our posterior-guided NAS (PGNAS) avoids tuning a number of hyper-parameters and enables a very effective architecture sampling in posterior probability space. Interestingly, it also leads to a deeper insight into the weight sharing used in the one-shot NAS and naturally alleviates the mismatch between the sampled architecture and weights caused by the weight sharing. We validate our PGNAS method on the fundamental image classification task. Results on Cifar-10, Cifar-100 and ImageNet show that PGNAS achieves a good trade-off between precision and speed of search among NAS methods. For example, it takes 11 GPU days to search a very competitive architecture with 1.98% and 14.28% test errors on Cifar10 and Cifar100, respectively.