Qian Hu

CL
h-index61
21papers
718citations
Novelty41%
AI Score53

21 Papers

CLMay 27Code
ChildEval: When large language models meet children's personalities

Yanyan Luo, Xue Han, Chunxu Zhao et al.

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

AIAug 8, 2023
FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy et al. · amazon-science

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.

AIMar 17, 2025
The Amazon Nova Family of Models: Technical Report and Model Card

Amazon AGI, Aaron Langford, Aayush Shah et al. · amazon-science

We present Amazon Nova, a new generation of state-of-the-art foundation models that deliver frontier intelligence and industry-leading price performance. Amazon Nova Pro is a highly-capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks. Amazon Nova Lite is a low-cost multimodal model that is lightning fast for processing images, video, documents and text. Amazon Nova Micro is a text-only model that delivers our lowest-latency responses at very low cost. Amazon Nova Canvas is an image generation model that creates professional grade images with rich customization controls. Amazon Nova Reel is a video generation model offering high-quality outputs, customization, and motion control. Our models were built responsibly and with a commitment to customer trust, security, and reliability. We report benchmarking results for core capabilities, agentic performance, long context, functional adaptation, runtime performance, and human evaluation.

CLNov 17, 2022
Is the Elephant Flying? Resolving Ambiguities in Text-to-Image Generative Models

Ninareh Mehrabi, Palash Goyal, Apurv Verma et al. · amazon-science, gatech

Natural language often contains ambiguities that can lead to misinterpretation and miscommunication. While humans can handle ambiguities effectively by asking clarifying questions and/or relying on contextual cues and common-sense knowledge, resolving ambiguities can be notoriously hard for machines. In this work, we study ambiguities that arise in text-to-image generative models. We curate a benchmark dataset covering different types of ambiguities that occur in these systems. We then propose a framework to mitigate ambiguities in the prompts given to the systems by soliciting clarifications from the user. Through automatic and human evaluations, we show the effectiveness of our framework in generating more faithful images aligned with human intention in the presence of ambiguities.

CLOct 23, 2023
Evaluating Large Language Models on Controlled Generation Tasks

Jiao Sun, Yufei Tian, Wangchunshu Zhou et al.

While recent studies have looked into the abilities of large language models in various benchmark tasks, including question generation, reading comprehension, multilingual and etc, there have been few studies looking into the controllability of large language models on generation tasks. We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models. We conclude that **large language models struggle at meeting fine-grained hard constraints**.

CVMay 25
A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras

Haoyang Peng, Qian Hu, Songan Zhang et al.

Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.

LGMay 3, 2022
Predicting vacant parking space availability zone-wisely: a graph based spatio-temporal prediction approach

Yajing Feng, Qian Hu, Zhenzhou Tang

Vacant parking space (VPS) prediction is one of the key issues of intelligent parking guidance systems. Accurately predicting VPS information plays a crucial role in intelligent parking guidance systems, which can help drivers find parking space quickly, reducing unnecessary waste of time and excessive environmental pollution. Through the simple analysis of historical data, we found that there not only exists a obvious temporal correlation in each parking lot, but also a clear spatial correlation between different parking lots. In view of this, this paper proposed a graph data-based model ST-GBGRU (Spatial-Temporal Graph Based Gated Recurrent Unit), the number of VPSs can be predicted both in short-term (i.e., within 30 min) and in long-term (i.e., over 30min). On the one hand, the temporal correlation of historical VPS data is extracted by GRU, on the other hand, the spatial correlation of historical VPS data is extracted by GCN inside GRU. Two prediction methods, namely direct prediction and iterative prediction, are combined with the proposed model. Finally, the prediction model is applied to predict the number VPSs of 8 public parking lots in Santa Monica. The results show that in the short-term and long-term prediction tasks, ST-GBGRU model can achieve high accuracy and have good application prospects.

CLApr 4, 2024
Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought

Jooyoung Lee, Fan Yang, Thanh Tran et al.

We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., <1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks. Specifically, the lightweight LM first generates a rationale for each input instance. The Frozen large LM is then prompted to predict a task output based on the rationale generated by the lightweight LM. Our approach is resource-efficient in the sense that it only requires training the lightweight LM. We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals. We assess our method with multi-hop extractive question answering (QA) benchmarks, HotpotQA, and 2WikiMultiHopQA. Experimental results show that our approach outperforms all baselines regarding answer prediction accuracy. We also find that reinforcement learning helps the model to produce higher-quality rationales with improved QA performance.

CLApr 2, 2024
Toward Informal Language Processing: Knowledge of Slang in Large Language Models

Zhewei Sun, Qian Hu, Rahul Gupta et al.

Recent advancement in large language models (LLMs) has offered a strong potential for natural language systems to process informal language. A representative form of informal language is slang, used commonly in daily conversations and online social media. To date, slang has not been comprehensively evaluated in LLMs due partly to the absence of a carefully designed and publicly accessible benchmark. Using movie subtitles, we construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications: 1) slang detection, and 2) identification of regional and historical sources of slang from natural sentences. We also show how our dataset can be used to probe the output distributions of LLMs for interpretive insights. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance. Furthermore, we show that our dataset enables finetuning of LLMs such as GPT-3.5 that achieve substantially better performance than strong zero-shot baselines. Our work offers a comprehensive evaluation and a high-quality benchmark on English slang based on the OpenSubtitles corpus, serving both as a publicly accessible resource and a platform for applying tools for informal language processing.

CLMar 6, 2025
Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations

Xue Han, Qian Hu, Yitong Wang et al.

Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time. The issue arises from knowing that LLMs are trained on large amounts of data where temporal information is rather sparse over long times, such as thousands of years, resulting in insufficient learning or catastrophic forgetting by the LLMs. This paper proposes a methodology named "Ticktack" for addressing the LLM's long-time span misalignment in a yearly setting. Specifically, we first propose to utilize the sexagenary year expression instead of the Gregorian year expression employed by LLMs, achieving a more uniform distribution in yearly granularity. Then, we employ polar coordinates to model the sexagenary cycle of 60 terms and the year order within each term, with additional temporal encoding to ensure LLMs understand them. Finally, we present a temporal representational alignment approach for post-training LLMs that effectively distinguishes time points with relevant knowledge, hence improving performance on time-related tasks, particularly over a long period. We also create a long time span benchmark for evaluation. Experimental results prove the effectiveness of our proposal.

CLDec 19, 2023
Faithful Model Evaluation for Model-Based Metrics

Palash Goyal, Qian Hu, Rahul Gupta · amazon-science

Statistical significance testing is used in natural language processing (NLP) to determine whether the results of a study or experiment are likely to be due to chance or if they reflect a genuine relationship. A key step in significance testing is the estimation of confidence interval which is a function of sample variance. Sample variance calculation is straightforward when evaluating against ground truth. However, in many cases, a metric model is often used for evaluation. For example, to compare toxicity of two large language models, a toxicity classifier is used for evaluation. Existing works usually do not consider the variance change due to metric model errors, which can lead to wrong conclusions. In this work, we establish the mathematical foundation of significance testing for model-based metrics. With experiments on public benchmark datasets and a production system, we show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.

AIOct 4, 2025
Quantifying Risks in Multi-turn Conversation with Large Language Models

Chengxiao Wang, Isha Chaudhary, Qian Hu et al.

Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose QRLLM, a novel, principled Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions: random node, graph path, adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70\% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

CLAug 22, 2025
MultiPL-MoE: Multi-Programming-Lingual Extension of Large Language Models through Hybrid Mixture-of-Experts

Qing Wang, Xue Han, Jiahui Wang et al.

Despite LLMs' excellent code creation capabilities, multilingual code generation remains extremely challenging. To address this, we intent to improve the multi-programming-lingual (MultiPL) performance of the base LLMs while retaining the most popular ones using restricted computational resources. We consider MultiPL to be a special case of multiple natural languages and propose a MultiPL extension of LLMs utilizing a hybrid mixture of experts (MoE), called MultiPL-MoE. Specifically, MultiPL-MoE combines two paired MoEs to optimize expert selection at both the token and segment levels. The token-level MoE is a standard upcycling MoE structure with a shared expert and a novel gate weight normalization approach that aids in the final fusion with the segment-level MoE. The segment-level MoE incorporates two innovative designs to better capture the syntactic structure and contextual patterns of programming languages: First, using a sliding window to partition the input token sequence into multiple segments; Then, adopting an expert-choice routing strategy that allows experts to select the top-k segments. The results of the experiment proved the effectiveness of MultiPL-MoE.

CVFeb 13, 2025
Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection

Yan Zhang, Wen Yang, Chang Xu et al.

Drone-based RGBT object detection plays a crucial role in many around-the-clock applications. However, real-world drone-viewed RGBT data suffers from the prominent position shift problem, i.e., the position of a tiny object differs greatly in different modalities. For instance, a slight deviation of a tiny object in the thermal modality will induce it to drift from the main body of itself in the RGB modality. Considering RGBT data are usually labeled on one modality (reference), this will cause the unlabeled modality (sensed) to lack accurate supervision signals and prevent the detector from learning a good representation. Moreover, the mismatch of the corresponding feature point between the modalities will make the fused features confusing for the detection head. In this paper, we propose to cast the cross-modality box shift issue as the label noise problem and address it on the fly via a novel Mean Teacher-based Cross-modality Box Correction head ensemble (CBC). In this way, the network can learn more informative representations for both modalities. Furthermore, to alleviate the feature map mismatch problem in RGBT fusion, we devise a Shifted Window-Based Cascaded Alignment (SWCA) module. SWCA mines long-range dependencies between the spatially unaligned features inside shifted windows and cascaded aligns the sensed features with the reference ones. Extensive experiments on two drone-based RGBT object detection datasets demonstrate that the correction results are both visually and quantitatively favorable, thereby improving the detection performance. In particular, our CBC module boosts the precision of the sensed modality ground truth by 25.52 aSim points. Overall, the proposed detector achieves an mAP_50 of 43.55 points on RGBTDronePerson and surpasses a state-of-the-art method by 8.6 mAP50 on a shift subset of DroneVehicle dataset. The code and data will be made publicly available.

DBJan 9, 2022
OPP-Miner: Order-preserving sequential pattern mining

Youxi Wu, Qian Hu, Yan Li et al.

A time series is a collection of measurements in chronological order. Discovering patterns from time series is useful in many domains, such as stock analysis, disease detection, and weather forecast. To discover patterns, existing methods often convert time series data into another form, such as nominal/symbolic format, to reduce dimensionality, which inevitably deviates the data values. Moreover, existing methods mainly neglect the order relationships between time series values. To tackle these issues, inspired by order-preserving matching, this paper proposes an Order-Preserving sequential Pattern (OPP) mining method, which represents patterns based on the order relationships of the time series data. An inherent advantage of such representation is that the trend of a time series can be represented by the relative order of the values underneath the time series data. To obtain frequent trends in time series, we propose the OPP-Miner algorithm to mine patterns with the same trend (sub-sequences with the same relative order). OPP-Miner employs the filtration and verification strategies to calculate the support and uses pattern fusion strategy to generate candidate patterns. To compress the result set, we also study finding the maximal OPPs. Experiments validate that OPP-Miner is not only efficient and scalable but can also discover similar sub-sequences in time series. In addition, case studies show that our algorithms have high utility in analyzing the COVID-19 epidemic by identifying critical trends and improve the clustering performance.

LGOct 19, 2021
Two-stage Voice Application Recommender System for Unhandled Utterances in Intelligent Personal Assistant

Wei Xiao, Qian Hu, Thahir Mohamed et al.

Intelligent personal assistants (IPA) enable voice applications that facilitate people's daily tasks. However, due to the complexity and ambiguity of voice requests, some requests may not be handled properly by the standard natural language understanding (NLU) component. In such cases, a simple reply like "Sorry, I don't know" hurts the user's experience and limits the functionality of IPA. In this paper, we propose a two-stage shortlister-reranker recommender system to match third-party voice applications (skills) to unhandled utterances. In this approach, a skill shortlister is proposed to retrieve candidate skills from the skill catalog by calculating both lexical and semantic similarity between skills and user requests. We also illustrate how to build a new system by using observed data collected from a baseline rule-based system, and how the exposure biases can generate discrepancy between offline and human metrics. Lastly, we present two relabeling methods that can handle the incomplete ground truth, and mitigate exposure bias. We demonstrate the effectiveness of our proposed system through extensive offline experiments. Furthermore, we present online A/B testing results that show a significant boost on user experience satisfaction.

AIJul 4, 2021
Attribute reduction and rule acquisition of formal decision context based on two new kinds of decision rules

Qian Hu, Keyun Qin

This paper mainly studies the rule acquisition and attribute reduction for formal decision context based on two new kinds of decision rules, namely I-decision rules and II-decision rules. The premises of these rules are object-oriented concepts, and the conclusions are formal concept and property-oriented concept respectively. The rule acquisition algorithms for I-decision rules and II-decision rules are presented. Some comparative analysis of these algorithms with the existing algorithms are examined which shows that the algorithms presented in this study behave well. The attribute reduction approaches to preserve I-decision rules and II-decision rules are presented by using discernibility matrix.

CVApr 2, 2021
A Detector-oblivious Multi-arm Network for Keypoint Matching

Xuelun Shen, Qian Hu, Xin Li et al.

This paper presents a matching network to establish point correspondence between images. We propose a Multi-Arm Network (MAN) to learn region overlap and depth, which can greatly improve the keypoint matching robustness while bringing little computational cost during the inference stage. Another design that makes this framework different from many existing learning based pipelines that require re-training when a different keypoint detector is adopted, our network can directly work with different keypoint detectors without such a time-consuming re-training process. Comprehensive experiments conducted on outdoor and indoor datasets demonstrated that our proposed MAN outperforms state-of-the-art methods.

LGNov 13, 2020
Metric-Free Individual Fairness with Cooperative Contextual Bandits

Qian Hu, Huzefa Rangwala

Data mining algorithms are increasingly used in automated decision making across all walks of daily life. Unfortunately, as reported in several studies these algorithms inject bias from data and environment leading to inequitable and unfair solutions. To mitigate bias in machine learning, different formalizations of fairness have been proposed that can be categorized into group fairness and individual fairness. Group fairness requires that different groups should be treated similarly which might be unfair to some individuals within a group. On the other hand, individual fairness requires that similar individuals be treated similarly. However, individual fairness remains understudied due to its reliance on problem-specific similarity metrics. We propose a metric-free individual fairness and a cooperative contextual bandits (CCB) algorithm. The CCB algorithm utilizes fairness as a reward and attempts to maximize it. The advantage of treating fairness as a reward is that the fairness criterion does not need to be differentiable. The proposed algorithm is tested on multiple real-world benchmark datasets. The results show the effectiveness of the proposed algorithm at mitigating bias and at achieving both individual and group fairness.

LGDec 26, 2019
Academic Performance Estimation with Attention-based Graph Convolutional Networks

Qian Hu, Huzefa Rangwala

Student's academic performance prediction empowers educational technologies including academic trajectory and degree planning, course recommender systems, early warning and advising systems. Given a student's past data (such as grades in prior courses), the task of student's performance prediction is to predict a student's grades in future courses. Academic programs are structured in a way that prior courses lay the foundation for future courses. The knowledge required by courses is obtained by taking multiple prior courses, which exhibits complex relationships modeled by graph structures. Traditional methods for student's performance prediction usually neglect the underlying relationships between multiple courses; and how students acquire knowledge across them. In addition, traditional methods do not provide interpretation for predictions needed for decision making. In this work, we propose a novel attention-based graph convolutional networks model for student's performance prediction. We conduct extensive experiments on a real-world dataset obtained from a large public university. The experimental results show that our proposed model outperforms state-of-the-art approaches in terms of grade prediction. The proposed model also shows strong accuracy in identifying students who are at-risk of failing or dropping out so that timely intervention and feedback can be provided to the student.

AIFeb 26, 2019
Reliable Deep Grade Prediction with Uncertainty Estimation

Qian Hu, Huzefa Rangwala

Currently, college-going students are taking longer to graduate than their parental generations. Further, in the United States, the six-year graduation rate has been 59% for decades. Improving the educational quality by training better-prepared students who can successfully graduate in a timely manner is critical. Accurately predicting students' grades in future courses has attracted much attention as it can help identify at-risk students early so that personalized feedback can be provided to them on time by advisors. Prior research on students' grade prediction include shallow linear models; however, students' learning is a highly complex process that involves the accumulation of knowledge across a sequence of courses that can not be sufficiently modeled by these linear models. In addition to that, prior approaches focus on prediction accuracy without considering prediction uncertainty, which is essential for advising and decision making. In this work, we present two types of Bayesian deep learning models for grade prediction. The MLP ignores the temporal dynamics of students' knowledge evolution. Hence, we propose RNN for students' performance prediction. To evaluate the performance of the proposed models, we performed extensive experiments on data collected from a large public university. The experimental results show that the proposed models achieve better performance than prior state-of-the-art approaches. Besides more accurate results, Bayesian deep learning models estimate uncertainty associated with the predictions. We explore how uncertainty estimation can be applied towards developing a reliable educational early warning system. In addition to uncertainty, we also develop an approach to explain the prediction results, which is useful for advisors to provide personalized feedback to students.