Fei Huang

h-index9

12papers

403citations

Novelty52%

AI Score48

Ranked #29,911 of 194,257 authors (top 15%)#6,206 in CL (top 20%)

12 Papers

4.3CLJan 11, 2023Code

MGeo: Multi-Modal Geographic Pre-Training Method

Ruixue Ding, Boli Chen, Pengjun Xie et al.

As a core task in location-based services (LBS) (e.g., navigation maps), query and point of interest (POI) matching connects users' intent with real-world geographic information. Recently, pre-trained models (PTMs) have made advancements in many natural language processing (NLP) tasks. Generic text-based PTMs do not have enough geographic knowledge for query-POI matching. To overcome this limitation, related literature attempts to employ domain-adaptive pre-training based on geo-related corpus. However, a query generally contains mentions of multiple geographic objects, such as nearby roads and regions of interest (ROIs). The geographic context (GC), i.e., these diverse geographic objects and their relationships, is therefore pivotal to retrieving the most relevant POI. Single-modal PTMs can barely make use of the important GC and therefore have limited performance. In this work, we propose a novel query-POI matching method Multi-modal Geographic language model (MGeo), which comprises a geographic encoder and a multi-modal interaction module. MGeo represents GC as a new modality and is able to fully extract multi-modal correlations for accurate query-POI matching. Besides, there is no publicly available benchmark for this topic. In order to facilitate further research, we build a new open-source large-scale benchmark Geographic TExtual Similarity (GeoTES). The POIs come from an open-source geographic information system (GIS). The queries are manually generated by annotators to prevent privacy issues. Compared with several strong baselines, the extensive experiment results and detailed ablation analyses on GeoTES demonstrate that our proposed multi-modal pre-training method can significantly improve the query-POI matching capability of generic PTMs, even when the queries' GC is not provided. Our code and dataset are publicly available at https://github.com/PhantomGrapes/MGeo.

8.1CLOct 3, 2023Code

Editing Personality for Large Language Models

Shengyu Mao, Xiaohan Wang, Mengru Wang et al.

This paper introduces an innovative task focused on editing the personality traits of Large Language Models (LLMs). This task seeks to adjust the models' responses to opinion-related questions on specified topics since an individual's personality often manifests in the form of their expressed opinions, thereby showcasing different personality traits. Specifically, we construct PersonalityEdit, a new benchmark dataset to address this task. Drawing on the theory in Social Psychology, we isolate three representative traits, namely Neuroticism, Extraversion, and Agreeableness, as the foundation for our benchmark. We then gather data using GPT-4, generating responses that align with a specified topic and embody the targeted personality trait. We conduct comprehensive experiments involving various baselines and discuss the representation of personality behavior in LLMs. Our findings uncover potential challenges of the proposed task, illustrating several remaining issues. We anticipate that our work can stimulate further annotation in model editing and personality-related research. Code is available at https://github.com/zjunlp/EasyEdit.

41.6CLMay 7, 2025Code

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Hao Sun, Zile Qiao, Jiayan Guo et al. · pku

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

31.5CLMar 3, 2025

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Yiming Wang, Pei Zhang, Siyuan Huang et al.

Test-time scaling enhances large language model performance by allocating additional compute resources during inference. Best-of-N (BoN) sampling serves as a common sampling-based scaling technique, broadening the search space in parallel to find better solutions from the model distribution. However, its cost-performance trade-off is still underexplored. Two main challenges limit the efficiency of BoN sampling: (1) Generating N full samples consumes substantial GPU memory, reducing inference capacity under limited resources. (2) Reward models add extra memory and latency overhead, and training strong reward models introduces potential training data costs. Although some studies have explored efficiency improvements, none have addressed both challenges at once. To address this gap, we propose Self-Truncation Best-of-N (ST-BoN), a decoding method that avoids fully generating all N samples and eliminates the need for reward models. It leverages early sampling consistency in the model's internal states to identify the most promising path and truncate suboptimal ones. In terms of cost, ST-BoN reduces dynamic GPU memory usage by over 80% and inference latency by 50%. In terms of cost-performance trade-off, ST-BoN achieves the same performance as Full-BoN while saving computational cost by 70%-80%, and under the same cost, it can improve accuracy by 3-4 points.

5.3LGDec 7, 2023

Jointly spatial-temporal representation learning for individual trajectories

Fei Huang, Jianrong Lv, Yang Yue

Individual trajectories, rich in human-environment interaction information across space and time, serve as vital inputs for geospatial foundation models (GeoFMs). However, existing attempts at learning trajectory representations have overlooked the implicit spatial-temporal dependency within trajectories, failing to encode such dependency in a deep learning-friendly format. That poses a challenge in obtaining general-purpose trajectory representations. Therefore, this paper proposes a spatial-temporal joint representation learning method (ST-GraphRL) to formalize learnable spatial-temporal dependencies into trajectory representations. The proposed ST-GraphRL consists of three compositions: (i) a weighted directed spatial-temporal graph to explicitly construct mobility interactions in both space and time dimensions; (ii) a two-stage jointly encoder (i.e., decoupling and fusion), to learn entangled spatial-temporal dependencies by independently decomposing and jointly aggregating space and time information; (iii) a decoder guides ST-GraphRL to learn explicit mobility regularities by simulating the spatial-temporal distributions of trajectories. Tested on three real-world human mobility datasets, the proposed ST-GraphRL outperformed all the baseline models in predicting movement spatial-temporal distributions and preserving trajectory similarity with high spatial-temporal correlations. Analyzing spatial-temporal features presented in latent space validates that ST-GraphRL understands spatial-temporal patterns. This study may also benefit representation learnings of other geospatial data to achieve general-purpose data representations and advance GeoFMs development.

13.9CLJan 21, 2025

Debate Helps Weak-to-Strong Generalization

Hao Lang, Fei Huang, Yongbin Li

Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.

5.1CYJan 22, 2025

FishBargain: An LLM-Empowered Bargaining Agent for Online Fleamarket Platform Sellers

Dexin Kong, Xu Yan, Ming Chen et al.

Different from traditional Business-to-Consumer e-commerce platforms~(e.g., Amazon), online fleamarket platforms~(e.g., Craigslist) mainly focus on individual sellers who are lack of time investment and business proficiency. Individual sellers often struggle with the bargaining process and thus the deal is unaccomplished. Recent advancements in Large Language Models(LLMs) demonstrate huge potential in various dialogue tasks, but those tasks are mainly in the form of passively following user's instruction. Bargaining, as a form of proactive dialogue task, represents a distinct art of dialogue considering the dynamism of environment and uncertainty of adversary strategies. In this paper, we propose an LLM-empowered bargaining agent designed for online fleamarket platform sellers, named as FishBargain. Specifically, FishBargain understands the chat context and product information, chooses both action and language skill considering possible adversary actions and generates utterances. FishBargain has been tested by thousands of individual sellers on one of the largest online fleamarket platforms~(Xianyu) in China. Both qualitative and quantitative experiments demonstrate that FishBargain can effectively help sellers make more deals.

4.5MLMay 24, 2025

Marginal Fairness: Fair Decision-Making under Risk Measures

Fei Huang, Silvana M. Pesenti

This paper introduces marginal fairness, a new individual fairness notion for equitable decision-making in the presence of protected attributes such as gender, race, and religion. This criterion ensures that decisions based on generalized distortion risk measures are insensitive to distributional perturbations in protected attributes, regardless of whether these attributes are continuous, discrete, categorical, univariate, or multivariate. To operationalize this notion and reflect real-world regulatory environments (such as the EU gender-neutral pricing regulation), we model business decision-making in highly regulated industries (such as insurance and finance) as a two-step process: (i) a predictive modeling stage, in which a prediction function for the target variable (e.g., insurance losses) is estimated based on both protected and non-protected covariates; and (ii) a decision-making stage, in which a generalized distortion risk measure is applied to the target variable, conditional only on non-protected covariates, to determine the decision. In this second step, we modify the risk measure such that the decision becomes insensitive to the protected attribute, thus enforcing fairness to ensure equitable outcomes under risk-sensitive, regulatory constraints. Furthermore, by utilizing the concept of cascade sensitivity, we extend the marginal fairness framework to capture how dependencies between covariates propagate the influence of protected attributes through the modeling pipeline. A numerical study and an empirical implementation using an auto insurance dataset demonstrate how the framework can be applied in practice.

2.7CLNov 18, 2025

Selective Weak-to-Strong Generalization

Hao Lang, Fei Huang, Yongbin Li

Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method consistently outperforms competitive baselines. Further analyses show that P(IK) can generalize across tasks and difficulties, which indicates selective W2SG can help superalignment.

6.7CLJul 16, 2025

Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese

Yikang Liu, Wanyang Zhang, Yiming Wang et al.

Translationese refers to linguistic properties that usually occur in translated texts. Previous works study translationese by framing it as a binary classification between original texts and translated texts. In this paper, we argue that translationese should be graded instead of binary and propose the first measure for translationese -- the translationese-index (T-index), computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use synthesized translations and translations in the wild to evaluate T-index's generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index can generalize to unseen genres, authors, and language pairs. Moreover, T-index computed using two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can effectively capture translationese, as demonstrated by alignment with human pointwise ratings and pairwise judgments. Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.

18.9CLSep 4, 2023Code

Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking

Yong Cao, Ruixue Ding, Boli Chen et al.

Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to extra semantic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed addition task, aiming to guide the model capable of effectively focusing on specific chunks. Experiments on two distinct Chinese geographic re-ranking datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset.

1.0LGDec 3, 2019

Event Ticket Price Prediction with Deep Neural Network on Spatial-Temporal Sparse Data

Fei Huang, Hao Huang

Event ticket price prediction is important to marketing strategy for any sports team or musical ensemble. An accurate prediction model can help the marketing team to make promotion plan more effectively and efficiently. However, given all the historical transaction records, it is challenging to predict the sale price of the remaining seats at any future timestamp, not only because that the sale price is relevant to a lot of features (seat locations, date-to-event of the transaction, event date, team performance, etc.), but also because of the temporal and spatial sparsity in the dataset. For a game/concert, the ticket selling price of one seat is only observable once at the time of sale. Furthermore, some seats may not even be purchased (therefore no record available). In fact, data sparsity is commonly encountered in many prediction problems. Here, we propose a bi-level optimizing deep neural network to address the curse of spatio-temporal sparsity. Specifically, we introduce coarsening and refining layers, and design a bi-level loss function to integrate different level of loss for better prediction accuracy. Our model can discover the interrelations among ticket sale price, seat locations, selling time, event information, etc. Experiments show that our proposed model outperforms other benchmark methods in real-world ticket selling price prediction.