Chuan-Ju Wang

IR
h-index11
14papers
255citations
Novelty39%
AI Score48

14 Papers

CLNov 13, 2025
FinNuE: Exposing the Risks of Using BERTScore for Numerical Semantic Evaluation in Finance

Yu-Shiang Huang, Yun-Yu Lee, Tzu-Hsin Chou et al.

BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.

CLSep 23, 2025
Financial Risk Relation Identification through Dual-view Adaptation

Wei-Ning Chiu, Yu-Hsiang Wang, Andy Hsiao et al.

A multitude of interconnected risk events -- ranging from regulatory changes to geopolitical tensions -- can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings -- authoritative, standardized financial documents -- as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.

IRSep 19, 2025
CFDA & CLIP at TREC iKAT 2025: Enhancing Personalized Conversational Search via Query Reformulation and Rank Fusion

Yu-Cheng Chang, Guan-Wei Yeo, Quah Eugene et al.

The 2025 TREC Interactive Knowledge Assistance Track (iKAT) featured both interactive and offline submission tasks. The former requires systems to operate under real-time constraints, making robustness and efficiency as important as accuracy, while the latter enables controlled evaluation of passage ranking and response generation with pre-defined datasets. To address this, we explored query rewriting and retrieval fusion as core strategies. We built our pipelines around Best-of-$N$ selection and Reciprocal Rank Fusion (RRF) strategies to handle different submission tasks. Results show that reranking and fusion improve robustness while revealing trade-offs between effectiveness and efficiency across both tasks.

IRAug 25, 2025
Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations

Hung-Chun Hsu, Yuan-Ching Kuo, Chao-Han Huck Yang et al.

The rapid evolution of e-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions. Recent advances in multimodal generative retrieval -- particularly those leveraging multimodal large language models (MLLMs) as retrievers -- have shown promise. However, most existing methods are tailored to single-turn scenarios and struggle to model the evolving intent and iterative nature of multi-turn dialogues when applied naively. Concurrently, test-time scaling has emerged as a powerful paradigm for improving large language model (LLM) performance through iterative inference-time refinement. Yet, its effectiveness typically relies on two conditions: (1) a well-defined problem space (e.g., mathematical reasoning), and (2) the model's ability to self-correct -- conditions that are rarely met in conversational product search. In this setting, user queries are often ambiguous and evolving, and MLLMs alone have difficulty grounding responses in a fixed product corpus. Motivated by these challenges, we propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval. Our approach builds on a generative retriever, further augmented with a test-time reranking (TTR) mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue. Experiments across multiple benchmarks show consistent improvements, with average gains of 14.5 points in MRR and 10.6 points in nDCG@1.

CLJul 2, 2025
Decision-Oriented Text Evaluation

Yu-Shiang Huang, Chuan-Ju Wang, Chung-Chi Chen

Natural language generation (NLG) is increasingly deployed in high-stakes domains, yet common intrinsic evaluation methods, such as n-gram overlap or sentence plausibility, weakly correlate with actual decision-making efficacy. We propose a decision-oriented framework for evaluating generated text by directly measuring its influence on human and large language model (LLM) decision outcomes. Using market digest texts--including objective morning summaries and subjective closing-bell analyses--as test cases, we assess decision quality based on the financial performance of trades executed by human investors and autonomous LLM agents informed exclusively by these texts. Our findings reveal that neither humans nor LLM agents consistently surpass random performance when relying solely on summaries. However, richer analytical commentaries enable collaborative human-LLM teams to outperform individual human or agent baselines significantly. Our approach underscores the importance of evaluating generated text by its ability to facilitate synergistic decision-making between humans and LLMs, highlighting critical limitations of traditional intrinsic metrics.

IRApr 29, 2021
Text-to-Text Multi-view Learning for Passage Re-ranking

Jia-Huei Ju, Jheng-Hong Yang, Chuan-Ju Wang

Recently, much progress in natural language processing has been driven by deep contextualized representations pretrained on large corpora. Typically, the fine-tuning on these pretrained models for a specific downstream task is based on single-view learning, which is however inadequate as a sentence can be interpreted differently from different perspectives. Therefore, in this work, we propose a text-to-text multi-view learning framework by incorporating an additional view -- the text generation view -- into a typical single-view passage ranking model. Empirically, the proposed approach is of help to the ranking performance compared to its single-view counterpart. Ablation studies are also reported in the paper.

IRAug 30, 2020
Personalized TV Recommendation: Fusing User Behavior and Preferences

Sheng-Chieh Lin, Ting-Wei Lin, Jing-Kai Lou et al.

In this paper, we propose a two-stage ranking approach for recommending linear TV programs. The proposed approach first leverages user viewing patterns regarding time and TV channels to identify potential candidates for recommendation and then further leverages user preferences to rank these candidates given textual information about programs. To evaluate the method, we conduct empirical studies on a real-world TV dataset, the results of which demonstrate the superior performance of our model in terms of both recommendation accuracy and time efficiency.

IRMay 23, 2020
Skewness Ranking Optimization for Personalized Recommendation

Chuan-Ju Wang, Yu-Neng Chuang, Chih-Ming Chen et al.

In this paper, we propose a novel optimization criterion that leverages features of the skew normal distribution to better model the problem of personalized recommendation. Specifically, the developed criterion borrows the concept and the flexibility of the skew normal distribution, based on which three hyperparameters are attached to the optimization criterion. Furthermore, from a theoretical point of view, we not only establish the relation between the maximization of the proposed criterion and the shape parameter in the skew normal distribution, but also provide the analogies and asymptotic analysis of the proposed criterion to maximization of the area under the ROC curve. Experimental results conducted on a range of large-scale real-world datasets show that our model significantly outperforms the state of the art and yields consistently best performance on all tested datasets.

CLMay 5, 2020
Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting

Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira et al.

Conversational search plays a vital role in conversational information seeking. As queries in information seeking dialogues are ambiguous for traditional ad-hoc information retrieval (IR) systems due to the coreference and omission resolution problems inherent in natural language dialogue, resolving these ambiguities is crucial. In this paper, we tackle conversational passage retrieval (ConvPR), an important component of conversational search, by addressing query ambiguities with query reformulation integrated into a multi-stage ad-hoc IR system. Specifically, we propose two conversational query reformulation (CQR) methods: (1) term importance estimation and (2) neural query rewriting. For the former, we expand conversational queries using important terms extracted from the conversational context with frequency-based signals. For the latter, we reformulate conversational queries into natural, standalone, human-understandable queries with a pretrained sequence-tosequence model. Detailed analyses of the two CQR methods are provided quantitatively and qualitatively, explaining their advantages, disadvantages, and distinct behaviors. Moreover, to leverage the strengths of both CQR methods, we propose combining their output with reciprocal rank fusion, yielding state-of-the-art retrieval effectiveness, 30% improvement in terms of NDCG@3 compared to the best submission of TREC CAsT 2019.

CLApr 4, 2020
Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira et al.

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs). We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task. In CQR benchmarks of task-oriented dialogue systems, we evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task. Examining a variety of architectures with different numbers of parameters, we demonstrate that the recent text-to-text transfer transformer (T5) achieves the best results both on CANARD and CAsT with fewer parameters, compared to similar transformer architectures.

CLMar 18, 2020
TTTTTackling WinoGrande Schemas

Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira et al.

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis. Our first (and only) submission to the official leaderboard yielded 0.7673 AUC on March 13, 2020, which is the best known result at this time and beats the previous state of the art by over five points.

IRFeb 17, 2019
Collaborative Similarity Embedding for Recommender Systems

Chih-Ming Chen, Chuan-Ju Wang, Ming-Feng Tsai et al.

We present collaborative similarity embedding (CSE), a unified framework that exploits comprehensive collaborative relations available in a user-item bipartite graph for representation learning and recommendation. In the proposed framework, we differentiate two types of proximity relations: direct proximity and k-th order neighborhood proximity. While learning from the former exploits direct user-item associations observable from the graph, learning from the latter makes use of implicit associations such as user-user similarities and item-item similarities, which can provide valuable information especially when the graph is sparse. Moreover, for improving scalability and flexibility, we propose a sampling technique that is specifically designed to capture the two types of proximity relations. Extensive experiments on eight benchmark datasets show that CSE yields significantly better performance than state-of-the-art recommendation methods.

IRAug 28, 2018
Superhighway: Bypass Data Sparsity in Cross-Domain CF

Kwei-Herng Lai, Ting-Hsiang Wang, Heng-Yu Chi et al.

Cross-domain collaborative filtering (CF) aims to alleviate data sparsity in single-domain CF by leveraging knowledge transferred from related domains. Many traditional methods focus on enriching compared neighborhood relations in CF directly to address the sparsity problem. In this paper, we propose superhighway construction, an alternative explicit relation-enrichment procedure, to improve recommendations by enhancing cross-domain connectivity. Specifically, assuming partially overlapped items (users), superhighway bypasses multi-hop inter-domain paths between cross-domain users (items, respectively) with direct paths to enrich the cross-domain connectivity. The experiments conducted on a real-world cross-region music dataset and a cross-platform movie dataset show that the proposed superhighway construction significantly improves recommendation performance in both target and source domains.

MMAug 28, 2018
Representation Learning for Image-based Music Recommendation

Chih-Chun Hsia, Kwei-Herng Lai, Yian Chen et al.

Image perception is one of the most direct ways to provide contextual information about a user concerning his/her surrounding environment; hence images are a suitable proxy for contextual recommendation. We propose a novel representation learning framework for image-based music recommendation that bridges the heterogeneity gap between music and image data; the proposed method is a key component for various contextual recommendation tasks. Preliminary experiments show that for an image-to-song retrieval task, the proposed method retrieves relevant or conceptually similar songs for input images.