CVDec 16, 2022
Biomedical image analysis competitions: The state of current participation practiceMatthias Eisenmann, Annika Reinke, Vivienn Weru et al. · utoronto
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
CLNov 15, 2023
Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence GenerationJiachen Zhao, Wenlong Zhao, Andrew Drozdov et al.
We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the discovery that a student model distilled from a few-shot prompted LLM can commonly generalize better than its teacher to unseen examples on such tasks. We find that the student is able to learn a general pattern from the high-quality pseudolabels produced by the teacher during knowledge distillation (KD), and favorably not a general pattern from the low-quality pseudolables. Leveraging this discovery, we propose a new method, Multistage Collaborative Knowledge Distillation from an LLM (MCKD), for these tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. Then at each stage of an iterative KD process, a new pair of students is trained on disjoint partitions of the pseudolabeled data, and produces new and improved pseudolabels for their unseen partitions. We conduct extensive experiments on four syntactic and semantic parsing datasets and show the effectiveness of MCKD for low-resource semi-supervised sequence generation. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms an LLM teacher and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 labeled examples.
76.0AIMay 26
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of RefusalKia-Jüng Yang, Dominik Meier, Jiachen Zhao et al.
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.
CLNov 12, 2023
Large Language Models are In-context Teachers for Knowledge ReasoningJiachen Zhao, Zonghai Yao, Zhichao Yang et al.
In this work, we study in-context teaching (ICT), where a teacher provides in-context example rationales to teach a student to reason over unseen cases. Human teachers are usually required to craft in-context demonstrations, which are costly and have high variance. We ask whether a large language model (LLM) can serve as a more effective in-context teacher for itself or other LLMs, compared to humans. Inspired by the Encoding Specificity Hypothesis from human episodic memory, we hypothesize that in-context exemplars crafted by the teacher should match the training data of the student. This hypothesis motivates us to propose Self-Explain where an LLM's self-elicited explanations are used as in-context demonstrations for prompting it as they are generalized from the model's training examples. Self-Explain is shown to significantly outperform using human-crafted exemplars and other baselines. Furthermore, we reveal that for ICT, rationales from different teacher LLMs or human experts that more resemble the student LLM's self-explanations are better in-context demonstrations. This supports our encoding specificity hypothesis. We then propose Teach-Back that aligns a teacher LLM with the student to enhance the ICT performance. For example, Teach-Back enables a 7B model to teach the much larger GPT-3.5 in context, surpassing human teachers by around 5% in test accuracy on medical question answering.
CLNov 6, 2023
In-Context Exemplars as Clues to Retrieving from Large Associative MemoryJiachen Zhao
Recently, large language models (LLMs) have made remarkable progress in natural language processing. The most representative ability of LLMs is in-context learning (ICL), which enables LLMs to learn patterns from in-context exemplars without training. The performance of ICL greatly depends on the exemplars used. However, how to choose exemplars remains unclear due to the lack of understanding of how in-context learning works. In this paper, we present a novel perspective on ICL by conceptualizing it as contextual retrieval from a model of associative memory. We establish a theoretical framework of ICL based on Hopfield Networks. Based on our framework, we look into how in-context exemplars influence the performance of ICL and propose more efficient active exemplar selection. Our study sheds new light on the mechanism of ICL by connecting it to memory retrieval, with potential implications for advancing the understanding of LLMs.
CLAug 20, 2022
Trigger-free Event Detection via Derangement Reading ComprehensionJiachen Zhao, Haiqin Yang
Event detection (ED), aiming to detect events from texts and categorize them, is vital to understanding actual happenings in real life. However, mainstream event detection models require high-quality expert human annotations of triggers, which are often costly and thus deter the application of ED to new domains. Therefore, in this paper, we focus on low-resource ED without triggers and aim to tackle the following formidable challenges: multi-label classification, insufficient clues, and imbalanced events distribution. We propose a novel trigger-free ED method via Derangement mechanism on a machine Reading Comprehension (DRC) framework. More specifically, we treat the input text as Context and concatenate it with all event type tokens that are deemed as Answers with an omitted default question. So we can leverage the self-attention in pre-trained language models to absorb semantic relations between input text and the event types. Moreover, we design a simple yet effective event derangement module (EDM) to prevent major events from being excessively learned so as to yield a more balanced training process. The experiment results show that our proposed trigger-free ED model is remarkably competitive to mainstream trigger-based models, showing its strong performance on low-source event detection.
89.8CLMar 27
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language ModelsJiyuan An, Jiachen Zhao, Fan Chen et al.
The construction of CAD models has traditionally relied on labor-intensive manual operations and specialized expertise. Recent advances in large language models (LLMs) have inspired research into text-to-CAD generation. However, existing approaches typically treat generation and editing as disjoint tasks, limiting their practicality. We propose PR-CAD, a progressive refinement framework that unifies generation and editing for controllable and faithful text-to-CAD modeling. To support this, we curate a high-fidelity interaction dataset spanning the full CAD lifecycle, encompassing multiple CAD representations as well as both qualitative and quantitative descriptions. The dataset systematically defines the types of edit operations and generates highly human-like interaction data. Building on a CAD representation tailored for LLMs, we propose a reinforcement learning-enhanced reasoning framework that integrates intent understanding, parameter estimation, and precise edit localization into a single agent. This enables an "all-in-one" solution for both design creation and refinement. Extensive experiments demonstrate strong mutual reinforcement between generation and editing tasks, and across qualitative and quantitative modalities. On public benchmarks, PR-CAD achieves state-of-the-art controllability and faithfulness in both generation and refinement scenarios, while also proving user-friendly and significantly improving CAD modeling efficiency.
CLDec 20, 2023
Learning and Forgetting Unsafe Examples in Large Language ModelsJiachen Zhao, Zhun Deng, David Madras et al.
As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the "ForgetFilter" algorithm, which filters unsafe data based on how strong the model's forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs' ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
CLJul 16, 2025
LLMs Encode Harmfulness and Refusal SeparatelyJiachen Zhao, Jing Huang, Zhengxuan Wu et al. · stanford
LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model's judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model's internal belief of harmfulness. These insights lead to a practical safety application: The model's latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs' internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety.
LGOct 28, 2025
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-ThoughtJiachen Zhao, Yiyou Sun, Weiyan Shi et al.
Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning steps in CoT are often assumed as a faithful reflection of the model's internal thinking process, and used to monitor unsafe intentions. However, we find many reasoning steps don't truly contribute to LLMs' prediction. We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS). We reveal that LLMs often interleave between true-thinking steps (which are genuinely used to produce the final output) and decorative-thinking steps (which only give the appearance of reasoning but have minimal causal impact). Notably, only a small subset of the total reasoning steps have a high TTS that causally drive the model's prediction: e.g., for the AIME dataset, only an average of 2.3% of reasoning steps in CoT have a TTS >= 0.7 (range: 0-1) under the Qwen-2.5 model. Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along or against this direction, we can force the model to perform or disregard certain CoT steps when computing the final result. Finally, we highlight that self-verification steps in CoT (i.e., aha moments) can also be decorative, where LLMs do not truly verify their solution. Steering along the TrueThinking direction can force internal reasoning over these steps, resulting in a change in the final results. Overall, our work reveals that LLMs often verbalize reasoning steps without actually performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.
LGAug 19, 2025
LatentFlow: Cross-Frequency Experimental Flow Reconstruction from Sparse Pressure via Latent MappingJunle Liu, Chang Liu, Yanyu Ke et al.
Acquiring temporally high-frequency and spatially high-resolution turbulent wake flow fields in particle image velocimetry (PIV) experiments remains a significant challenge due to hardware limitations and measurement noise. In contrast, temporal high-frequency measurements of spatially sparse wall pressure are more readily accessible in wind tunnel experiments. In this study, we propose a novel cross-modal temporal upscaling framework, LatentFlow, which reconstructs high-frequency (512 Hz) turbulent wake flow fields by fusing synchronized low-frequency (15 Hz) flow field and pressure data during training, and high-frequency wall pressure signals during inference. The first stage involves training a pressure-conditioned $β$-variation autoencoder ($p$C-$β$-VAE) to learn a compact latent representation that captures the intrinsic dynamics of the wake flow. A secondary network maps synchronized low-frequency wall pressure signals into the latent space, enabling reconstruction of the wake flow field solely from sparse wall pressure. Once trained, the model utilizes high-frequency, spatially sparse wall pressure inputs to generate corresponding high-frequency flow fields via the $p$C-$β$-VAE decoder. By decoupling the spatial encoding of flow dynamics from temporal pressure measurements, LatentFlow provides a scalable and robust solution for reconstructing high-frequency turbulent wake flows in data-constrained experimental settings.
IVMar 19, 2024
QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation ChallengeHongwei Bran Li, Fernando Navarro, Ivan Ezhov et al.
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
LGDec 15, 2023
Student as an Inherent Denoiser of Noisy TeacherJiachen Zhao
Knowledge distillation (KD) has been widely employed to transfer knowledge from a large language model (LLM) to a specialized model in low-data regimes through pseudo label learning. However, pseudo labels generated by teacher models are usually noisy and may influence KD performance. This study delves into KD with noisy teachers and uncovers that the student model can already generate more accurate predictions than the teacher labels used to train it during KD, indicating its inherent ability to denoise noisy teacher labels. Motivated by this finding, we propose Peer-Advised KD to improve vanilla KD from noisy teachers. Experiments show that Peer-Advised KD can outperform LLM by approximately 5% with 50 human-labeled data, and even competitive to standard supervised finetuning with 750 human-labeled data.