87.0CLJun 3Code
ReasoningFlow: Discourse Structures for Understanding LLM Reasoning TracesJinu Lee, Shivam Agarwal, Amruta Parulekar et al.
Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.
CLMay 11, 2022
Cryptocurrency Bubble Detection: A New Stock Market Dataset, Financial Task & Hyperbolic ModelsRamit Sawhney, Shivam Agarwal, Vivek Mittal et al.
The rapid spread of information over social media influences quantitative trading and investments. The growing popularity of speculative trading of highly volatile assets such as cryptocurrencies and meme stocks presents a fresh challenge in the financial realm. Investigating such "bubbles" - periods of sudden anomalous behavior of markets are critical in better understanding investor behavior and market dynamics. However, high volatility coupled with massive volumes of chaotic social media texts, especially for underexplored assets like cryptocoins pose a challenge to existing methods. Taking the first step towards NLP for cryptocoins, we present and publicly release CryptoBubbles, a novel multi-span identification task for bubble detection, and a dataset of more than 400 cryptocoins from 9 exchanges over five years spanning over two million tweets. Further, we develop a set of sequence-to-sequence hyperbolic models suited to this multi-span identification task based on the power-law dynamics of cryptocurrencies and user behavior on social media. We further test the effectiveness of our models under zero-shot settings on a test set of Reddit posts pertaining to 29 "meme stocks", which see an increase in trade volume due to social media hype. Through quantitative, qualitative, and zero-shot analyses on Reddit and Twitter spanning cryptocoins and meme-stocks, we show the practical applicability of CryptoBubbles and hyperbolic models.
CLOct 23, 2024Code
Scaling Diffusion Language Models via Adaptation from Autoregressive ModelsShansan Gong, Shivam Agarwal, Yizhe Zhang et al.
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (127M-355M-7B) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions https://github.com/HKUNLP/DiffuLLaMA.
LGMay 21, 2025
The Unreasonable Effectiveness of Entropy Minimization in LLM ReasoningShivam Agarwal, Zimin Zhang, Lifan Yuan et al.
Entropy minimization (EM) trains the model to concentrate even more probability mass on its most confident outputs. We show that this simple objective alone, without any labeled data, can substantially improve large language models' (LLMs) performance on challenging math, physics, and coding tasks. We explore three approaches: (1) EM-FT minimizes token-level entropy similarly to instruction finetuning, but on unlabeled outputs drawn from the model; (2) EM-RL: reinforcement learning with negative entropy as the only reward to maximize; (3) EM-INF: inference-time logit adjustment to reduce entropy without any training data or parameter updates. On Qwen-7B, EM-RL, without any labeled data, achieves comparable or better performance than strong RL baselines such as GRPO and RLOO that are trained on 60K labeled examples. Furthermore, EM-INF enables Qwen-32B to match or exceed the performance of proprietary models like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro on the challenging SciCode benchmark, while being 3x more efficient than self-consistency and sequential refinement. Our findings reveal that many pretrained LLMs possess previously underappreciated reasoning capabilities that can be effectively elicited through entropy minimization alone, without any labeled data or even any parameter updates.
IRMar 7, 2024
Improving Retrieval in Theme-specific Applications using a Corpus Topical TaxonomySeongKu Kang, Shivam Agarwal, Bowen Jin et al.
Document retrieval has greatly benefited from the advancements of large-scale pre-trained language models (PLMs). However, their effectiveness is often limited in theme-specific applications for specialized areas or industries, due to unique terminologies, incomplete contexts of user queries, and specialized search intents. To capture the theme-specific information and improve retrieval, we propose to use a corpus topical taxonomy, which outlines the latent topic structure of the corpus while reflecting user-interested aspects. We introduce ToTER (Topical Taxonomy Enhanced Retrieval) framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers. Through extensive quantitative, ablative, and exploratory experiments on two real-world datasets, we ascertain the benefits of using topical taxonomy for retrieval in theme-specific applications and demonstrate the effectiveness of ToTER.
CLMay 24, 2023
Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language ModelsPengcheng Jiang, Shivam Agarwal, Bowen Jin et al.
The mission of open knowledge graph (KG) completion is to draw new findings from known facts. Existing works that augment KG completion require either (1) factual triples to enlarge the graph reasoning space or (2) manually designed prompts to extract knowledge from a pre-trained language model (PLM), exhibiting limited performance and requiring expensive efforts from experts. To this end, we propose TAGREAL that automatically generates quality query prompts and retrieves support information from large text corpora to probe knowledge from PLM for KG completion. The results show that TAGREAL achieves state-of-the-art performance on two benchmark datasets. We find that TAGREAL has superb performance even with limited training data, outperforming existing embedding-based, graph-based, and PLM-based methods.
HCSep 20, 2021
Visually Connecting Historical Figures Through Event Knowledge GraphsShahid Latif, Shivam Agarwal, Simon Gottschalk et al.
Knowledge graphs store information about historical figures and their relationships indirectly through shared events. We developed a visualization system, VisKonnect, for analyzing the intertwined lives of historical figures based on the events they participated in. A user's query is parsed for identifying named entities, and related data is retrieved from an event knowledge graph. While a short textual answer to the query is generated using the GPT-3 language model, various linked visualizations provide context, display additional information related to the query, and allow exploration.
AIMar 30, 2021
Flatland Competition 2020: MAPF and MARL for Efficient Train Coordination on a Grid WorldFlorian Laurent, Manuel Schneider, Christian Scheller et al.
The Flatland competition aimed at finding novel approaches to solve the vehicle re-scheduling problem (VRSP). The VRSP is concerned with scheduling trips in traffic networks and the re-scheduling of vehicles when disruptions occur, for example the breakdown of a vehicle. While solving the VRSP in various settings has been an active area in operations research (OR) for decades, the ever-growing complexity of modern railway networks makes dynamic real-time scheduling of traffic virtually impossible. Recently, multi-agent reinforcement learning (MARL) has successfully tackled challenging tasks where many agents need to be coordinated, such as multiplayer video games. However, the coordination of hundreds of agents in a real-life setting like a railway network remains challenging and the Flatland environment used for the competition models these real-world properties in a simplified manner. Submissions had to bring as many trains (agents) to their target stations in as little time as possible. While the best submissions were in the OR category, participants found many promising MARL approaches. Using both centralized and decentralized learning based approaches, top submissions used graph representations of the environment to construct tree-based observations. Further, different coordination mechanisms were implemented, such as communication and prioritization between agents. This paper presents the competition setup, four outstanding solutions to the competition, and a cross-comparison between them.
CRJan 28, 2021
S++: A Fast and Deployable Secure-Computation Framework for Privacy-Preserving Neural Network TrainingPrashanthi Ramachandran, Shivam Agarwal, Arup Mondal et al.
We introduce S++, a simple, robust, and deployable framework for training a neural network (NN) using private data from multiple sources, using secret-shared secure function evaluation. In short, consider a virtual third party to whom every data-holder sends their inputs, and which computes the neural network: in our case, this virtual third party is actually a set of servers which individually learn nothing, even with a malicious (but non-colluding) adversary. Previous work in this area has been limited to just one specific activation function: ReLU, rendering the approach impractical for many use-cases. For the first time, we provide fast and verifiable protocols for all common activation functions and optimize them for running in a secret-shared manner. The ability to quickly, verifiably, and robustly compute exponentiation, softmax, sigmoid, etc., allows us to use previously written NNs without modification, vastly reducing developer effort and complexity of code. In recent times, ReLU has been found to converge much faster and be more computationally efficient as compared to non-linear functions like sigmoid or tanh. However, we argue that it would be remiss not to extend the mechanism to non-linear functions such as the logistic sigmoid, tanh, and softmax that are fundamental due to their ability to express outputs as probabilities and their universal approximation property. Their contribution in RNNs and a few recent advancements also makes them more relevant.
HCSep 1, 2020
How Visualization PhD Students Cope with Paper RejectionsShivam Agarwal, Shahid Latif, Fabian Beck
We conducted a questionnaire study aimed towards PhD students in the field of visualization research to understand how they cope with paper rejections. We collected responses from 24 participants and performed a qualitative analysis of the data in relation to the provided support by collaborators, resubmission strategies, handling multiple rejects, and personal impression of the reviews. The results indicate that the PhD students in the visualization community generally cope well with the negative reviews and, with experience, learn how to act accordingly to improve and resubmit their work. Our results reveal the main coping strategies that can be applied for constructively handling rejected visualization papers. The most prominent strategies include: discussing reviews with collaborators and making a resubmission plan, doing a major revision to improve the work, shortening the work, and seeing rejection as a positive learning experience.
MMMar 30, 2020
Deep Residual Neural Networks for Image in Speech SteganographyShivam Agarwal, Siddarth Venkatraman
Steganography is the art of hiding a secret message inside a publicly visible carrier message. Ideally, it is done without modifying the carrier, and with minimal loss of information in the secret message. Recently, various deep learning based approaches to steganography have been applied to different message types. We propose a deep learning based technique to hide a source RGB image message inside finite length speech segments without perceptual loss. To achieve this, we train three neural networks; an encoding network to hide the message in the carrier, a decoding network to reconstruct the message from the carrier and an additional image enhancer network to further improve the reconstructed message. We also discuss future improvements to the algorithm proposed.