CLJun 17, 2022
CookDial: A dataset for task-oriented dialogs grounded in procedural documentsYiwei Jiang, Klim Zaporojets, Johannes Deleu et al.
This work presents a new dialog dataset, CookDial, that facilitates research on task-oriented dialog systems with procedural knowledge understanding. The corpus contains 260 human-to-human task-oriented dialogs in which an agent, given a recipe document, guides the user to cook a dish. Dialogs in CookDial exhibit two unique features: (i) procedural alignment between the dialog flow and supporting document; (ii) complex agent decision-making that involves segmenting long sentences, paraphrasing hard instructions and resolving coreference in the dialog context. In addition, we identify three challenging (sub)tasks in the assumed task-oriented dialog system: (1) User Question Understanding, (2) Agent Action Frame Prediction, and (3) Agent Response Generation. For each of these tasks, we develop a neural baseline model, which we evaluate on the CookDial dataset. We publicly release the CookDial dataset, comprising rich annotations of both dialogs and recipe documents, to stimulate further research on domain-specific document-grounded dialog systems.
ROSep 10, 2024
Multimodal Large Language Model Driven Scenario Testing for Autonomous VehiclesQiujing Lu, Xuanhan Wang, Yiwei Jiang et al.
The generation of corner cases has become increasingly crucial for efficiently testing autonomous vehicles prior to road deployment. However, existing methods struggle to accommodate diverse testing requirements and often lack the ability to generalize to unseen situations, thereby reducing the convenience and usability of the generated scenarios. A method that facilitates easily controllable scenario generation for efficient autonomous vehicles (AV) testing with realistic and challenging situations is greatly needed. To address this, we proposed OmniTester: a multimodal Large Language Model (LLM) based framework that fully leverages the extensive world knowledge and reasoning capabilities of LLMs. OmniTester is designed to generate realistic and diverse scenarios within a simulation environment, offering a robust solution for testing and evaluating AVs. In addition to prompt engineering, we employ tools from Simulation of Urban Mobility to simplify the complexity of codes generated by LLMs. Furthermore, we incorporate Retrieval-Augmented Generation and a self-improvement mechanism to enhance the LLM's understanding of scenarios, thereby increasing its ability to produce more realistic scenes. In the experiments, we demonstrated the controllability and realism of our approaches in generating three types of challenging and complex scenarios. Additionally, we showcased its effectiveness in reconstructing new scenarios described in crash report, driven by the generalization capability of LLMs.
LGNov 6, 2025
An MLCommons Scientific Benchmarks OntologyBen Hawks, Gregor von Laszewski, Matthew D. Sinclair et al.
Scientific machine learning research spans diverse domains and data modalities, yet existing benchmark efforts remain siloed and lack standardization. This makes novel and transformative applications of machine learning to critical scientific use-cases more fragmented and less clear in pathways to impact. This paper introduces an ontology for scientific benchmarking developed through a unified, community-driven effort that extends the MLCommons ecosystem to cover physics, chemistry, materials science, biology, climate science, and more. Building on prior initiatives such as XAI-BENCH, FastML Science Benchmarks, PDEBench, and the SciMLBench framework, our effort consolidates a large set of disparate benchmarks and frameworks into a single taxonomy of scientific, application, and system-level benchmarks. New benchmarks can be added through an open submission workflow coordinated by the MLCommons Science Working Group and evaluated against a six-category rating rubric that promotes and identifies high-quality benchmarks, enabling stakeholders to select benchmarks that meet their specific needs. The architecture is extensible, supporting future scientific and AI/ML motifs, and we discuss methods for identifying emerging computing patterns for unique scientific workloads. The MLCommons Science Benchmarks Ontology provides a standardized, scalable foundation for reproducible, cross-domain benchmarking in scientific machine learning. A companion webpage for this work has also been developed as the effort evolves: https://mlcommons-science.github.io/benchmark/
DCApr 4
Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC ClustersRutwik Jain, Yiwei Jiang, Matthew D. Sinclair et al.
As large-scale HPC compute clusters increasingly adopt accelerators such as GPUs to meet the voracious demands of modern workloads, these clusters are increasingly becoming power constrained. Unfortunately, modern applications can often temporarily exceed the power ratings of the accelerators ("power spikes"). Thus, current and future HPC systems must optimize for both power and performance together. However, this is made difficult by increasingly diverse applications, which often require bespoke optimizations to run efficiently on each cluster. Traditionally researchers overcome this problem by profiling applications on specific clusters and optimizing, but the scale, algorithmic diversity, and lack of effective tools make this challenging. To overcome these inefficiencies, we propose Minos, a systematic classification mechanism that identifies similar application characteristics via low-cost profiling for power and performance. This allows us to group similarly behaving workloads into a finite number of distinct classes and reduce the overhead of extensively profiling new workloads. For example, when predicting frequency capping behavior for a previously unseen application, Minos reduces profiling time by 89%. Moreover, across 18 popular graph analytics, HPC, HPC+ML, and ML workloads, Minos achieves a mean error of 4% for power predictions and 3% for performance predictions, significantly improving predictions over state-of-the-art approaches by 10%.
CLMar 27, 2025
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive TasksWenqi Zhang, Mengna Wang, Gangao Liu et al.
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
CLAug 30, 2021
Towards Consistent Document-level Entity Linking: Joint Models for Entity Linking and Coreference ResolutionKlim Zaporojets, Johannes Deleu, Yiwei Jiang et al.
We consider the task of document-level entity linking (EL), where it is important to make consistent decisions for entity mentions over the full document jointly. We aim to leverage explicit "connections" among mentions within the document itself: we propose to join the EL task with that of coreference resolution (coref). This is complementary to related works that exploit either (i) implicit document information (e.g., latent relations among entity mentions, or general language models) or (ii) connections between the candidate links (e.g, as inferred from the external knowledge base). Specifically, we cluster mentions that are linked via coreference, and enforce a single EL for all of the clustered mentions together. The latter constraint has the added benefit of increased coverage by joining EL candidate lists for the thus clustered mentions. We formulate the coref+EL problem as a structured prediction task over directed trees and use a globally normalized model to solve it. Experimental results on two datasets show a boost of up to +5% F1-score on both coref and EL tasks, compared to their standalone counterparts. For a subset of hard cases, with individual mentions lacking the correct EL in their candidate entity list, we obtain a +50% increase in accuracy.
SDOct 15, 2020
Music Classification in MIDI Format based on LSTM MdelYiting Xia, Yiwei Jiang, Tao Ye
Music classification between music made by AI or human composers can be done by deep learning networks. We first transformed music samples in midi format to natural language sequences, then classified these samples by mLSTM (multiplicative Long Short Term Memory) + logistic regression. The accuracy of the result evaluated by 10-fold cross validation can reach 90%. Our work indicates that music generated by AI and human composers do have different characteristics, which can be learned by deep learning networks.