CRNov 22, 2023Code
A Somewhat Robust Image Watermark against Diffusion-based Editing ModelsMingtian Tan, Tianhao Wang, Somesh Jha
Recently, diffusion models (DMs) have become the state-of-the-art method for image synthesis. Editing models based on DMs, known for their high fidelity and precision, have inadvertently introduced new challenges related to image copyright infringement and malicious editing. Our work is the first to formalize and address this issue. After assessing and attempting to enhance traditional image watermarking techniques, we recognize their limitations in this emerging context. In response, we develop a novel technique, RIW (Robust Invisible Watermarking), to embed invisible watermarks leveraging adversarial example techniques. Our technique ensures a high extraction accuracy of $96\%$ for the invisible watermark after editing, compared to the $0\%$ offered by conventional methods. We provide access to our code at https://github.com/BennyTMT/RIW.
CLApr 17, 2024Code
Language Models Still Struggle to Zero-shot Reason about Time SeriesMike A. Merrill, Mingtian Tan, Vinayak Gupta et al.
Time series are critical for decision-making in fields like finance and healthcare. Their importance has driven a recent influx of works passing time series into language models, leading to non-trivial forecasting on some datasets. But it remains unknown whether non-trivial forecasting implies that language models can reason about time series. To address this gap, we generate a first-of-its-kind evaluation framework for time series reasoning, including formal tasks and a corresponding dataset of multi-scale time series paired with text captions across ten domains. Using these data, we probe whether language models achieve three forms of reasoning: (1) Etiological Reasoning - given an input time series, can the language model identify the scenario that most likely created it? (2) Question Answering - can a language model answer factual questions about time series? (3) Context-Aided Forecasting - does highly relevant textual context improve a language model's time series forecasts? We find that otherwise highly-capable language models demonstrate surprisingly limited time series reasoning: they score marginally above random on etiological and question answering tasks (up to 30 percentage points worse than humans) and show modest success in using context to improve forecasting. These weakness showcase that time series reasoning is an impactful, yet deeply underdeveloped direction for language model research. We also make our datasets and code public at to support further research in this direction at https://github.com/behavioral-data/TSandLanguage
CRSep 12, 2023
Exploiting Machine Unlearning for Backdoor Attacks in Deep Learning SystemPeixin Zhang, Jun Sun, Mingtian Tan et al.
In recent years, the security issues of artificial intelligence have become increasingly prominent due to the rapid development of deep learning research and applications. Backdoor attack is an attack targeting the vulnerability of deep learning models, where hidden backdoors are activated by triggers embedded by the attacker, thereby outputting malicious predictions that may not align with the intended output for a given input. In this work, we propose a novel black-box backdoor attack based on machine unlearning. The attacker first augments the training set with carefully designed samples, including poison and mitigation data, to train a `benign' model. Then, the attacker posts unlearning requests for the mitigation samples to remove the impact of relevant data on the model, gradually activating the hidden backdoor. Since backdoors are implanted during the iterative unlearning process, it significantly increases the computational overhead of existing defense methods for backdoor detection or mitigation. To address this new security threat, we proposes two methods for detecting or mitigating such malicious unlearning requests. We conduct the experiment in both exact unlearning and approximate unlearning (i.e., SISA) settings. Experimental results indicate that: 1) our attack approach can successfully implant backdoor into the model, and sharding increases the difficult of attack; 2) our detection algorithms are effective in identifying the mitigation samples, while sharding reduces the effectiveness of our detection algorithms.
AIMar 18, 2025Code
Inferring Events from Time Series using Language ModelsMingtian Tan, Mike A. Merrill, Zack Gottesman et al.
Time series data measure how environments change over time and drive decision-making in critical domains like finance and healthcare. A common goal in analyzing time series data is to understand the underlying events that cause the observed variations. We conduct the first study of whether Large Language Models (LLMs) can infer events described with natural language from time series data. We evaluate 18 LLMs on a task to match event sequences with real-valued time series data using a new benchmark we develop using sports data. Several current LLMs demonstrate promising abilities, with OpenAI's o1 performing the best but with DS-R1-distill-Qwen-32B outperforming proprietary models such as GPT-4o. From insights derived from analyzing reasoning failures, we also find clear avenues to improve performance. By applying post-training optimizations, i.e., distillation and self-improvement, we significantly enhance the performance of the Qwen2.5 1.5B, achieving results second only to o1. All resources needed to reproduce our work are available: https://github.com/BennyTMT/GAMETime
LGMay 9
LEAF: A Living Benchmark for Event-Augmented ForecastingMingtian Tan, Mihir Parmar, Palash Goyal et al.
Large Language Models (LLMs) are increasingly applied to forecasting. To evaluate this capability while mitigating pre-training data contamination, several living benchmarks have been proposed. However, existing benchmarks either lack the multidimensional events essential for accurate forecasting due to data scarcity, or focus on relatively closed environments. To assess the predictive capabilities of LLMs in complex, real-world scenarios, we propose LEAF, the first living benchmark for event-augmented forecasting tasks, including future event probabilities, trend and time series forecasting. LEAF utilizes a recursive retrieval agent system paired with dual-agent cross-validation to provide comprehensive and relevant auxiliary text for forecasting. Evaluating state-of-the-art proprietary and open-weight LLMs, we find that these models can leverage signals extracted from complex events to enhance predictive performance. In the stock domain, we find that LLMs achieve better performance on equities they confidently identify as more predictable. Furthermore, the events demonstrate a strong correlation with the target equities. To this end, LEAF provides a necessary, dynamically updating testbed to continuously track and drive progress in event-driven forecasting tasks.
AIOct 14, 2025
Toward Reasoning-Centric Time-Series AnalysisXinlei Wang, Mingtian Tan, Jing Qiu et al.
Traditional time series analysis has long relied on pattern recognition, trained on static and well-established benchmarks. However, in real-world settings -- where policies shift, human behavior adapts, and unexpected events unfold -- effective analysis must go beyond surface-level trends to uncover the actual forces driving them. The recent rise of Large Language Models (LLMs) presents new opportunities for rethinking time series analysis by integrating multimodal inputs. However, as the use of LLMs becomes popular, we must remain cautious, asking why we use LLMs and how to exploit them effectively. Most existing LLM-based methods still employ their numerical regression ability and ignore their deeper reasoning potential. This paper argues for rethinking time series with LLMs as a reasoning task that prioritizes causal structure and explainability. This shift brings time series analysis closer to human-aligned understanding, enabling transparent and context-aware insights in complex real-world environments.
LGJun 22, 2024
Are Language Models Actually Useful for Time Series Forecasting?Mingtian Tan, Mike A. Merrill, Vinayak Gupta et al.
Large language models (LLMs) are being applied to time series forecasting. But are language models actually useful for time series? In a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade forecasting performance -- in most cases, the results even improve! We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and find that patching and attention structures perform similarly to LLM-based forecasters.
CRJan 28, 2019
Do Not Return Similarity: Face Recovery with DistanceMingtian Tan, Zhe Zhou
Machine Learning (ML) already has been integrated into all kinds of systems, helping developers to solve problems with even higher accuracy than human beings. However, when integrating ML models into a system, developers may accidentally take not enough care of the outputs of ML models, mainly because of their unfamiliarity with ML and AI, resulting in severe consequences like hurting data owners' privacy. In this work, we focus on understanding the risks of abusing embeddings of ML models, an important and popular way of using ML. To show the consequence, we reveal several kinds of channels in which embeddings are accidentally leaked. As our study shows, a face verification system deployed by a government organization leaking only distance to authentic users allows an attacker to exactly recover the embedding of the verifier's pre-installed photo. Further, as we discovered, with the leaked embedding, attackers can easily recover the input photo with negligible quality losses, indicating devastating consequences to users' privacy. This is achieved with our devised GAN-like structure model, which showed 93.65% success rate on popular face embedding model under black box assumption.