45.9CRMay 27
Domain-Informed Representation for Evolutionary Sieving in Integral and Module LatticesAhmad Tashfeen, Qi Cheng
Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.
LGFeb 2
AgroFlux: A Spatial-Temporal Benchmark for Carbon and Nitrogen Flux Prediction in Agricultural EcosystemsQi Cheng, Licheng Liu, Yao Zhang et al.
Agroecosystem, which heavily influenced by human actions and accounts for a quarter of global greenhouse gas emissions (GHGs), plays a crucial role in mitigating global climate change and securing environmental sustainability. However, we can't manage what we can't measure. Accurately quantifying the pools and fluxes in the carbon, nutrient, and water nexus of the agroecosystem is therefore essential for understanding the underlying drivers of GHG and developing effective mitigation strategies. Conventional approaches like soil sampling, process-based models, and black-box machine learning models are facing challenges such as data sparsity, high spatiotemporal heterogeneity, and complex subsurface biogeochemical and physical processes. Developing new trustworthy approaches such as AI-empowered models, will require the AI-ready benchmark dataset and outlined protocols, which unfortunately do not exist. In this work, we introduce a first-of-its-kind spatial-temporal agroecosystem GHG benchmark dataset that integrates physics-based model simulations from Ecosys and DayCent with real-world observations from eddy covariance flux towers and controlled-environment facilities. We evaluate the performance of various sequential deep learning models on carbon and nitrogen flux prediction, including LSTM-based models, temporal CNN-based model, and Transformer-based models. Furthermore, we explored transfer learning to leverage simulated data to improve the generalization of deep learning models on real-world observations. Our benchmark dataset and evaluation framework contribute to the development of more accurate and scalable AI-driven agroecosystem models, advancing our understanding of ecosystem-climate interactions.
76.3DBMay 6
Efficient Cost-Based Rewrite in a Bottom-Up OptimizerQi Cheng, Yang Sun, Weidong Yu et al.
The query optimizer in a Database Management Systems (DBMS), translates declarative queries into efficient execution plans. Conventional bottom-up optimization consists of two main stages: Query Rewrite (QRW) and Cost-Based Optimization (CBO). However, applying a rewrite rule during QRW may not always be beneficial; the best choice may depend on the (estimated) execution cost of the original and rewritten expressions. Fully exploiting such cost-dependent rules necessitates interleaving QRW with frequent CBO invocations, thereby incurring substantial overhead and often impractical optimization times. To mitigate this inefficiency, we introduce a novel cost-based rewrite framework for bottom-up optimizers. The core of our approach is a multi-level caching mechanism for intermediate CBO results aimed at eliminating redundant computation. Furthermore, we establish and exploit upper cost bounds to intelligently prune the search space during optimization. We also contribute methodological solutions for caching and reusing intermediate plan results within a bottom-up optimizer architecture. The framework has been implemented in the GaussDB optimizer. Experiments show that it significantly reduces overall optimization time, demonstrating the effectiveness of our approach.
LGMay 5, 2025
Knowledge Guided Encoder-Decoder Framework: Integrating Multiple Physical Models for Agricultural Ecosystem ModelingQi Cheng, Licheng Liu, Yao Zhang et al.
Agricultural monitoring is critical for ensuring food security, maintaining sustainable farming practices, informing policies on mitigating food shortage, and managing greenhouse gas emissions. Traditional process-based physical models are often designed and implemented for specific situations, and their parameters could also be highly uncertain. In contrast, data-driven models often use black-box structures and does not explicitly model the inter-dependence between different ecological variables. As a result, they require extensive training data and lack generalizability to different tasks with data distribution shifts and inconsistent observed variables. To address the need for more universal models, we propose a knowledge-guided encoder-decoder model, which can predict key crop variables by leveraging knowledge of underlying processes from multiple physical models. The proposed method also integrates a language model to process complex and inconsistent inputs and also utilizes it to implement a model selection mechanism for selectively combining the knowledge from different physical models. Our evaluations on predicting carbon and nitrogen fluxes for multiple sites demonstrate the effectiveness and robustness of the proposed model under various scenarios.
LGMar 7
Retrieval-Augmented Multi-scale Framework for County-Level Crop Yield Prediction Across Large RegionsYiming Sun, Qi Cheng, Licheng Liu et al.
This paper proposes a new method for crop yield prediction, which is essential for developing management strategies, informing insurance assessments, and ensuring long-term food security. Although existing data-driven approaches have shown promise in this domain, their performance often degrades when applied across large geographic regions and long time periods. This limitation arises from two key challenges: (1) difficulty in jointly capturing short-term and long-term temporal patterns, and (2) inability to effectively accommodate spatial data variability in agricultural systems. Ignoring these issues often leads to unreliable predictions for specific regions or years, which ultimately affects policy decisions and resource allocation. In this paper, we propose a new predictive framework to address these challenges. First, we introduce a new backbone model architecture that captures both short-term daily-scale crop growth dynamics and long-term dependencies across years. To further improve generalization across diverse spatial regions, we augment this model with a retrieval-based adaptation strategy. Recognizing the substantial yield variation across years, we design a novel retrieval-and-refinement pipeline that adjusts retrieved samples by removing cross-year bias not explained by input features. Our experiments on real-world county-level corn yield data over 630 counties in the US demonstrate that our method consistently outperforms different types of baselines. The results also verify the effectiveness of the retrieval-based augmentation method in improving model robustness under spatial heterogeneity.
LGOct 5, 2025
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM ReasoningZiyan Wang, Zheng Wang, Jie Fu et al.
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.
AIMay 20, 2025
LLM-based Evaluation Policy Extraction for Ecological ModelingQi Cheng, Licheng Liu, Qing Zhu et al.
Evaluating ecological time series is critical for benchmarking model performance in many important applications, including predicting greenhouse gas fluxes, capturing carbon-nitrogen dynamics, and monitoring hydrological cycles. Traditional numerical metrics (e.g., R-squared, root mean square error) have been widely used to quantify the similarity between modeled and observed ecosystem variables, but they often fail to capture domain-specific temporal patterns critical to ecological processes. As a result, these methods are often accompanied by expert visual inspection, which requires substantial human labor and limits the applicability to large-scale evaluation. To address these challenges, we propose a novel framework that integrates metric learning with large language model (LLM)-based natural language policy extraction to develop interpretable evaluation criteria. The proposed method processes pairwise annotations and implements a policy optimization mechanism to generate and combine different assessment metrics. The results obtained on multiple datasets for evaluating the predictions of crop gross primary production and carbon dioxide flux have confirmed the effectiveness of the proposed method in capturing target assessment preferences, including both synthetically generated and expert-annotated model comparisons. The proposed framework bridges the gap between numerical metrics and expert knowledge while providing interpretable evaluation policies that accommodate the diverse needs of different ecosystem modeling studies.
CLOct 17, 2024
Learning Multimodal Cues of Children's UncertaintyQi Cheng, Mert İnan, Rahma Mbarki et al.
Understanding uncertainty plays a critical role in achieving common ground (Clark et al.,1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.
CLJun 6, 2024
Every Answer Matters: Evaluating Commonsense with Probabilistic MeasuresQi Cheng, Michael Boratko, Pranay Kumar Yelugam et al.
Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of "boiling water" could be making tea and cooking, but it also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.
LGDec 22, 2020
Mixture Model Framework for Traumatic Brain Injury Prognosis Using Heterogeneous Clinical and Outcome DataAlan D. Kaplan, Qi Cheng, K. Aditya Mohan et al.
Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily nor accurately determined from clinical indicators. This is due in part to the heterogeneity of damage inflicted to the brain, ultimately resulting in diverse and complex outcomes. Using a data-driven approach on many distinct data elements may be necessary to describe this large set of outcomes and thereby robustly depict the nuanced differences among TBI patients' recovery. In this work, we develop a method for modeling large heterogeneous data types relevant to TBI. Our approach is geared toward the probabilistic representation of mixed continuous and discrete variables with missing values. The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings. In addition, it includes a set of clinical outcome assessments at 3, 6, and 12 months post-injury. The model is used to stratify patients into distinct groups in an unsupervised learning setting. We use the model to infer outcomes using input data, and show that the collection of input data reduces uncertainty of outcomes over a baseline approach. In addition, we quantify the performance of a likelihood scoring technique that can be used to self-evaluate the extrapolation risk of prognosis on unseen patients.
CRApr 21, 2020
On the ideal shortest vector problem over random rational primesYanbin Pan, Jun Xu, Nick Wadleigh et al.
Any ideal in a number field can be factored into a product of prime ideals. In this paper we study the prime ideal shortest vector problem (SVP) in the ring $ \Z[x]/(x^{2^n} + 1) $, a popular choice in the design of ideal lattice based cryptosystems. We show that a majority of rational primes lie under prime ideals admitting a polynomial time algorithm for SVP. Although the shortest vector problem of ideal lattices underpins the security of Ring-LWE cryptosystem, this work does not break Ring-LWE, since the security reduction is from the worst case ideal SVP to the average case Ring-LWE, and it is one-way.
CRDec 20, 2016
LWE from Non-commutative Group RingsQi Cheng, Jun Zhang, Jincheng Zhuang
The Ring Learning-With-Errors (LWE) problem, whose security is based on hard ideal lattice problems, has proven to be a promising primitive with diverse applications in cryptography. There are however recent discoveries of faster algorithms for the principal ideal SVP problem, and attempts to generalize the attack to non-principal ideals. In this work, we study the LWE problem on group rings, and build cryptographic schemes based on this new primitive. One can regard the LWE on cyclotomic integers as a special case when the underlying group is cyclic, while our proposal utilizes non-commutative groups, which eliminates the weakness associated with the principal ideal lattices. In particular, we show how to build public key encryption schemes from dihedral group rings, which maintains the efficiency of the ring-LWE and improves its security.
NTOct 18, 2013
Traps to the BGJT-Algorithm for Discrete LogarithmsQi Cheng, Daqing Wan, Jincheng Zhuang
In the recent breakthrough paper by Barbulescu, Gaudry, Joux and Thom{é}, a quasi-polynomial time algorithm (QPA) is proposed for the discrete logarithm problem over finite fields of small characteristic. The time complexity analysis of the algorithm is based on several heuristics presented in their paper. We show that some of the heuristics are problematic in their original forms, in particular, when the field is not a Kummer extension. We believe that the basic idea behind the new approach should still work, and propose a fix to the algorithm in non-Kummer cases, without altering the quasi-polynomial time complexity. The modified algorithm is also heuristic. Further study is required in order to fully understand the effectiveness of the new approach.