Yixin Shen

LG
h-index18
7papers
72citations
Novelty43%
AI Score47

7 Papers

LGJan 22Code
Towards Automated Kernel Generation in the Era of LLMs

Yang Yu, Peiyu Zang, Chi Hsu Tsai et al.

The performance of modern AI systems is fundamentally constrained by the quality of their underlying kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models (LLMs) and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically compiling the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at https://github.com/flagos-ai/awesome-LLM-driven-kernel-generation.

86.5ROApr 17
Human Cognition in Machines: A Unified Perspective of World Models

Timothy Rupprecht, Pu Zhao, Amir Taherin et al.

This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

CLDec 8, 2024Code
7B Fully Open Source Moxin-LLM/VLM -- From Pretraining to GRPO-based Reinforcement Learning Enhancement

Pu Zhao, Xuan Shen, Zhenglun Kong et al. · harvard

Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Although open-source LLMs present unprecedented opportunities for innovation and research, the commercialization of LLMs has raised concerns about transparency, reproducibility, and safety. Many open-source LLMs fail to meet fundamental transparency requirements by withholding essential components like training code and data, which may hinder further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a fully open-source LLM developed, adhering to principles of open science, open source, open data, and open access. We release the pre-training code and configurations, training and fine-tuning datasets, and intermediate and final checkpoints, aiming to make continuous commitments to fully open-source LLMs. After pre-training the base model, we finetune the Moxin Base model with SOTA post-training framework and instruction data to obtain Moxin Instruct model. To improve the reasoning capability, we further finetune our Instruct model with chain-of-thought data distilled from DeepSeek R1, and then use Group Relative Policy Optimization (GRPO) following DeepSeek R1 to finetune our model, leading to the Moxin Reasoning model. Moreover, we develop our vision language model based on our Moxin model. Experiments show that our models achieve superior performance in various evaluations such as zero-shot evaluation, few-shot evaluation, and CoT evaluation.

LGAug 4, 2025
Kron-LoRA: Hybrid Kronecker-LoRA Adapters for Scalable, Sustainable Fine-tuning

Yixin Shen

Fine-tuning massive pre-trained language models across many tasks demands adapters that are both parameter-efficient and expressive. We introduce \textbf{Kron-LoRA}, a hybrid adapter that combines Kronecker-structured factorization with low-rank LoRA compression-an integration that, to our knowledge, has not been explored in parameter-efficient fine-tuning or in matrix approximation literature. Kron-LoRA achieves up to 4$\times$ fewer parameters than standard LoRA while retaining similar expressivity. Experiments on DistilBERT, Mistral-7B, LLaMA-2-7B, and LLaMA-3-8B across eight benchmarks show that Kron-LoRA matches or exceeds LoRA baselines with modest memory savings and only a 5-8\% speed overhead. In sequential fine-tuning, it also delivers competitive cross-task transfer despite using only one-quarter of the adapter parameters. Kron-LoRA thus offers a scalable, sustainable solution for multi-task adaptation of large language models.

QUANT-PHFeb 14, 2022
Variational quantum solutions to the Shortest Vector Problem

Martin R. Albrecht, Miloš Prokop, Yixin Shen et al.

A fundamental computational problem is to find a shortest non-zero vector in Euclidean lattices, a problem known as the Shortest Vector Problem (SVP). This problem is believed to be hard even on quantum computers and thus plays a pivotal role in post-quantum cryptography. In this work we explore how (efficiently) Noisy Intermediate Scale Quantum (NISQ) devices may be used to solve SVP. Specifically, we map the problem to that of finding the ground state of a suitable Hamiltonian. In particular, (i) we establish new bounds for lattice enumeration, this allows us to obtain new bounds (resp.~estimates) for the number of qubits required per dimension for any lattices (resp.~random q-ary lattices) to solve SVP; (ii) we exclude the zero vector from the optimization space by proposing (a) a different classical optimisation loop or alternatively (b) a new mapping to the Hamiltonian. These improvements allow us to solve SVP in dimension up to 28 in a quantum emulation, significantly more than what was previously achieved, even for special cases. Finally, we extrapolate the size of NISQ devices that is required to be able to solve instances of lattices that are hard even for the best classical algorithms and find that with approximately $10^3$ noisy qubits such instances can be tackled.

DSFeb 19, 2020
Improved Classical and Quantum Algorithms for the Shortest Vector Problem via Bounded Distance Decoding

Divesh Aggarwal, Yanlin Chen, Rajendra Kumar et al.

The most important computational problem on lattices is the Shortest Vector Problem (SVP). In this paper, we present new algorithms that improve the state-of-the-art for provable classical/quantum algorithms for SVP. We present the following results. $\bullet$ A new algorithm for SVP that provides a smooth tradeoff between time complexity and memory requirement. For any positive integer $4\leq q\leq \sqrt{n}$, our algorithm takes $q^{13n+o(n)}$ time and requires $poly(n)\cdot q^{16n/q^2}$ memory. This tradeoff which ranges from enumeration ($q=\sqrt{n}$) to sieving ($q$ constant), is a consequence of a new time-memory tradeoff for Discrete Gaussian sampling above the smoothing parameter. $\bullet$ A quantum algorithm for SVP that runs in time $2^{0.950n+o(n)}$ and requires $2^{0.5n+o(n)}$ classical memory and poly(n) qubits. In Quantum Random Access Memory (QRAM) model this algorithm takes only $2^{0.835n+o(n)}$ time and requires a QRAM of size $2^{0.293n+o(n)}$, poly(n) qubits and $2^{0.5n}$ classical space. This improves over the previously fastest classical (which is also the fastest quantum) algorithm due to [ADRS15] that has a time and space complexity $2^{n+o(n)}$. $\bullet$ A classical algorithm for SVP that runs in time $2^{1.669n+o(n)}$ time and $2^{0.5n+o(n)}$ space. This improves over an algorithm of [CCL18] that has the same space complexity. The time complexity of our classical and quantum algorithms are obtained using a known upper bound on a quantity related to the lattice kissing number which is $2^{0.402n}$. We conjecture that for most lattices this quantity is a $2^{o(n)}$. Assuming that this is the case, our classical algorithm runs in time $2^{1.292n+o(n)}$, our quantum algorithm runs in time $2^{0.750n+o(n)}$ and our quantum algorithm in QRAM model runs in time $2^{0.667n+o(n)}$.

QUANT-PHFeb 12, 2020
Improved Classical and Quantum Algorithms for Subset-Sum

Xavier Bonnetain, Rémi Bricout, André Schrottenloher et al.

We present new classical and quantum algorithms for solving random subset-sum instances. First, we improve over the Becker-Coron-Joux algorithm (EUROCRYPT 2011) from $\tilde{\mathcal{O}}(2^{0.291 n})$ downto $\tilde{\mathcal{O}}(2^{0.283 n})$, using more general representations with values in $\{-1,0,1,2\}$. Next, we improve the state of the art of quantum algorithms for this problem in several directions. By combining the Howgrave-Graham-Joux algorithm (EUROCRYPT 2010) and quantum search, we devise an algorithm with asymptotic cost $\tilde{\mathcal{O}}(2^{0.236 n})$, lower than the cost of the quantum walk based on the same classical algorithm proposed by Bernstein, Jeffery, Lange and Meurer (PQCRYPTO 2013). This algorithm has the advantage of using \emph{classical} memory with quantum random access, while the previously known algorithms used the quantum walk framework, and required \emph{quantum} memory with quantum random access. We also propose new quantum walks for subset-sum, performing better than the previous best time complexity of $\tilde{\mathcal{O}}(2^{0.226 n})$ given by Helm and May (TQC 2018). We combine our new techniques to reach a time $\tilde{\mathcal{O}}(2^{0.216 n})$. This time is dependent on a heuristic on quantum walk updates, formalized by Helm and May, that is also required by the previous algorithms. We show how to partially overcome this heuristic, and we obtain an algorithm with quantum time $\tilde{\mathcal{O}}(2^{0.218 n})$ requiring only the standard classical subset-sum heuristics.