Mark Lee

CL
h-index27
20papers
2,432citations
Novelty41%
AI Score57

20 Papers

AIJul 29, 2024
Apple Intelligence Foundation Language Models

Tom Gunter, Zirui Wang, Chong Wang et al.

We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.

CLMay 19
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu, Mohammed Bahja, Mark Lee

Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

LGFeb 24, 2025Code
Delta Decompression for MoE-based LLMs Compression

Hao Gu, Wei Li, Lujun Li et al.

Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available in https://github.com/lliai/D2MoE.

CLMar 7, 2024Code
Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Frances A. Laureano De Leon, Harish Tayyar Madabushi, Mark Lee

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.

LGMay 23, 2024Code
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

Xianzhi Du, Tom Gunter, Xiang Kong et al.

Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. When comparing MoE to dense models, prior work typically adopt the following setting: 1) use FLOPs or activated parameters as a measure of model complexity; 2) train all models to the same number of tokens. We argue that this setting favors MoE as FLOPs and activated parameters do not accurately measure the communication overhead in sparse layers, leading to a larger actual training budget for MoE. In this work, we revisit the settings by adopting step time as a more accurate measure of model complexity, and by determining the total compute budget under the Chinchilla compute-optimal settings. To efficiently run MoE on modern accelerators, we adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range. We evaluate MoE and dense LLMs on a set of nine 0-shot and two 1-shot English tasks, as well as MMLU 5-shot and GSM8K 8-shot across three model scales at 6.4B, 12.6B, and 29.6B. Experimental results show that even under these settings, MoE consistently outperform dense LLMs on the speed-accuracy trade-off curve with meaningful gaps. Our full model implementation and sharding strategy has been released at~\url{https://github.com/apple/axlearn}

CLFeb 16
Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.

LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

LGJul 7, 2025
AXLearn: Modular Large Model Training on Heterogeneous Infrastructure

Mark Lee, Tom Gunter, Chang Lan et al.

We design and implement AXLearn, a production deep learning system that facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for heterogeneous hardware infrastructure. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on heterogeneous compute infrastructure. We introduce a novel method of quantifying modularity via Lines-of-Code (LoC)-complexity, which demonstrates how our system maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in other systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn.

CLDec 10, 2024
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia

Lance Calvin Lim Gamboa, Mark Lee

Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets, a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.

CLOct 20, 2024
A Novel Interpretability Metric for Explaining Bias in Language Models: Applications on Multilingual Models from Southeast Asia

Lance Calvin Lim Gamboa, Mark Lee

Work on bias in pretrained language models (PLMs) focuses on bias evaluation and mitigation and fails to tackle the question of bias attribution and explainability. We propose a novel metric, the $\textit{bias attribution score}$, which draws from information theory to measure token-level contributions to biased behavior in PLMs. We then demonstrate the utility of this metric by applying it on multilingual PLMs, including models from Southeast Asia which have not yet been thoroughly examined in bias evaluation literature. Our results confirm the presence of sexist and homophobic bias in Southeast Asian PLMs. Interpretability and semantic analyses also reveal that PLM bias is strongly induced by words relating to crime, intimate relationships, and helping among other discursive categories, suggesting that these are topics where PLMs strongly reproduce bias from pretraining data and where PLMs should be used with more caution.

CLJun 8, 2025
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.

CLDec 5, 2025
Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu, Mohammed Bahja, Venelin Kovatchev et al.

Evaluating and optimising authorial style in long-form story generation remains challenging because style is often assessed with ad hoc prompting and is frequently conflated with overall writing quality. We propose a two-stage pipeline. First, we train a dedicated style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded $[0,1]$ reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modelling provides a practical mechanism for controllable style transfer in long-form generation under a moderate model size and training budget.

ETOct 8, 2025
Multi-objective Bayesian Optimization with Human-in-the-Loop for Flexible Neuromorphic Electronics Fabrication

Benius Dunn, Javier Meza-Arroyo, Armi Tiihonen et al.

Neuromorphic computing hardware enables edge computing and can be implemented in flexible electronics for novel applications. Metal oxide materials are promising candidates for fabricating flexible neuromorphic electronics, but suffer from processing constraints due to the incompatibilities between oxides and polymer substrates. In this work, we use photonic curing to fabricate flexible metal-insulator-metal capacitors with solution-processible aluminum oxide dielectric tailored for neuromorphic applications. Because photonic curing outcomes depend on many input parameters, identifying an optimal processing condition through a traditional grid-search approach is unfeasible. Here, we apply multi-objective Bayesian optimization (MOBO) to determine photonic curing conditions that optimize the trade-off between desired electrical properties of large capacitance-frequency dispersion and low leakage current. Furthermore, we develop a human-in-the-loop (HITL) framework for incorporating failed experiments into the MOBO machine learning workflow, demonstrating that this framework accelerates optimization by reducing the number of experimental rounds required. Once optimization is concluded, we analyze different Pareto-optimal conditions to tune the dielectrics properties and provide insight into the importance of different inputs through Shapley Additive exPlanations analysis. The demonstrated framework of combining MOBO with HITL feedback can be adapted to a wide range of multi-objective experimental problems that have interconnected inputs and high experimental failure rates to generate usable results for machine learning models.

CLAug 27, 2025
Social Bias in Multilingual Language Models: A Survey

Lance Calvin Lim Gamboa, Yue Feng, Mark Lee

Pretrained multilingual models exhibit the same social bias as models processing English texts. This systematic review analyzes emerging research that extends bias evaluation and mitigation approaches into multilingual and non-English contexts. We examine these studies with respect to linguistic diversity, cultural awareness, and their choice of evaluation metrics and mitigation techniques. Our survey illuminates gaps in the field's dominant methodological design choices (e.g., preference for certain languages, scarcity of multilingual mitigation experiments) while cataloging common issues encountered and solutions implemented in adapting bias benchmarks across languages and cultures. Drawing from the implications of our findings, we chart directions for future research that can reinforce the multilingual bias literature's inclusivity, cross-cultural appropriateness, and alignment with state-of-the-art NLP advancements.

CVMar 14, 2024
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier et al.

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

CVNov 12, 2021
The self-supervised spectral-spatial attention-based transformer network for automated, accurate prediction of crop nitrogen status from UAV imagery

Xin Zhang, Liangxiu Han, Tam Sobeih et al.

Nitrogen (N) fertilizer is routinely applied by farmers to increase crop yields. At present, farmers often over-apply N fertilizer in some locations or at certain times because they do not have high-resolution crop N status data. N-use efficiency can be low, with the remaining N lost to the environment, resulting in higher production costs and environmental pollution. Accurate and timely estimation of N status in crops is crucial to improving cropping systems' economic and environmental sustainability. Destructive approaches based on plant tissue analysis are time consuming and impractical over large fields. Recent advances in remote sensing and deep learning have shown promise in addressing the aforementioned challenges in a non-destructive way. In this work, we propose a novel deep learning framework: a self-supervised spectral-spatial attention-based vision transformer (SSVT). The proposed SSVT introduces a Spectral Attention Block (SAB) and a Spatial Interaction Block (SIB), which allows for simultaneous learning of both spatial and spectral features from UAV digital aerial imagery, for accurate N status prediction in wheat fields. Moreover, the proposed framework introduces local-to-global self-supervised learning to help train the model from unlabelled data. The proposed SSVT has been compared with five state-of-the-art models including: ResNet, RegNet, EfficientNet, EfficientNetV2 and the original vision transformer on both testing and independent datasets. The proposed approach achieved high accuracy (0.96) with good generalizability and reproducibility for wheat N status estimation.

CLJun 3, 2021
Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children's mindreading ability

Venelin Kovatchev, Phillip Smith, Mark Lee et al.

In this paper we implement and compare 7 different data augmentation strategies for the task of automatic scoring of children's ability to understand others' thoughts, feelings, and desires (or "mindreading"). We recruit in-domain experts to re-annotate augmented samples and determine to what extent each strategy preserves the original rating. We also carry out multiple experiments to measure how much each augmentation strategy improves the performance of automatic scoring systems. To determine the capabilities of automatic systems to generalize to unseen data, we create UK-MIND-20 - a new corpus of children's performance on tests of mindreading, consisting of 10,320 question-answer pairs. We obtain a new state-of-the-art performance on the MIND-CA corpus, improving macro-F1-score by 6 points. Results indicate that both the number of training examples and the quality of the augmentation strategies affect the performance of the systems. The task-specific augmentations generally outperform task-agnostic augmentations. Automatic augmentations based on vectors (GloVe, FastText) perform the worst. We find that systems trained on MIND-CA generalize well to UK-MIND-20. We demonstrate that data augmentation strategies also improve the performance on unseen data.

CLNov 16, 2020
"What is on your mind?" Automated Scoring of Mindreading in Childhood and Early Adolescence

Venelin Kovatchev, Phillip Smith, Mark Lee et al.

In this paper we present the first work on the automated scoring of mindreading ability in middle childhood and early adolescence. We create MIND-CA, a new corpus of 11,311 question-answer pairs in English from 1,066 children aged 7 to 14. We perform machine learning experiments and carry out extensive quantitative and qualitative evaluation. We obtain promising results, demonstrating the applicability of state-of-the-art NLP solutions to a new domain and task.

CVJun 20, 2019
On Physical Adversarial Patches for Object Detection

Mark Lee, Zico Kolter

In this paper, we demonstrate a physical adversarial patch attack against object detectors, notably the YOLOv3 detector. Unlike previous work on physical object detection attacks, which required the patch to overlap with the objects being misclassified or avoiding detection, we show that a properly designed patch can suppress virtually all the detected objects in the image. That is, we can place the patch anywhere in the image, causing all existing objects in the image to be missed entirely by the detector, even those far away from the patch itself. This in turn opens up new lines of physical attacks against object detection systems, which require no modification of the objects in a scene. A demo of the system can be found at https://youtu.be/WXnQjbZ1e7Y.

SEApr 27, 2013
SBVR vs OCL: A Comparative Analysis of Standards

Imran Sarwar Bajwa, Behzad Bordbar, Mark Lee

In software modelling, the designers have to produce UML visual models with software constraints. Similarly, in business modelling, designers have to model business processes using business constraints (business rules). Constraints are the key components in the skeleton of business or software models. A designer has to write constraints to semantically compliment business models or UML models and finally implementing the constraints into business processes or source code. Business constraints/rules can be written using SBVR (Semantics of Business Vocabulary and Rules) while OCL (Object Constraint Language) is the well-known medium for writing software constraints. SBVR and OCL are two significant standards from OMG. Both standards are principally different as SBVR is typically used in business domains and OCL is employed to compliment software models. However, we have identified a few similarities in both standards that are interesting to study. In this paper, we have performed a comparative analysis of both standards as we are looking for a mechanism for automatic transformation of SBVR to OCL. The major emphasis of the study is to highlight principal features of SBVR and OCL such as similarities, differences and key parameters on which these both standards can work together.