AISep 15, 2024
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset ComparisonJudy Hanwen Shen, Archit Sharma, Jun Qin · stanford
The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
AIJul 29, 2024
Apple Intelligence Foundation Language ModelsTom Gunter, Zirui Wang, Chong Wang et al.
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
CVSep 25, 2025
Dual-supervised Asymmetric Co-training for Semi-supervised Medical Domain GeneralizationJincai Song, Haipeng Chen, Jun Qin et al.
Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solution for generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudolabels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.
CVSep 8, 2017
DeepFeat: A Bottom Up and Top Down Saliency Model Based on Deep Features of Convolutional Neural NetsAli Mahdi, Jun Qin
A deep feature based saliency model (DeepFeat) is developed to leverage the understanding of the prediction of human fixations. Traditional saliency models often predict the human visual attention relying on few level image cues. Although such models predict fixations on a variety of image complexities, their approaches are limited to the incorporated features. In this study, we aim to provide an intuitive interpretation of convolu- tional neural network deep features by combining low and high level visual factors. We exploit four evaluation metrics to evaluate the correspondence between the proposed framework and the ground-truth fixations. The key findings of the results demon- strate that the DeepFeat algorithm, incorporation of bottom up and top down saliency maps, outperforms the individual bottom up and top down approach. Moreover, in comparison to nine 9 state-of-the-art saliency models, our proposed DeepFeat model achieves satisfactory performance based on all four evaluation metrics.
CVJun 1, 2017
Line Profile Based Segmentation Algorithm for Touching Corn KernelsAli Mahdi, Jun Qin
Image segmentation of touching objects plays a key role in providing accurate classification for computer vision technologies. A new line profile based imaging segmentation algorithm has been developed to provide a robust and accurate segmentation of a group of touching corns. The performance of the line profile based algorithm has been compared to a watershed based imaging segmentation algorithm. Both algorithms are tested on three different patterns of images, which are isolated corns, single-lines, and random distributed formations. The experimental results show that the algorithm can segment a large number of touching corn kernels efficiently and accurately.
SDDec 16, 2016
Neural networks based EEG-Speech ModelsPengfei Sun, Jun Qin
In this paper, we propose an end-to-end neural network (NN) based EEG-speech (NES) modeling framework, in which three network structures are developed to map imagined EEG signals to phonemes. The proposed NES models incorporate a language model based EEG feature extraction layer, an acoustic feature mapping layer, and a restricted Boltzmann machine (RBM) based the feature learning layer. The NES models can jointly realize the representation of multichannel EEG signals and the projection of acoustic speech signals. Among three proposed NES models, two augmented networks utilize spoken EEG signals as either bias or gate information to strengthen the feature learning and translation of imagined EEG signals. Experimental results show that all three proposed NES models outperform the baseline support vector machine (SVM) method on EEG-speech classification. With respect to binary classification, our approach achieves comparable results relative to deep believe network approach.
SDNov 1, 2016
Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech DetectionPengfei Sun, Jun Qin
In this letter, we propose enhanced factored three way restricted Boltzmann machines (EFTW-RBMs) for speech detection. The proposed model incorporates conditional feature learning by multiplying the dynamical state of the third unit, which allows a modulation over the visible-hidden node pairs. Instead of stacking previous frames of speech as the third unit in a recursive manner, the correlation related weighting coefficients are assigned to the contextual neighboring frames. Specifically, a threshold function is designed to capture the long-term features and blend the globally stored speech structure. A factored low rank approximation is introduced to reduce the parameters of the three-dimensional interaction tensor, on which non-negative constraint is imposed to address the sparsity characteristic. The validations through the area-under-ROC-curve (AUC) and signal distortion ratio (SDR) show that our approach outperforms several existing 1D and 2D (i.e., time and time-frequency domain) speech detection algorithms in various noisy environments.
SDOct 3, 2016
Speech Enhancement via Two-Stage Dual Tree Complex Wavelet Packet Transform with a Speech Presence Probability EstimatorPengfei Sun, Jun Qin
In this paper, a two-stage dual tree complex wavelet packet transform (DTCWPT) based speech enhancement algorithm has been proposed, in which a speech presence probability (SPP) estimator and a generalized minimum mean squared error (MMSE) estimator are developed. To overcome the drawback of signal distortions caused by down sampling of WPT, a two-stage analytic decomposition concatenating undecimated WPT (UWPT) and decimated WPT is employed. An SPP estimator in the DTCWPT domain is derived based on a generalized Gamma distribution of speech, and Gaussian noise assumption. The validation results show that the proposed algorithm can obtain enhanced perceptual evaluation of speech quality (PESQ), and segmental signal-to-noise ratio (SegSNR) at low SNR nonstationary noise, compared with other four state-of-the-art speech enhancement algorithms, including optimally modified LSA (OM-LSA), soft masking using a posteriori SNR uncertainty (SMPO), a posteriori SPP based MMSE estimation (MMSE-SPP), and adaptive Bayesian wavelet thresholding (BWT).
SDSep 29, 2016
Semi-supervised Speech Enhancement in Envelop and Details SubspacesPengfei Sun, Jun Qin
In this study, we propose a modulation decoupling based single channel speech enhancement subspace framework, in which the spectrogram of noisy speech is decoupled as the product of a spectral envelop subspace and a spectral details subspace. This decoupling approach provides a method to specifically work on elimination of those noises that greatly affect the intelligibility. Two supervised low-rank and sparse decomposition schemes are developed in the spectral envelop subspace to obtain a robust recovery of speech components. A Bayesian formulation of non-negative factorization is used to learn the speech dictionary from the spectral envelop subspace of clean speech samples. In the spectral details subspace, a standard robust principal component analysis is implemented to extract the speech components. The validation results show that compared with four speech enhancement algorithms, including MMSE-SPP, NMF-RPCA, RPCA, and LARC, the proposed MS based algorithms achieve satisfactory performance on improving perceptual quality, and especially speech intelligibility.
SDSep 29, 2016
Low Rank and Sparsity Analysis Applied to Speech Enhancement via Online Estimated DictionaryPengfei Sun, Jun Qin
We propose an online estimated dictionary based single channel speech enhancement algorithm, which focuses on low rank and sparse matrix decomposition. In this proposed algorithm, a noisy speech spectral matrix is considered as the summation of low rank background noise components and an activation of the online speech dictionary, on which both low rank and sparsity constraints are imposed. This decomposition takes the advantage of local estimated dictionary high expressiveness on speech components. The local dictionary can be obtained through estimating the speech presence probability by applying Expectation Maximal algorithm, in which a generalized Gamma prior for speech magnitude spectrum is used. The evaluation results show that the proposed algorithm achieves significant improvements when compared to four other speech enhancement algorithms.