LGJul 1, 2022Code
When Does Differentially Private Learning Not Suffer in High Dimensions?Xuechen Li, Daogao Liu, Tatsunori Hashimoto et al. · stanford
Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}.
CRFeb 11, 2023
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security AttacksDaniel Kang, Xuechen Li, Ion Stoica et al. · stanford
Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, we show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations.
CYMar 28, 2023
Foundation Models and Fair UsePeter Henderson, Xuechen Li, Dan Jurafsky et al. · stanford
Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models.
LGDec 3, 2022
Exploring the Limits of Differentially Private Deep Learning with Group-wise ClippingJiyan He, Xuechen Li, Da Yu et al. · microsoft-research, stanford
Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $ε=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.
CLOct 25, 2022
Synthetic Text Generation with Differential Privacy: A Simple and Practical RecipeXiang Yue, Huseyin A. Inan, Xuechen Li et al. · stanford
Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee, such as differential privacy (DP), provides a promising path to mitigating these privacy concerns, but previous approaches in this direction have typically failed to produce synthetic data of high quality. In this work, we show that a simple and practical recipe in the text domain is effective: simply fine-tuning a pretrained generative language model with DP enables the model to generate useful synthetic text with strong privacy protection. Through extensive empirical analyses on both benchmark and private customer data, we demonstrate that our method produces synthetic text that is competitive in terms of utility with its non-private counterpart, meanwhile providing strong protection against potential privacy leakages.
CLNov 16, 2022
Holistic Evaluation of Language ModelsPercy Liang, Rishi Bommasani, Tony Lee et al. · stanford
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
LGOct 15, 2022
A Closer Look at the Calibration of Differentially Private LearnersHanlin Zhang, Xuechen Li, Prithviraj Sen et al. · stanford
We systematically study the calibration of classifiers trained with differentially private stochastic gradient descent (DP-SGD) and observe miscalibration across a wide range of vision and language tasks. Our analysis identifies per-example gradient clipping in DP-SGD as a major cause of miscalibration, and we show that existing approaches for improving calibration with differential privacy only provide marginal improvements in calibration error while occasionally causing large degradations in accuracy. As a solution, we show that differentially private variants of post-processing calibration methods such as temperature scaling and Platt scaling are surprisingly effective and have negligible utility cost to the overall model. Across 7 tasks, temperature scaling and Platt scaling with DP-SGD result in an average 3.1-fold reduction in the in-domain expected calibration error and only incur at most a minor percent drop in accuracy.
CVOct 29, 2022
ImplantFormer: Vision Transformer based Implant Position Regression Using Dental CBCT DataXinquan Yang, Xuguang Li, Xuechen Li et al.
Implant prosthesis is the most appropriate treatment for dentition defect or dentition loss, which usually involves a surgical guide design process to decide the implant position. However, such design heavily relies on the subjective experiences of dentists. In this paper, a transformer-based Implant Position Regression Network, ImplantFormer, is proposed to automatically predict the implant position based on the oral CBCT data. We creatively propose to predict the implant position using the 2D axial view of the tooth crown area and fit a centerline of the implant to obtain the actual implant position at the tooth root. Convolutional stem and decoder are designed to coarsely extract image features before the operation of patch embedding and integrate multi-level feature maps for robust prediction, respectively. As both long-range relationship and local features are involved, our approach can better represent global information and achieves better location performance. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed ImplantFormer achieves superior performance than existing methods.
CVJun 26, 2023
TCEIP: Text Condition Embedded Regression Network for Dental Implant Position PredictionXinquan Yang, Jinheng Xie, Xuguang Li et al.
When deep neural network has been proposed to assist the dentist in designing the location of dental implant, most of them are targeting simple cases where only one missing tooth is available. As a result, literature works do not work well when there are multiple missing teeth and easily generate false predictions when the teeth are sparsely distributed. In this paper, we are trying to integrate a weak supervision text, the target region, to the implant position regression network, to address above issues. We propose a text condition embedded implant position regression network (TCEIP), to embed the text condition into the encoder-decoder framework for improvement of the regression performance. A cross-modal interaction that consists of cross-modal attention (CMA) and knowledge alignment module (KAM) is proposed to facilitate the interaction between features of images and texts. The CMA module performs a cross-attention between the image feature and the text condition, and the KAM mitigates the knowledge gap between the image feature and the image encoder of the CLIP. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed TCEIP achieves superior performance than existing methods.
CVAug 7, 2022
Sample hardness based gradient loss for long-tailed cervical cell detectionMinmin Liu, Xuechen Li, Xiangbo Gao et al.
Due to the difficulty of cancer samples collection and annotation, cervical cancer datasets usually exhibit a long-tailed data distribution. When training a detector to detect the cancer cells in a WSI (Whole Slice Image) image captured from the TCT (Thinprep Cytology Test) specimen, head categories (e.g. normal cells and inflammatory cells) typically have a much larger number of samples than tail categories (e.g. cancer cells). Most existing state-of-the-art long-tailed learning methods in object detection focus on category distribution statistics to solve the problem in the long-tailed scenario without considering the "hardness" of each sample. To address this problem, in this work we propose a Grad-Libra Loss that leverages the gradients to dynamically calibrate the degree of hardness of each sample for different categories, and re-balance the gradients of positive and negative samples. Our loss can thus help the detector to put more emphasis on those hard samples in both head and tail categories. Extensive experiments on a long-tailed TCT WSI image dataset show that the mainstream detectors, e.g. RepPoints, FCOS, ATSS, YOLOF, etc. trained using our proposed Gradient-Libra Loss, achieved much higher (7.8%) mAP than that trained using cross-entropy classification loss.
CVAug 10, 2023
TCSloT: Text Guided 3D Context and Slope Aware Triple Network for Dental Implant Position PredictionXinquan Yang, Jinheng Xie, Xuechen Li et al.
In implant prosthesis treatment, the surgical guide of implant is used to ensure accurate implantation. However, such design heavily relies on the manual location of the implant position. When deep neural network has been proposed to assist the dentist in locating the implant position, most of them take a single slice as input, which do not fully explore 3D contextual information and ignoring the influence of implant slope. In this paper, we design a Text Guided 3D Context and Slope Aware Triple Network (TCSloT) which enables the perception of contextual information from multiple adjacent slices and awareness of variation of implant slopes. A Texture Variation Perception (TVP) module is correspondingly elaborated to process the multiple slices and capture the texture variation among slices and a Slope-Aware Loss (SAL) is proposed to dynamically assign varying weights for the regression head. Additionally, we design a conditional text guidance (CTG) module to integrate the text condition (i.e., left, middle and right) from the CLIP for assisting the implant position prediction. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed TCSloT achieves superior performance than existing methods.
CVJun 20, 2024Code
SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease DiagnosisZijian Cai, Xinquan Yang, Xuguang Li et al.
Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challenge, the two-stage process of pretraining and fine-tuning requires even more training time and computational resources. In this paper, we present a self-supervised auxiliary detection (SSAD) framework, which is plug-and-play and compatible with any detectors. It consists of a reconstruction branch and a detection branch. Both branches are trained simultaneously, sharing the same encoder, without the need for finetuning. The reconstruction branch learns to restore the tooth texture of healthy or diseased teeth, while the detection branch utilizes these learned features for diagnosis. To enhance the encoder's ability to capture fine-grained features, we incorporate the image encoder of SAM to construct a texture consistency (TC) loss, which extracts image embedding from the input and output of reconstruction branch, and then enforces both embedding into the same feature space. Extensive experiments on the public DENTEX dataset through three detection tasks demonstrate that the proposed SSAD framework achieves state-of-the-art performance compared to mainstream object detection methods and SSL methods. The code is available at https://github.com/Dylonsword/SSAD
LGMay 22, 2023Code
AlpacaFarm: A Simulation Framework for Methods that Learn from Human FeedbackYann Dubois, Xuechen Li, Rohan Taori et al.
Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 50x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, DPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm.
LGOct 12, 2021Code
Large Language Models Can Be Strong Differentially Private LearnersXuechen Li, Florian Tramèr, Percy Liang et al.
Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.
LGMar 30, 2024
Linguistic Calibration of Long-Form GenerationsNeil Band, Xuechen Li, Tengyu Ma et al.
Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.
HCFeb 10
Beyond Input-Output: Rethinking Creativity through Design-by-Analogy in Human-AI CollaborationXuechen Li, Shuai Zhang, Nan Cao et al.
While the proliferation of foundation models has significantly boosted individual productivity, it also introduces a potential challenge: the homogenization of creative content. In response, we revisit Design-by-Analogy (DbA), a cognitively grounded approach that fosters novel solutions by mapping inspiration across domains. However, prevailing perspectives often restrict DbA to early ideation or specific data modalities, while reducing AI-driven design to simplified input-output pipelines. Such conceptual limitations inadvertently foster widespread design fixation. To address this, we expand the understanding of DbA by embedding it into the entire creative process, thereby demonstrating its capacity to mitigate such fixation. Through a systematic review of 85 studies, we identify six forms of representation and classify techniques across seven stages of the creative process. We further discuss three major application domains: creative industries, intelligent manufacturing, and education and services, demonstrating DbA's practical relevance. Building on this synthesis, we frame DbA as a mediating technology for human-AI collaboration and outline the potential opportunities and inherent risks for advancing creativity support in HCI and design research.
LGFeb 18, 2025
Multi-branch of Attention Yields Accurate Results for Tabular DataXuechen Li, Yupeng Li, Jian Liu et al.
Tabular data inherently exhibits significant feature heterogeneity, but existing transformer-based methods lack specialized mechanisms to handle this property. To bridge the gap, we propose MAYA, an encoder-decoder transformer-based framework. In the encoder, we design a Multi-Branch of Attention (MBA) that constructs multiple parallel attention branches and averages the features at each branch, effectively fusing heterogeneous features while limiting parameter growth. Additionally, we employ collaborative learning with a dynamic consistency weight constraint to produce more robust representations. In the decoder stage, cross-attention is utilized to seamlessly integrate tabular data with corresponding label features. This dual-attention mechanism effectively captures both intra-instance and inter-instance interactions. We evaluate the proposed method on a wide range of datasets and compare it with other state-of-the-art transformer-based methods. Extensive experiments demonstrate that our model achieves superior performance among transformer-based methods in both tabular classification and regression tasks.
LGJun 3, 2024
Scalable Ensembling For Mitigating Reward OveroptimisationAhmed M. Ahmed, Rafael Rafailov, Stepan Sharkov et al.
Reinforcement Learning from Human Feedback (RLHF) has enabled significant advancements within language modeling for powerful, instruction-following models. However, the alignment of these models remains a pressing challenge as the policy tends to overfit the learned ``proxy" reward model past an inflection point of utility as measured by a ``gold" reward model that is more performant -- a phenomenon known as overoptimisation. Prior work has mitigated this issue by computing a pessimistic statistic over an ensemble of reward models, which is common in Offline Reinforcement Learning but incredibly costly for language models with high memory requirements, making such approaches infeasible for sufficiently large models. To this end, we propose using a shared encoder but separate linear heads. We find this leads to similar performance as the full ensemble while allowing tremendous savings in memory and time required for training for models of similar size.
CVDec 26, 2023
SCPMan: Shape Context and Prior Constrained Multi-scale Attention Network for Pancreatic SegmentationLeilei Zeng, Xuechen Li, Xinquan Yang et al.
Due to the poor prognosis of Pancreatic cancer, accurate early detection and segmentation are critical for improving treatment outcomes. However, pancreatic segmentation is challenged by blurred boundaries, high shape variability, and class imbalance. To tackle these problems, we propose a multiscale attention network with shape context and prior constraint for robust pancreas segmentation. Specifically, we proposed a Multi-scale Feature Extraction Module (MFE) and a Mixed-scale Attention Integration Module (MAI) to address unclear pancreas boundaries. Furthermore, a Shape Context Memory (SCM) module is introduced to jointly model semantics across scales and pancreatic shape. Active Shape Model (ASM) is further used to model the shape priors. Experiments on NIH and MSD datasets demonstrate the efficacy of our model, which improves the state-of-the-art Dice Score for 1.01% and 1.03% respectively. Our architecture provides robust segmentation performance, against the blurry boundaries, and variations in scale and shape of pancreas.
CVMay 17, 2023
Two-Stream Regression Network for Dental Implant Position PredictionXinquan Yang, Xuguang Li, Xuechen Li et al.
In implant prosthesis treatment, the design of the surgical guide heavily relies on the manual location of the implant position, which is subjective and prone to doctor's experiences. When deep learning based methods has started to be applied to address this problem, the space between teeth are various and some of them might present similar texture characteristic with the actual implant region. Both problems make a big challenge for the implant position prediction. In this paper, we develop a two-stream implant position regression framework (TSIPR), which consists of an implant region detector (IRD) and a multi-scale patch embedding regression network (MSPENet), to address this issue. For the training of IRD, we extend the original annotation to provide additional supervisory information, which contains much more rich characteristic and do not introduce extra labeling costs. A multi-scale patch embedding module is designed for the MSPENet to adaptively extract features from the images with various tooth spacing. The global-local feature interaction block is designed to build the encoder of MSPENet, which combines the transformer and convolution for enriched feature representation. During inference, the RoI mask extracted from the IRD is used to refine the prediction results of the MSPENet. Extensive experiments on a dental implant dataset through five-fold cross-validation demonstrated that the proposed TSIPR achieves superior performance than existing methods.
LGAug 16, 2021
On the Opportunities and Risks of Foundation ModelsRishi Bommasani, Drew A. Hudson, Ehsan Adeli et al.
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.
IVJun 23, 2021
Learning from Pseudo Lesion: A Self-supervised Framework for COVID-19 DiagnosisZhongliang Li, Zhihao Jin, Xuechen Li et al.
The Coronavirus disease 2019 (COVID-19) has rapidly spread all over the world since its first report in December 2019 and thoracic computed tomography (CT) has become one of the main tools for its diagnosis. In recent years, deep learning-based approaches have shown impressive performance in myriad image recognition tasks. However, they usually require a large number of annotated data for training. Inspired by Ground Glass Opacity (GGO), a common finding in COIVD-19 patient's CT scans, we proposed in this paper a novel self-supervised pretraining method based on pseudo lesions generation and restoration for COVID-19 diagnosis. We used Perlin noise, a gradient noise based mathematical model, to generate lesion-like patterns, which were then randomly pasted to the lung regions of normal CT images to generate pseudo COVID-19 images. The pairs of normal and pseudo COVID-19 images were then used to train an encoder-decoder architecture based U-Net for image restoration, which does not require any labelled data. The pretrained encoder was then fine-tuned using labelled data for COVID-19 diagnosis task. Two public COVID-19 diagnosis datasets made up of CT images were employed for evaluation. Comprehensive experimental results demonstrated that the proposed self-supervised learning approach could extract better feature representation for COVID-19 diagnosis and the accuracy of the proposed method outperformed the supervised model pretrained on large scale images by 6.57% and 3.03% on SARS-CoV-2 dataset and Jinan COVID-19 dataset, respectively.
LGMay 28, 2021
Learning to Extend Program Graphs to Work-in-Progress CodeXuechen Li, Chris J. Maddison, Daniel Tarlow
Source code spends most of its time in a broken or incomplete state during software development. This presents a challenge to machine learning for code, since high-performing models typically rely on graph structured representations of programs derived from traditional program analyses. Such analyses may be undefined for broken or incomplete code. We extend the notion of program graphs to work-in-progress code by learning to predict edge relations between tokens, training on well-formed code before transferring to work-in-progress code. We consider the tasks of code completion and localizing and repairing variable misuse in a work-in-process scenario. We demonstrate that training relation-aware models with fine-tuned edges consistently leads to improved performance on both tasks.
LGMay 27, 2021
Efficient and Accurate Gradients for Neural SDEsPatrick Kidger, James Foster, Xuechen Li et al.
Neural SDEs combine many of the best qualities of both RNNs and SDEs: memory efficient training, high-capacity function approximation, and strong priors on model space. This makes them a natural choice for modelling many types of temporal dynamics. Training a Neural SDE (either as a VAE or as a GAN) requires backpropagating through an SDE solve. This may be done by solving a backwards-in-time SDE whose solution is the desired parameter gradients. However, this has previously suffered from severe speed and accuracy issues, due to high computational cost and numerical truncation errors. Here, we overcome these issues through several technical innovations. First, we introduce the \textit{reversible Heun method}. This is a new SDE solver that is \textit{algebraically reversible}: eliminating numerical gradient errors, and the first such solver of which we are aware. Moreover it requires half as many function evaluations as comparable solvers, giving up to a $1.98\times$ speedup. Second, we introduce the \textit{Brownian Interval}: a new, fast, memory efficient, and exact way of sampling \textit{and reconstructing} Brownian motion. With this we obtain up to a $10.6\times$ speed improvement over previous techniques, which in contrast are both approximate and relatively slow. Third, when specifically training Neural SDEs as GANs (Kidger et al. 2021), we demonstrate how SDE-GANs may be trained through careful weight clipping and choice of activation function. This reduces computational cost (giving up to a $1.87\times$ speedup) and removes the numerical truncation errors associated with gradient penalty. Altogether, we outperform the state-of-the-art by substantial margins, with respect to training speed, and with respect to classification, prediction, and MMD test metrics. We have contributed implementations of all of our techniques to the torchsde library to help facilitate their adoption.
MLFeb 12, 2021
Infinitely Deep Bayesian Neural Networks with Stochastic Differential EquationsWinnie Xu, Ricky T. Q. Chen, Xuechen Li et al.
We perform scalable approximate inference in continuous-depth Bayesian neural networks. In this model class, uncertainty about separate weights in each layer gives hidden units that follow a stochastic differential equation. We demonstrate gradient-based stochastic variational inference in this infinite-parameter setting, producing arbitrarily-flexible approximate posteriors. We also derive a novel gradient estimator that approaches zero variance as the approximate posterior over weights approaches the true posterior. This approach brings continuous-depth Bayesian neural nets to a competitive comparison against discrete-depth alternatives, while inheriting the memory-efficient training and tunable precision of Neural ODEs.
LGFeb 6, 2021
Neural SDEs as Infinite-Dimensional GANsPatrick Kidger, James Foster, Xuechen Li et al.
Stochastic differential equations (SDEs) are a staple of mathematical modelling of temporal dynamics. However, a fundamental limitation has been that such models have typically been relatively inflexible, which recent work introducing Neural SDEs has sought to solve. Here, we show that the current classical approach to fitting SDEs may be approached as a special case of (Wasserstein) GANs, and in doing so the neural and classical regimes may be brought together. The input noise is Brownian motion, the output samples are time-evolving paths produced by a numerical solver, and by parameterising a discriminator as a Neural Controlled Differential Equation (CDE), we obtain Neural SDEs as (in modern machine learning parlance) continuous-time generative time series models. Unlike previous work on this problem, this is a direct extension of the classical approach without reference to either prespecified statistics or density functions. Arbitrary drift and diffusions are admissible, so as the Wasserstein loss has a unique global minima, in the infinite data limit any SDE may be learnt. Example code has been made available as part of the \texttt{torchsde} repository.
MLJun 18, 2020
When Does Preconditioning Help or Hurt Generalization?Shun-ichi Amari, Jimmy Ba, Roger Grosse et al.
While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of first- and second-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.
LGJan 5, 2020
Scalable Gradients for Stochastic Differential EquationsXuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen et al.
The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset.
MLJun 19, 2019
Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and BeyondXuechen Li, Denny Wu, Lester Mackey et al.
Sampling with Markov chain Monte Carlo methods often amounts to discretizing some continuous-time dynamics with numerical integration. In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth Itô diffusions exhibiting fast Wasserstein-$2$ contraction, based on local deviation properties of the integration scheme. In particular, we study a sampling algorithm constructed by discretizing the overdamped Langevin diffusion with the method of stochastic Runge-Kutta. For strongly convex potentials that are smooth up to a certain order, its iterates converge to the target distribution in $2$-Wasserstein distance in $\tilde{\mathcal{O}}(dε^{-2/3})$ iterations. This improves upon the best-known rate for strongly log-concave sampling based on the overdamped Langevin equation using only the gradient oracle without adjustment. In addition, we extend our analysis of stochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly non-convex potentials and show they achieve better rates compared to the Euler-Maruyama scheme in terms of the dependence on tolerance $ε$. Numerical studies show that these algorithms lead to better stability and lower asymptotic errors.
CVMar 25, 2019
A Novel Method for the Absolute Pose Problem with Pairwise ConstraintsYinlong Liu, Xuechen Li, Manning Wang et al.
Absolute pose estimation is a fundamental problem in computer vision, and it is a typical parameter estimation problem, meaning that efforts to solve it will always suffer from outlier-contaminated data. Conventionally, for a fixed dimensionality d and the number of measurements N, a robust estimation problem cannot be solved faster than O(N^d). Furthermore, it is almost impossible to remove d from the exponent of the runtime of a globally optimal algorithm. However, absolute pose estimation is a geometric parameter estimation problem, and thus has special constraints. In this paper, we consider pairwise constraints and propose a globally optimal algorithm for solving the absolute pose estimation problem. The proposed algorithm has a linear complexity in the number of correspondences at a given outlier ratio. Concretely, we first decouple the rotation and the translation subproblems by utilizing the pairwise constraints, and then we solve the rotation subproblem using the branch-and-bound algorithm. Lastly, we estimate the translation based on the known rotation by using another branch-and-bound algorithm. The advantages of our method are demonstrated via thorough testing on both synthetic and real-world data
CVDec 29, 2018
Fast and Globally Optimal Rigid Registration of 3D Point Sets by Transformation DecompositionXuechen Li, Yinlong Liu, Yiru Wang et al.
The rigid registration of two 3D point sets is a fundamental problem in computer vision. The current trend is to solve this problem globally using the BnB optimization framework. However, the existing global methods are slow for two main reasons: the computational complexity of BnB is exponential to the problem dimensionality (which is six for 3D rigid registration), and the bound evaluation used in BnB is inefficient. In this paper, we propose two techniques to address these problems. First, we introduce the idea of translation invariant vectors, which allows us to decompose the search of a 6D rigid transformation into a search of 3D rotation followed by a search of 3D translation, each of which is solved by a separate BnB algorithm. This transformation decomposition reduces the problem dimensionality of BnB algorithms and substantially improves its efficiency. Then, we propose a new data structure, named 3D Integral Volume, to accelerate the bound evaluation in both BnB algorithms. By combining these two techniques, we implement an efficient algorithm for rigid registration of 3D point sets. Extensive experiments on both synthetic and real data show that the proposed algorithm is three orders of magnitude faster than the existing state-of-the-art global methods.
LGFeb 14, 2018
Isolating Sources of Disentanglement in Variational AutoencodersRicky T. Q. Chen, Xuechen Li, Roger Grosse et al.
We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate our $β$-TCVAE (Total Correlation Variational Autoencoder), a refinement of the state-of-the-art $β$-VAE objective for learning disentangled representations, requiring no additional hyperparameters during training. We further propose a principled classifier-free measure of disentanglement called the mutual information gap (MIG). We perform extensive quantitative and qualitative experiments, in both restricted and non-restricted settings, and show a strong relation between total correlation and disentanglement, when the latent variables model is trained using our framework.
LGJan 10, 2018
Inference Suboptimality in Variational AutoencodersChris Cremer, Xuechen Li, David Duvenaud
Amortized inference allows latent-variable models trained via variational learning to scale to large datasets. The quality of approximate inference is determined by two factors: a) the capacity of the variational distribution to match the true posterior and b) the ability of the recognition network to produce good variational parameters for each datapoint. We examine approximate inference in variational autoencoders in terms of these factors. We find that divergence from the true posterior is often due to imperfect recognition networks, rather than the limited complexity of the approximating distribution. We show that this is due partly to the generator learning to accommodate the choice of approximation. Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.