Shi Pu

OC
h-index3
23papers
1,278citations
Novelty50%
AI Score48

23 Papers

LGMar 12, 2023
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Fan Bao, Shen Nie, Kaiwen Xue et al.

This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).

CVMar 29, 2022Code
Alignment-Uniformity aware Representation Learning for Zero-shot Video Classification

Shi Pu, Kaili Zhao, Mao Zheng

Most methods tackle zero-shot video classification by aligning visual-semantic representations within seen classes, which limits generalization to unseen classes. To enhance model generalizability, this paper presents an end-to-end framework that preserves alignment and uniformity properties for representations on both seen and unseen classes. Specifically, we formulate a supervised contrastive loss to simultaneously align visual-semantic features (i.e., alignment) and encourage the learned features to distribute uniformly (i.e., uniformity). Unlike existing methods that only consider the alignment, we propose uniformity to preserve maximal-info of existing features, which improves the probability that unobserved features fall around observed data. Further, we synthesize features of unseen classes by proposing a class generator that interpolates and extrapolates the features of seen classes. Besides, we introduce two metrics, closeness and dispersion, to quantify the two properties and serve as new measurements of model generalizability. Experiments show that our method significantly outperforms SoTA by relative improvements of 28.1% on UCF101 and 27.0% on HMDB51. Code is available.

OCJun 21, 2023
Distributed Random Reshuffling Methods with Improved Convergence

Kun Huang, Linli Zhou, Shi Pu

This paper proposes two distributed random reshuffling methods, namely Gradient Tracking with Random Reshuffling (GT-RR) and Exact Diffusion with Random Reshuffling (ED-RR), to solve the distributed optimization problem over a connected network, where a set of agents aim to minimize the average of their local cost functions. Both algorithms invoke random reshuffling (RR) update for each agent, inherit favorable characteristics of RR for minimizing smooth nonconvex objective functions, and improve the performance of previous distributed random reshuffling methods both theoretically and empirically. Specifically, both GT-RR and ED-RR achieve the convergence rate of $O(1/[(1-λ)^{1/3}m^{1/3}T^{2/3}])$ in driving the (minimum) expected squared norm of the gradient to zero, where $T$ denotes the number of epochs, $m$ is the sample size for each agent, and $1-λ$ represents the spectral gap of the mixing matrix. When the objective functions further satisfy the Polyak-Łojasiewicz (PL) condition, we show GT-RR and ED-RR both achieve $O(1/[(1-λ)mT^2])$ convergence rate in terms of the averaged expected differences between the agents' function values and the global minimum value. Notably, both results are comparable to the convergence rates of centralized RR methods (up to constant factors depending on the network topology) and outperform those of previous distributed random reshuffling algorithms.

OCJan 14, 2023
CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence

Kun Huang, Shi Pu

In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized { stochastic gradient descent (SGD)} for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(n{C^3}/(1-λ_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3{C^6}/(1-λ_2)^4)$ under smooth nonconvex objective functions, where $(1-λ_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. In particular, CEDAS exhibits the shortest transient times when $C < \mathcal{O}(1/(1 - λ_2)^2)$, which is common in practice. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.

OCJan 30, 2023
Distributed Stochastic Optimization under a General Variance Condition

Kun Huang, Xiao Li, Shi Pu

Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. Though numerous algorithms have been proposed and successfully applied to general practical problems, their theoretical guarantees mainly rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical Federated Averaging (FedAvg) algorithm (McMahan et al., 2017) as well as the more recent SCAFFOLD method (Karimireddy et al., 2020) for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.

LGFeb 3
Achieving Linear Speedup for Composite Federated Learning

Kun Huang, Shi Pu

This paper proposes FedNMap, a normal map-based method for composite federated learning, where the objective consists of a smooth loss and a possibly nonsmooth regularizer. FedNMap leverages a normal map-based update scheme to handle the nonsmooth term and incorporates a local correction strategy to mitigate the impact of data heterogeneity across clients. Under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance, FedNMap achieves linear speedup with respect to both the number of clients $n$ and the number of local updates $Q$ for nonconvex losses, both with and without the Polyak-Łojasiewicz (PL) condition. To our knowledge, this is the first result establishing linear speedup for nonconvex composite federated learning.

OCMar 20, 2025
Distributed Learning over Arbitrary Topology: Linear Speed-Up with Polynomial Transient Time

Runze You, Shi Pu

We study a distributed learning problem in which $n$ agents, each with potentially heterogeneous local data, collaboratively minimize the sum of their local cost functions via peer-to-peer communication. We propose a novel algorithm, \emph{Spanning Tree Push-Pull} (STPP), which employs two spanning trees extracted from a general communication graph to distribute both model parameters and stochastic gradients. Unlike prior approaches that rely heavily on spectral gap properties, STPP leverages a more flexible topological characterization, enabling robust information flow and efficient updates. Theoretically, we prove that STPP achieves linear speedup and polynomial transient iteration complexity -- up to $\mathcal{O}(n^7)$ for smooth nonconvex objectives and $\tilde{\mathcal{O}}(n^3)$ for smooth strongly convex objectives -- under arbitrary network topologies. Moreover, compared with existing methods, STPP achieves faster convergence rates on sparse and non-regular topologies (e.g., directed rings) and reduces communication overhead on dense networks (e.g., static exponential graphs). Numerical experiments further demonstrate the strong performance of STPP across various graph architectures.

LGJan 4
Accelerating Decentralized Optimization via Overlapping Local Steps

Yijie Zhou, Shi Pu

Decentralized optimization has emerged as a critical paradigm for distributed learning, enabling scalable training while preserving data privacy through peer-to-peer collaboration. However, existing methods often suffer from communication bottlenecks due to frequent synchronization between nodes. We present Overlapping Local Decentralized SGD (OLDSGD), a novel approach to accelerate decentralized training by computation-communication overlapping, significantly reducing network idle time. With a deliberately designed update, OLDSGD preserves the same average update as Local SGD while avoiding communication-induced stalls. Theoretically, we establish non-asymptotic convergence rates for smooth non-convex objectives, showing that OLDSGD retains the same iteration complexity as standard Local Decentralized SGD while improving per-iteration runtime. Empirical results demonstrate OLDSGD's consistent improvements in wall-clock time convergence under different levels of communication delays. With minimal modifications to existing frameworks, OLDSGD offers a practical solution for faster decentralized learning without sacrificing theoretical guarantees.

CRJul 29, 2025
NCCR: to Evaluate the Robustness of Neural Networks and Adversarial Examples

Shi Pu, Fu Song, Wenjie Wang

Neural networks have received a lot of attention recently, and related security issues have come with it. Many studies have shown that neural networks are vulnerable to adversarial examples that have been artificially perturbed with modification, which is too small to be distinguishable by human perception. Different attacks and defenses have been proposed to solve these problems, but there is little research on evaluating the robustness of neural networks and their inputs. In this work, we propose a metric called the neuron cover change rate (NCCR) to measure the ability of deep learning models to resist attacks and the stability of adversarial examples. NCCR monitors alterations in the output of specifically chosen neurons when the input is perturbed, and networks with a smaller degree of variation are considered to be more robust. The results of the experiment on image recognition and the speaker recognition model show that our metrics can provide a good assessment of the robustness of neural networks or their inputs. It can also be used to detect whether an input is adversarial or not, as adversarial examples are always less robust.

LGMay 15, 2025
Asynchronous Decentralized SGD under Non-Convexity: A Block-Coordinate Descent Framework

Yijie Zhou, Shi Pu

Decentralized optimization has become vital for leveraging distributed data without central control, enhancing scalability and privacy. However, practical deployments face fundamental challenges due to heterogeneous computation speeds and unpredictable communication delays. This paper introduces a refined model of Asynchronous Decentralized Stochastic Gradient Descent (ADSGD) under practical assumptions of bounded computation and communication times. To understand the convergence of ADSGD, we first analyze Asynchronous Stochastic Block Coordinate Descent (ASBCD) as a tool, and then show that ADSGD converges under computation-delay-independent step sizes. The convergence result is established without assuming bounded data heterogeneity. Empirical experiments reveal that ADSGD outperforms existing methods in wall-clock convergence time across various scenarios. With its simplicity, efficiency in memory and communication, and resilience to communication and computation delays, ADSGD is well-suited for real-world decentralized learning tasks.

CLJun 17, 2024
HARE: HumAn pRiors, a key to small language model Efficiency

Lingyun Zhang, Bin jin, Gaojian Ge et al.

Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.

CRMay 7, 2023
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning

Shengfang Zhai, Yinpeng Dong, Qingni Shen et al.

With the help of conditioning mechanisms, the state-of-the-art diffusion models have achieved tremendous success in guided image generation, particularly in text-to-image synthesis. To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. Specifically, we perform backdoor attacks on three levels of the vision semantics: Pixel-Backdoor, Object-Backdoor and Style-Backdoor. By utilizing a regularization loss, our methods efficiently inject backdoors into a large-scale text-to-image diffusion model while preserving its utility with benign inputs. We conduct empirical experiments on Stable Diffusion, the widely-used text-to-image diffusion model, demonstrating that the large-scale diffusion model can be easily backdoored within a few fine-tuning steps. We conduct additional experiments to explore the impact of different types of textual triggers, as well as the backdoor persistence during further training, providing insights for the development of backdoor defense methods. Besides, our investigation may contribute to the copyright protection of text-to-image models in the future.

OCDec 31, 2021
Distributed Random Reshuffling over Networks

Kun Huang, Xiao Li, Andre Milzarek et al.

In this paper, we consider distributed optimization problems where $n$ agents, each possessing a local cost function, collaboratively minimize the average of the local cost functions over a connected network. To solve the problem, we propose a distributed random reshuffling (D-RR) algorithm that invokes the random reshuffling (RR) update in each agent. We show that D-RR inherits favorable characteristics of RR for both smooth strongly convex and smooth nonconvex objective functions. In particular, for smooth strongly convex objective functions, D-RR achieves $\mathcal{O}(1/T^2)$ rate of convergence (where $T$ counts epoch number) in terms of the squared distance between the iterate and the global minimizer. When the objective function is assumed to be smooth nonconvex, we show that D-RR drives the squared norm of gradient to $0$ at a rate of $\mathcal{O}(1/T^{2/3})$. These convergence results match those of centralized RR (up to constant factors) and outperform the distributed stochastic gradient descent (DSGD) algorithm if we run a relatively large number of epochs. Finally, we conduct a set of numerical experiments to illustrate the efficiency of the proposed D-RR method on both strongly convex and nonconvex distributed optimization problems.

OCJul 26, 2021
Provably Accelerated Decentralized Gradient Method Over Unbalanced Directed Graphs

Zhuoqing Song, Lei Shi, Shi Pu et al.

We consider the decentralized optimization problem, where a network of $n$ agents aims to collaboratively minimize the average of their individual smooth and convex objective functions through peer-to-peer communication in a directed graph. To tackle this problem, we propose two accelerated gradient tracking methods, namely APD and APD-SC, for non-strongly convex and strongly convex objective functions, respectively. We show that APD and APD-SC converge at the rates $O\left(\frac{1}{k^2}\right)$ and $O\left(\left(1 - C\sqrt{\fracμ{L}}\right)^k\right)$, respectively, up to constant factors depending only on the mixing matrix. APD and APD-SC are the first decentralized methods over unbalanced directed graphs that achieve the same provable acceleration as centralized methods. Numerical experiments demonstrate the effectiveness of both methods.

OCJun 14, 2021
Compressed Gradient Tracking for Decentralized Optimization Over General Directed Networks

Zhuoqing Song, Lei Shi, Shi Pu et al.

In this paper, we propose two communication efficient decentralized optimization algorithms over a general directed multi-agent network. The first algorithm, termed Compressed Push-Pull (CPP), combines the gradient tracking Push-Pull method with communication compression. We show that CPP is applicable to a general class of unbiased compression operators and achieves linear convergence rate for strongly convex and smooth objective functions. The second algorithm is a broadcast-like version of CPP (B-CPP), and it also achieves linear convergence rate under the same conditions on the objective functions. B-CPP can be applied in an asynchronous broadcast setting and further reduce communication costs compared to CPP. Numerical experiments complement the theoretical analysis and confirm the effectiveness of the proposed methods.

OCMay 11, 2021
Improving the Transient Times for Distributed Stochastic Gradient Methods

Kun Huang, Shi Pu

We consider the distributed optimization problem where $n$ agents each possessing a local cost function, collaboratively minimize the average of the $n$ cost functions over a connected network. Assuming stochastic gradient information is available, we study a distributed stochastic gradient algorithm, called exact diffusion with adaptive stepsizes (EDAS) adapted from the Exact Diffusion method and NIDS and perform a non-asymptotic convergence analysis. We not only show that EDAS asymptotically achieves the same network independent convergence rate as centralized stochastic gradient descent (SGD) for minimizing strongly convex and smooth objective functions, but also characterize the transient time needed for the algorithm to approach the asymptotic convergence rate, which behaves as $K_T=\mathcal{O}\left(\frac{n}{1-λ_2}\right)$, where $1-λ_2$ stands for the spectral gap of the mixing matrix. To the best of our knowledge, EDAS achieves the shortest transient time when the average of the $n$ cost functions is strongly convex and each cost function is smooth. Numerical simulations further corroborate and strengthen the obtained theoretical results.

IROct 26, 2020
Multimodal Topic Learning for Video Recommendation

Shi Pu, Yijiang He, Zheng Li et al.

Facilitated by deep neural networks, video recommendation systems have made significant advances. Existing video recommendation systems directly exploit features from different modalities (e.g., user personal data, user behavior data, video titles, video tags, and visual contents) to input deep neural networks, while expecting the networks to online mine user-preferred topics implicitly from these features. However, the features lacking semantic topic information limits accurate recommendation generation. In addition, feature crosses using visual content features generate high dimensionality features that heavily downgrade the online computational efficiency of networks. In this paper, we explicitly separate topic generation from recommendation generation, propose a multimodal topic learning algorithm to exploit three modalities (i.e., tags, titles, and cover images) for generating video topics offline. The topics generated by the proposed algorithm serve as semantic topic features to facilitate preference scope determination and recommendation generation. Furthermore, we use the semantic topic features instead of visual content features to effectively reduce online computational cost. Our proposed algorithm has been deployed in the Kuaibao information streaming platform. Online and offline evaluation results show that our proposed algorithm performs favorably.

LGSep 12, 2020
A general framework for decentralized optimization with first-order methods

Ran Xin, Shi Pu, Angelia Nedić et al.

Decentralized optimization to minimize a finite sum of functions over a network of nodes has been a significant focus within control and signal processing research due to its natural relevance to optimal control and signal estimation problems. More recently, the emergence of sophisticated computing and large-scale data science needs have led to a resurgence of activity in this area. In this article, we discuss decentralized first-order gradient methods, which have found tremendous success in control, signal processing, and machine learning problems, where such methods, due to their simplicity, serve as the first method of choice for many complex inference and training tasks. In particular, we provide a general framework of decentralized first-order methods that is applicable to undirected and directed communication networks alike, and show that much of the existing work on optimization and consensus can be related explicitly to this framework. We further extend the discussion to decentralized stochastic first-order methods that rely on stochastic gradients at each node and describe how local variance reduction schemes, previously shown to have promise in the centralized settings, are able to improve the performance of decentralized methods when combined with what is known as gradient tracking. We motivate and demonstrate the effectiveness of the corresponding methods in the context of machine learning and signal processing problems that arise in decentralized environments.

OCJun 28, 2019
Asymptotic Network Independence in Distributed Stochastic Optimization for Machine Learning

Shi Pu, Alex Olshevsky, Ioannis Ch. Paschalidis

We provide a discussion of several recent results which, in certain scenarios, are able to overcome a barrier in distributed stochastic optimization for machine learning. Our focus is the so-called asymptotic network independence property, which is achieved whenever a distributed method executed over a network of n nodes asymptotically converges to the optimal solution at a comparable rate to a centralized method with the same computational power as the entire network. We explain this property through an example involving the training of ML models and sketch a short mathematical analysis for comparing the performance of distributed stochastic gradient descent (DSGD) with centralized stochastic gradient decent (SGD).

OCJun 6, 2019
A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

Shi Pu, Alex Olshevsky, Ioannis Ch. Paschalidis

This paper is concerned with minimizing the average of $n$ cost functions over a network in which agents may communicate and exchange information with each other. We consider the setting where only noisy gradient information is available. To solve the problem, we study the distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD). Our main contribution is to characterize the transient time needed for DSGD to approach the asymptotic convergence rate, which we show behaves as $K_T=\mathcal{O}\left(\frac{n}{(1-ρ_w)^2}\right)$, where $1-ρ_w$ denotes the spectral gap of the mixing matrix. Moreover, we construct a "hard" optimization problem for which we show the transient time needed for DSGD to approach the asymptotic convergence rate is lower bounded by $Ω\left(\frac{n}{(1-ρ_w)^2} \right)$, implying the sharpness of the obtained result. Numerical experiments demonstrate the tightness of the theoretical results.

CVOct 9, 2018
Deep Attentive Tracking via Reciprocative Learning

Shi Pu, Yibing Song, Chao Ma et al.

Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.

OCJun 11, 2018
Swarming for Faster Convergence in Stochastic Optimization

Shi Pu, Alfredo Garcia

We study a distributed framework for stochastic optimization which is inspired by models of collective motion found in nature (e.g., swarming) with mild communication requirements. Specifically, we analyze a scheme in which each one of $N > 1$ independent threads, implements in a distributed and unsynchronized fashion, a stochastic gradient-descent algorithm which is perturbed by a swarming potential. Assuming the overhead caused by synchronization is not negligible, we show the swarming-based approach exhibits better performance than a centralized algorithm (based upon the average of $N$ observations) in terms of (real-time) convergence speed. We also derive an error bound that is monotone decreasing in network size and connectivity. We characterize the scheme's finite-time performances for both convex and non-convex objective functions.

OCMay 25, 2018
Distributed Stochastic Gradient Tracking Methods

Shi Pu, Angelia Nedić

In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size $n$, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.