CVApr 12, 2023
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRAJames Seale Smith, Yen-Chang Hsu, Lingyu Zhang et al.
Recent works demonstrate a remarkable ability to customize text-to-image diffusion models while only providing a few example images. What happens if you try to customize such models using multiple, fine-grained concepts in a sequential (i.e., continual) manner? In our work, we show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. Specifically, when adding a new concept, the ability to generate high quality images of past, similar concepts degrade. To circumvent this forgetting, we propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. Furthermore, we use customization prompts which do not include the word of the customized object (i.e., "person" for a human face dataset) and are initialized as completely random embeddings. Importantly, our method induces only marginal additional parameter costs and requires no storage of user data for replay. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, which we refer to as Continual Diffusion, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification. The high achieving performance of C-LoRA in two separate domains positions it as a compelling solution for a wide range of applications, and we believe it has significant potential for practical impact. Project page: https://jamessealesmith.github.io/continual-diffusion/
AIJan 13, 2023
Multitask Weakly Supervised Learning for Origin Destination Travel Time EstimationHongjun Wang, Zhiwen Zhang, Zipei Fan et al.
Travel time estimation from GPS trips is of great importance to order duration, ridesharing, taxi dispatching, etc. However, the dense trajectory is not always available due to the limitation of data privacy and acquisition, while the origin destination (OD) type of data, such as NYC taxi data, NYC bike data, and Capital Bikeshare data, is more accessible. To address this issue, this paper starts to estimate the OD trips travel time combined with the road network. Subsequently, a Multitask Weakly Supervised Learning Framework for Travel Time Estimation (MWSL TTE) has been proposed to infer transition probability between roads segments, and the travel time on road segments and intersection simultaneously. Technically, given an OD pair, the transition probability intends to recover the most possible route. And then, the output of travel time is equal to the summation of all segments' and intersections' travel time in this route. A novel route recovery function has been proposed to iteratively maximize the current route's co occurrence probability, and minimize the discrepancy between routes' probability distribution and the inverse distribution of routes' estimation loss. Moreover, the expected log likelihood function based on a weakly supervised framework has been deployed in optimizing the travel time from road segments and intersections concurrently. We conduct experiments on a wide range of real world taxi datasets in Xi'an and Chengdu and demonstrate our method's effectiveness on route recovery and travel time estimation.
CVDec 12, 2022
Robust Perception through EquivarianceChengzhi Mao, Lingyu Zhang, Abhishek Joshi et al.
Deep networks for computer vision are not reliable when they encounter adversarial examples. In this paper, we introduce a framework that uses the dense intrinsic constraints in natural images to robustify inference. By introducing constraints at inference time, we can shift the burden of robustness from training to the inference algorithm, thereby allowing the model to adjust dynamically to each individual image's unique and potentially novel characteristics at inference time. Among different constraints, we find that equivariance-based constraints are most effective, because they allow dense constraints in the feature space without overly constraining the representation at a fine-grained level. Our theoretical results validate the importance of having such dense constraints at inference time. Our empirical experiments show that restoring feature equivariance at inference time defends against worst-case adversarial perturbations. The method obtains improved adversarial robustness on four datasets (ImageNet, Cityscapes, PASCAL VOC, and MS-COCO) on image recognition, semantic segmentation, and instance segmentation tasks. Project page is available at equi4robust.cs.columbia.edu.
ROSep 30, 2024
Enabling Multi-Robot Collaboration from Single-Human GuidanceZhengran Ji, Lingyu Zhang, Paul Sajda et al.
Learning collaborative behaviors is essential for multi-agent systems. Traditionally, multi-agent reinforcement learning solves this implicitly through a joint reward and centralized observations, assuming collaborative behavior will emerge. Other studies propose to learn from demonstrations of a group of collaborative experts. Instead, we propose an efficient and explicit way of learning collaborative behaviors in multi-agent systems by leveraging expertise from only a single human. Our insight is that humans can naturally take on various roles in a team. We show that agents can effectively learn to collaborate by allowing a human operator to dynamically switch between controlling agents for a short period and incorporating a human-like theory-of-mind model of teammates. Our experiments showed that our method improves the success rate of a challenging collaborative hide-and-seek task by up to 58% with only 40 minutes of human guidance. We further demonstrate our findings transfer to the real world by conducting multi-robot experiments.
LGNov 28, 2022
Easy Begun is Half Done: Spatial-Temporal Graph Modeling with ST-Curriculum DropoutHongjun Wang, Jiyuan Chen, Tong Pan et al.
Spatial-temporal (ST) graph modeling, such as traffic speed forecasting and taxi demand prediction, is an important task in deep learning area. However, for the nodes in graph, their ST patterns can vary greatly in difficulties for modeling, owning to the heterogeneous nature of ST data. We argue that unveiling the nodes to the model in a meaningful order, from easy to complex, can provide performance improvements over traditional training procedure. The idea has its root in Curriculum Learning which suggests in the early stage of training models can be sensitive to noise and difficult samples. In this paper, we propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for spatial-temporal graph modeling. Specifically, we evaluate the learning difficulty of each node in high-level feature space and drop those difficult ones out to ensure the model only needs to handle fundamental ST relations at the beginning, before gradually moving to hard ones. Our strategy can be applied to any canonical deep learning architecture without extra trainable parameters, and extensive experiments on a wide range of datasets are conducted to illustrate that, by controlling the difficulty level of ST relations as the training progresses, the model is able to capture better representation of the data and thus yields better generalization.
CVDec 13, 2022
Adversarially Robust Video Perception by Seeing MotionLingyu Zhang, Chengzhi Mao, Junfeng Yang et al.
Despite their excellent performance, state-of-the-art computer vision models often fail when they encounter adversarial examples. Video perception models tend to be more fragile under attacks, because the adversary has more places to manipulate in high-dimensional data. In this paper, we find one reason for video models' vulnerability is that they fail to perceive the correct motion under adversarial perturbations. Inspired by the extensive evidence that motion is a key factor for the human visual system, we propose to correct what the model sees by restoring the perceived motion information. Since motion information is an intrinsic structure of the video data, recovering motion signals can be done at inference time without any human annotation, which allows the model to adapt to unforeseen, worst-case inputs. Visualizations and empirical experiments on UCF-101 and HMDB-51 datasets show that restoring motion information in deep vision models improves adversarial robustness. Even under adaptive attacks where the adversary knows our defense, our algorithm is still effective. Our work provides new insight into robust video perception algorithms by using intrinsic structures from the data. Our webpage is available at https://motion4robust.cs.columbia.edu.
CLDec 15, 2023Code
Do LVLMs Understand Charts? Analyzing and Correcting Factual Errors in Chart CaptioningKung-Hsiang Huang, Mingyang Zhou, Hou Pong Chan et al.
Recent advancements in large vision-language models (LVLMs) have led to significant progress in generating natural language descriptions for visual content and thus enhancing various applications. One issue with these powerful models is that they sometimes produce texts that are factually inconsistent with the visual input. While there has been some effort to mitigate such inconsistencies in natural image captioning, the factuality of generated captions for structured document images, such as charts, has not received as much scrutiny, posing a potential threat to information reliability in critical applications. This work delves into the factuality aspect by introducing a comprehensive typology of factual errors in generated chart captions. A large-scale human annotation effort provides insight into the error patterns and frequencies in captions crafted by various chart captioning models, ultimately forming the foundation of a novel dataset, CHOCOLATE. Our analysis reveals that even state-of-the-art models, including GPT-4V, frequently produce captions laced with factual inaccuracies. In response to this challenge, we establish the new task of Chart Caption Factual Error Correction and introduce CHARTVE, a model for visual entailment that outperforms proprietary and open-source LVLMs in evaluating factual consistency. Furthermore, we propose C2TFEC, an interpretable two-stage framework that excels at correcting factual errors. This work inaugurates a new domain in factual error correction for chart captions, presenting a novel evaluation mechanism, and demonstrating an effective approach to ensuring the factuality of generated chart captions. The code and data as well as the continuously updated benchmark can be found at: https://khuangaf.github.io/CHOCOLATE/.
HCJul 31, 2024
CREW: Facilitating Human-AI Teaming ResearchLingyu Zhang, Zhengran Ji, Boyuan Chen
With the increasing deployment of artificial intelligence (AI) technologies, the potential of humans working with AI agents has been growing at a great speed. Human-AI teaming is an important paradigm for studying various aspects when humans and AI agents work together. The unique aspect of Human-AI teaming research is the need to jointly study humans and AI agents, demanding multidisciplinary research efforts from machine learning to human-computer interaction, robotics, cognitive science, neuroscience, psychology, social science, and complex systems. However, existing platforms for Human-AI teaming research are limited, often supporting oversimplified scenarios and a single task, or specifically focusing on either human-teaming research or multi-agent AI algorithms. We introduce CREW, a platform to facilitate Human-AI teaming research in real-time decision-making scenarios and engage collaborations from multiple scientific disciplines, with a strong emphasis on human involvement. It includes pre-built tasks for cognitive studies and Human-AI teaming with expandable potentials from our modular design. Following conventional cognitive neuroscience research, CREW also supports multimodal human physiological signal recording for behavior analysis. Moreover, CREW benchmarks real-time human-guided reinforcement learning agents using state-of-the-art algorithms and well-tuned baselines. With CREW, we were able to conduct 50 human subject studies within a week to verify the effectiveness of our benchmark.
HCMar 24
Scientific judgment drifts over time in AI ideationLingyu Zhang, Mitchell Wang, Boyuan Chen
Scientific discovery begins with ideas, yet evaluating early-stage research concepts is a subtle and subjective human judgment. As large language models (LLMs) are increasingly tasked with generating scientific hypotheses, most systems implicitly assume that scientists' evaluations form a fixed gold standard, assuming that scientists' judgments do not change. Here we challenge this assumption. In a two-wave study with 7,938 ratings from 63 active researchers across six scientific departments, each participant repeatedly evaluated a constant "control" research idea alongside AI-generated ideas. We find that expert evaluations are not stable: test-retest reliability of overall quality is only moderate (ICC~0.59-0.74), indicating substantial within-participant variability even for identical ideas. Yet the internal structure of judgment remained stable, such as the relative importance placed on originality, feasibility, clarity, and other criteria. We then aligned an LLM-based ideation system to first-wave human ratings and used it to select new ideas. Although alignment improved agreement with Wave-1 evaluations, its apparent gains disappeared once drift in human standards was accounted for. Thus, tuning to a fixed human snapshot produced improvements that were transient rather than persistent. These findings reveal that human evaluation of scientific ideas is not static but a dynamic process with stable priorities and requires shifting calibration. Treating one-time human ratings as immutable ground truth risks overstating progress in AI-assisted ideation and obscuring the challenge of co-evolving with changing expert standards. Drift-aware evaluation protocols and longitudinal benchmarks may therefore be essential for building AI systems that reliably augment, rather than overfit to, human scientific judgment.
LGOct 19, 2024
GUIDE: Real-Time Human-Shaped AgentsLingyu Zhang, Zhengran Ji, Nicholas R Waytowich et al.
The recent rapid advancement of machine learning has been driven by increasingly powerful models with the growing availability of training data and computational resources. However, real-time decision-making tasks with limited time and sparse learning signals remain challenging. One way of improving the learning speed and performance of these agents is to leverage human guidance. In this work, we introduce GUIDE, a framework for real-time human-guided reinforcement learning by enabling continuous human feedback and grounding such feedback into dense rewards to accelerate policy learning. Additionally, our method features a simulated feedback module that learns and replicates human feedback patterns in an online fashion, effectively reducing the need for human input while allowing continual training. We demonstrate the performance of our framework on challenging tasks with sparse rewards and visual observations. Our human study involving 50 subjects offers strong quantitative and qualitative evidence of the effectiveness of our approach. With only 10 minutes of human feedback, our algorithm achieves up to 30% increase in success rate compared to its RL baseline.
LGNov 18, 2024
Unveiling the Inflexibility of Adaptive Embedding in Traffic ForecastingHongjun Wang, Jiyuan Chen, Lingyu Zhang et al.
Spatiotemporal Graph Neural Networks (ST-GNNs) and Transformers have shown significant promise in traffic forecasting by effectively modeling temporal and spatial correlations. However, rapid urbanization in recent years has led to dynamic shifts in traffic patterns and travel demand, posing major challenges for accurate long-term traffic prediction. The generalization capability of ST-GNNs in extended temporal scenarios and cross-city applications remains largely unexplored. In this study, we evaluate state-of-the-art models on an extended traffic benchmark and observe substantial performance degradation in existing ST-GNNs over time, which we attribute to their limited inductive capabilities. Our analysis reveals that this degradation stems from an inability to adapt to evolving spatial relationships within urban environments. To address this limitation, we reconsider the design of adaptive embeddings and propose a Principal Component Analysis (PCA) embedding approach that enables models to adapt to new scenarios without retraining. We incorporate PCA embeddings into existing ST-GNN and Transformer architectures, achieving marked improvements in performance. Notably, PCA embeddings allow for flexibility in graph structures between training and testing, enabling models trained on one city to perform zero-shot predictions on other cities. This adaptability demonstrates the potential of PCA embeddings in enhancing the robustness and generalization of spatiotemporal models.
LGAug 29, 2025
A Knowledge-Guided Cross-Modal Feature Fusion Model for Local Traffic Demand PredictionLingyu Zhang, Pengfei Xu, Guobin Wu et al.
Traffic demand prediction plays a critical role in intelligent transportation systems. Existing traffic prediction models primarily rely on temporal traffic data, with limited efforts incorporating human knowledge and experience for urban traffic demand forecasting. However, in real-world scenarios, traffic knowledge and experience derived from human daily life significantly influence precise traffic prediction. Such knowledge and experiences can guide the model in uncovering latent patterns within traffic data, thereby enhancing the accuracy and robustness of predictions. To this end, this paper proposes integrating structured temporal traffic data with textual data representing human knowledge and experience, resulting in a novel knowledge-guided cross-modal feature representation learning (KGCM) model for traffic demand prediction. Based on regional transportation characteristics, we construct a prior knowledge dataset using a large language model combined with manual authoring and revision, covering both regional and global knowledge and experiences. The KGCM model then learns multimodal data features through designed local and global adaptive graph networks, as well as a cross-modal feature fusion mechanism. A proposed reasoning-based dynamic update strategy enables dynamic optimization of the graph model's parameters, achieving optimal performance. Experiments on multiple traffic datasets demonstrate that our model accurately predicts future traffic demand and outperforms existing state-of-the-art (SOTA) models.
IRAug 29, 2025
Next Point-of-interest (POI) Recommendation Model Based on Multi-modal Spatio-temporal Context Feature EmbeddingLingyu Zhang, Guobin Wu, Yan Wang et al.
The next Point-of-interest (POI) recommendation is mainly based on sequential traffic information to predict the user's next boarding point location. This is a highly regarded and widely applied research task in the field of intelligent transportation, and there have been many research results to date. Traditional POI prediction models primarily rely on short-term traffic sequence information, often neglecting both long-term and short-term preference data, as well as crucial spatiotemporal context features in user behavior. To address this issue, this paper introduces user long-term preference information and key spatiotemporal context information, and proposes a POI recommendation model based on multimodal spatiotemporal context feature embedding. The model extracts long-term preference features and key spatiotemporal context features from traffic data through modules such as spatiotemporal feature processing, multimodal embedding, and self-attention aggregation. It then uses a weighted fusion method to dynamically adjust the weights of long-term and short-term features based on users' historical behavior patterns and the current context. Finally, the fused features are matched using attention, and the probability of each location candidate becoming the next location is calculated. This paper conducts experimental verification on multiple transportation datasets, and the results show that the POI prediction model combining multiple types of features has higher prediction accuracy than existing SOTA models and methods.
ROOct 26, 2021
Research on the Inverse Kinematics Prediction of a Soft Biomimetic Actuator via BP Neural NetworkHuichen Ma, Junjie Zhou, Jian Zhang et al.
In this work, we address the inverse kinetics problem of motion planning of soft biomimetic actuators driven by three chambers. Soft biomimetic actuators have been applied in many applications owing to their intrinsic softness. Although a mathematical model can be derived to describe the inverse dynamics of this actuator, it is still not accurate to capture the nonlinearity and uncertainty of the material and the system. Besides, such a complex model is time-consuming, so it is not easy to apply in the real-time control unit. Therefore, developing a model-free approach in this area could be a new idea. To overcome these intrinsic problems, we propose a back-propagation (BP) neural network learning the inverse kinetics of the soft biomimetic actuator moving in three-dimensional space. After training with sample data, the BP neural network model can represent the relation between the manipulator tip position and the pressure applied to the chambers. The proposed algorithm is more precise than the analytical model. The results show that a desired terminal position can be achieved with a degree of accuracy of 2.46% relative average error with respect to the total actuator length.
LGMay 7, 2021
Apply Artificial Neural Network to Solving Manpower Scheduling ProblemTianyu Liu, Lingyu Zhang
The manpower scheduling problem is a kind of critical combinational optimization problem. Researching solutions to scheduling problems can improve the efficiency of companies, hospitals, and other work units. This paper proposes a new model combined with deep learning to solve the multi-shift manpower scheduling problem based on the existing research. This model first solves the objective function's optimized value according to the current constraints to find the plan of employee arrangement initially. It will then use the scheduling table generation algorithm to obtain the scheduling result in a short time. Moreover, the most prominent feature we propose is that we will use the neural network training method based on the time series to solve long-term and long-period scheduling tasks and obtain manpower arrangement. The selection criteria of the neural network and the training process are also described in this paper. We demonstrate that our model can make a precise forecast based on the improvement of neural networks. This paper also discusses the challenges in the neural network training process and obtains enlightening results after getting the arrangement plan. Our research shows that neural networks and deep learning strategies have the potential to solve similar problems effectively.
AIMay 7, 2021
An Intelligent Model for Solving Manpower Scheduling ProblemsLingyu Zhang, Tianyu Liu, Yunhai Wang
The manpower scheduling problem is a critical research field in the resource management area. Based on the existing studies on scheduling problem solutions, this paper transforms the manpower scheduling problem into a combinational optimization problem under multi-constraint conditions from a new perspective. It also uses logical paradigms to build a mathematical model for problem solution and an improved multi-dimensional evolution algorithm for solving the model. Moreover, the constraints discussed in this paper basically cover all the requirements of human resource coordination in modern society and are supported by our experiment results. In the discussion part, we compare our model with other heuristic algorithms or linear programming methods and prove that the model proposed in this paper makes a 25.7% increase in efficiency and a 17% increase in accuracy at most. In addition, to the numerical solution of the manpower scheduling problem, this paper also studies the algorithm for scheduling task list generation and the method of displaying scheduling results. As a result, we not only provide various modifications for the basic algorithm to solve different condition problems but also propose a new algorithm that increases at least 28.91% in time efficiency by comparing with different baseline models.
CYAug 7, 2020
Predicting Individual Treatment Effects of Large-scale Team Competitions in a Ride-sharing EconomyTeng Ye, Wei Ai, Lingyu Zhang et al.
Millions of drivers worldwide have enjoyed financial benefits and work schedule flexibility through a ride-sharing economy, but meanwhile they have suffered from the lack of a sense of identity and career achievement. Equipped with social identity and contest theories, financially incentivized team competitions have been an effective instrument to increase drivers' productivity, job satisfaction, and retention, and to improve revenue over cost for ride-sharing platforms. While these competitions are overall effective, the decisive factors behind the treatment effects and how they affect the outcomes of individual drivers have been largely mysterious. In this study, we analyze data collected from more than 500 large-scale team competitions organized by a leading ride-sharing platform, building machine learning models to predict individual treatment effects. Through a careful investigation of features and predictors, we are able to reduce out-sample prediction error by more than 24%. Through interpreting the best-performing models, we discover many novel and actionable insights regarding how to optimize the design and the execution of team competitions on ride-sharing platforms. A simulated analysis demonstrates that by simply changing a few contest design options, the average treatment effect of a real competition is expected to increase by as much as 26%. Our procedure and findings shed light on how to analyze and optimize large-scale online field experiments in general.
LGMay 27, 2019
Multi-Modal Graph Interaction for Multi-Graph Convolution Network in Urban Spatiotemporal ForecastingXu Geng, Xiyu Wu, Lingyu Zhang et al.
Graph convolution network based approaches have been recently used to model region-wise relationships in region-level prediction problems in urban computing. Each relationship represents a kind of spatial dependency, like region-wise distance or functional similarity. To incorporate multiple relationships into spatial feature extraction, we define the problem as a multi-modal machine learning problem on multi-graph convolution networks. Leveraging the advantage of multi-modal machine learning, we propose to develop modality interaction mechanisms for this problem, in order to reduce generalization error by reinforcing the learning of multimodal coordinated representations. In this work, we propose two interaction techniques for handling features in lower layers and higher layers respectively. In lower layers, we propose grouped GCN to combine the graph connectivity from different modalities for more complete spatial feature extraction. In higher layers, we adapt multi-linear relationship networks to GCN by exploring the dimension transformation and freezing part of the covariance structure. The adapted approach, called multi-linear relationship GCN, learns more generalized features to overcome the train-test divergence induced by time shifting. We evaluated our model on ridehailing demand forecasting problem using two real-world datasets. The proposed technique outperforms state-of-the art baselines in terms of prediction accuracy, training efficiency, interpretability and model robustness.
IRMar 18, 2019
POI Semantic Model with a Deep Convolutional StructureJi Zhao, Meiyu Yu, Huan Chen et al.
When using the electronic map, POI retrieval is the initial and important step, whose quality directly affects the user experience. Similarity between user query and POI information is the most critical feature in POI retrieval. An accurate similarity calculation is challenging since the mismatch between a query and a retrieval text may exist in the case of a mistyped query or an alias inquiry. In this paper, we propose a POI latent semantic model based on deep networks, which can effectively extract query features and POI information features for the similarity calculation. Our model describes the semantic information of complex texts at multiple layers, and achieves multi-field matches by modeling POI's name and detailed address respectively. Our model is evaluated by the POI retrieval ranking datasets, including the labeled data of relevance and real-world user click data in POI retrieval. Results show that our model significantly outperforms our competitors in POI retrieval ranking tasks. The proposed algorithm has become a critical component of an online system serving millions of people everyday.