CVAug 19, 2023Code
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsWenbo Hu, Yifan Xu, Yi Li et al. · harvard
Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limited to the token count, potentially curtailing the recognition of scenes with text-rich context. To improve upon them, the present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. BLIVA incorporates the query embeddings from InstructBLIP and also directly projects encoded patch embeddings into the LLM, a technique inspired by LLaVA. This approach assists the model to capture intricate details potentially missed during the query decoding process. Empirical evidence demonstrates that our model, BLIVA, significantly enhances performance in processing text-rich VQA benchmarks (up to 17.76% in OCR-VQA benchmark) and in undertaking general (not particularly text-rich) VQA benchmarks (up to 7.9% in Visual Spatial Reasoning benchmark), and achieved 17.72% overall improvement in a comprehensive multimodal LLM benchmark (MME), comparing to our baseline InstructBLIP. BLIVA demonstrates significant capability in decoding real-world images, irrespective of text presence. To demonstrate the broad industry applications enabled by BLIVA, we evaluate the model using a new dataset comprising YouTube thumbnails paired with question-answer sets across 11 diverse categories. Our code and models are freely accessible at https://github.com/mlpc-ucsd/BLIVA.
LGJun 3Code
CausalPOI: Spatio-Temporal Graph-Based Causal Modeling for Cold-Start POI Check-in ForecastingZhaoqi Zhang, Miao Xie, Yi Li et al.
As urban environments continue to evolve rapidly, accurately modeling the dynamic behaviour of Points of Interest is essential for supporting data-driven urban planning and commercial decision-making. While recent advancements in spatio-temporal graph learning have improved POI forecasting, most methods rely on proximity-based graphs and correlation-driven modeling, which overlook the functional dependencies between POIs and fail to capture the causal effects of urban interventions. In this paper, we introduce a novel research problem -- cold-start POI check-in forecasting, which aims to predict the future check-in pattern of a newly introduced POI, by modeling its temporal evolution and functional interactions with nearby POIs in a structured urban spatial context. To address these challenges, we propose CausalPOI, a spatio-temporal graph-based causal representation learning framework. CausalPOI leverages Spatio-Temporal Functional Interaction Graph to model semantic and spatial relationships between POIs, and constructs structurally aligned treatment and control graphs to simulate factual and counterfactual scenarios. Extensive experiments on real-world SafeGraph datasets demonstrate that CausalPOI significantly outperforms state-of-the-art baselines across the board, validating its effectiveness in spatio-temporal forecasting, semantic interaction modeling, and causal effect estimation, providing a more interpretable and actionable foundation for urban intervention analysis. Source code is available at Github.
CVAug 23, 2023Code
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say NoHualiang Wang, Yi Li, Huifeng Yao et al.
Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD detection methods driven by CLIP, which only require class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers the logic of saying no within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples using positive-semantic prompts and negation-semantic prompts. Specifically, we design a novel learnable no prompt and a no text encoder to capture negation semantics within images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with no prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection by utilizing negation semantics from no prompts and the text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. Our CLIPN can serve as a solid foundation for effectively leveraging CLIP in downstream OOD tasks. The code is available on https://github.com/xmed-lab/CLIPN.
CVApr 12, 2023Code
A Closer Look at the Explainability of Contrastive Language-Image Pre-trainingYi Li, Hualiang Wang, Yiqun Duan et al.
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.
CVMay 30
An Attribute-Based Measure of Video ComplexityAditya Sarkar, Yi Li, Zihao Wang et al.
A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge' with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.
CVJul 17, 2022
CATRE: Iterative Point Clouds Alignment for Category-level Object Pose RefinementXingyu Liu, Gu Wang, Yi Li et al. · tsinghua
While category-level 9DoF object pose estimation has emerged recently, previous correspondence-based or direct regression methods are both limited in accuracy due to the huge intra-category variances in object shape and color, etc. Orthogonal to them, this work presents a category-level object pose and size refiner CATRE, which is able to iteratively enhance pose estimate from point clouds to produce accurate results. Given an initial pose estimate, CATRE predicts a relative transformation between the initial pose and ground truth by means of aligning the partially observed point cloud and an abstract shape prior. In specific, we propose a novel disentangled architecture being aware of the inherent distinctions between rotation and translation/size estimation. Extensive experiments show that our approach remarkably outperforms state-of-the-art methods on REAL275, CAMERA25, and LM benchmarks up to a speed of ~85.32Hz, and achieves competitive results on category-level tracking. We further demonstrate that CATRE can perform pose refinement on unseen category. Code and trained models are available.
AIJun 3
Agents' Last ExamYiyou Sun, Xinyang Han, Weichen Zhang et al.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
CVJun 14, 2022Code
Online Easy Example Mining for Weakly-supervised Gland Segmentation from Histology ImagesYi Li, Yiduo Yu, Yiwen Zou et al.
Developing an AI-assisted gland segmentation method from histology images is critical for automatic cancer diagnosis and prognosis; however, the high cost of pixel-level annotations hinders its applications to broader diseases. Existing weakly-supervised semantic segmentation methods in computer vision achieve degenerative results for gland segmentation, since the characteristics and problems of glandular datasets are different from general object datasets. We observe that, unlike natural images, the key problem with histology images is the confusion of classes owning to morphological homogeneity and low color contrast among different tissues. To this end, we propose a novel method Online Easy Example Mining (OEEM) that encourages the network to focus on credible supervision signals rather than noisy signals, therefore mitigating the influence of inevitable false predictions in pseudo-masks. According to the characteristics of glandular datasets, we design a strong framework for gland segmentation. Our results exceed many fully-supervised methods and weakly-supervised methods for gland segmentation over 4.4% and 6.04% at mIoU, respectively. Code is available at https://github.com/xmed-lab/OEEM.
CLSep 21, 2024Code
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language ModelsYuqing Huang, Rongyang Zhang, Xuesong He et al.
There is a growing interest in the role that LLMs play in chemistry which lead to an increased focus on the development of LLMs benchmarks tailored to chemical domains to assess the performance of LLMs across a spectrum of chemical tasks varying in type and complexity. However, existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. To this end, we propose \textbf{\textit{ChemEval}}, which provides a comprehensive assessment of the capabilities of LLMs across a wide range of chemical domain tasks. Specifically, ChemEval identified 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks which are informed by open-source data and the data meticulously crafted by chemical experts, ensuring that the tasks have practical value and can effectively evaluate the capabilities of LLMs. In the experiment, we evaluate 12 mainstream LLMs on ChemEval under zero-shot and few-shot learning contexts, which included carefully selected demonstration examples and carefully designed prompts. The results show that while general LLMs like GPT-4 and Claude-3.5 excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge. Conversely, specialized LLMs exhibit enhanced chemical competencies, albeit with reduced literary comprehension. This suggests that LLMs have significant potential for enhancement when tackling sophisticated tasks in the field of chemistry. We believe our work will facilitate the exploration of their potential to drive progress in chemistry. Our benchmark and analysis will be available at {\color{blue} \url{https://github.com/USTC-StarTeam/ChemEval}}.
CVMay 31, 2022
VALHALLA: Visual Hallucination for Machine TranslationYi Li, Rameswar Panda, Yoon Kim et al.
Designing better machine translation systems by considering auxiliary inputs such as images has attracted much attention in recent years. While existing methods show promising performance over the conventional text-only translation systems, they typically require paired text and image as input during inference, which limits their applicability to real-world scenarios. In this paper, we introduce a visual hallucination framework, called VALHALLA, which requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation. In particular, given a source sentence an autoregressive hallucination transformer is used to predict a discrete visual representation from the input text, and the combined text and hallucinated representations are utilized to obtain the target translation. We train the hallucination transformer jointly with the translation transformer using standard backpropagation with cross-entropy losses while being guided by an additional loss that encourages consistency between predictions using either ground-truth or hallucinated visual representations. Extensive experiments on three standard translation datasets with a diverse set of language pairs demonstrate the effectiveness of our approach over both text-only baselines and state-of-the-art methods. Project page: http://www.svcl.ucsd.edu/projects/valhalla.
CVAug 15, 2023Code
Context-Aware Pseudo-Label Refinement for Source-Free Domain Adaptive Fundus Image SegmentationZheang Huai, Xinpeng Ding, Yi Li et al.
In the domain adaptation problem, source data may be unavailable to the target client side due to privacy or intellectual property issues. Source-free unsupervised domain adaptation (SF-UDA) aims at adapting a model trained on the source side to align the target distribution with only the source model and unlabeled target data. The source model usually produces noisy and context-inconsistent pseudo-labels on the target domain, i.e., neighbouring regions that have a similar visual appearance are annotated with different pseudo-labels. This observation motivates us to refine pseudo-labels with context relations. Another observation is that features of the same class tend to form a cluster despite the domain gap, which implies context relations can be readily calculated from feature distances. To this end, we propose a context-aware pseudo-label refinement method for SF-UDA. Specifically, a context-similarity learning module is developed to learn context relations. Next, pseudo-label revision is designed utilizing the learned context relations. Further, we propose calibrating the revised pseudo-labels to compensate for wrong revision caused by inaccurate context relations. Additionally, we adopt a pixel-level and class-level denoising scheme to select reliable pseudo-labels for domain adaptation. Experiments on cross-domain fundus images indicate that our approach yields the state-of-the-art results. Code is available at https://github.com/xmed-lab/CPR.
CVApr 18, 2023
SViTT: Temporal Learning of Sparse Video-Text TransformersYi Li, Kyle Min, Subarna Tripathi et al.
Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.
IVApr 13, 2022
WSSS4LUAD: Grand Challenge on Weakly-supervised Tissue Semantic Segmentation for Lung AdenocarcinomaChu Han, Xipeng Pan, Lixu Yan et al.
Lung cancer is the leading cause of cancer death worldwide, and adenocarcinoma (LUAD) is the most common subtype. Exploiting the potential value of the histopathology images can promote precision medicine in oncology. Tissue segmentation is the basic upstream task of histopathology image analysis. Existing deep learning models have achieved superior segmentation performance but require sufficient pixel-level annotations, which is time-consuming and expensive. To enrich the label resources of LUAD and to alleviate the annotation efforts, we organize this challenge WSSS4LUAD to call for the outstanding weakly-supervised semantic segmentation (WSSS) techniques for histopathology images of LUAD. Participants have to design the algorithm to segment tumor epithelial, tumor-associated stroma and normal tissue with only patch-level labels. This challenge includes 10,091 patch-level annotations (the training set) and over 130 million labeled pixels (the validation and test sets), from 87 WSIs (67 from GDPH, 20 from TCGA). All the labels were generated by a pathologist-in-the-loop pipeline with the help of AI models and checked by the label review board. Among 532 registrations, 28 teams submitted the results in the test phase with over 1,000 submissions. Finally, the first place team achieved mIoU of 0.8413 (tumor: 0.8389, stroma: 0.7931, normal: 0.8919). According to the technical reports of the top-tier teams, CAM is still the most popular approach in WSSS. Cutmix data augmentation has been widely adopted to generate more reliable samples. With the success of this challenge, we believe that WSSS approaches with patch-level annotations can be a complement to the traditional pixel annotations while reducing the annotation efforts. The entire dataset has been released to encourage more researches on computational pathology in LUAD and more novel WSSS techniques.
LGJun 11, 2023
Learning the Positions in CountSketchYi Li, Honghao Lin, Simin Liu et al.
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem, e.g., low-rank approximation and regression. In the learning-based sketching paradigm proposed by~\cite{indyk2019learning}, the sketch matrix is found by choosing a random sparse matrix, e.g., CountSketch, and then the values of its non-zero entries are updated by running gradient descent on a training data set. Despite the growing body of work on this paradigm, a noticeable omission is that the locations of the non-zero entries of previous algorithms were fixed, and only their values were learned. In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries. Our first proposed algorithm is based on a greedy algorithm. However, one drawback of the greedy algorithm is its slower training time. We fix this issue and propose approaches for learning a sketching matrix for both low-rank approximation and Hessian approximation for second order optimization. The latter is helpful for a range of constrained optimization problems, such as LASSO and matrix estimation with a nuclear norm constraint. Both approaches achieve good accuracy with a fast running time. Moreover, our experiments suggest that our algorithm can still reduce the error significantly even if we only have a very limited number of training matrices.
CVMar 25, 2022
Unsupervised Image Deraining: Optimization Model Driven Deep CNNChangfeng Yu, Yi Chang, Yi Li et al.
The deep convolutional neural network has achieved significant progress for single image rain streak removal. However, most of the data-driven learning methods are full-supervised or semi-supervised, unexpectedly suffering from significant performance drops when dealing with real rain. These data-driven learning methods are representative yet generalize poor for real rain. The opposite holds true for the model-driven unsupervised optimization methods. To overcome these problems, we propose a unified unsupervised learning framework which inherits the generalization and representation merits for real rain removal. Specifically, we first discover a simple yet important domain knowledge that directional rain streak is anisotropic while the natural clean image is isotropic, and formulate the structural discrepancy into the energy function of the optimization model. Consequently, we design an optimization model-driven deep CNN in which the unsupervised loss function of the optimization model is enforced on the proposed network for better generalization. In addition, the architecture of the network mimics the main role of the optimization models with better feature representation. On one hand, we take advantage of the deep network to improve the representation. On the other hand, we utilize the unsupervised loss of the optimization model for better generalization. Overall, the unsupervised learning framework achieves good generalization and representation: unsupervised training (loss) with only a few real rainy images (input) and physical meaning network (architecture). Extensive experiments on synthetic and real-world rain datasets show the superiority of the proposed method.
CLFeb 3
Accelerating Scientific Research with Gemini: Case Studies and Common TechniquesDavid P. Woodruff, Vincent Cohen-Addad, Lalit Jain et al.
Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.
IRMar 23, 2022
PEAR: Personalized Re-ranking with Contextualized Transformer for RecommendationYi Li, Jieming Zhu, Weiwen Liu et al.
The goal of recommender systems is to provide ordered item lists to users that best match their interests. As a critical task in the recommendation pipeline, re-ranking has received increasing attention in recent years. In contrast to conventional ranking models that score each item individually, re-ranking aims to explicitly model the mutual influences among items to further refine the ordering of items given an initial ranking list. In this paper, we present a personalized re-ranking model (dubbed PEAR) based on contextualized transformer. PEAR makes several major improvements over the existing methods. Specifically, PEAR not only captures feature-level and item-level interactions, but also models item contexts from both the initial ranking list and the historical clicked item list. In addition to item-level ranking score prediction, we also augment the training of PEAR with a list-level classification task to assess users' satisfaction on the whole ranking list. Experimental results on both public and production datasets have shown the superior effectiveness of PEAR compared to the previous re-ranking models.
DSJul 11, 2023
$\ell_p$-Regression in the Arbitrary Partition Model of CommunicationYi Li, Honghao Lin, David P. Woodruff
We consider the randomized communication complexity of the distributed $\ell_p$-regression problem in the coordinator model, for $p\in (0,2]$. In this problem, there is a coordinator and $s$ servers. The $i$-th server receives $A^i\in\{-M, -M+1, \ldots, M\}^{n\times d}$ and $b^i\in\{-M, -M+1, \ldots, M\}^n$ and the coordinator would like to find a $(1+ε)$-approximate solution to $\min_{x\in\mathbb{R}^n} \|(\sum_i A^i)x - (\sum_i b^i)\|_p$. Here $M \leq \mathrm{poly}(nd)$ for convenience. This model, where the data is additively shared across servers, is commonly referred to as the arbitrary partition model. We obtain significantly improved bounds for this problem. For $p = 2$, i.e., least squares regression, we give the first optimal bound of $\tildeΘ(sd^2 + sd/ε)$ bits. For $p \in (1,2)$,we obtain an $\tilde{O}(sd^2/ε+ sd/\mathrm{poly}(ε))$ upper bound. Notably, for $d$ sufficiently large, our leading order term only depends linearly on $1/ε$ rather than quadratically. We also show communication lower bounds of $Ω(sd^2 + sd/ε^2)$ for $p\in (0,1]$ and $Ω(sd^2 + sd/ε)$ for $p\in (1,2]$. Our bounds considerably improve previous bounds due to (Woodruff et al. COLT, 2013) and (Vempala et al., SODA, 2020).
CVSep 19, 2022
Tree-based Text-Vision BERT for Video Search in Baidu Video AdvertisingTan Yu, Jie Liu, Yi Yang et al. · baidu
The advancement of the communication technology and the popularity of the smart phones foster the booming of video ads. Baidu, as one of the leading search engine companies in the world, receives billions of search queries per day. How to pair the video ads with the user search is the core task of Baidu video advertising. Due to the modality gap, the query-to-video retrieval is much more challenging than traditional query-to-document retrieval and image-to-image search. Traditionally, the query-to-video retrieval is tackled by the query-to-title retrieval, which is not reliable when the quality of tiles are not high. With the rapid progress achieved in computer vision and natural language processing in recent years, content-based search methods becomes promising for the query-to-video retrieval. Benefited from pretraining on large-scale datasets, some visionBERT methods based on cross-modal attention have achieved excellent performance in many vision-language tasks not only in academia but also in industry. Nevertheless, the expensive computation cost of cross-modal attention makes it impractical for large-scale search in industrial applications. In this work, we present a tree-based combo-attention network (TCAN) which has been recently launched in Baidu's dynamic video advertising platform. It provides a practical solution to deploy the heavy cross-modal attention for the large-scale query-to-video search. After launching tree-based combo-attention network, click-through rate gets improved by 2.29\% and conversion rate get improved by 2.63\%.
CVAug 5, 2024
Fairness and Bias Mitigation in Computer Vision: A SurveySepehr Dehdashtian, Ruozhen He, Yi Li et al.
Computer vision systems have witnessed rapid progress over the past two decades due to multiple advances in the field. As these systems are increasingly being deployed in high-stakes real-world applications, there is a dire need to ensure that they do not propagate or amplify any discriminatory tendencies in historical or human-curated data or inadvertently learn biases from spurious correlations. This paper presents a comprehensive survey on fairness that summarizes and sheds light on ongoing trends and successes in the context of computer vision. The topics we discuss include 1) The origin and technical definitions of fairness drawn from the wider fair machine learning literature and adjacent disciplines. 2) Work that sought to discover and analyze biases in computer vision systems. 3) A summary of methods proposed to mitigate bias in computer vision systems in recent years. 4) A comprehensive summary of resources and datasets produced by researchers to measure, analyze, and mitigate bias and enhance fairness. 5) Discussion of the field's success, continuing trends in the context of multimodal foundation and generative models, and gaps that still need to be addressed. The presented characterization should help researchers understand the importance of identifying and mitigating bias in computer vision and the state of the field and identify potential directions for future research.
SYMay 16
Multi-Objective-Optimization Assisted Data Collection Framework for IoUT Based on Offline ReinforcementYimian Ding, Xinqi Wang, Jingzehua Xu et al.
The Information Updating Networks (IUNs) offers significant potential for ocean exploration but encounters challenges due to dynamic underwater environments and severe system attenuation. Current methods relying on Autonomous Underwater Vehicles (AUVs) based on online reinforcement learning (RL) lead to high computational costs and low data utilization. To address these issues and the constraints of turbulent ocean environments, we propose a multi-AUV assisted data collection framework for IUNs based on multi-agent offline RL. This framework maximizes data rate and the value of information (VoI), minimizes energy consumption, and ensures collision avoidance by utilizing environmental and equipment status data. We introduce a semi-communication decentralized training with decentralized execution (SC-DTDE) paradigm and a multi-agent independent conservative Q-learning algorithm (MAICQL) to effectively tackle the problem. Extensive simulations demonstrate the high applicability, robustness, and data collection efficiency of the proposed framework.
CVJan 30Code
Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective PredictionAditya Sarkar, Yi Li, Jiacheng Cheng et al.
Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.
MLMar 9, 2023
Penalized Deep Partially Linear Cox Models with Application to CT Scans of Lung Cancer PatientsYuming Sun, Jian Kang, Chinmay Haridas et al.
Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective patient-centered therapies. The National Lung Screening Trial (NLST) employed computed tomography texture analysis, which provides objective measurements of texture patterns on CT scans, to quantify the mortality risks of lung cancer patients. Partially linear Cox models have gained popularity for survival analysis by dissecting the hazard function into parametric and nonparametric components, allowing for the effective incorporation of both well-established risk factors (such as age and clinical variables) and emerging risk factors (e.g., image features) within a unified framework. However, when the dimension of parametric components exceeds the sample size, the task of model fitting becomes formidable, while nonparametric modeling grapples with the curse of dimensionality. We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model. We prove the convergence and asymptotic properties of the estimator and compare it to other methods through extensive simulation studies, evaluating its performance in risk prediction and feature selection. The proposed method is applied to the NLST study dataset to uncover the effects of key clinical and imaging risk factors on patients' survival. Our findings provide valuable insights into the relationship between these factors and survival outcomes.
AIMay 27, 2022
GALOIS: Boosting Deep Reinforcement Learning via Generalizable Logic SynthesisYushi Cao, Zhiming Li, Tianpei Yang et al.
Despite achieving superior performance in human-level control problems, unlike humans, deep reinforcement learning (DRL) lacks high-order intelligence (e.g., logic deduction and reuse), thus it behaves ineffectively than humans regarding learning and generalization in complex problems. Previous works attempt to directly synthesize a white-box logic program as the DRL policy, manifesting logic-driven behaviors. However, most synthesis methods are built on imperative or declarative programming, and each has a distinct limitation, respectively. The former ignores the cause-effect logic during synthesis, resulting in low generalizability across tasks. The latter is strictly proof-based, thus failing to synthesize programs with complex hierarchical logic. In this paper, we combine the above two paradigms together and propose a novel Generalizable Logic Synthesis (GALOIS) framework to synthesize hierarchical and strict cause-effect logic programs. GALOIS leverages the program sketch and defines a new sketch-based hybrid program language for guiding the synthesis. Based on that, GALOIS proposes a sketch-based program synthesis method to automatically generate white-box programs with generalizable and interpretable cause-effect logic. Extensive evaluations on various decision-making tasks with complex logic demonstrate the superiority of GALOIS over mainstream baselines regarding the asymptotic performance, generalizability, and great knowledge reusability across different environments.
SEApr 23Code
SpecSyn: LLM-based Synthesis and Refinement of Formal Specifications for Real-world Program VerificationLezhi Ma, Shangqing Liu, Yi Li et al.
Program verification is a formal technique to rigorously ensure the correctness and fault-freeness of software systems. However, constructing comprehensive interprocedural specifications for full verification obligations is time-consuming and labor-intensive, giving rise to automated specification generation approaches. Despite the significant advancements in these approaches brought by Large Language Models (LLMs), existing LLM-empowered approaches still suffer from significant limitations: they lack effective strategies for handling sizable input programs, and are typically equipped with no mechanisms to evaluate and guarantee the strength of the generated specifications. The limitations impair their ability to extract precise specifications from real-world complicated programs to support the verification of target properties, thereby hindering the applicability of existing approaches in verification tasks on real-world programs. To remedy this gap, we propose SpecSyn, a novel LLM-based specification generation method. SpecSyn first decomposes the input program into individual segments, which are handled respectively by the subsequent iterative specification generation process. Innovatively, we incorporate into the process a specification refinement mechanism based on semantic-non-equivalent program mutations and variant discrimination, assessing and enhancing the semantic strength of the generated specifications. Extensive experiments show that SpecSyn maintains high precision over 90% and outstanding recall over 75%, significantly outperforming existing LLM-based approaches. In further evaluations, SpecSyn successfully handles 1071 out of 1365 target properties for open-source programs, proving its applicability on real-world program verification tasks.
CVAug 30, 2022
SSORN: Self-Supervised Outlier Removal Network for Robust Homography EstimationYi Li, Wenjie Pei, Zhenyu He
The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the homography estimation problem, they lack the feature matching step and/or the outlier removal step, which are important steps in the traditional homography estimation pipeline. In this paper, we attempt to build a deep learning model that mimics all four steps in the traditional homography estimation pipeline. In particular, the feature matching step is implemented using the cost volume technique. To remove outliers in the cost volume, we treat this outlier removal problem as a denoising problem and propose a novel self-supervised loss to solve the problem. Extensive experiments on synthetic and real datasets demonstrate that the proposed model outperforms existing deep learning models.
CRFeb 1, 2023
CATFL: Certificateless Authentication-based Trustworthy Federated Learning for 6G Semantic CommunicationsGaolei Li, Yuanyuan Zhao, Yi Li
Federated learning (FL) provides an emerging approach for collaboratively training semantic encoder/decoder models of semantic communication systems, without private user data leaving the devices. Most existing studies on trustworthy FL aim to eliminate data poisoning threats that are produced by malicious clients, but in many cases, eliminating model poisoning attacks brought by fake servers is also an important objective. In this paper, a certificateless authentication-based trustworthy federated learning (CATFL) framework is proposed, which mutually authenticates the identity of clients and server. In CATFL, each client verifies the server's signature information before accepting the delivered global model to ensure that the global model is not delivered by false servers. On the contrary, the server also verifies the server's signature information before accepting the delivered model updates to ensure that they are submitted by authorized clients. Compared to PKI-based methods, the CATFL can avoid too high certificate management overheads. Meanwhile, the anonymity of clients shields data poisoning attacks, while real-name registration may suffer from user-specific privacy leakage risks. Therefore, a pseudonym generation strategy is also presented in CATFL to achieve a trade-off between identity traceability and user anonymity, which is essential to conditionally prevent from user-specific privacy leakage. Theoretical security analysis and evaluation results validate the superiority of CATFL.
SPOct 1, 2022
Cross Task Neural Architecture Search for EEG Signal ClassificationsYiqun Duan, Zhen Wang, Yi Li et al.
Electroencephalograms (EEGs) are brain dynamics measured outside the brain, which have been widely utilized in non-invasive brain-computer interface applications. Recently, various neural network approaches have been proposed to improve the accuracy of EEG signal recognition. However, these approaches severely rely on manually designed network structures for different tasks which generally are not sharing the same empirical design cross-task-wise. In this paper, we propose a cross-task neural architecture search (CTNAS-EEG) framework for EEG signal recognition, which can automatically design the network structure across tasks and improve the recognition accuracy of EEG signals. Specifically, a compatible search space for cross-task searching and an efficient constrained searching method is proposed to overcome challenges brought by EEG signals. By unifying structure search on different EEG tasks, this work is the first to explore and analyze the searched structure difference cross-task-wise. Moreover, by introducing architecture search, this work is the first to analyze model performance by customizing model structure for each human subject. Detailed experimental results suggest that the proposed CTNAS-EEG could reach state-of-the-art performance on different EEG tasks, such as Motor Imagery (MI) and Emotion recognition. Extensive experiments and detailed analysis are provided as a good reference for follow-up researchers.
MLMay 22, 2022
Lassoed Tree BoostingAlejandro Schuler, Yi Li, Mark van der Laan
Gradient boosting performs exceptionally in most prediction problems and scales well to large datasets. In this paper we prove that a ``lassoed'' gradient boosted tree algorithm with early stopping achieves faster than $n^{-1/4}$ L2 convergence in the large nonparametric space of cadlag functions of bounded sectional variation. This rate is remarkable because it does not depend on the dimension, sparsity, or smoothness. We use simulation and real data to confirm our theory and demonstrate empirical performance and scalability on par with standard boosting. Our convergence proofs are based on a novel, general theorem on early stopping with empirical loss minimizers of nested Donsker classes.
CVSep 10, 2023
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image CaptioningGuisheng Liu, Yi Li, Zhengcong Fei et al.
While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion. To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model. In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network. Prefix-diffusion is able to generate diverse captions with relatively less parameters, while maintaining the fluency and relevance of the captions benefiting from the generative capabilities of the diffusion model. Our work paves the way for scaling up diffusion models for image captioning, and achieves promising performance compared with recent approaches.
AIMay 11Code
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph EvolutionDongming Jiang, Yi Li, Guanpeng Li et al.
Memory retrieval in agentic large language model (LLM) systems is often treated as a static lookup problem, relying on flat vector search or fixed binary relational graphs. However, fixed graph structures cannot capture the varying strength, confidence, and query-dependent relevance of relationships between events. In this paper, we propose HAGE, a weighted multi-relational memory framework that reconceptualizes retrieval as sequential, query-conditioned traversal over a unified relational memory graph. Memory is organized as relation-specific graph views over shared memory nodes, where each edge is associated with a trainable relation feature vector encoding multiple relational signals. Given a query, an LLM-based classifier identifies the relational intent, and a routing network dynamically modulates the corresponding dimensions of the edge embedding. Traversal scores are computed via a learned combination of semantic similarity and these query-conditioned edge representations. This allows memory traversal to prioritize high-utility relational paths while softly suppressing noisy or weakly relevant connections. Beyond adaptive traversal, HAGE further introduces a reinforcement learning-based training framework that jointly optimizes routing behavior and edge representations using downstream tasks. Finally, empirical results demonstrate improved long-horizon reasoning accuracy and a favorable accuracy-efficiency trade-off compared to state-of-the-art agentic memory systems. Our code is available at https://github.com/FredJiang0324/HAGE_MVPReview.
LGApr 12
ReadMOF: Structure-Free Semantic Embeddings from Systematic MOF Nomenclature for Machine LearningKewei Zhu, Cameron Wilson, Bartosz Mazur et al.
Systematic chemical names, such as IUPAC-style nomenclature for metal-organic frameworks (MOFs), contain rich structural and compositional information in a standardized textual format. Here we introduce ReadMOF, which is, to our knowledge, the first nomenclature-free machine learning framework that leverages these names to model structure-property relationships without requiring atomic coordinates or connectivity graphs. By employing pretrained language models, ReadMOF converts systematic MOF names from the Cambridge Structural Database (CSD) into vector embeddings that closely represent traditional structure-based descriptors. These embeddings enable applications in materials informatics, including property prediction, similarity retrieval, and clustering, with performance comparable to geometry-dependent methods. When combined with large language models, ReadMOF also establishes chemically meaningful reasoning ability with textual input only. Our results show that structured chemical language, interpreted through modern natural language processing techniques, can provide a scalable, interpretable, and geometry-independent alternative to conventional molecular representations. This approach opens new opportunities for language-driven discovery in materials science.
CVSep 15, 2022
Exploring Visual Interpretability for Contrastive Language-Image Pre-trainingYi Li, Hualiang Wang, Yiqun Duan et al.
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual explainability of CLIP is rarely studied, especially for the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization results against human understanding. This phenomenon is universal for both vision transformers and convolutional networks, which suggests this problem is unique and not owing to certain network. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. For this problem, we propose the Explainable Contrastive Language-Image Pre-training (ECLIP), which corrects the explainability via the Masked Max Pooling. Specifically, to avoid the semantic shift, we replace the original attention pooling by max pooling to focus on the confident foreground, with guidance from free attention during training. Experiments on three datasets suggest that ECLIP greatly improves the explainability of CLIP, and beyond previous explainability methods at large margins. The code will be released later.
CVSep 27, 2022
FreeSeg: Free Mask from Interpretable Contrastive Language-Image Pretraining for Semantic SegmentationYi Li, Huifeng Yao, Hualiang Wang et al.
Fully supervised semantic segmentation learns from dense masks, which requires heavy annotation cost for closed set. In this paper, we use natural language as supervision without any pixel-level annotation for open world segmentation. We call the proposed framework as FreeSeg, where the mask is freely available from raw feature map of pretraining model. Compared with zero-shot or openset segmentation, FreeSeg doesn't require any annotated masks, and it widely predicts categories beyond class-agnostic unsupervised segmentation. Specifically, FreeSeg obtains free mask from Image-Text Similarity Map (ITSM) of Interpretable Contrastive Language-Image Pretraining (ICLIP). And our core improvements are the smoothed min pooling for dense ICLIP, with the partial label and pixel strategies for segmentation. Furthermore, FreeSeg is very straight forward without complex design like grouping, clustering or retrieval. Besides the simplicity, the performances of FreeSeg surpass previous state-of-the-art at large margins, e.g. 13.4% higher at mIoU on VOC dataset in the same settings.
AINov 1, 2023
On the Opportunities of Green Computing: A SurveyYou Zhou, Xiujing Lin, Xiang Zhang et al.
Artificial Intelligence (AI) has achieved significant advancements in technology and research with the development over several decades, and is widely used in many areas including computing vision, natural language processing, time-series analysis, speech synthesis, etc. During the age of deep learning, especially with the arise of Large Language Models, a large majority of researchers' attention is paid on pursuing new state-of-the-art (SOTA) results, resulting in ever increasing of model size and computational complexity. The needs for high computing power brings higher carbon emission and undermines research fairness by preventing small or medium-sized research institutions and companies with limited funding in participating in research. To tackle the challenges of computing resources and environmental impact of AI, Green Computing has become a hot research topic. In this survey, we give a systematic overview of the technologies used in Green Computing. We propose the framework of Green Computing and devide it into four key components: (1) Measures of Greenness, (2) Energy-Efficient AI, (3) Energy-Efficient Computing Systems and (4) AI Use Cases for Sustainability. For each components, we discuss the research progress made and the commonly used techniques to optimize the AI efficiency. We conclude that this new research direction has the potential to address the conflicts between resource constraints and AI development. We encourage more researchers to put attention on this direction and make AI more environmental friendly.
CVJul 22, 2023
Morphology-inspired Unsupervised Gland Segmentation via Selective Semantic GroupingQixiang Zhang, Yi Li, Cheng Xue et al.
Designing deep learning algorithms for gland segmentation is crucial for automatic cancer diagnosis and prognosis, yet the expensive annotation cost hinders the development and application of this technology. In this paper, we make a first attempt to explore a deep learning method for unsupervised gland segmentation, where no manual annotations are required. Existing unsupervised semantic segmentation methods encounter a huge challenge on gland images: They either over-segment a gland into many fractions or under-segment the gland regions by confusing many of them with the background. To overcome this challenge, our key insight is to introduce an empirical cue about gland morphology as extra knowledge to guide the segmentation process. To this end, we propose a novel Morphology-inspired method via Selective Semantic Grouping. We first leverage the empirical cue to selectively mine out proposals for gland sub-regions with variant appearances. Then, a Morphology-aware Semantic Grouping module is employed to summarize the overall information about the gland by explicitly grouping the semantics of its sub-region proposals. In this way, the final segmentation network could learn comprehensive knowledge about glands and produce well-delineated, complete predictions. We conduct experiments on GlaS dataset and CRAG dataset. Our method exceeds the second-best counterpart over 10.56% at mIOU.
MLOct 11, 2023
Kernel Cox partially linear regression: building predictive models for cancer patients' survivalYaohua Rong, Sihai Dave Zhao, Xia Zheng et al.
Wide heterogeneity exists in cancer patients' survival, ranging from a few months to several decades. To accurately predict clinical outcomes, it is vital to build an accurate predictive model that relates patients' molecular profiles with patients' survival. With complex relationships between survival and high-dimensional molecular predictors, it is challenging to conduct non-parametric modeling and irrelevant predictors removing simultaneously. In this paper, we build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model. We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and non-parametric predictors through a LASSO penalty. An efficient high-dimensional algorithm is developed for the proposed method. Comparison with other competing methods in simulation shows that the proposed method always has better predictive accuracy. We apply this method to analyze a multiple myeloma dataset and predict patients' death burden based on their gene expressions. Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
ETNov 13, 2023
Pruning random resistive memory for optimizing analogue AIYi Li, Songqi Wang, Yaping Zhao et al.
The rapid advancement of artificial intelligence (AI) has been marked by the large language models exhibiting human-like intelligence. However, these models also present unprecedented challenges to energy consumption and environmental sustainability. One promising solution is to revisit analogue computing, a technique that predates digital computing and exploits emerging analogue electronic devices, such as resistive memory, which features in-memory computing, high scalability, and nonvolatility. However, analogue computing still faces the same challenges as before: programming nonidealities and expensive programming due to the underlying devices physics. Here, we report a universal solution, software-hardware co-design using structural plasticity-inspired edge pruning to optimize the topology of a randomly weighted analogue resistive memory neural network. Software-wise, the topology of a randomly weighted neural network is optimized by pruning connections rather than precisely tuning resistive memory weights. Hardware-wise, we reveal the physical origin of the programming stochasticity using transmission electron microscopy, which is leveraged for large-scale and low-cost implementation of an overparameterized random neural network containing high-performance sub-networks. We implemented the co-design on a 40nm 256K resistive memory macro, observing 17.3% and 19.9% accuracy improvements in image and audio classification on FashionMNIST and Spoken digits datasets, as well as 9.8% (2%) improvement in PR (ROC) in image segmentation on DRIVE datasets, respectively. This is accompanied by 82.1%, 51.2%, and 99.8% improvement in energy efficiency thanks to analogue in-memory computing. By embracing the intrinsic stochasticity and in-memory computing, this work may solve the biggest obstacle of analogue computing systems and thus unleash their immense potential for next-generation AI hardware.
AIMay 24
DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMsYi Li, Songtao Wei, Dongming Jiang et al.
Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to $6.5\times$ compared with communication-heavy baselines.
AIApr 16
MAGMA: A Multi-Graph based Agentic Memory Architecture for AI AgentsDongming Jiang, Yi Li, Guanpeng Li et al.
Memory-Augmented Generation (MAG) extends Large Language Models with external memory to support long-context reasoning, but existing approaches largely rely on semantic similarity over monolithic memory stores, entangling temporal, causal, and entity information. This design limits interpretability and alignment between query intent and retrieved evidence, leading to suboptimal reasoning accuracy. In this paper, we propose MAGMA, a multi-graph agentic memory architecture that represents each memory item across orthogonal semantic, temporal, causal, and entity graphs. MAGMA formulates retrieval as policy-guided traversal over these relational views, enabling query-adaptive selection and structured context construction. By decoupling memory representation from retrieval logic, MAGMA provides transparent reasoning paths and fine-grained control over retrieval. Experiments on LoCoMo and LongMemEval demonstrate that MAGMA consistently outperforms state-of-the-art agentic memory systems in long-horizon reasoning tasks.
DBOct 1, 2023
City Foundation Models for Learning General Purpose Representations from OpenStreetMapPasquale Balsebre, Weiming Huang, Gao Cong et al.
Pre-trained Foundation Models (PFMs) have ushered in a paradigm-shift in Artificial Intelligence, due to their ability to learn general-purpose representations that can be readily employed in a wide range of downstream tasks. While PFMs have been successfully adopted in various fields such as Natural Language Processing and Computer Vision, their capacity in handling geospatial data and answering urban questions remains limited. This can be attributed to the intrinsic heterogeneity of geospatial data, which encompasses different data types, including points, segments and regions, as well as multiple information modalities, such as a spatial position, visual characteristics and textual annotations. The proliferation of Volunteered Geographic Information initiatives, and the ever-increasing availability of open geospatial data sources, like OpenStreetMap, which is freely accessible globally, unveil a promising opportunity to bridge this gap. In this paper, we present CityFM, a self-supervised framework to train a foundation model within a selected geographical area of interest, such as a city. CityFM relies solely on open data from OSM, and produces multimodal representations of entities of different types, incorporating spatial, visual, and textual information. We analyse the entity representations generated using our foundation models from a qualitative perspective, and conduct quantitative experiments on road, building, and region-level downstream tasks. We compare its results to algorithms tailored specifically for the respective applications. In all the experiments, CityFM achieves performance superior to, or on par with, the baselines.
LGAug 19, 2024
Community-Centric Graph UnlearningYi Li, Shichao Zhang, Guixian Zhang et al.
Graph unlearning technology has become increasingly important since the advent of the `right to be forgotten' and the growing concerns about the privacy and security of artificial intelligence. Graph unlearning aims to quickly eliminate the effects of specific data on graph neural networks (GNNs). However, most existing deterministic graph unlearning frameworks follow a balanced partition-submodel training-aggregation paradigm, resulting in a lack of structural information between subgraph neighborhoods and redundant unlearning parameter calculations. To address this issue, we propose a novel Graph Structure Mapping Unlearning paradigm (GSMU) and a novel method based on it named Community-centric Graph Eraser (CGE). CGE maps community subgraphs to nodes, thereby enabling the reconstruction of a node-level unlearning operation within a reduced mapped graph. CGE makes the exponential reduction of both the amount of training data and the number of unlearning parameters. Extensive experiments conducted on five real-world datasets and three widely used GNN backbones have verified the high performance and efficiency of our CGE method, highlighting its potential in the field of graph unlearning.
IVMay 24, 2022
UNet#: A UNet-like Redesigning Skip Connections for Medical Image SegmentationLedan Qian, Xiao Zhou, Yi Li et al.
As an essential prerequisite for developing a medical intelligent assistant system, medical image segmentation has received extensive research and concentration from the neural network community. A series of UNet-like networks with encoder-decoder architecture has achieved extraordinary success, in which UNet2+ and UNet3+ redesign skip connections, respectively proposing dense skip connection and full-scale skip connection and dramatically improving compared with UNet in medical image segmentation. However, UNet2+ lacks sufficient information explored from the full scale, which will affect the learning of organs' location and boundary. Although UNet3+ can obtain the full-scale aggregation feature map, owing to the small number of neurons in the structure, it does not satisfy the segmentation of tiny objects when the number of samples is small. This paper proposes a novel network structure combining dense skip connections and full-scale skip connections, named UNet-sharp (UNet\#) for its shape similar to symbol \#. The proposed UNet\# can aggregate feature maps of different scales in the decoder sub-network and capture fine-grained details and coarse-grained semantics from the full scale, which benefits learning the exact location and accurately segmenting the boundary of organs or lesions. We perform deep supervision for model pruning to speed up testing and make it possible for the model to run on mobile devices; furthermore, designing two classification-guided modules to reduce false positives achieves more accurate segmentation results. Various experiments of semantic segmentation and instance segmentation on different modalities (EM, CT, MRI) and dimensions (2D, 3D) datasets, including the nuclei, brain tumor, liver, and lung, demonstrate that the proposed method outperforms state-of-the-art models.
CVMar 4, 2024Code
AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic SegmentationHaonan Wang, Qixiang Zhang, Yi Li et al.
Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate the burden of time-consuming pixel-level manual labeling, which leverages limited labeled data along with larger amounts of unlabeled data. Current state-of-the-art methods train the labeled data with ground truths and unlabeled data with pseudo labels. However, the two training flows are separate, which allows labeled data to dominate the training process, resulting in low-quality pseudo labels and, consequently, sub-optimal results. To alleviate this issue, we present AllSpark, which reborns the labeled features from unlabeled ones with the channel-wise cross-attention mechanism. We further introduce a Semantic Memory along with a Channel Semantic Grouping strategy to ensure that unlabeled features adequately represent labeled features. The AllSpark shed new light on the architecture level designs of SSSS rather than framework level, which avoids increasingly complicated training pipeline designs. It can also be regarded as a flexible bottleneck module that can be seamlessly integrated into a general transformer-based segmentation model. The proposed AllSpark outperforms existing methods across all evaluation protocols on Pascal, Cityscapes and COCO benchmarks without bells-and-whistles. Code and model weights are available at: https://github.com/xmed-lab/AllSpark.
RONov 4, 2023
STOW: Discrete-Frame Segmentation and Tracking of Unseen Objects for Warehouse Picking RobotsYi Li, Muru Zhang, Markus Grotz et al.
Segmentation and tracking of unseen object instances in discrete frames pose a significant challenge in dynamic industrial robotic contexts, such as distribution warehouses. Here, robots must handle object rearrangement, including shifting, removal, and partial occlusion by new items, and track these items after substantial temporal gaps. The task is further complicated when robots encounter objects not learned in their training sets, which requires the ability to segment and track previously unseen items. Considering that continuous observation is often inaccessible in such settings, our task involves working with a discrete set of frames separated by indefinite periods during which substantial changes to the scene may occur. This task also translates to domestic robotic applications, such as rearrangement of objects on a table. To address these demanding challenges, we introduce new synthetic and real-world datasets that replicate these industrial and household scenarios. We also propose a novel paradigm for joint segmentation and tracking in discrete frames along with a transformer module that facilitates efficient inter-frame communication. The experiments we conduct show that our approach significantly outperforms recent methods. For additional results and videos, please visit \href{https://sites.google.com/view/stow-corl23}{website}. Code and dataset will be released.
ETJul 26, 2024
Topology Optimization of Random Memristors for Input-Aware Dynamic SNNBo Wang, Shaocong Wang, Ning Lin et al.
There is unprecedented development in machine learning, exemplified by recent large language models and world simulators, which are artificial neural networks running on digital computers. However, they still cannot parallel human brains in terms of energy efficiency and the streamlined adaptability to inputs of different difficulties, due to differences in signal representation, optimization, run-time reconfigurability, and hardware architecture. To address these fundamental challenges, we introduce pruning optimization for input-aware dynamic memristive spiking neural network (PRIME). Signal representation-wise, PRIME employs leaky integrate-and-fire neurons to emulate the brain's inherent spiking mechanism. Drawing inspiration from the brain's structural plasticity, PRIME optimizes the topology of a random memristive spiking neural network without expensive memristor conductance fine-tuning. For runtime reconfigurability, inspired by the brain's dynamic adjustment of computational depth, PRIME employs an input-aware dynamic early stop policy to minimize latency during inference, thereby boosting energy efficiency without compromising performance. Architecture-wise, PRIME leverages memristive in-memory computing, mirroring the brain and mitigating the von Neumann bottleneck. We validated our system using a 40 nm 256 Kb memristor-based in-memory computing macro on neuromorphic image classification and image inpainting. Our results demonstrate the classification accuracy and Inception Score are comparable to the software baseline, while achieving maximal 62.50-fold improvements in energy efficiency, and maximal 77.0% computational load savings. The system also exhibits robustness against stochastic synaptic noise of analogue memristors. Our software-hardware co-designed model paves the way to future brain-inspired neuromorphic computing with brain-like energy efficiency and adaptivity.
CVSep 9, 2024
SX-Stitch: An Efficient VMS-UNet Based Framework for Intraoperative Scoliosis X-Ray Image StitchingYi Li, Heting Gao, Mingde He et al.
In scoliosis surgery, the limited field of view of the C-arm X-ray machine restricts the surgeons' holistic analysis of spinal structures .This paper presents an end-to-end efficient and robust intraoperative X-ray image stitching method for scoliosis surgery,named SX-Stitch. The method is divided into two stages:segmentation and stitching. In the segmentation stage, We propose a medical image segmentation model named Vision Mamba of Spine-UNet (VMS-UNet), which utilizes the state space Mamba to capture long-distance contextual information while maintaining linear computational complexity, and incorporates the SimAM attention mechanism, significantly improving the segmentation performance.In the stitching stage, we simplify the alignment process between images to the minimization of a registration energy function. The total energy function is then optimized to order unordered images, and a hybrid energy function is introduced to optimize the best seam, effectively eliminating parallax artifacts. On the clinical dataset, Sx-Stitch demonstrates superiority over SOTA schemes both qualitatively and quantitatively.
NAMar 22, 2013
Numerical method for real root isolation of semi-algebraic system and its applicationsZhenyi Ji, Wenyuan Wu, Yi Li et al.
In this paper, based on the homotopy continuation method and the interval Newton method, an efficient algorithm is introduced to isolate the real roots of semi-algebraic system. Tests on some random examples and a variety of problems including transcendental functions arising in many applications show that the new algorithm reduces the cost substantially compared with the traditional symbolic approaches.
CLApr 7Code
MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured AbstractsWeiyue Li, Ruizhi Qian, Yi Li et al.
Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.
CVJul 5, 2024
Self-Supervised Representation Learning for Adversarial Attack DetectionYi Li, Plamen Angelov, Neeraj Suri
Supervised learning-based adversarial attack detection methods rely on a large number of labeled data and suffer significant performance degradation when applying the trained model to new domains. In this paper, we propose a self-supervised representation learning framework for the adversarial attack detection task to address this drawback. Firstly, we map the pixels of augmented input images into an embedding space. Then, we employ the prototype-wise contrastive estimation loss to cluster prototypes as latent variables. Additionally, drawing inspiration from the concept of memory banks, we introduce a discrimination bank to distinguish and learn representations for each individual instance that shares the same or a similar prototype, establishing a connection between instances and their associated prototypes. We propose a parallel axial-attention (PAA)-based encoder to facilitate the training process by parallel training over height- and width-axis of attention maps. Experimental results show that, compared to various benchmark self-supervised vision learning models and supervised adversarial attack detection methods, the proposed model achieves state-of-the-art performance on the adversarial attack detection task across a wide range of images.