Siqi Wu

CV
h-index9
17papers
104citations
Novelty49%
AI Score51

17 Papers

CLSep 12, 2023
How We Define Harm Impacts Data Annotations: Explaining How Annotators Distinguish Hateful, Offensive, and Toxic Comments

Angela Schöpke-Gonzalez, Siqi Wu, Sagar Kumar et al.

Computational social science research has made advances in machine learning and natural language processing that support content moderators in detecting harmful content. These advances often rely on training datasets annotated by crowdworkers for harmful content. In designing instructions for annotation tasks to generate training data for these algorithms, researchers often treat the harm concepts that we train algorithms to detect - 'hateful', 'offensive', 'toxic', 'racist', 'sexist', etc. - as interchangeable. In this work, we studied whether the way that researchers define 'harm' affects annotation outcomes. Using Venn diagrams, information gain comparisons, and content analyses, we reveal that annotators do not use the concepts 'hateful', 'offensive', and 'toxic' interchangeably. We identify that features of harm definitions and annotators' individual characteristics explain much of how annotators use these terms differently. Our results offer empirical evidence discouraging the common practice of using harm concepts interchangeably in content moderation research. Instead, researchers should make specific choices about which harm concepts to analyze based on their research goals. Recognizing that researchers are often resource constrained, we also encourage researchers to provide information to bound their findings when their concepts of interest differ from concepts that off-the-shelf harmful content detection algorithms identify. Finally, we encourage algorithm providers to ensure their instruments can adapt to contextually-specific content detection goals (e.g., soliciting instrument users' feedback).

CVFeb 14, 2025Code
Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression

Siqi Wu, Yinda Chen, Dong Liu et al.

In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an end-to-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model's capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5\% bits per pixel. Our code is publicly available at https://github.com/ydchen0806/CLC.

CVJul 2, 2024
Conceptual Codebook Learning for Vision-Language Models

Yi Zhang, Ke Yu, Siqi Wu et al.

In this paper, we propose Conceptual Codebook Learning (CoCoLe), a novel fine-tuning method for vision-language models (VLMs) to address the challenge of improving the generalization capability of VLMs while fine-tuning them on downstream tasks in a few-shot setting. We recognize that visual concepts, such as textures, shapes, and colors are naturally transferable across domains and play a crucial role in generalization tasks. Motivated by this interesting finding, we learn a conceptual codebook consisting of visual concepts as keys and conceptual prompts as values, which serves as a link between the image encoder's outputs and the text encoder's inputs. Specifically, for a given image, we leverage the codebook to identify the most relevant conceptual prompts associated with the class embeddings to perform the classification. Additionally, we incorporate a handcrafted concept cache as a regularization to alleviate the overfitting issues in low-shot scenarios. We observe that this conceptual codebook learning method is able to achieve enhanced alignment between visual and linguistic modalities. Extensive experimental results demonstrate that our CoCoLe method remarkably outperforms the existing state-of-the-art methods across various evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization tasks. Detailed ablation studies further confirm the efficacy of each component in CoCoLe.

74.7CYMay 15
Characterizing AI Fact-Checkers and Their Contributions on Community Notes

Yilin Gong, Siqi Wu

Recent advances in artificial intelligence (AI) have made timely, scalable, and effective fact-checking increasingly feasible. One such deployment is X's Community Notes, which provides the AI Note Writer API to enable end-to-end automated generation of contextual information. We present the first empirical analysis of AI fact-checkers and their contributions on Community Notes, examining four key dimensions: volume, velocity, variety, and veracity. We find that, between September 2, 2025 and May 9, 2026, 20 AI writers account for 14.2% of all submitted notes, with their daily share rising rapidly to 44.8% lately. AI writers are highly responsive, typically submitting notes within minutes of posts becoming available via the API. They also expand coverage, contributing notes to 16.8% of fact-checked posts, of which 74.4% are not checked by humans. Over time, AI writers become more prolific and responsive, with increasing coverage and discovery rates. Despite these advantages, their veracity remains mixed. Collectively, AI writers contribute a higher share of helpful notes while receiving a smaller share of human ratings, relative to their share of submitted notes. Controlling for the fact-checked post and note submission order, both AI and human writers exhibit a first-mover advantage, with earlier notes attracting more ratings. More importantly, AI-generated notes are less likely to be classified as helpful than those written by human experts, though they outperform those written by laypeople. Our findings provide new insights into the practical capabilities and limitations of AI-driven fact-checking, with implications for the design and governance of human--AI collaborative crowdsourced context systems.

44.7CYApr 18
The Effects of Request Alerts on the Diversity and Visibility of Community Notes

Yilin Gong, Siqi Wu

Several major social media platforms have shifted toward crowdsourced fact-checking systems like Community Notes to combat misinformation at scale. However, these systems face criticism regarding which content is scrutinized and how visible that scrutiny is. To address these concerns, X allows users to request community notes for specific posts. When sufficient requests accumulate, X displays an alert, formalizing an interface cue intended to guide contributor behavior. In this study, we examine the effectiveness of request alerts. We infer the presence of request alerts at the time each note was written and identify 318 top writers who were repeatedly exposed to these alerts. Through analyzing their contributed 54,874 English notes written with and without request alerts, we find that at the individual level, writers fact-check more diverse and more political content under alerts. Nonetheless, at the collective level, these shifts direct contributions toward the already dominant Politics and Conflict category, thereby increasing content inequality within the Community Notes ecosystem. Finally, using a mixed-effects model that controls for both writer- and topic-level random effects, we estimate that notes written under alerts are between 8.4 and 20.2 percentage points more likely to be classified as helpful and thus visible to the public, compared to non-alerted notes. This visibility gain diminishes as topics diverge further from writers' prior interests, demonstrating a pivot penalty effect. Taken together, our findings show that request alerts function as an effective interface cue that increases both topical diversity and note visibility in Community Notes.

CVMar 27, 2025Code
Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression

Hanyue Tu, Siqi Wu, Li Li et al.

Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit rates. The source code is available at https://github.com/hytu99/MSINN-VRLIC.

29.3SIMar 26
Auditing Algorithmic Personalization in TikTok Comment Sections

Yueru Yan, Siqi Wu

Personalization algorithms are ubiquitous in modern social computing systems, yet their effects on comment sections remain underexplored. In this work, we conducted an algorithmic auditing experiment to examine comment personalization on TikTok. We trained sock-puppet accounts to exhibit left-leaning or right-leaning preferences and successfully validated 17 of them by analyzing the videos recommended on their For You Pages. We then scraped the comment sections shown to these trained partisan accounts, along with five cold-start accounts, across 65 politically neutral videos related to the 2024 U.S. presidential election that contain abundant discussions from both left-leaning and right-leaning perspectives. We find that while the composition of top comments remains largely consistent for all videos, ranking divergence between accounts from different political groups is significantly greater than that observed within the same group for some videos. This effect is strongly correlated with video-level metrics such as comment volume, engagement inequality, and partisan skew in the comment sections. Furthermore, through an exploratory case study, we find preliminary evidence that personalization can result in comment exposure aligned with an account's political leaning. However, this pattern is not universal, suggesting that the extent of politically oriented comment personalization is context-dependent.

CVMay 6, 2025
Preliminary Explorations with GPT-4o(mni) Native Image Generation

Pu Cao, Feng Zhou, Junyi Ji et al.

Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.

CVMay 29, 2025
CryoCCD: Conditional Cycle-consistent Diffusion with Biophysical Modeling for Cryo-EM Synthesis

Runmin Jiang, Genpei Zhang, Yuntian Yang et al. · cmu, harvard

Single-particle cryo-electron microscopy (cryo-EM) has become a cornerstone of structural biology, enabling near-atomic resolution analysis of macromolecules through advanced computational methods. However, the development of cryo-EM processing tools is constrained by the scarcity of high-quality annotated datasets. Synthetic data generation offers a promising alternative, but existing approaches lack thorough biophysical modeling of heterogeneity and fail to reproduce the complex noise observed in real imaging. To address these limitations, we present CryoCCD, a synthesis framework that unifies versatile biophysical modeling with the first conditional cycle-consistent diffusion model tailored for cryo-EM. The biophysical engine provides multi-functional generation capabilities to capture authentic biological organization, and the diffusion model is enhanced with cycle consistency and mask-guided contrastive learning to ensure realistic noise while preserving structural fidelity. Extensive experiments demonstrate that CryoCCD generates structurally faithful micrographs, enhances particle picking and pose estimation, as well as achieves superior performance over state-of-the-art baselines, while also generalizing effectively to held-out protein families.

LGApr 16, 2024
HiGraphDTI: Hierarchical Graph Representation Learning for Drug-Target Interaction Prediction

Bin Liu, Siqi Wu, Jin Wang et al.

The discovery of drug-target interactions (DTIs) plays a crucial role in pharmaceutical development. The deep learning model achieves more accurate results in DTI prediction due to its ability to extract robust and expressive features from drug and target chemical structures. However, existing deep learning methods typically generate drug features via aggregating molecular atom representations, ignoring the chemical properties carried by motifs, i.e., substructures of the molecular graph. The atom-drug double-level molecular representation learning can not fully exploit structure information and fails to interpret the DTI mechanism from the motif perspective. In addition, sequential model-based target feature extraction either fuses limited contextual information or requires expensive computational resources. To tackle the above issues, we propose a hierarchical graph representation learning-based DTI prediction method (HiGraphDTI). Specifically, HiGraphDTI learns hierarchical drug representations from triple-level molecular graphs to thoroughly exploit chemical information embedded in atoms, motifs, and molecules. Then, an attentional feature fusion module incorporates information from different receptive fields to extract expressive target features.Last, the hierarchical attention mechanism identifies crucial molecular segments, which offers complementary views for interpreting interaction mechanisms. The experiment results not only demonstrate the superiority of HiGraphDTI to the state-of-the-art methods, but also confirm the practical ability of our model in interaction interpretation and new DTI discovery.

CVJun 4, 2025
MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Xiaochun Lei, Siqi Wu, Weilin Wu et al.

Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design that integrates CNNs with Mamba to effectively capture both local features and long-range dependencies; (2) Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid architecture that improves multi-scale object detection across various object sizes; and (3) Edge-focused Efficiency: our method achieved 66.6% mAP at 31.9 FPS on the PASCAL VOC dataset without any pre-training and supports deployment on edge devices such as the NVIDIA Jetson Xavier NX and Orin NX.

SIFeb 3, 2021
AttentionFlow: Visualising Influence in Networks of Time Series

Minjeong Shin, Alasdair Tran, Siqi Wu et al.

The collective attention on online items such as web pages, search terms, and videos reflects trends that are of social, cultural, and economic interest. Moreover, attention trends of different items exhibit mutual influence via mechanisms such as hyperlinks or recommendations. Many visualisation tools exist for time series, network evolution, or network influence; however, few systems connect all three. In this work, we present AttentionFlow, a new system to visualise networks of time series and the dynamic influence they have on one another. Centred around an ego node, our system simultaneously presents the time series on each node using two visual encodings: a tree ring for an overview and a line chart for details. AttentionFlow supports interactions such as overlaying time series of influence and filtering neighbours by time or flux. We demonstrate AttentionFlow using two real-world datasets, VevoMusic and WikiTraffic. We show that attention spikes in songs can be explained by external events such as major awards, or changes in the network such as the release of a new song. Separate case studies also demonstrate how an artist's influence changes over their career, and that correlated Wikipedia traffic is driven by cultural interests. More broadly, AttentionFlow can be generalised to visualise networks of time series on physical infrastructures such as road networks, or natural phenomena such as weather and geological measurements.

SIMar 21, 2020
Variation across Scales: Measurement Fidelity under Twitter Data Sampling

Siqi Wu, Marian-Andrei Rizoiu, Lexing Xie

A comprehensive understanding of data quality is the cornerstone of measurement studies in social media research. This paper presents in-depth measurements on the effects of Twitter data sampling across different timescales and different subjects (entities, networks, and cascades). By constructing complete tweet streams, we show that Twitter rate limit message is an accurate indicator for the volume of missing tweets. Sampling also differs significantly across timescales. While the hourly sampling rate is influenced by the diurnal rhythm in different time zones, the millisecond level sampling is heavily affected by the implementation choices. For Twitter entities such as users, we find the Bernoulli process with a uniform rate approximates the empirical distributions well. It also allows us to estimate the true ranking with the observed sample data. For networks on Twitter, their structures are altered significantly and some components are more likely to be preserved. For retweet cascades, we observe changes in distributions of tweet inter-arrival time and user influence, which will affect models that rely on these features. This work calls attention to noises and potential biases in social data, and provides a few tools to measure Twitter sampling effects.

SIAug 20, 2019
Estimating Attention Flow in Online Video Networks

Siqi Wu, Marian-Andrei Rizoiu, Lexing Xie

Online videos have shown tremendous increase in Internet traffic. Most video hosting sites implement recommender systems, which connect the videos into a directed network and conceptually act as a source of pathways for users to navigate. At present, little is known about how human attention is allocated over such large-scale networks, and about the impacts of the recommender systems. In this paper, we first construct the Vevo network -- a YouTube video network with 60,740 music videos interconnected by the recommendation links, and we collect their associated viewing dynamics. This results in a total of 310 million views every day over a period of 9 weeks. Next, we present large-scale measurements that connect the structure of the recommendation network and the video attention dynamics. We use the bow-tie structure to characterize the Vevo network and we find that its core component (23.1% of the videos), which occupies most of the attention (82.6% of the views), is made out of videos that are mainly recommended among themselves. This is indicative of the links between video recommendation and the inequality of attention allocation. Finally, we address the task of estimating the attention flow in the video recommendation network. We propose a model that accounts for the network effects for predicting video popularity, and we show it consistently outperforms the baselines. This model also identifies a group of artists gaining attention because of the recommendation network. Altogether, our observations and our models provide a new set of tools to better understand the impacts of recommender systems on collective social attention.

MLFeb 22, 2019
Unique Sharp Local Minimum in $\ell_1$-minimization Complete Dictionary Learning

Yu Wang, Siqi Wu, Bin Yu

We study the problem of globally recovering a dictionary from a set of signals via $\ell_1$-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the $K$ atoms from a complete reference dictionary $D^*\in \mathbb R^{K\times K}$, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary $D^*$ to be a sharp local minimum of the expected $\ell_1$ objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of $\ell_1$-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation-based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to have monotonic convergence. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to many state-of-the-art dictionary learning algorithms.

SISep 8, 2017
Beyond Views: Measuring and Predicting Engagement in Online Videos

Siqi Wu, Marian-Andrei Rizoiu, Lexing Xie

The share of videos in the internet traffic has been growing, therefore understanding how videos capture attention on a global scale is also of growing importance. Most current research focus on modeling the number of views, but we argue that video engagement, or time spent watching is a more appropriate measure for resource allocation problems in attention, networking, and promotion activities. In this paper, we present a first large-scale measurement of video-level aggregate engagement from publicly available data streams, on a collection of 5.3 million YouTube videos published over two months in 2016. We study a set of metrics including time and the average percentage of a video watched. We define a new metric, relative engagement, that is calibrated against video properties and strongly correlate with recognized notions of quality. Moreover, we find that engagement measures of a video are stable over time, thus separating the concerns for modeling engagement and those for popularity -- the latter is known to be unstable over time and driven by external promotions. We also find engagement metrics predictable from a cold-start setup, having most of its variance explained by video context, topics and channel information -- R2=0.77. Our observations imply several prospective uses of engagement metrics -- choosing engaging topics for video production, or promoting engaging videos in recommender systems.

MLMay 17, 2015
Local identifiability of $l_1$-minimization dictionary learning: a sufficient and almost necessary condition

Siqi Wu, Bin Yu

We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf x_i\in \mathbb R^K$ for $i=1,...,N$ via $l_1$-minimization. We assume that $\mathbf x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf D_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $μ$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $μ\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(μ^{-2})$ nonzeros on average. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$.