Jing Su

CL
h-index29
29papers
1,221citations
Novelty43%
AI Score57

29 Papers

CLJun 4, 2025Code
Seed-Coder: Let the Code Model Curate Data for Itself

ByteDance Seed, Yuyu Zhang, Jing Su et al. · bytedance

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

CLApr 10, 2025
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

ByteDance Seed, Jiaze Chen, Tiantian Fan et al. · bytedance

We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.

AINov 30, 2024
FullStack Bench: Evaluating LLMs as Full Stack Coders

Bytedance-Seed-Foundation-Code-Team, Yao Cheng, Jianfeng Chen et al. · bytedance

As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.

CLMay 31
UniD$^3$: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning

Qing Wang, Tianshi Liu, Minghao Zhou et al.

Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD$^3$, a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD$^3$ processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD$^3$ produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD$^3$ chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD$^3$ provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.

CVMar 15, 2022
ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation

Liang Xu, Ziyang Song, Dongliang Wang et al.

We present a GAN-based Transformer for general action-conditioned 3D human motion generation, including not only single-person actions but also multi-person interactive actions. Our approach consists of a powerful Action-conditioned motion TransFormer (ActFormer) under a GAN training scheme, equipped with a Gaussian Process latent prior. Such a design combines the strong spatio-temporal representation capacity of Transformer, superiority in generative modeling of GAN, and inherent temporal correlations from the latent prior. Furthermore, ActFormer can be naturally extended to multi-person motions by alternately modeling temporal correlations and human interactions with Transformer encoders. To further facilitate research on multi-person motion generation, we introduce a new synthetic dataset of complex multi-person combat behaviors. Extensive experiments on NTU-13, NTU RGB+D 120, BABEL and the proposed combat dataset show that our method can adapt to various human motion representations and achieve superior performance over the state-of-the-art methods on both single-person and multi-person motion generation tasks, demonstrating a promising step towards a general human motion generator.

LGSep 24, 2024
Merging LoRAs like Playing LEGO: Pushing the Modularity of LoRA to Extremes Through Rank-Wise Clustering

Ziyu Zhao, Tao Shen, Didi Zhu et al.

Low-Rank Adaptation (LoRA) has emerged as a popular technique for fine-tuning large language models (LLMs) to various domains due to its modular design and widespread availability on platforms like Huggingface. This modularity has sparked interest in combining multiple LoRAs to enhance LLM capabilities. However, existing methods for LoRA composition primarily focus on task-specific adaptations that require additional training, and current model merging techniques often fail to fully leverage LoRA's modular nature, leading to parameter interference and performance degradation. In this paper, we investigate the feasibility of disassembling and reassembling multiple LoRAs at a finer granularity, analogous to assembling LEGO blocks. We introduce the concept of Minimal Semantic Units (MSUs), where the parameters corresponding to each rank in LoRA function as independent units. These MSUs demonstrate permutation invariance and concatenation-summation equivalence properties, enabling flexible combinations to create new LoRAs. Building on these insights, we propose the LoRA-LEGO framework. This framework conducts rank-wise parameter clustering by grouping MSUs from different LoRAs into $k$ clusters. The centroid of each cluster serves as a representative MSU, enabling the assembly of a merged LoRA with an adjusted rank of $k$. Additionally, we apply a dual reweighting strategy to optimize the scale of the merged LoRA. Experiments across various benchmarks demonstrate that our method outperforms existing approaches in LoRA merging.

CLJan 10, 2024Code
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks

Xueyu Hu, Ziyu Zhao, Shuang Wei et al.

In this paper, we introduce InfiAgent-DABench, the first benchmark specifically designed to evaluate LLM-based agents on data analysis tasks. These tasks require agents to end-to-end solving complex tasks by interacting with an execution environment. This benchmark contains DAEval, a dataset consisting of 257 data analysis questions derived from 52 CSV files, and an agent framework which incorporates LLMs to serve as data analysis agents for both serving and evaluation. Since data analysis questions are often open-ended and hard to evaluate without human supervision, we adopt a format-prompting technique to convert each question into a closed-form format so that they can be automatically evaluated. Our extensive benchmarking of 34 LLMs uncovers the current challenges encountered in data analysis tasks. In addition, building on top of our agent framework, we develop a specialized agent, DAAgent, which surpasses GPT-3.5 by 3.9% on DABench. Evaluation datasets and toolkits for InfiAgent-DABench are released at https://github.com/InfiAgent/InfiAgent .

SEApr 3, 2025Code
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu et al.

The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

LGNov 14, 2025
Virtual Width Networks

Seed, Baisheng Li, Banggu Wu et al.

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

CLJan 22
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Chenghao Fan, Wen Heng, Bo Li et al.

Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

LGMay 18, 2022
Accurate Fairness: Improving Individual Fairness without Trading Accuracy

Xuran Li, Peng Wu, Jing Su

Accuracy and individual fairness are both crucial for trustworthy machine learning, but these two aspects are often incompatible with each other so that enhancing one aspect may sacrifice the other inevitably with side effects of true bias or false fairness. We propose in this paper a new fairness criterion, accurate fairness, to align individual fairness with accuracy. Informally, it requires the treatments of an individual and the individual's similar counterparts to conform to a uniform target, i.e., the ground truth of the individual. We prove that accurate fairness also implies typical group fairness criteria over a union of similar sub-populations. We then present a Siamese fairness in-processing approach to minimize the accuracy and fairness losses of a machine learning model under the accurate fairness constraints. To the best of our knowledge, this is the first time that a Siamese approach is adapted for bias mitigation. We also propose fairness confusion matrix-based metrics, fair-precision, fair-recall, and fair-F1 score, to quantify a trade-off between accuracy and individual fairness. Comparative case studies with popular fairness datasets show that our Siamese fairness approach can achieve on average 1.02%-8.78% higher individual fairness (in terms of fairness through awareness) and 8.38%-13.69% higher accuracy, as well as 10.09%-20.57% higher true fair rate, and 5.43%-10.01% higher fair-F1 score, than the state-of-the-art bias mitigation techniques. This demonstrates that our Siamese fairness approach can indeed improve individual fairness without trading accuracy. Finally, the accurate fairness criterion and Siamese fairness approach are applied to mitigate the possible service discrimination with a real Ctrip dataset, by on average fairly serving 112.33% more customers (specifically, 81.29% more customers in an accurately fair way) than baseline models.

LGFeb 15, 2024
Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review

Jing Su, Chufeng Jiang, Xin Jin et al. · cmu

This systematic literature review comprehensively examines the application of Large Language Models (LLMs) in forecasting and anomaly detection, highlighting the current state of research, inherent challenges, and prospective future directions. LLMs have demonstrated significant potential in parsing and analyzing extensive datasets to identify patterns, predict future events, and detect anomalous behavior across various domains. However, this review identifies several critical challenges that impede their broader adoption and effectiveness, including the reliance on vast historical datasets, issues with generalizability across different contexts, the phenomenon of model hallucinations, limitations within the models' knowledge boundaries, and the substantial computational resources required. Through detailed analysis, this review discusses potential solutions and strategies to overcome these obstacles, such as integrating multimodal data, advancements in learning methodologies, and emphasizing model explainability and computational efficiency. Moreover, this review outlines critical trends that are likely to shape the evolution of LLMs in these fields, including the push toward real-time processing, the importance of sustainable modeling practices, and the value of interdisciplinary collaboration. Conclusively, this review underscores the transformative impact LLMs could have on forecasting and anomaly detection while emphasizing the need for continuous innovation, ethical considerations, and practical solutions to realize their full potential.

CLAug 4, 2025
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference

Yuxuan Song, Zheng Zhang, Cheng Luo et al.

We present Seed Diffusion Preview, a large-scale language model based on discrete-state diffusion, offering remarkably fast inference speed. Thanks to non-sequential, parallel generation, discrete diffusion models provide a notable speedup to mitigate the inherent latency of token-by-token decoding, as demonstrated recently (e.g., Mercury Coder, Gemini Diffusion). Seed Diffusion Preview achieves an inference speed of 2,146 token/s over H20 GPUs while maintaining competitive performance across a sweep of standard code evaluation benchmarks, significantly faster than contemporary Mercury and Gemini Diffusion, establishing new state of the art on the speed-quality Pareto frontier for code models.

AISep 2, 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Haoming Wang, Haoyang Zou, Huatong Song et al. · pku

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

AIFeb 24, 2024
Empowering Large Language Model Agents through Action Learning

Haiteng Zhao, Chang Ma, Guoyin Wang et al.

Large Language Model (LLM) Agents have recently garnered increasing interest yet they are limited in their ability to learn from trial and error, a key element of intelligent behavior. In this work, we argue that the capacity to learn new actions from experience is fundamental to the advancement of learning in LLM agents. While humans naturally expand their action spaces and develop skills through experiential learning, LLM agents typically operate within fixed action spaces, limiting their potential for growth. To address these challenges, our study explores open-action learning for language agents. We introduce a framework LearnAct with an iterative learning strategy to create and improve actions in the form of Python functions. In each iteration, LLM revises and updates the currently available actions based on the errors identified in unsuccessful training tasks, thereby enhancing action effectiveness. Our experimental evaluations across Robotic Planning and Alfworld environments reveal that after learning on a few training task instances, our approach to open-action learning markedly improves agent performance for the type of task (by 32 percent in AlfWorld compared to ReAct+Reflexion, for instance) highlighting the importance of experiential action learning in the development of more intelligent LLM agents.

LGJan 25, 2025
Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning

Ziyu Zhao, Yixiao Zhou, Zhi Zhang et al.

Low-Rank Adaptation (LoRA) is widely used for adapting large language models (LLMs) to specific domains due to its efficiency and modularity. Meanwhile, vanilla LoRA struggles with task conflicts in multi-task scenarios. Recent works adopt Mixture of Experts (MoE) by treating each LoRA module as an expert, thereby mitigating task interference through multiple specialized LoRA modules. While effective, these methods often isolate knowledge within individual tasks, failing to fully exploit the shared knowledge across related tasks. In this paper, we establish a connection between single LoRA and multi-LoRA MoE, integrating them into a unified framework. We demonstrate that the dynamic routing of multiple LoRAs is functionally equivalent to rank partitioning and block-level activation within a single LoRA. We further empirically demonstrate that finer-grained LoRA partitioning, within the same total and activated parameter constraints, leads to better performance gains across heterogeneous tasks. Building on these findings, we propose Single-ranked Mixture of Experts LoRA (\textbf{SMoRA}), which embeds MoE into LoRA by \textit{treating each rank as an independent expert}. With a \textit{dynamic rank-wise activation} mechanism, SMoRA promotes finer-grained knowledge sharing while mitigating task conflicts. Experiments demonstrate that SMoRA activates fewer parameters yet achieves better performance in multi-task scenarios.

CLMar 25, 2024
An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing

Ziwei Chai, Guoyin Wang, Jing Su et al. · cmu

We present Expert-Token-Routing, a unified generalist framework that facilitates seamless integration of multiple expert LLMs. Our framework represents expert LLMs as special expert tokens within the vocabulary of a meta LLM. The meta LLM can route to an expert LLM like generating new tokens. Expert-Token-Routing not only supports learning the implicit expertise of expert LLMs from existing instruction dataset but also allows for dynamic extension of new expert LLMs in a plug-and-play manner. It also conceals the detailed collaboration process from the user's perspective, facilitating interaction as though it were a singular LLM. Our framework outperforms various existing multi-LLM collaboration paradigms across benchmarks that incorporate six diverse expert domains, demonstrating effectiveness and robustness in building generalist LLM system via synergizing multiple expert LLMs.

IVJan 25, 2025
MAP-based Problem-Agnostic diffusion model for Inverse Problems

Pingping Tao, Haixia Liu, Jing Su et al.

Diffusion models have indeed shown great promise in solving inverse problems in image processing. In this paper, we propose a novel, problem-agnostic diffusion model called the maximum a posteriori (MAP)-based guided term estimation method for inverse problems. To leverage unconditionally pretrained diffusion models to address conditional generation tasks, we divide the conditional score function into two terms according to Bayes' rule: an unconditional score function (approximated by a pretrained score network) and a guided term, which is estimated using a novel MAP-based method that incorporates a Gaussian-type prior of natural images. This innovation allows us to better capture the intrinsic properties of the data, leading to improved performance. Numerical results demonstrate that our method preserves contents more effectively compared to state-of-the-art methods--for example, maintaining the structure of glasses in super-resolution tasks and producing more coherent results in the neighborhood of masked regions during inpainting.

CLDec 10, 2024
The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model

Jiawei Chen, Wentao Chen, Jing Su et al.

Large language models (LLMs) have shown significant multilingual capabilities. However, the mechanisms underlying the development of these capabilities during pre-training are not well understood. In this paper, we use code LLMs as an experimental platform to explore the evolution of multilingual capabilities in LLMs during the pre-training process. Based on our observations, we propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities. During the learning process, multiple languages initially share a single knowledge system dominated by the primary language and gradually develop language-specific knowledge systems. We then validate the above hypothesis by tracking the internal states of the LLMs through identifying working languages and language transferring neurons. Experimental results show that the internal state changes of the LLM are consistent with our Babel Tower Hypothesis. Building on these insights, we propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs, which significantly outperforms LLMs trained on the original corpus. The proposed Babel Tower Hypothesis provides new insights into designing pre-training data distributions to achieve optimal multilingual capabilities in LLMs.

CRFeb 8, 2022
Topological Authentication Technique In Topologically Asymmetric Cryptosystem

Bing Yao, Jing Su, Fei Ma et al.

Making topological authentication from theory to practical application is an important and challenging task. More and more researchers pay attention on coming quantum computation, privacy data protection, lattices and cryptography. Research show the advantages of topological authentications through graph operations, various matrices, graph colorings and graph labelings are: related with two or more different mathematical areas, be not pictures, there are huge number of colorings and labelings, rooted on modern mathematics, diversity of asymmetric ciphers, simplicity and convenience, easily created, irreversibility, computational security, provable security, and so on. Topological authentications based on various graph homomorphisms, degree-sequence homomorphisms, graph-set homomorphisms. Randomly topological coding and topological authentications are based on Hanzi authentication, randomly adding-edge-removing operation, randomly leaf-adding algorithms, graph random increasing techniques, operation graphic lattice and dynamic networked models and their spanning trees and maximum leaf spanning trees. Realization of topological authentication is an important topic, we study: number-based strings generated from colored graphs, particular graphs (complete graphs, trees, planar graphs), some methods of generating public-keys. some techniques of topologically asymmetric cryptosystem are: W-type matching labelings, dual-type labelings, reciprocal-type labelings, topological homomorphisms, indexed colorings, graphic lattices, degree-sequence lattices, every-zero Cds-matrix groups of degree-sequences, every-zero graphic groups, graphic lattices having coloring closure property, self-similar networked lattices.

CLDec 3, 2020
BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling

Jing Su, Qingyun Dai, Frank Guerin et al.

Visual storytelling is a creative and challenging task, aiming to automatically generate a story-like description for a sequence of images. The descriptions generated by previous visual storytelling approaches lack coherence because they use word-level sequence generation methods and do not adequately consider sentence-level dependencies. To tackle this problem, we propose a novel hierarchical visual storytelling framework which separately models sentence-level and word-level semantics. We use the transformer-based BERT to obtain embeddings for sentences and words. We then employ a hierarchical LSTM network: the bottom LSTM receives as input the sentence vector representation from BERT, to learn the dependencies between the sentences corresponding to images, and the top LSTM is responsible for generating the corresponding word vector representations, taking input from the bottom LSTM. Experimental results demonstrate that our model outperforms most closely related baselines under automatic evaluation metrics BLEU and CIDEr, and also show the effectiveness of our method with human evaluation.

CLDec 2, 2020
Generating Descriptions for Sequential Images with Local-Object Attention and Global Semantic Context Modelling

Jing Su, Chenghua Lin, Mian Zhou et al.

In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft.

CVSep 15, 2020
Collaborative Distillation in the Parameter and Spectrum Domains for Video Action Recognition

Haisheng Su, Jing Su, Dongliang Wang et al.

Recent years have witnessed the significant progress of action recognition task with deep networks. However, most of current video networks require large memory and computational resources, which hinders their applications in practice. Existing knowledge distillation methods are limited to the image-level spatial domain, ignoring the temporal and frequency information which provide structural knowledge and are important for video analysis. This paper explores how to train small and efficient networks for action recognition. Specifically, we propose two distillation strategies in the frequency domain, namely the feature spectrum and parameter distribution distillations respectively. Our insight is that appealing performance of action recognition requires \textit{explicitly} modeling the temporal frequency spectrum of video features. Therefore, we introduce a spectrum loss that enforces the student network to mimic the temporal frequency spectrum from the teacher network, instead of \textit{implicitly} distilling features as many previous works. Second, the parameter frequency distribution is further adopted to guide the student network to learn the appearance modeling process from the teacher. Besides, a collaborative learning strategy is presented to optimize the training process from a probabilistic view. Extensive experiments are conducted on several action recognition benchmarks, such as Kinetics, Something-Something, and Jester, which consistently verify effectiveness of our approach, and demonstrate that our method can achieve higher performance than state-of-the-art methods with the same backbone.

LGJan 9, 2020
Performance-Oriented Neural Architecture Search

Andrew Anderson, Jing Su, Rozenn Dahyot et al.

Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly efficient in addition to the typical criteria around accuracy. Using the task of keyword spotting in audio on edge computing devices, we demonstrate that our approach results in neural architecture that is not only highly accurate, but also efficiently mapped to the computing platform which will perform the inference. Using our modified neural architecture search, we demonstrate $0.88\%$ increase in TOP-1 accuracy with $1.85\times$ reduction in latency for keyword spotting in audio on an embedded SoC, and $1.59\times$ on a high-end GPU.

AINov 3, 2019
Potential Applications of Machine Learning at Multidisciplinary Medical Team Meetings

Bridget Kane, Jing Su, Saturnino Luz

While machine learning (ML) systems have produced great advances in several domains, their use in support of complex cooperative work remains a research challenge. A particularly challenging setting, and one that may benefit from ML support is the work of multidisciplinary medical teams (MDTs). This paper focuses on the activities performed during the multidisciplinary medical team meeting (MDTM), reviewing their main characteristics in light of a longitudinal analysis of several MDTs in a large teaching hospital over a period of ten years and of our development of ML methods to support MDTMs, and identifying opportunities and possible pitfalls for the use of ML to support MDTMs.

LGJan 15, 2019
Bonseyes AI Pipeline -- bringing AI to you. End-to-end integration of data, algorithms and deployment tools

Miguel de Prado, Jing Su, Rabia Saeed et al.

Next generation of embedded Information and Communication Technology (ICT) systems are collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the keys for the next technological revolution: the seamless integration of AI in our daily life. However, training and deployment of custom AI solutions on embedded devices require a fine-grained integration of data, algorithms, and tools to achieve high accuracy. Such integration requires a high level of expertise that becomes a real bottleneck for small and medium enterprises wanting to deploy AI solutions on the Edge which, ultimately, slows down the adoption of AI on daily-life applications. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms, and deployment tools together. By removing the integration barriers and lowering the required expertise, we can interconnect the different stages of tools and provide a modular end-to-end development of AI products for embedded devices. Our AI pipeline consists of four modular main steps: i) data ingestion, ii) model training, iii) deployment optimization and, iv) the IoT hub integration. To show the effectiveness of our pipeline, we provide examples of different AI applications during each of the steps. Besides, we integrate our deployment framework, LPDNN, into the AI pipeline and present its lightweight architecture and deployment capabilities for embedded devices. Finally, we demonstrate the results of the AI pipeline by showing the deployment of several AI applications such as keyword spotting, image classification and object detection on a set of well-known embedded platforms, where LPDNN consistently outperforms all other popular deployment frameworks.

CRJul 26, 2018
Topological Graphic Passwords And Their Matchings Towards Cryptography

Bing Yao, Hui Sun, Xiaohui Zhang et al.

Graphical passwords (GPWs) are convenient for mobile equipments with touch screen. Topological graphic passwords (Topsnut-gpws) can be saved in computer by classical matrices and run quickly than the existing GPWs. We research Topsnut-gpws by the matching of view, since they have many advantages. We discuss: configuration matching partition, coloring/labelling matching partition, set matching partition, matching chain, etc. And, we introduce new graph labellings for enriching Topsnut-matchings and show that these labellings can be realized for trees or spanning trees of networks. In theoretical works we explore Graph Labelling Analysis, and show that every graph admits our extremal labellings and set-type labellings in graph theory. Many of the graph labellings mentioned are related with problems of set matching partitions to number theory, and yield new objects and new problems to graph theory.

CRJun 8, 2018
Graph Theory Towards New Graphical Passwords In Information Networks

Bing Yao, Hui Sun, Hongyu Wang et al.

Graphical passwords (GPWs) have been studied over 20 years. We are motivated from QR codes that can be considered GPWs, are successfully applied in today's world. However, one need such GPWs that can be use conveniently and have higher level security in information networks. We aim to GPWs for mobile devices with touch screen. No more researching papers of GPWs by means of techniques of graph theory that are already used in a wide range of scientific areas, especially, dynamic networks. Our investigation is devoted to designing new type of graphical passwords by methods of graph theory, such as various topological structures and colorings (labellings). We develop the idea of "topological structure plus number theory" in maximal planar graphs, and provide new techniques for designing new type of GPWs, and furthermore we do theoretical analysis on these new type of GPWs.

CLAug 5, 2015
Topic Stability over Noisy Sources

Jing Su, Oisín Boydell, Derek Greene et al.

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.