Hong Lu

CV
h-index40
23papers
233citations
Novelty53%
AI Score55

23 Papers

CVSep 26, 2024Code
General Compression Framework for Efficient Transformer Object Tracking

Lingyi Hong, Jinglun Li, Xinyu Zhou et al.

Previous works have attempted to improve tracking efficiency through lightweight architecture design or knowledge distillation from teacher models to compact student trackers. However, these solutions often sacrifice accuracy for speed to a great extent, and also have the problems of complex training process and structural limitations. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages to break the limitation of model structure. Additionally, we also design a unique replacement training technique that randomly substitutes specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model's ability to replicate the teacher model's behavior and simplifies the training process. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model's compression process. CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of our CompressTracker. Our CompressTracker-SUTrack, compressed from SUTrack, retains about 99 performance on LaSOT (72.2 AUC) while achieves 2.42x speed up. Code is available at https://github.com/LingyiHongfd/CompressTracker.

IVAug 1, 2023
Improved Prognostic Prediction of Pancreatic Cancer Using Multi-Phase CT by Integrating Neural Distance and Texture-Aware Transformer

Hexin Dong, Jiawen Yao, Yuxing Tang et al.

Pancreatic ductal adenocarcinoma (PDAC) is a highly lethal cancer in which the tumor-vascular involvement greatly affects the resectability and, thus, overall survival of patients. However, current prognostic prediction methods fail to explicitly and accurately investigate relationships between the tumor and nearby important vessels. This paper proposes a novel learnable neural distance that describes the precise relationship between the tumor and vessels in CT images of different patients, adopting it as a major feature for prognosis prediction. Besides, different from existing models that used CNNs or LSTMs to exploit tumor enhancement patterns on dynamic contrast-enhanced CT imaging, we improved the extraction of dynamic tumor-related texture features in multi-phase contrast-enhanced CT by fusing local and global features using CNN and transformer modules, further enhancing the features extracted across multi-phase CT images. We extensively evaluated and compared the proposed method with existing methods in the multi-center (n=4) dataset with 1,070 patients with PDAC, and statistical analysis confirmed its clinical effectiveness in the external test set consisting of three centers. The developed risk marker was the strongest predictor of overall survival among preoperative factors and it has the potential to be combined with established clinical factors to select patients at higher risk who might benefit from neoadjuvant therapy.

IVJan 4, 2023
A deep local attention network for pre-operative lymph node metastasis prediction in pancreatic cancer via multiphase CT imaging

Zhilin Zheng, Xu Fang, Jiawen Yao et al.

Lymph node (LN) metastasis status is one of the most critical prognostic and cancer staging factors for patients with resectable pancreatic ductal adenocarcinoma (PDAC), or in general, for any types of solid malignant tumors. Preoperative prediction of LN metastasis from non-invasive CT imaging is highly desired, as it might be straightforwardly used to guide the following neoadjuvant treatment decision and surgical planning. Most studies only capture the tumor characteristics in CT imaging to implicitly infer LN metastasis and very few work exploit direct LN's CT imaging information. To the best of our knowledge, this is the first work to propose a fully-automated LN segmentation and identification network to directly facilitate the LN metastasis status prediction task. Nevertheless LN segmentation/detection is very challenging since LN can be easily confused with other hard negative anatomic structures (e.g., vessels) from radiological images. We explore the anatomical spatial context priors of pancreatic LN locations by generating a guiding attention map from related organs and vessels to assist segmentation and infer LN status. As such, LN segmentation is impelled to focus on regions that are anatomically adjacent or plausible with respect to the specific organs and vessels. The metastasized LN identification network is trained to classify the segmented LN instances into positives or negatives by reusing the segmentation network as a pre-trained backbone and padding a new classification head. More importantly, we develop a LN metastasis status prediction network that combines the patient-wise aggregation results of LN segmentation/identification and deep imaging features extracted from the tumor region. Extensive quantitative nested five-fold cross-validation is conducted on a discovery dataset of 749 patients with PDAC.

ROMar 11
Novelty Adaptation Through Hybrid Large Language Model (LLM)-Symbolic Planning and LLM-guided Reinforcement Learning

Hong Lu, Pierrick Lorang, Timothy R. Duggan et al.

In dynamic open-world environments, autonomous agents often encounter novelties that hinder their ability to find plans to achieve their goals. Specifically, traditional symbolic planners fail to generate plans when the robot's planning domain lacks the operators that enable it to interact appropriately with novel objects in the environment. We propose a neuro-symbolic architecture that integrates symbolic planning, reinforcement learning, and a large language model (LLM) to learn how to handle novel objects. In particular, we leverage the common sense reasoning capability of the LLM to identify missing operators, generate plans with the symbolic AI planner, and write reward functions to guide the reinforcement learning agent in learning control policies for newly identified operators. Our method outperforms the state-of-the-art methods in operator discovery as well as operator learning in continuous robotic domains.

CVNov 30, 2023
SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation

Lingyi Hong, Wei Zhang, Shuyong Gao et al.

Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

CVSep 16, 2024
GlobalMapNet: An Online Framework for Vectorized Global HD Map Construction

Anqi Shi, Yuze Cai, Xiangyu Chen et al.

High-definition (HD) maps are essential for autonomous driving systems. Traditionally, an expensive and labor-intensive pipeline is implemented to construct HD maps, which is limited in scalability. In recent years, crowdsourcing and online mapping have emerged as two alternative methods, but they have limitations respectively. In this paper, we provide a novel methodology, namely global map construction, to perform direct generation of vectorized global maps, combining the benefits of crowdsourcing and online mapping. We introduce GlobalMapNet, the first online framework for vectorized global HD map construction, which updates and utilizes a global map on the ego vehicle. To generate the global map from scratch, we propose GlobalMapBuilder to match and merge local maps continuously. We design a new algorithm, Map NMS, to remove duplicate map elements and produce a clean map. We also propose GlobalMapFusion to aggregate historical map information, improving consistency of prediction. We examine GlobalMapNet on two widely recognized datasets, Argoverse2 and nuScenes, showing that our framework is capable of generating globally consistent results.

CLJul 17, 2025Code
QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation

Jiazheng Li, Hongzhou Lin, Hong Lu et al.

Reinforcement learning (RL) has emerged as a central paradigm for training large language models (LLMs) in reasoning tasks. Yet recent studies question RL's ability to incentivize reasoning capacity beyond the base model. This raises a key challenge: how can RL be adapted to solve harder reasoning problems more effectively? To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 72.50% (+10.73%) on AIME24, 62.29% (+12.79%) on AIME25, and 41.67% (+10.11%) on HMMT25. Code, data and model are available at https://github.com/foreverlasting1202/QuestA.

AIJul 25, 2024
Combining Cognitive and Generative AI for Self-explanation in Interactive AI Agents

Shalini Sushri, Rahul Dass, Rhea Basappa et al.

The Virtual Experimental Research Assistant (VERA) is an inquiry-based learning environment that empowers a learner to build conceptual models of complex ecological systems and experiment with agent-based simulations of the models. This study investigates the convergence of cognitive AI and generative AI for self-explanation in interactive AI agents such as VERA. From a cognitive AI viewpoint, we endow VERA with a functional model of its own design, knowledge, and reasoning represented in the Task--Method--Knowledge (TMK) language. From the perspective of generative AI, we use ChatGPT, LangChain, and Chain-of-Thought to answer user questions based on the VERA TMK model. Thus, we combine cognitive and generative AI to generate explanations about how VERA works and produces its answers. The preliminary evaluation of the generation of explanations in VERA on a bank of 66 questions derived from earlier work appears promising.

CVDec 3, 2021Code
TRNR: Task-Driven Image Rain and Noise Removal with a Few Images Based on Patch Analysis

Wu Ran, Bohong Yang, Peirong Ma et al.

The recent success of learning-based image rain and noise removal can be attributed primarily to well-designed neural network architectures and large labeled datasets. However, we discover that current image rain and noise removal methods result in low utilization of images. To alleviate the reliance of deep models on large labeled datasets, we propose the task-driven image rain and noise removal (TRNR) based on a patch analysis strategy. The patch analysis strategy samples image patches with various spatial and statistical properties for training and can increase image utilization. Furthermore, the patch analysis strategy encourages us to introduce the N-frequency-K-shot learning task for the task-driven approach TRNR. TRNR allows neural networks to learn from numerous N-frequency-K-shot learning tasks, rather than from a large amount of data. To verify the effectiveness of TRNR, we build a Multi-Scale Residual Network (MSResNet) for both image rain removal and Gaussian noise removal. Specifically, we train MSResNet for image rain removal and noise removal with a few images (for example, 20.0\% train-set of Rain100H). Experimental results demonstrate that TRNR enables MSResNet to learn more effectively when data is scarce. TRNR has also been shown in experiments to improve the performance of existing methods. Furthermore, MSResNet trained with a few images using TRNR outperforms most recent deep learning methods trained data-driven on large labeled datasets. These experimental results have confirmed the effectiveness and superiority of the proposed TRNR. The source code is available on \url{https://github.com/Schizophreni/MSResNet-TRNR}.

CLJul 26, 2023
Utilizing Large Language Models for Natural Interface to Pharmacology Databases

Hong Lu, Chuan Li, Yinheng Li et al.

The drug development process necessitates that pharmacologists undertake various tasks, such as reviewing literature, formulating hypotheses, designing experiments, and interpreting results. Each stage requires accessing and querying vast amounts of information. In this abstract, we introduce a Large Language Model (LLM)-based Natural Language Interface designed to interact with structured information stored in databases. Our experiments demonstrate the feasibility and effectiveness of the proposed framework. This framework can generalize to query a wide range of pharmaceutical data and knowledge bases.

AIOct 22, 2024
ICPL: Few-shot In-context Preference Learning via LLMs

Chao Yu, Qixin Tan, Hong Lu et al.

Preference-based reinforcement learning is an effective way to handle tasks where rewards are hard to specify but can be exceedingly inefficient as preference learning is often tabula rasa. We demonstrate that Large Language Models (LLMs) have native preference-learning capabilities that allow them to achieve sample-efficient preference learning, addressing this challenge. We propose In-Context Preference Learning (ICPL), which uses in-context learning capabilities of LLMs to reduce human query inefficiency. ICPL uses the task description and basic environment code to create sets of reward functions which are iteratively refined by placing human feedback over videos of the resultant policies into the context of an LLM and then requesting better rewards. We first demonstrate ICPL's effectiveness through a synthetic preference study, providing quantitative evidence that it significantly outperforms baseline preference-based methods with much higher performance and orders of magnitude greater efficiency. We observe that these improvements are not solely coming from LLM grounding in the task but that the quality of the rewards improves over time, indicating preference learning capabilities. Additionally, we perform a series of real human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

CVApr 18, 2024
Harnessing Joint Rain-/Detail-aware Representations to Eliminate Intricate Rains

Wu Ran, Peirong Ma, Zhiquan He et al.

Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.

ROFeb 6, 2025
Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture

Hong Lu, Hengxu Li, Prithviraj Singh Shahani et al.

Vision-language-action (VLA) models hold promise as generalist robotics solutions by translating visual and linguistic inputs into robot actions, yet they lack reliability due to their black-box nature and sensitivity to environmental changes. In contrast, cognitive architectures (CA) excel in symbolic reasoning and state monitoring but are constrained by rigid predefined execution. This work bridges these approaches by probing OpenVLA's hidden layers to uncover symbolic representations of object properties, relations, and action states, enabling integration with a CA for enhanced interpretability and robustness. Through experiments on LIBERO-spatial pick-and-place tasks, we analyze the encoding of symbolic states across different layers of OpenVLA's Llama backbone. Our probing results show consistently high accuracies (> 0.90) for both object and action states across most layers, though contrary to our hypotheses, we did not observe the expected pattern of object states being encoded earlier than action states. We demonstrate an integrated DIARC-OpenVLA system that leverages these symbolic representations for real-time state monitoring, laying the foundation for more interpretable and reliable robotic manipulation.

CLJan 19, 2025
Self-Explanation in Social AI Agents

Rhea Basappa, Mustafa Tekman, Hong Lu et al.

Social AI agents interact with members of a community, thereby changing the behavior of the community. For example, in online learning, an AI social assistant may connect learners and thereby enhance social interaction. These social AI assistants too need to explain themselves in order to enhance transparency and trust with the learners. We present a method of self-explanation that uses introspection over a self-model of an AI social assistant. The self-model is captured as a functional model that specifies how the methods of the agent use knowledge to achieve its tasks. The process of generating self-explanations uses Chain of Thought to reflect on the self-model and ChatGPT to provide explanations about its functioning. We evaluate the self-explanation of the AI social assistant for completeness and correctness. We also report on its deployment in a live class.

ROMar 6, 2025
Curiosity-Driven Imagination: Discovering Plan Operators and Learning Associated Policies for Open-World Adaptation

Pierrick Lorang, Hong Lu, Matthias Scheutz

Adapting quickly to dynamic, uncertain environments-often called "open worlds"-remains a major challenge in robotics. Traditional Task and Motion Planning (TAMP) approaches struggle to cope with unforeseen changes, are data-inefficient when adapting, and do not leverage world models during learning. We address this issue with a hybrid planning and learning system that integrates two models: a low level neural network based model that learns stochastic transitions and drives exploration via an Intrinsic Curiosity Module (ICM), and a high level symbolic planning model that captures abstract transitions using operators, enabling the agent to plan in an "imaginary" space and generate reward machines. Our evaluation in a robotic manipulation domain with sequential novelty injections demonstrates that our approach converges faster and outperforms state-of-the-art hybrid methods.

CVMar 3, 2025
ClipGrader: Leveraging Vision-Language Models for Robust Label Quality Assessment in Object Detection

Hong Lu, Yali Bian, Rahul C. Shah

High-quality annotations are essential for object detection models, but ensuring label accuracy - especially for bounding boxes - remains both challenging and costly. This paper introduces ClipGrader, a novel approach that leverages vision-language models to automatically assess the accuracy of bounding box annotations. By adapting CLIP (Contrastive Language-Image Pre-training) to evaluate both class label correctness and spatial precision of bounding box, ClipGrader offers an effective solution for grading object detection labels. Tested on modified object detection datasets with artificially disturbed bounding boxes, ClipGrader achieves 91% accuracy on COCO with a 1.8% false positive rate. Moreover, it maintains 87% accuracy with a 2.1% false positive rate when trained on just 10% of the COCO data. ClipGrader also scales effectively to larger datasets such as LVIS, achieving 79% accuracy across 1,203 classes. Our experiments demonstrate ClipGrader's ability to identify errors in existing COCO annotations, highlighting its potential for dataset refinement. When integrated into a semi-supervised object detection (SSOD) model, ClipGrader readily improves the pseudo label quality, helping achieve higher mAP (mean Average Precision) throughout the training process. ClipGrader thus provides a scalable AI-assisted tool for enhancing annotation quality control and verifying annotations in large-scale object detection datasets.

LGNov 18, 2025
Extending Test-Time Scaling: A 3D Perspective with Context, Batch, and Turn

Chao Yu, Qixin Tan, Jiaxuan Gao et al.

Reasoning reinforcement learning (RL) has recently revealed a new scaling effect: test-time scaling. Thinking models such as R1 and o1 improve their reasoning accuracy at test time as the length of the reasoning context increases. However, compared with training-time scaling, test-time scaling is fundamentally limited by the limited context length of base models, which remains orders of magnitude smaller than the amount of tokens consumed during training. We revisit test-time enhancement techniques through the lens of scaling effect and introduce a unified framework of multi-dimensional test-time scaling to extend the capacity of test-time reasoning. Beyond conventional context-length scaling, we consider two additional dimensions: batch scaling, where accuracy improves with parallel sampling, and turn scaling, where iterative self-refinement enhances reasoning quality. Building on this perspective, we propose 3D test-time scaling, which integrates context, batch, and turn scaling. We show that: (1) each dimension demonstrates a test-time scaling effect, but with a bounded capacity; (2) combining all three dimensions substantially improves the reasoning performance of challenging testbeds, including IOI, IMO, and CPHO, and further benefits from human preference feedback; and (3) the human-in-the-loop framework naturally extends to a more open-ended domain, i.e., embodied learning, which enables the design of humanoid control behaviors.

CLOct 29, 2025
Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Senjie Jin, Lu Chen, Zhiheng Xi et al.

Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.

LGSep 1, 2025
Multitask Battery Management with Flexible Pretraining

Hong Lu, Jiali Chen, Jingzhao Zhang et al.

Industrial-scale battery management involves various types of tasks, such as estimation, prediction, and system-level diagnostics. Each task employs distinct data across temporal scales, sensor resolutions, and data channels. Building task-specific methods requires a great deal of data and engineering effort, which limits the scalability of intelligent battery management. Here we present the Flexible Masked Autoencoder (FMAE), a flexible pretraining framework that can learn with missing battery data channels and capture inter-correlations across data snippets. FMAE learns unified battery representations from heterogeneous data and can be adopted by different tasks with minimal data and engineering efforts. Experimentally, FMAE consistently outperforms all task-specific methods across five battery management tasks with eleven battery datasets. On remaining life prediction tasks, FMAE uses 50 times less inference data while maintaining state-of-the-art results. Moreover, when real-world data lack certain information, such as system voltage, FMAE can still be applied with marginal performance impact, achieving comparable results with the best hand-crafted features. FMAE demonstrates a practical route to a flexible, data-efficient model that simplifies real-world multi-task management of dynamical systems.

HCJan 13, 2022
Interactive Data Analysis with Next-step Natural Language Query Recommendation

Xingbo Wang, Furui Cheng, Yong Wang et al.

Natural language interfaces (NLIs) provide users with a convenient way to interactively analyze data through natural language queries. Nevertheless, interactive data analysis is a demanding process, especially for novice data analysts. When exploring large and complex SQL databases from different domains, data analysts do not necessarily have sufficient knowledge about different data tables and application domains. It makes them unable to systematically elicit a series of topically-related and meaningful queries for insight discovery in target domains. We develop a NLI with a step-wise query recommendation module to assist users in choosing appropriate next-step exploration actions. The system adopts a data-driven approach to suggest semantically relevant and context-aware queries for application domains of users' interest based on their query logs. Also, the system helps users organize query histories and results into a dashboard to communicate the discovered data insights. With a comparative user study, we show that our system can facilitate a more effective and systematic data analysis process than a baseline without the recommendation module.

CVNov 24, 2021
ACNet: Approaching-and-Centralizing Network for Zero-Shot Sketch-Based Image Retrieval

Hao Ren, Ziqiang Zheng, Yang Wu et al.

The huge domain gap between sketches and photos and the highly abstract sketch representations pose challenges for sketch-based image retrieval (\underline{SBIR}). The zero-shot sketch-based image retrieval (\underline{ZS-SBIR}) is more generic and practical but poses an even greater challenge because of the additional knowledge gap between the seen and unseen categories. To simultaneously mitigate both gaps, we propose an \textbf{A}pproaching-and-\textbf{C}entralizing \textbf{Net}work (termed "\textbf{ACNet}") to jointly optimize sketch-to-photo synthesis and the image retrieval. The retrieval module guides the synthesis module to generate large amounts of diverse photo-like images which gradually approach the photo domain, and thus better serve the retrieval module than ever to learn domain-agnostic representations and category-agnostic common knowledge for generalizing to unseen categories. These diverse images generated with retrieval guidance can effectively alleviate the overfitting problem troubling concrete category-specific training samples with high gradients. We also discover the use of proxy-based NormSoftmax loss is effective in the zero-shot setting because its centralizing effect can stabilize our joint training and promote the generalization ability to unseen categories. Our approach is simple yet effective, which achieves state-of-the-art performance on two widely used ZS-SBIR datasets and surpasses previous methods by a large margin.

CVJan 29, 2019
Evaluating Generalization Ability of Convolutional Neural Networks and Capsule Networks for Image Classification via Top-2 Classification

Hao Ren, Jianlin Su, Hong Lu

Image classification is a challenging problem which aims to identify the category of object in the image. In recent years, deep Convolutional Neural Networks (CNNs) have been applied to handle this task, and impressive improvement has been achieved. However, some research showed the output of CNNs can be easily altered by adding relatively small perturbations to the input image, such as modifying few pixels. Recently, Capsule Networks (CapsNets) are proposed, which can help eliminating this limitation. Experiments on MNIST dataset revealed that capsules can better characterize the features of object than CNNs. But it's hard to find a suitable quantitative method to compare the generalization ability of CNNs and CapsNets. In this paper, we propose a new image classification task called Top-2 classification to evaluate the generalization ability of CNNs and CapsNets. The models are trained on single label image samples same as the traditional image classification task. But in the test stage, we randomly concatenate two test image samples which contain different labels, and then use the trained models to predict the top-2 labels on the unseen newly-created two label image samples. This task can provide us precise quantitative results to compare the generalization ability of CNNs and CapsNets. Back to the CapsNet, because it uses Full Connectivity (FC) mechanism among all capsules, it requires many parameters. To reduce the number of parameters, we introduce the Parameter-Sharing (PS) mechanism between capsules. Experiments on five widely used benchmark image datasets demonstrate the method significantly reduces the number of parameters, without losing the effectiveness of extracting features. Further, on the Top-2 classification task, the proposed PS CapsNets obtain impressive higher accuracy compared to the traditional CNNs and FC CapsNets by a large margin.

LGOct 22, 2018
Compositional Coding Capsule Network with K-Means Routing for Text Classification

Hao Ren, Hong Lu

Text classification is a challenging problem which aims to identify the category of texts. In the process of training, word embeddings occupy a large part of parameters. Under the limitation of limited computing resources, it indirectly limits the ability of subsequent network designs. In order to reduce the number of parameters, the compositional coding mechanism has been proposed recently. Based on this, this paper further explores compositional coding and proposes a compositional weighted coding method. And we apply capsule network to model the relationship between word embeddings, a new routing algorithm, which is based on k-means clustering theory, is proposed to fully mine the relationship between word embeddings. Combined with our compositional weighted coding method and the routing algorithm, we design a neural network for text classification. Experiments conducted on eight challenging text classification datasets show that the proposed method achieves competitive accuracy compared to the state-of-the-art approach with significantly fewer parameters.