Xin Meng

CV
h-index18
20papers
452citations
Novelty51%
AI Score56

20 Papers

CVNov 19, 2022
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

Zhenglun Kong, Haoyu Ma, Geng Yuan et al. · harvard, meta-ai

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT.

CVNov 2, 2022Code
Data Level Lottery Ticket Hypothesis for Vision Transformers

Xuan Shen, Zhenglun Kong, Minghai Qin et al. · harvard

The conventional lottery ticket hypothesis (LTH) claims that there exists a sparse subnetwork within a dense neural network and a proper random initialization method called the winning ticket, such that it can be trained from scratch to almost as good as the dense counterpart. Meanwhile, the research of LTH in vision transformers (ViTs) is scarcely evaluated. In this paper, we first show that the conventional winning ticket is hard to find at the weight level of ViTs by existing methods. Then, we generalize the LTH for ViTs to input data consisting of image patches inspired by the input dependence of ViTs. That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches. We call this subset of input patches the em winning tickets, which represent a significant amount of information in the input data. We use a ticket selector to generate the winning tickets based on the informativeness of patches for various types of ViT, including DeiT, LV-ViT, and Swin Transformers. The experiments show that there is a clear difference between the performance of models trained with winning tickets and randomly selected subsets, which verifies our proposed theory. We elaborate on the analogical similarity between our proposed Data-LTH-ViTs and the conventional LTH to further verify the integrity of our theory. The Source codes are available at https://github.com/shawnricecake/vit-lottery-ticket-input.

ARNov 15, 2022
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

Peiyan Dong, Mengshu Sun, Alec Lu et al. · harvard

While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. By analyzing the inherent computational patterns in ViTs, we first design an effective attention-based multi-head token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization, and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Finally, we propose a latency-aware multi-stage training strategy to determine the transformer blocks for inserting token selectors and optimize the desired (average) pruning rates for inserted token selectors, in order to improve both the model accuracy and inference latency on hardware. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%$\sim$8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%$\sim$65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46$\times$$\sim$4.89$\times$ speedup.

LGOct 31, 2025
ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

Xin Tang, Youfang Han, Fangfei Gou et al.

Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.

ITApr 23
Downlink Channel Matrix Estimation from PMI-Only Feedback in FDD Systems: Maximum Likelihood and Sharp Excess Risk Bound

Jinchi Chen, Mingxi Hu, Peigang Jiang et al.

We study downlink channel estimation in a frequency-division duplex (FDD) massive MIMO system from PMI-only feedback under a 5G NR-type limited-feedback architecture. In this architecture, the user selects a preferred codeword from a shared codebook based on the reduced-dimensional channel and only reports its index (known as the precoding matrix indicator, PMI) back to the base station. Therefore, the channel must be estimated from these highly quantized, nonlinear PMI observations. Based on a probabilistic perturbation model, a constrained maximum likelihood estimator (MLE) is proposed for this estimation problem, whose objective can also be interpreted as a relaxation of the hard empirical decision error. The Cramér--Rao bound is derived for the complex-valued model, with the global phase ambiguity handled via gauge-fixing. For the real-valued setting, a global excess-risk bound of order $O(1/\sqrt{T})$ is established, which is then refined to a sharp local rate of order $O(1/T)$ under suitable identifiability conditions. Numerical results show that the MLE asymptotically attains the Cramér--Rao bound and outperforms several baseline methods on both synthetic data and realistic FDD channels.

CLMay 28, 2025Code
Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

Zhenglun Kong, Zheng Zhan, Shiyue Hou et al. · harvard

Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at https://github.com/ZLKong/LLM_Integration

CVDec 1, 2024
EventGPT: Event Stream Understanding with Multimodal Large Language Models

Shaoyu Liu, Jianing Li, Guanghui Zhao et al.

Event cameras record visual information as asynchronous pixel change streams, excelling at scene perception under unsatisfactory lighting or high-dynamic conditions. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, to the best of our knowledge, marking a pioneering attempt to integrate large language models (LLMs) with event stream comprehension. To mitigate the huge domain gaps, we develop a three-stage optimization paradigm to gradually equip a pre-trained LLM with the capability of understanding event-based scenes. Our EventGPT comprises an event encoder, followed by a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, RGB image-text pairs generated by GPT are leveraged to warm up the linear projector, referring to LLaVA, as the gap between natural image and language modalities is relatively smaller. Secondly, we construct a synthetic yet large dataset, N-ImageNet-Chat, consisting of event frames and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.

CVJan 29, 2025
Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment

Zixue Zeng, Xiaoyan Zhao, Matthew Cartier et al.

We introduce a novel segmentation-aware joint training framework called generative reinforcement network (GRN) that integrates segmentation loss feedback to optimize both image generation and segmentation performance in a single stage. An image enhancement technique called segmentation-guided enhancement (SGE) is also developed, where the generator produces images tailored specifically for the segmentation model. Two variants of GRN were also developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for semi-supervised learning (GRN-SSL). GRN's performance was evaluated using a dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The annotations included six anatomical structures: dermis, superficial fat, superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient (DSC) compared to models trained on fully labeled datasets. GRN-SEL alone reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling requirements by 70%, and GRN-SSL alone by 60%, all while maintaining performance comparable to fully supervised models. These findings suggest the effectiveness of the GRN framework in optimizing segmentation performance with significantly less labeled data, offering a scalable and efficient solution for ultrasound image analysis and reducing the burdens associated with data annotation.

IVNov 25, 2025
LAYER: A Quantitative Explainable AI Framework for Decoding Tissue-Layer Drivers of Myofascial Low Back Pain

Zixue Zeng, Anthony M. Perti, Tong Yu et al.

Myofascial pain (MP) is a leading cause of chronic low back pain, yet its tissue-level drivers remain poorly defined and lack reliable image biomarkers. Existing studies focus predominantly on muscle while neglecting fascia, fat, and other soft tissues that play integral biomechanical roles. We developed an anatomically grounded explainable artificial intelligence (AI) framework, LAYER (Layer-wise Analysis for Yielding Explainable Relevance Tissue), that analyses six tissue layers in three-dimensional (3D) ultrasound and quantifies their contribution to MP prediction. By utilizing the largest multi-model 3D ultrasound cohort consisting of over 4,000 scans, LAYER reveals that non-muscle tissues contribute substantially to pain prediction. In B-mode imaging, the deep fascial membrane (DFM) showed the highest saliency (0.420), while in combined B-mode and shear-wave images, the collective saliency of non-muscle layers (0.316) nearly matches that of muscle (0.317), challenging the conventional muscle-centric paradigm in MP research and potentially affecting the therapy methods. LAYER establishes a quantitative, interpretable framework for linking layer-specific anatomy to pain physiology, uncovering new tissue targets and providing a generalizable approach for explainable analysis of soft-tissue imaging.

CVNov 22, 2025
SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Zhiyu Xu, Weilong Yan, Yufei Shi et al.

Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

CVSep 15, 2025
Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI

Bo Cao, Fan Yu, Mengmeng Feng et al.

Multimodal learning has attracted much attention in recent years due to its ability to effectively utilize data features from a variety of different modalities. Diagnosing the vulnerability of atherosclerotic plaques directly from carotid 3D MRI images is relatively challenging for both radiologists and conventional 3D vision networks. In clinical practice, radiologists assess patient conditions using a multimodal approach that incorporates various imaging modalities and domain-specific expertise, paving the way for the creation of multimodal diagnostic networks. In this paper, we have developed an effective strategy to leverage radiologists' domain knowledge to automate the diagnosis of carotid plaque vulnerability through Variation inference and Multimodal knowledge Distillation (VMD). This method excels in harnessing cross-modality prior knowledge from limited image annotations and radiology reports within training data, thereby enhancing the diagnostic network's accuracy for unannotated 3D MRI images. We conducted in-depth experiments on the dataset collected in-house and verified the effectiveness of the VMD strategy we proposed.

IVMar 25, 2025
GRN+: A Simplified Generative Reinforcement Network for Tissue Layer Analysis in 3D Ultrasound Images for Chronic Low-back Pain

Zixue Zeng, Xiaoyan Zhao, Matthew Cartier et al.

3D ultrasound delivers high-resolution, real-time images of soft tissues, which is essential for pain research. However, manually distinguishing various tissues for quantitative analysis is labor-intensive. To streamline this process, we developed and validated GRN+, a novel multi-model framework that automates layer segmentation with minimal annotated data. GRN+ combines a ResNet-based generator and a U-Net segmentation model. Through a method called Segmentation-guided Enhancement (SGE), the generator produces new images and matching masks under the guidance of the segmentation model, with its weights adjusted according to the segmentation loss gradient. To prevent gradient explosion and secure stable training, a two-stage backpropagation strategy was implemented: the first stage propagates the segmentation loss through both the generator and segmentation model, while the second stage concentrates on optimizing the segmentation model alone, thereby refining mask prediction using the generated images. Tested on 69 fully annotated 3D ultrasound scans from 29 subjects with six manually labeled tissue layers, GRN+ outperformed all other semi-supervised methods in terms of the Dice coefficient using only 5% labeled data, despite not using unlabeled data for unsupervised training. Additionally, when applied to fully annotated datasets, GRN+ with SGE achieved a 2.16% higher Dice coefficient while incurring lower computational costs compared to other models. Overall, GRN+ provides accurate tissue segmentation while reducing both computational expenses and the dependency on extensive annotations, making it an effective tool for 3D ultrasound analysis in cLBP patients.

IVMar 25, 2025
InterSliceBoost: Identifying Tissue Layers in Three-dimensional Ultrasound Images for Chronic Lower Back Pain (cLBP) Assessment

Zixue Zeng, Matthew Cartier, Xiaoyan Zhao et al.

Available studies on chronic lower back pain (cLBP) typically focus on one or a few specific tissues rather than conducting a comprehensive layer-by-layer analysis. Since three-dimensional (3-D) images often contain hundreds of slices, manual annotation of these anatomical structures is both time-consuming and error-prone. We aim to develop and validate a novel approach called InterSliceBoost to enable the training of a segmentation model on a partially annotated dataset without compromising segmentation performance. The architecture of InterSliceBoost includes two components: an inter-slice generator and a segmentation model. The generator utilizes residual block-based encoders to extract features from adjacent image-mask pairs (IMPs). Differential features are calculated and input into a decoder to generate inter-slice IMPs. The segmentation model is trained on partially annotated datasets (e.g., skipping 1, 2, 3, or 7 images) and the generated inter-slice IMPs. To validate the performance of InterSliceBoost, we utilized a dataset of 76 B-mode ultrasound scans acquired on 29 subjects enrolled in an ongoing cLBP study. InterSliceBoost, trained on only 33% of the image slices, achieved a mean Dice coefficient of 80.84% across all six layers on the independent test set, with Dice coefficients of 73.48%, 61.11%, 81.87%, 95.74%, 83.52% and 88.74% for segmenting dermis, superficial fat, superficial fascial membrane, deep fat, deep fascial membrane, and muscle. This performance is significantly higher than the conventional model trained on fully annotated images (p<0.05). InterSliceBoost can effectively segment the six tissue layers depicted on 3-D B-model ultrasound images in settings with partial annotations.

MMMar 14, 2025
Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking

Ziyi Wang, Songbai Tan, Gang Xu et al.

With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.

CRFeb 13, 2025
X-SG$^2$S: Safe and Generalizable Gaussian Splatting with X-dimensional Watermarks

Zihang Cheng, Huiping Zhuang, Chun Li et al.

3D Gaussian Splatting (3DGS) has been widely used in 3D reconstruction and 3D generation. Training to get a 3DGS scene often takes a lot of time and resources and even valuable inspiration. The increasing amount of 3DGS digital asset have brought great challenges to the copyright protection. However, it still lacks profound exploration targeted at 3DGS. In this paper, we propose a new framework X-SG$^2$S which can simultaneously watermark 1 to 3D messages while keeping the original 3DGS scene almost unchanged. Generally, we have a X-SG$^2$S injector for adding multi-modal messages simultaneously and an extractor for extract them. Specifically, we first split the watermarks into message patches in a fixed manner and sort the 3DGS points. A self-adaption gate is used to pick out suitable location for watermarking. Then use a XD(multi-dimension)-injection heads to add multi-modal messages into sorted 3DGS points. A learnable gate can recognize the location with extra messages and XD-extraction heads can restore hidden messages from the location recommended by the learnable gate. Extensive experiments demonstrated that the proposed X-SG$^2$S can effectively conceal multi modal messages without changing pretrained 3DGS pipeline or the original form of 3DGS parameters. Meanwhile, with simple and efficient model structure and high practicality, X-SG$^2$S still shows good performance in hiding and extracting multi-modal inner structured or unstructured messages. X-SG$^2$S is the first to unify 1 to 3D watermarking model for 3DGS and the first framework to add multi-modal watermarks simultaneous in one 3DGS which pave the wave for later researches.

CVMay 16, 2023
Pink-Eggs Dataset V1: A Step Toward Invasive Species Management Using Deep Learning Embedded Solutions

Di Xu, Yang Zhao, Xiang Hao et al.

We introduce a novel dataset consisting of images depicting pink eggs that have been identified as Pomacea canaliculata eggs, accompanied by corresponding bounding box annotations. The purpose of this dataset is to aid researchers in the analysis of the spread of Pomacea canaliculata species by utilizing deep learning techniques, as well as supporting other investigative pursuits that require visual data pertaining to the eggs of Pomacea canaliculata. It is worth noting, however, that the identity of the eggs in question is not definitively established, as other species within the same taxonomic family have been observed to lay similar-looking eggs in regions of the Americas. Therefore, a crucial prerequisite to any decision regarding the elimination of these eggs would be to establish with certainty whether they are exclusively attributable to invasive Pomacea canaliculata or if other species are also involved. The dataset is available at https://www.kaggle.com/datasets/deeshenzhen/pinkeggs

ROFeb 22, 2022
Transporters with Visual Foresight for Solving Unseen Rearrangement Tasks

Hongtao Wu, Jikai Ye, Xin Meng et al.

Rearrangement tasks have been identified as a crucial challenge for intelligent robotic manipulation, but few methods allow for precise construction of unseen structures. We propose a visual foresight model for pick-and-place rearrangement manipulation which is able to learn efficiently. In addition, we develop a multi-modal action proposal module which builds on the Goal-Conditioned Transporter Network, a state-of-the-art imitation learning method. Our image-based task planning method, Transporters with Visual Foresight, is able to learn from only a handful of data and generalize to multiple unseen tasks in a zero-shot manner. TVF is able to improve the performance of a state-of-the-art imitation learning method on unseen tasks in simulation and real robot experiments. In particular, the average success rate on unseen tasks improves from 55.4% to 78.5% in simulation experiments and from 30% to 63.3% in real robot experiments when given only tens of expert demonstrations. Video and code are available on our project website: https://chirikjianlab.github.io/tvf/

CVDec 27, 2021
SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Zhenglun Kong, Peiyan Dong, Xiaolong Ma et al.

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures. Nevertheless, it stays ambiguous on how to perform exclusive pruning on the ViT structure. Considering three key points: the structural characteristics, the internal data pattern of ViTs, and the related edge device deployment, we leverage the input token sparsity and propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures, such as Pooling-based ViT (PiT). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens generated by the selector module into a package token that will participate in subsequent calculations rather than being completely discarded. Our framework is bound to the trade-off between accuracy and computation constraints of specific edge devices through our proposed computation-aware training strategy. Experimental results show that our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification. Moreover, our framework can guarantee the identified model to meet resource specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile platforms. For example, our method reduces the latency of DeiT-T to 26 ms (26%$\sim $41% superior to existing works) on the mobile device with 0.25%$\sim $4% higher top-1 accuracy on ImageNet.

ROAug 12, 2021
Put the Bear on the Chair! Intelligent Robot Interaction with Previously Unseen Chairs via Robot Imagination

Hongtao Wu, Xin Meng, Sipu Ruan et al.

In this paper, we study the problem of autonomously seating a teddy bear on a previously unseen chair. To achieve this goal, we present a novel method for robots to imagine the sitting pose of the bear by physically simulating a virtual humanoid agent sitting on the chair. We also develop a robotic system which leverages motion planning to plan SE(2) motions for a humanoid robot to walk to the chair and whole-body motions to put the bear on it. Furthermore, to cope with cases where the chair is not in an accessible pose for placing the bear, a human assistance module is introduced for a human to follow language instructions given by the robot to rotate the chair and help make the chair accessible. We implement our method with a robot arm and a humanoid robot. We calibrate the proposed system with 3 chairs and test on 12 previously unseen chairs in both accessible and inaccessible poses extensively. Results show that our method enables the robot to autonomously seat the teddy bear on the 12 previously unseen chairs with a very high success rate. The human assistance module is also shown to be very effective in changing the accessibility of the chair. Video demos and more details are available at https://chirikjianlab.github.io/putbearonchair/.

CVJan 14, 2019
Assessment of central serous chorioretinopathy (CSC) depicted on color fundus photographs using deep Learning

Yi Zhen, Hang Chen, Xu Zhang et al.

To investigate whether and to what extent central serous chorioretinopathy (CSC) depicted on color fundus photographs can be assessed using deep learning technology. We collected a total of 2,504 fundus images acquired on different subjects. We verified the CSC status of these images using their corresponding optical coherence tomography (OCT) images. A total of 1,329 images depicted CSC. These images were preprocessed and normalized. This resulting dataset was randomly split into three parts in the ratio of 8:1:1 respectively for training, validation, and testing purposes. We used the deep learning architecture termed InceptionV3 to train the classifier. We performed nonparametric receiver operating characteristic (ROC) analyses to assess the capability of the developed algorithm to identify CSC. The Kappa coefficient between the two raters was 0.48 (p < 0.001), while the Kappa coefficients between the computer and the two raters were 0.59 (p < 0.001) and 0.33 (p < 0.05).Our experiments showed that the computer algorithm based on deep learning can assess CSC depicted on color fundus photographs in a relatively reliable and consistent way.