Demetri Terzopoulos

CV
h-index18
43papers
5,053citations
Novelty48%
AI Score56

43 Papers

AIApr 9, 2023
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes

Ran Gong, Jiangyong Huang, Yizhou Zhao et al. · berkeley

Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object goal states, which poses challenges for the learning of complex tasks and transferring learned policy from simulated environments to the real world. Furthermore, state discretization limits a robot's ability to follow human instructions based on the grounding of actions and states. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges in novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms that address this gap and underscore the potential for further research in this area. Project website: https://arnold-benchmark.github.io.

AISep 18, 2023
MindAgent: Emergent Gaming Interaction

Ran Gong, Qiuyuan Huang, Xiaojian Ma et al.

Large Language Models (LLMs) have the capacity of performing complex scheduling in a multi-agent system and can coordinate these agents into completing sophisticated tasks that require extensive collaboration. However, despite the introduction of numerous gaming frameworks, the community has insufficient benchmarks towards building general multi-agents collaboration infrastructure that encompass both LLM and human-NPCs collaborations. In this work, we propose a novel infrastructure - MindAgent - to evaluate planning and coordination emergent capabilities for gaming interaction. In particular, our infrastructure leverages existing gaming framework, to i) require understanding of the coordinator for a multi-agent system, ii) collaborate with human players via un-finetuned proper instructions, and iii) establish an in-context learning on few-shot prompt with feedback. Furthermore, we introduce CUISINEWORLD, a new gaming scenario and related benchmark that dispatch a multi-agent collaboration efficiency and supervise multiple agents playing the game simultaneously. We conduct comprehensive evaluations with new auto-metric CoS for calculating the collaboration efficiency. Finally, our infrastructure can be deployed into real-world gaming scenarios in a customized VR version of CUISINEWORLD and adapted in existing broader Minecraft gaming domain. We hope our findings on LLMs and the new infrastructure for general-purpose scheduling and coordination can help shed light on how such skills can be obtained by learning from large language corpora.

IVNov 8, 2023Code
CSAM: A 2.5D Cross-Slice Attention Module for Anisotropic Volumetric Medical Image Segmentation

Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao et al.

A large portion of volumetric medical data, especially magnetic resonance imaging (MRI) data, is anisotropic, as the through-plane resolution is typically much lower than the in-plane resolution. Both 3D and purely 2D deep learning-based segmentation methods are deficient in dealing with such volumetric data since the performance of 3D methods suffers when confronting anisotropic data, and 2D methods disregard crucial volumetric information. Insufficient work has been done on 2.5D methods, in which 2D convolution is mainly used in concert with volumetric information. These models focus on learning the relationship across slices, but typically have many parameters to train. We offer a Cross-Slice Attention Module (CSAM) with minimal trainable parameters, which captures information across all the slices in the volume by applying semantic, positional, and slice attention on deep feature maps at different scales. Our extensive experiments using different network architectures and tasks demonstrate the usefulness and generalizability of CSAM. Associated code is available at https://github.com/aL3x-O-o-Hung/CSAM.

IVMar 29, 2022
CAT-Net: A Cross-Slice Attention Transformer Model for Prostate Zonal Segmentation in MRI

Alex Ling Yu Hung, Haoxin Zheng, Qi Miao et al.

Prostate cancer is the second leading cause of cancer death among men in the United States. The diagnosis of prostate MRI often relies on the accurate prostate zonal segmentation. However, state-of-the-art automatic segmentation methods often fail to produce well-contained volumetric segmentation of the prostate zones since certain slices of prostate MRI, such as base and apex slices, are harder to segment than other slices. This difficulty can be overcome by accounting for the cross-slice relationship of adjacent slices, but current methods do not fully learn and exploit such relationships. In this paper, we propose a novel cross-slice attention mechanism, which we use in a Transformer module to systematically learn the cross-slice relationship at different scales. The module can be utilized in any existing learning-based segmentation framework with skip connections. Experiments show that our cross-slice attention is able to capture the cross-slice information in prostate zonal segmentation and improve the performance of current state-of-the-art methods. Our method improves segmentation accuracy in the peripheral zone, such that the segmentation results are consistent across all the prostate slices (apex, mid-gland, and base).

IVMar 28, 2022
RAVIR: A Dataset and Methodology for the Semantic Segmentation and Quantitative Analysis of Retinal Arteries and Veins in Infrared Reflectance Imaging

Ali Hatamizadeh, Hamid Hosseini, Niraj Patel et al.

The retinal vasculature provides important clues in the diagnosis and monitoring of systemic diseases including hypertension and diabetes. The microvascular system is of primary involvement in such conditions, and the retina is the only anatomical site where the microvasculature can be directly observed. The objective assessment of retinal vessels has long been considered a surrogate biomarker for systemic vascular diseases, and with recent advancements in retinal imaging and computer vision technologies, this topic has become the subject of renewed attention. In this paper, we present a novel dataset, dubbed RAVIR, for the semantic segmentation of Retinal Arteries and Veins in Infrared Reflectance (IR) imaging. It enables the creation of deep learning-based models that distinguish extracted vessel type without extensive post-processing. We propose a novel deep learning-based methodology, denoted as SegRAVIR, for the semantic segmentation of retinal arteries and veins and the quantitative measurement of the widths of segmented vessels. Our extensive experiments validate the effectiveness of SegRAVIR and demonstrate its superior performance in comparison to state-of-the-art models. Additionally, we propose a knowledge distillation framework for the domain adaptation of RAVIR pretrained networks on color images. We demonstrate that our pretraining procedure yields new state-of-the-art benchmarks on the DRIVE, STARE, and CHASE_DB1 datasets. Dataset link: https://ravirdataset.github.io/data/

IVJul 1, 2024Code
Cross-Slice Attention and Evidential Critical Loss for Uncertainty-Aware Prostate Cancer Detection

Alex Ling Yu Hung, Haoxin Zheng, Kai Zhao et al.

Current deep learning-based models typically analyze medical images in either 2D or 3D albeit disregarding volumetric information or suffering sub-optimal performance due to the anisotropic resolution of MR data. Furthermore, providing an accurate uncertainty estimation is beneficial to clinicians, as it indicates how confident a model is about its prediction. We propose a novel 2.5D cross-slice attention model that utilizes both global and local information, along with an evidential critical loss, to perform evidential deep learning for the detection in MR images of prostate cancer, one of the most common cancers and a leading cause of cancer-related death in men. We perform extensive experiments with our model on two different datasets and achieve state-of-the-art performance in prostate cancer detection along with improved epistemic uncertainty estimation. The implementation of the model is available at https://github.com/aL3x-O-o-Hung/GLCSA_ECLoss.

LGDec 5, 2022
A Mobility-Aware Deep Learning Model for Long-Term COVID-19 Pandemic Prediction and Policy Impact Analysis

Danfeng Guo, Zijie Huang, Junheng Hao et al. · microsoft-research

Pandemic(epidemic) modeling, aiming at disease spreading analysis, has always been a popular research topic especially following the outbreak of COVID-19 in 2019. Some representative models including SIR-based deep learning prediction models have shown satisfactory performance. However, one major drawback for them is that they fall short in their long-term predictive ability. Although graph convolutional networks (GCN) also perform well, their edge representations do not contain complete information and it can lead to biases. Another drawback is that they usually use input features which they are unable to predict. Hence, those models are unable to predict further future. We propose a model that can propagate predictions further into the future and it has better edge representations. In particular, we model the pandemic as a spatial-temporal graph whose edges represent the transition of infections and are learned by our model. We use a two-stream framework that contains GCN and recursive structures (GRU) with an attention mechanism. Our model enables mobility analysis that provides an effective toolbox for public health researchers and policy makers to predict how different lock-down strategies that actively control mobility can influence the spread of pandemics. Experiments show that our model outperforms others in its long-term predictive power. Moreover, we simulate the effects of certain policies and predict their impacts on infection control.

78.5LGMay 26
Less is More: Early Stopping Rollout for On-Policy Distillation

Zhou Ziheng, Jiaqi Li, Huacong Tang et al.

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

91.1MAApr 27
Why Are We Moral? An LLM-based Agent Simulation Approach to Study Moral Evolution

Zhou Ziheng, Huacong Tang, Mingjie Bi et al.

The evolution of morality presents a puzzle: natural selection should favor self-interest, yet humans developed moral systems promoting altruism. Traditional approaches must abstract away cognitive processes, leaving open how cognitive factors shape moral evolution. We introduce an LLM-based agent simulation framework that brings cognitive realism to this question: agents with varying moral dispositions perceive, remember, reason, and decide in a simulated prehistoric hunter-gatherer society. This enables us to manipulate factors that traditional models cannot represent -- such as moral type observability and communication bandwidth -- and to discover emergent cognitive mechanisms from agent interactions. Across 20 runs spanning four settings, we find that cooperation and mutual help are the central driver of evolutionary survival, with universal and reciprocal morality exhibiting the most stable outcomes across conditions while selfishness is strongly disfavoured. Beyond cooperation itself, we further identify cognition as a central mediator -- most clearly through a cost of moral judgment that shifts the winning moral type across settings, with a self-purging effect among selfish agents as an additional cognitive pattern. We validate robustness across multiple LLM backbones, architecture ablations, and prompt sensitivity analyses. This work establishes LLM-based simulation as a powerful new paradigm to complement traditional research in evolutionary biology and anthropology, opening new avenues for investigating the complexities of moral and social evolution.

CVJul 31, 2024
Prompting Medical Large Vision-Language Models to Diagnose Pathologies by Visual Question Answering

Danfeng Guo, Demetri Terzopoulos

Large Vision-Language Models (LVLMs) have achieved significant success in recent years, and they have been extended to the medical domain. Although demonstrating satisfactory performance on medical Visual Question Answering (VQA) tasks, Medical LVLMs (MLVLMs) suffer from the hallucination problem, which makes them fail to diagnose complex pathologies. Moreover, they readily fail to learn minority pathologies due to imbalanced training data. We propose two prompting strategies for MLVLMs that reduce hallucination and improve VQA performance. In the first strategy, we provide a detailed explanation of the queried pathology. In the second strategy, we fine-tune a cheap, weak learner to achieve high performance on a specific metric, and textually provide its judgment to the MLVLM. Tested on the MIMIC-CXR-JPG and Chexpert datasets, our methods significantly improve the diagnostic F1 score, with the highest increase being 0.27. We also demonstrate that our prompting strategies can be extended to general LVLM domains. Based on POPE metrics, it effectively suppresses the false negative predictions of existing LVLMs and improves Recall by approximately 0.07.

CVApr 11, 2023
Semi-Supervised Relational Contrastive Learning

Attiano Purpura-Pontoniere, Demetri Terzopoulos, Adam Wang et al.

Disease diagnosis from medical images via supervised learning is usually dependent on tedious, error-prone, and costly image labeling by medical experts. Alternatively, semi-supervised learning and self-supervised learning offer effectiveness through the acquisition of valuable insights from readily available unlabeled images. We present Semi-Supervised Relational Contrastive Learning (SRCL), a novel semi-supervised learning model that leverages self-supervised contrastive loss and sample relation consistency for the more meaningful and effective exploitation of unlabeled data. Our experimentation with the SRCL model explores both pre-train/fine-tune and joint learning of the pretext (contrastive learning) and downstream (diagnostic classification) tasks. We validate against the ISIC 2018 Challenge benchmark skin lesion classification dataset and demonstrate the effectiveness of our semi-supervised method on varying amounts of labeled data.

AIDec 6, 2024Code
TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

Qian Long, Zhi Li, Ran Gong et al.

Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.

AIJan 7, 2024
Agent AI: Surveying the Horizons of Multimodal Interaction

Zane Durante, Qiuyuan Huang, Naoki Wake et al. · stanford

Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define "Agent AI" as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.

IVAug 19, 2019Code
Deep Active Lesion Segmentation

Ali Hatamizadeh, Assaf Hoogi, Debleena Sengupta et al.

Lesion segmentation is an important problem in computer-assisted diagnosis that remains challenging due to the prevalence of low contrast, irregular boundaries that are unamenable to shape priors. We introduce Deep Active Lesion Segmentation (DALS), a fully automated segmentation framework for that leverages the powerful nonlinear feature extraction abilities of fully Convolutional Neural Networks (CNNs) and the precise boundary delineation abilities of Active Contour Models (ACMs). Our DALS framework benefits from an improved level-set ACM formulation with a per-pixel-parameterized energy functional and a novel multiscale encoder-decoder CNN that learns an initialization probability map along with parameter maps for the ACM. We evaluate our lesion segmentation model on a new Multiorgan Lesion Segmentation (MLS) dataset that contains images of various organs, including brain, liver, and lung, across different imaging modalities---MR and CT. Our results demonstrate favorable performance compared to competing methods, especially for small training datasets. Source code : $\text{https://github.com/ahatamiz/dals}$

CVDec 16, 2024
Wonderland: Navigating 3D Scenes from a Single Image

Hanwen Liang, Junli Cao, Vidit Goel et al.

How can one efficiently generate high-quality, wide-scope 3D scenes from arbitrary single images? Existing methods suffer several drawbacks, such as requiring multi-view data, time-consuming per-scene optimization, distorted geometry in occluded areas, and low visual quality in backgrounds. Our novel 3D scene reconstruction pipeline overcomes these limitations to tackle the aforesaid challenge. Specifically, we introduce a large-scale reconstruction model that leverages latents from a video diffusion model to predict 3D Gaussian Splattings of scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that encode multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive learning strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets affirm that our model significantly outperforms existing single-view 3D scene generation methods, especially with out-of-domain images. Thus, we demonstrate for the first time that a 3D reconstruction model can effectively be built upon the latent space of a diffusion model in order to realize efficient 3D scene generation.

AIFeb 8, 2024
An Interactive Agent Foundation Model

Zane Durante, Bidipta Sarkar, Ran Gong et al. · stanford

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

86.4AIApr 27
Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Zhou Ziheng, Huacong Tang, Jinyuan Zhang et al.

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

CVMar 1, 2025
CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion

Yaowei Guo, Jiazheng Xing, Xiaojun Hou et al.

Video summarization, by selecting the most informative and/or user-relevant parts of original videos to create concise summary videos, has high research value and consumer demand in today's video proliferation era. Multi-modal video summarization that accomodates user input has become a research hotspot. However, current multi-modal video summarization methods suffer from two limitations. First, existing methods inadequately fuse information from different modalities and cannot effectively utilize modality-unique features. Second, most multi-modal methods focus on video and text modalities, neglecting the audio modality, despite the fact that audio information can be very useful in certain types of videos. In this paper we propose CFSum, a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework to fully utilize modality-unique information. In the first stage, multi-modal features are fused simultaneously to perform initial coarse-grained feature fusion, then, in the second stage, video and audio features are explicitly attended with the text representation yielding more fine-grained information interaction. The CFSum architecture gives equal importance to each modality, ensuring that each modal feature interacts deeply with the other modalities. Our extensive comparative experiments against prior methods and ablation studies on various datasets confirm the effectiveness and superiority of CFSum.

AIOct 29, 2024
Inverse Attention Agents for Multi-Agent Systems

Qian Long, Ruoyan Li, Minglu Zhao et al.

A major challenge for Multi-Agent Systems is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind (ToM) implemented algorithmically using an attention mechanism trained in an end-to-end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent's final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.

CLDec 9, 2023
Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models

Zhou Ziheng, Yingnian Wu, Song-Chun Zhu et al.

We introduce Aligner, a novel Parameter-Efficient Fine-Tuning (PEFT) method for aligning multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique design that constructs a globally shared set of tunable tokens that modify the attention of every layer. Remarkably with this method, even when using one token accounting for a mere 5,000 parameters, Aligner can still perform comparably well to state-of-the-art LLM adaptation methods like LoRA that require millions of parameters. This capacity is substantiated in both instruction following and value alignment tasks. Besides the multiple order-of-magnitude improvement in parameter efficiency, the insight Aligner provides into the internal mechanisms of LLMs is also valuable. The architectural features and efficacy of our method, in addition to our experiments demonstrate that an LLM separates its internal handling of "form" and "knowledge" in a somewhat orthogonal manner. This finding promises to motivate new research into LLM mechanism understanding and value alignment.

QMJul 25, 2025
Review of Deep Learning Applications to Structural Proteomics Enabled by Cryogenic Electron Microscopy and Tomography

Brady K. Zhou, Jason J. Hu, Jane K. J. Lee et al.

The past decade's "cryoEM revolution" has produced exponential growth in high-resolution structural data through advances in cryogenic electron microscopy (cryoEM) and tomography (cryoET). Deep learning integration into structural proteomics workflows addresses longstanding challenges including low signal-to-noise ratios, preferred orientation artifacts, and missing-wedge problems that historically limited efficiency and scalability. This review examines AI applications across the entire cryoEM pipeline, from automated particle picking using convolutional neural networks (Topaz, crYOLO, CryoSegNet) to computational solutions for preferred orientation bias (spIsoNet, cryoPROS) and advanced denoising algorithms (Topaz-Denoise). In cryoET, tools like IsoNet employ U-Net architectures for simultaneous missing-wedge correction and noise reduction, while TomoNet streamlines subtomogram averaging through AI-driven particle detection. The workflow culminates with automated atomic model building using sophisticated tools like ModelAngelo, DeepTracer, and CryoREAD that translate density maps into interpretable biological structures. These AI-enhanced approaches have achieved near-atomic resolution reconstructions with minimal manual intervention, resolved previously intractable datasets suffering from severe orientation bias, and enabled successful application to diverse biological systems from HIV virus-like particles to in situ ribosomal complexes. As deep learning evolves, particularly with large language models and vision transformers, the future promises sophisticated automation and accessibility in structural biology, potentially revolutionizing our understanding of macromolecular architecture and function.

AIFeb 28, 2024
Position Paper: Agent AI Towards a Holistic Intelligence

Qiuyuan Huang, Naoki Wake, Bidipta Sarkar et al.

Recent advancements in large foundation models have remarkably enhanced our understanding of sensory information in open-world environments. In leveraging the power of foundation models, it is crucial for AI research to pivot away from excessive reductionism and toward an emphasis on systems that function as cohesive wholes. Specifically, we emphasize developing Agent AI -- an embodied system that integrates large foundation models into agent actions. The emerging field of Agent AI spans a wide range of existing embodied and agent-based multimodal interactions, including robotics, gaming, and healthcare systems, etc. In this paper, we propose a novel large action model to achieve embodied intelligent behavior, the Agent Foundation Model. On top of this idea, we discuss how agent AI exhibits remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Furthermore, we discuss the potential of Agent AI from an interdisciplinary perspective, underscoring AI cognition and consciousness within scientific discourse. We believe that those discussions serve as a basis for future research directions and encourage broader societal engagement.

GRNov 12, 2021
Neuromuscular Control of the Face-Head-Neck Biomechanical Complex With Learning-Based Expression Transfer From Images and Videos

Xiao S. Zeng, Surya Dwarakanath, Wuyue Lu et al.

The transfer of facial expressions from people to 3D face models is a classic computer graphics problem. In this paper, we present a novel, learning-based approach to transferring facial expressions and head movements from images and videos to a biomechanical model of the face-head-neck complex. Leveraging the Facial Action Coding System (FACS) as an intermediate representation of the expression space, we train a deep neural network to take in FACS Action Units (AUs) and output suitable facial muscle and jaw activation signals for the musculoskeletal model. Through biomechanical simulation, the activations deform the facial soft tissues, thereby transferring the expression to the model. Our approach has advantages over previous approaches. First, the facial expressions are anatomically consistent as our biomechanical model emulates the relevant anatomy of the face, head, and neck. Second, by training the neural network using data generated from the biomechanical model itself, we eliminate the manual effort of data collection for expression transfer. The success of our approach is demonstrated through experiments involving the transfer onto our face-head-neck model of facial expressions and head poses from a range of facial images and videos.

CVOct 25, 2021
Generalized Multi-Task Learning from Substantially Unlabeled Multi-Source Medical Image Data

Ayaan Haque, Abdullah-Al-Zubaer Imran, Adam Wang et al.

Deep learning-based models, when trained in a fully-supervised manner, can be effective in performing complex image analysis tasks, although contingent upon the availability of large labeled datasets. Especially in the medical imaging domain, however, expert image annotation is expensive, time-consuming, and prone to variability. Semi-supervised learning from limited quantities of labeled data has shown promise as an alternative. Maximizing knowledge gains from copious unlabeled data benefits semi-supervised learning models. Moreover, learning multiple tasks within the same model further improves its generalizability. We propose MultiMix, a new multi-task learning model that jointly learns disease classification and anatomical segmentation in a semi-supervised manner, while preserving explainability through a novel saliency bridge between the two tasks. Our experiments with varying quantities of multi-source labeled data in the training sets confirm the effectiveness of MultiMix in the simultaneous classification of pneumonia and segmentation of the lungs in chest X-ray images. Moreover, both in-domain and cross-domain evaluations across these tasks further showcase the potential of our model to adapt to challenging generalization scenarios.

CVMar 18, 2021
A Location-Sensitive Local Prototype Network for Few-Shot Medical Image Segmentation

Qinji Yu, Kang Dang, Nima Tajbakhsh et al.

Despite the tremendous success of deep neural networks in medical image segmentation, they typically require a large amount of costly, expert-level annotated data. Few-shot segmentation approaches address this issue by learning to transfer knowledge from limited quantities of labeled examples. Incorporating appropriate prior knowledge is critical in designing high-performance few-shot segmentation algorithms. Since strong spatial priors exist in many medical imaging modalities, we propose a prototype-based method -- namely, the location-sensitive local prototype network -- that leverages spatial priors to perform few-shot medical image segmentation. Our approach divides the difficult problem of segmenting the entire image with global prototypes into easily solvable subproblems of local region segmentation with local prototypes. For organ segmentation experiments on the VISCERAL CT image dataset, our method outperforms the state-of-the-art approaches by 10% in the mean Dice coefficient. Extensive ablation studies demonstrate the substantial benefits of incorporating spatial information and confirm the effectiveness of our approach.

CVOct 28, 2020
MultiMix: Sparingly Supervised, Extreme Multitask Learning From Medical Images

Ayaan Haque, Abdullah-Al-Zubaer Imran, Adam Wang et al.

Semi-supervised learning via learning from limited quantities of labeled data has been investigated as an alternative to supervised counterparts. Maximizing knowledge gains from copious unlabeled data benefit semi-supervised learning settings. Moreover, learning multiple tasks within the same model further improves model generalizability. We propose a novel multitask learning model, namely MultiMix, which jointly learns disease classification and anatomical segmentation in a sparingly supervised manner, while preserving explainability through bridge saliency between the two tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justify the effectiveness of our multitasking model for the classification of pneumonia and segmentation of lungs from chest X-ray images. Moreover, both in-domain and cross-domain evaluations across the tasks further showcase the potential of our model to adapt to challenging generalization scenarios.

CVJul 22, 2020
End-to-End Trainable Deep Active Contour Models for Automated Image Segmentation: Delineating Buildings in Aerial Imagery

Ali Hatamizadeh, Debleena Sengupta, Demetri Terzopoulos

The automated segmentation of buildings in remote sensing imagery is a challenging task that requires the accurate delineation of multiple building instances over typically large image areas. Manual methods are often laborious and current deep-learning-based approaches fail to delineate all building instances and do so with adequate accuracy. As a solution, we present Trainable Deep Active Contours (TDACs), an automatic image segmentation framework that intimately unites Convolutional Neural Networks (CNNs) and Active Contour Models (ACMs). The Eulerian energy functional of the ACM component includes per-pixel parameter maps that are predicted by the backbone CNN, which also initializes the ACM. Importantly, both the ACM and CNN components are fully implemented in TensorFlow and the entire TDAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. TDAC yields fast, accurate, and fully automatic simultaneous delineation of arbitrarily many buildings in the image. We validate the model on two publicly available aerial image datasets for building segmentation, and our results demonstrate that TDAC establishes a new state-of-the-art performance.

IVMay 28, 2020
Bipartite Distance for Shape-Aware Landmark Detection in Spinal X-Ray Images

Abdullah-Al-Zubaer Imran, Chao Huang, Hui Tang et al.

Scoliosis is a congenital disease that causes lateral curvature in the spine. Its assessment relies on the identification and localization of vertebrae in spinal X-ray images, conventionally via tedious and time-consuming manual radiographic procedures that are prone to subjectivity and observational variability. Reliability can be improved through the automatic detection and localization of spinal landmarks. To guide a CNN in the learning of spinal shape while detecting landmarks in X-ray images, we propose a novel loss based on a bipartite distance (BPD) measure, and show that it consistently improves landmark detection performance.

IVMay 8, 2020
Progressive Adversarial Semantic Segmentation

Abdullah-Al-Zubaer Imran, Demetri Terzopoulos

Medical image computing has advanced rapidly with the advent of deep learning techniques such as convolutional neural networks. Deep convolutional neural networks can perform exceedingly well given full supervision. However, the success of such fully-supervised models for various image analysis tasks (e.g., anatomy or lesion segmentation from medical images) is limited to the availability of massive amounts of labeled data. Given small sample sizes, such models are prohibitively data biased with large domain shift. To tackle this problem, we propose a novel end-to-end medical image segmentation model, namely Progressive Adversarial Semantic Segmentation (PASS), which can make improved segmentation predictions without requiring any domain-specific data during training time. Our extensive experimentation with 8 public diabetic retinopathy and chest X-ray datasets, confirms the effectiveness of PASS for accurate vascular and pulmonary segmentation, both for in-domain and cross-domain evaluations.

CVMay 5, 2020
Partly Supervised Multitask Learning

Abdullah-Al-Zubaer Imran, Chao Huang, Hui Tang et al.

Semi-supervised learning has recently been attracting attention as an alternative to fully supervised models that require large pools of labeled data. Moreover, optimizing a model for multiple tasks can provide better generalizability than single-task learning. Leveraging self-supervision and adversarial training, we propose a novel general purpose semi-supervised, multiple-task model---namely, self-supervised, semi-supervised, multitask learning (S$^4$MTL)---for accomplishing two important tasks in medical imaging, segmentation and diagnostic classification. Experimental results on chest and spine X-ray datasets suggest that our S$^4$MTL model significantly outperforms semi-supervised single task, semi/fully-supervised multitask, and fully-supervised single task models, even with a 50\% reduction of class and segmentation labels. We hypothesize that our proposed model can be effective in tackling limited annotation problems for joint training, not only in medical imaging domains, but also for general-purpose vision tasks.

IVApr 15, 2020
Analysis of Scoliosis From Spinal X-Ray Images

Abdullah-Al-Zubaer Imran, Chao Huang, Hui Tang et al.

Scoliosis is a congenital disease in which the spine is deformed from its normal shape. Measurement of scoliosis requires labeling and identification of vertebrae in the spine. Spine radiographs are the most cost-effective and accessible modality for imaging the spine. Reliable and accurate vertebrae segmentation in spine radiographs is crucial in image-guided spinal assessment, disease diagnosis, and treatment planning. Conventional assessments rely on tedious and time-consuming manual measurement, which is subject to inter-observer variability. A fully automatic method that can accurately identify and segment the associated vertebrae is unavailable in the literature. Leveraging a carefully-adjusted U-Net model with progressive side outputs, we propose an end-to-end segmentation model that provides a fully automatic and reliable segmentation of the vertebrae associated with scoliosis measurement. Our experimental results from a set of anterior-posterior spine X-Ray images indicate that our model, which achieves an average Dice score of 0.993, promises to be an effective tool in the identification and labeling of spinal vertebrae, eventually helping doctors in the reliable estimation of scoliosis. Moreover, estimation of Cobb angles from the segmented vertebrae further demonstrates the effectiveness of our model.

IVFeb 11, 2020
Edge-Gated CNNs for Volumetric Semantic Segmentation of Medical Images

Ali Hatamizadeh, Demetri Terzopoulos, Andriy Myronenko

Textures and edges contribute different information to image recognition. Edges and boundaries encode shape information, while textures manifest the appearance of regions. Despite the success of Convolutional Neural Networks (CNNs) in computer vision and medical image analysis applications, predominantly only texture abstractions are learned, which often leads to imprecise boundary delineations. In medical imaging, expert manual segmentation often relies on organ boundaries; for example, to manually segment a liver, a medical practitioner usually identifies edges first and subsequently fills in the segmentation mask. Motivated by these observations, we propose a plug-and-play module, dubbed Edge-Gated CNNs (EG-CNNs), that can be used with existing encoder-decoder architectures to process both edge and texture information. The EG-CNN learns to emphasize the edges in the encoder, to predict crisp boundaries by an auxiliary edge supervision, and to fuse its output with the original CNN output. We evaluate the effectiveness of the EG-CNN with various mainstream CNNs on two publicly available datasets, BraTS 19 and KiTS 19 for brain tumor and kidney semantic segmentation. We demonstrate how the addition of EG-CNN consistently improves segmentation accuracy and generalization performance.

CVJan 15, 2020
Image Segmentation Using Deep Learning: A Survey

Shervin Minaee, Yuri Boykov, Fatih Porikli et al.

Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.

CVSep 29, 2019
End-to-End Deep Convolutional Active Contours for Image Segmentation

Ali Hatamizadeh, Debleena Sengupta, Demetri Terzopoulos

The Active Contour Model (ACM) is a standard image analysis technique whose numerous variants have attracted an enormous amount of research attention across multiple fields. Incorrectly, however, the ACM's differential-equation-based formulation and prototypical dependence on user initialization have been regarded as being largely incompatible with the recently popular deep learning approaches to image segmentation. This paper introduces the first tight unification of these two paradigms. In particular, we devise Deep Convolutional Active Contours (DCAC), a truly end-to-end trainable image segmentation framework comprising a Convolutional Neural Network (CNN) and an ACM with learnable parameters. The ACM's Eulerian energy functional includes per-pixel parameter maps predicted by the backbone CNN, which also initializes the ACM. Importantly, both the CNN and ACM components are fully implemented in TensorFlow, and the entire DCAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. As a challenging test case, we tackle the problem of building instance segmentation in aerial images and evaluate DCAC on two publicly available datasets, Vaihingen and Bing Huts. Our reseults demonstrate that, for building segmentation, the DCAC establishes a new state-of-the-art performance by a wide margin.

CVAug 21, 2019
End-to-End Boundary Aware Networks for Medical Image Segmentation

Ali Hatamizadeh, Demetri Terzopoulos, Andriy Myronenko

Fully convolutional neural networks (CNNs) have proven to be effective at representing and classifying textural information, thus transforming image intensity into output class masks that achieve semantic image segmentation. In medical image analysis, however, expert manual segmentation often relies on the boundaries of anatomical structures of interest. We propose boundary aware CNNs for medical image segmentation. Our networks are designed to account for organ boundary information, both by providing a special network edge branch and edge-aware loss terms, and they are trainable end-to-end. We validate their effectiveness on the task of brain tumor segmentation using the BraTS 2018 dataset. Our experiments reveal that our approach yields more accurate segmentation results, which makes it promising for more extensive application to medical image segmentation.

IVAug 10, 2019
Semi-Supervised Multi-Task Learning With Chest X-Ray Images

Abdullah-Al-Zubaer Imran, Demetri Terzopoulos

Discriminative models that require full supervision are inefficacious in the medical imaging domain when large labeled datasets are unavailable. By contrast, generative modeling---i.e., learning data generation and classification---facilitates semi-supervised training with limited labeled data. Moreover, generative modeling can be advantageous in accomplishing multiple objectives for better generalization. We propose a novel multi-task learning model for jointly learning a classifier and a segmentor, from chest X-ray images, through semi-supervised learning. In addition, we propose a new loss function that combines absolute KL divergence with Tversky loss (KLTV) to yield faster convergence and better segmentation performance. Based on our experimental results using a novel segmentation model, an Adversarial Pyramid Progressive Attention U-Net (APPAU-Net), we hypothesize that KLTV can be more effective for generalizing multi-tasking models while being competitive in segmentation-only tasks.

LGJun 14, 2019
Multi-Adversarial Variational Autoencoder Networks

Abdullah-Al-Zubaer Imran, Demetri Terzopoulos

The unsupervised training of GANs and VAEs has enabled them to generate realistic images mimicking real-world distributions and perform image-based unsupervised clustering or semi-supervised classification. Combining the power of these two generative models, we introduce Multi-Adversarial Variational autoEncoder Networks (MAVENs), a novel network architecture that incorporates an ensemble of discriminators in a VAE-GAN network, with simultaneous adversarial learning and variational inference. We apply MAVENs to the generation of synthetic images and propose a new distribution measure to quantify the quality of the generated images. Our experimental results using datasets from the computer vision and medical imaging domains---Street View House Numbers, CIFAR-10, and Chest X-Ray datasets---demonstrate competitive performance against state-of-the-art semi-supervised models both in image generation and classification tasks.

IVMay 28, 2019
Deep Dilated Convolutional Nets for the Automatic Segmentation of Retinal Vessels

Ali Hatamizadeh, Hamid Hosseini, Zhengyuan Liu et al.

The reliable segmentation of retinal vasculature can provide the means to diagnose and monitor the progression of a variety of diseases affecting the blood vessel network, including diabetes and hypertension. We leverage the power of convolutional neural networks to devise a reliable and fully automated method that can accurately detect, segment, and analyze retinal vessels. In particular, we propose a novel, fully convolutional deep neural network with an encoder-decoder architecture that employs dilated spatial pyramid pooling with multiple dilation rates to recover the lost content in the encoder and add multiscale contextual information to the decoder. We also propose a simple yet effective way of quantifying and tracking the widths of retinal vessels through direct use of the segmentation predictions. Unlike previous deep-learning-based approaches to retinal vessel segmentation that mainly rely on patch-wise analysis, our proposed method leverages a whole-image approach during training and inference, resulting in more efficient training and faster inference through the access of global content in the image. We have tested our method on two publicly available datasets, and our state-of-the-art results on both the DRIVE and CHASE-DB1 datasets attest to the effectiveness of our approach.

CVFeb 18, 2019
Automatic Segmentation of Pulmonary Lobes Using a Progressive Dense V-Network

Abdullah-Al-Zubaer Imran, Ali Hatamizadeh, Shilpa P. Ananth et al.

Reliable and automatic segmentation of lung lobes is important for diagnosis, assessment, and quantification of pulmonary diseases. The existing techniques are prohibitively slow, undesirably rely on prior (airway/vessel) segmentation, and/or require user interactions for optimal results. This work presents a reliable, fast, and fully automated lung lobe segmentation based on a progressive dense V-network (PDV-Net). The proposed method can segment lung lobes in one forward pass of the network, with an average runtime of 2 seconds using 1 Nvidia Titan XP GPU, eliminating the need for any prior atlases, lung segmentation or any subsequent user intervention. We evaluated our model using 84 chest CT scans from the LIDC and 154 pathological cases from the LTRC datasets. Our model achieved a Dice score of $0.939 \pm 0.02$ for the LIDC test set and $0.950 \pm 0.01$ for the LTRC test set, significantly outperforming a 2D U-net model and a 3D dense V-net. We further evaluated our model against 55 cases from the LOLA11 challenge, obtaining an average Dice score of 0.935---a performance level competitive to the best performing team with an average score of 0.938. Our extensive robustness analyses also demonstrate that our model can reliably segment both healthy and pathological lung lobes in CT scans from different vendors, and that our model is robust against configurations of CT scan reconstruction.

CVJan 25, 2019
Surrogate Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data

Nima Tajbakhsh, Yufei Hu, Junli Cao et al.

We investigate the effectiveness of a simple solution to the common problem of deep learning in medical image analysis with limited quantities of labeled training data. The underlying idea is to assign artificial labels to abundantly available unlabeled medical images and, through a process known as surrogate supervision, pre-train a deep neural network model for the target medical image analysis task lacking sufficient labeled training data. In particular, we employ 3 surrogate supervision schemes, namely rotation, reconstruction, and colorization, in 4 different medical imaging applications representing classification and segmentation for both 2D and 3D medical images. 3 key findings emerge from our research: 1) pre-training with surrogate supervision is effective for small training sets; 2) deep models trained from initial weights pre-trained through surrogate supervision outperform the same models when trained from scratch, suggesting that pre-training with surrogate supervision should be considered prior to training any deep 3D models; 3) pre-training models in the medical domain with surrogate supervision is more effective than transfer learning from an unrelated domain (e.g., natural images), indicating the practical value of abundant unlabeled medical image data.

CVOct 14, 2018
Learning to Sketch with Deep Q Networks and Demonstrated Strokes

Tao Zhou, Chen Fang, Zhaowen Wang et al.

Doodling is a useful and common intelligent skill that people can learn and master. In this work, we propose a two-stage learning framework to teach a machine to doodle in a simulated painting environment via Stroke Demonstration and deep Q-learning (SDQ). The developed system, Doodle-SDQ, generates a sequence of pen actions to reproduce a reference drawing and mimics the behavior of human painters. In the first stage, it learns to draw simple strokes by imitating in supervised fashion from a set of strokeaction pairs collected from artist paintings. In the second stage, it is challenged to draw real and more complex doodles without ground truth actions; thus, it is trained with Qlearning. Our experiments confirm that (1) doodling can be learned without direct stepby- step action supervision and (2) pretraining with stroke demonstration via supervised learning is important to improve performance. We further show that Doodle-SDQ is effective at producing plausible drawings in different media types, including sketch and watercolor.

CVMay 24, 2017
Attention-based Natural Language Person Retrieval

Tao Zhou, Muhao Chen, Jie Yu et al.

Following the recent progress in image classification and captioning using deep learning, we develop a novel natural language person retrieval system based on an attention mechanism. More specifically, given the description of a person, the goal is to localize the person in an image. To this end, we first construct a benchmark dataset for natural language person retrieval. To do so, we generate bounding boxes for persons in a public image dataset from the segmentation masks, which are then annotated with descriptions and attributes using the Amazon Mechanical Turk. We then adopt a region proposal network in Faster R-CNN as a candidate region generator. The cropped images based on the region proposals as well as the whole images with attention weights are fed into Convolutional Neural Networks for visual feature extraction, while the natural language expression and attributes are input to Bidirectional Long Short- Term Memory (BLSTM) models for text feature extraction. The visual and text features are integrated to score region proposals, and the one with the highest score is retrieved as the output of our system. The experimental results show significant improvement over the state-of-the-art method for generic object retrieval and this line of research promises to benefit search in surveillance video footage.

CVApr 1, 2017
Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth using Stochastic Grammars

Chenfanfu Jiang, Siyuan Qi, Yixin Zhu et al.

We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, benchmarking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics-based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illuminations and camera viewpoints). We demonstrate the value of our synthesized dataset, by improving performance in certain machine-learning-based scene understanding tasks--depth and surface normal prediction, semantic segmentation, reconstruction, etc.--and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.