Naji Khosravan

h-index28

21papers

551citations

Novelty53%

AI Score45

Ranked #69,192 of 205,806 authors (top 34%)#24,390 in CV (top 41%)

21 Papers

CVApr 26, 2023

Graph-CoVis: GNN-based Multi-view Panorama Global Pose Estimation

Negar Nejatishahidin, Will Hutchcroft, Manjunath Narayana et al. · uw

In this paper, we address the problem of wide-baseline camera pose estimation from a group of 360$^\circ$ panoramas under upright-camera assumption. Recent work has demonstrated the merit of deep-learning for end-to-end direct relative pose regression in 360$^\circ$ panorama pairs [11]. To exploit the benefits of multi-view logic in a learning-based framework, we introduce Graph-CoVis, which non-trivially extends CoVisPose [11] from relative two-view to global multi-view spherical camera pose estimation. Graph-CoVis is a novel Graph Neural Network based architecture that jointly learns the co-visible structure and global motion in an end-to-end and fully-supervised approach. Using the ZInD [4] dataset, which features real homes presenting wide-baselines, occlusion, and limited visual overlap, we show that our model performs competitively to state-of-the-art approaches.

CVApr 1, 2022

LASER: LAtent SpacE Rendering for 2D Visual Localization

Zhixiang Min, Naji Khosravan, Zachary Bessinger et al.

We present LASER, an image-based Monte Carlo Localization (MCL) framework for 2D floor maps. LASER introduces the concept of latent space rendering, where 2D pose hypotheses on the floor map are directly rendered into a geometrically-structured latent space by aggregating viewing ray features. Through a tightly coupled rendering codebook scheme, the viewing ray features are dynamically determined at rendering-time based on their geometries (i.e. length, incident-angle), endowing our representation with view-dependent fine-grain variability. Our codebook scheme effectively disentangles feature encoding from rendering, allowing the latent space rendering to run at speeds above 10KHz. Moreover, through metric learning, our geometrically-structured latent space is common to both pose hypotheses and query images with arbitrary field of views. As a result, LASER achieves state-of-the-art performance on large-scale indoor localization datasets (i.e. ZInD and Structured3D) for both panorama and perspective image queries, while significantly outperforming existing learning-based methods in speed.

LGAug 4, 2022

Embedding Alignment for Unsupervised Federated Learning via Smart Data Exchange

Satyavrat Wagle, Seyyedali Hosseinalipour, Naji Khosravan et al.

Federated learning (FL) has been recognized as one of the most promising solutions for distributed machine learning (ML). In most of the current literature, FL has been studied for supervised ML tasks, in which edge devices collect labeled data. Nevertheless, in many applications, it is impractical to assume existence of labeled data across devices. To this end, we develop a novel methodology, Cooperative Federated unsupervised Contrastive Learning (CF-CL), for FL across edge devices with unlabeled datasets. CF-CL employs local device cooperation where data are exchanged among devices through device-to-device (D2D) communications to avoid local model bias resulting from non-independent and identically distributed (non-i.i.d.) local datasets. CF-CL introduces a push-pull smart data sharing mechanism tailored to unsupervised FL settings, in which, each device pushes a subset of its local datapoints to its neighbors as reserved data points, and pulls a set of datapoints from its neighbors, sampled through a probabilistic importance sampling technique. We demonstrate that CF-CL leads to (i) alignment of unsupervised learned latent spaces across devices, (ii) faster global convergence, allowing for less frequent global model aggregations; and (iii) is effective in extreme non-i.i.d. data settings across the devices.

CVAug 29, 2023

iBARLE: imBalance-Aware Room Layout Estimation

Taotao Jing, Lichen Wang, Naji Khosravan et al.

Room layout estimation predicts layouts from a single panorama. It requires datasets with large-scale and diverse room shapes to train the models. However, there are significant imbalances in real-world datasets including the dimensions of layout complexity, camera locations, and variation in scene appearance. These issues considerably influence the model training performance. In this work, we propose the imBalance-Aware Room Layout Estimation (iBARLE) framework to address these issues. iBARLE consists of (1) Appearance Variation Generation (AVG) module, which promotes visual appearance domain generalization, (2) Complex Structure Mix-up (CSMix) module, which enhances generalizability w.r.t. room structure, and (3) a gradient-based layout objective function, which allows more effective accounting for occlusions in complex layouts. All modules are jointly trained and help each other to achieve the best performance. Experiments and ablation studies based on ZInD~\cite{cruz2021zillow} dataset illustrate that iBARLE has state-of-the-art performance compared with other layout estimation baselines.

AISep 5, 2025

Multi-Modal Multi-Task (M3T) Federated Foundation Models for Embodied AI: Potentials and Challenges for Edge Integration

Kasra Borazjani, Payam Abdisarabshali, Fardis Nadimi et al.

As embodied AI systems become increasingly multi-modal, personalized, and interactive, they must learn effectively from diverse sensory inputs, adapt continually to user preferences, and operate safely under resource and privacy constraints. These challenges expose a pressing need for machine learning models capable of swift, context-aware adaptation while balancing model generalization and personalization. Here, two methods emerge as suitable candidates, each offering parts of these capabilities: multi-modal multi-task foundation models (M3T-FMs) provide a pathway toward generalization across tasks and modalities, whereas federated learning (FL) offers the infrastructure for distributed, privacy-preserving model updates and user-level model personalization. However, when used in isolation, each of these approaches falls short of meeting the complex and diverse capability requirements of real-world embodied AI environments. In this vision paper, we introduce multi-modal multi-task federated foundation models (M3T-FFMs) for embodied AI, a new paradigm that unifies the strengths of M3T-FMs with the privacy-preserving distributed training nature of FL, enabling intelligent systems at the wireless edge. We collect critical deployment dimensions of M3T-FFMs in embodied AI ecosystems under a unified framework, which we name "EMBODY": Embodiment heterogeneity, Modality richness and imbalance, Bandwidth and compute constraints, On-device continual learning, Distributed control and autonomy, and Yielding safety, privacy, and personalization. For each, we identify concrete challenges and envision actionable research directions. We also present an evaluation framework for deploying M3T-FFMs in embodied AI systems, along with the associated trade-offs. Finally, we present a prototype implementation of M3T-FFMs and evaluate their energy and latency performance.

LGSep 3, 2025Code

Hierarchical Federated Foundation Models over Wireless Networks for Multi-Modal Multi-Task Intelligence: Integration of Edge Learning with D2D/P2P-Enabled Fog Learning Architectures

Payam Abdisarabshali, Fardis Nadimi, Kasra Borazjani et al.

The rise of foundation models (FMs) has reshaped the landscape of machine learning. As these models continued to grow, leveraging geo-distributed data from wireless devices has become increasingly critical, giving rise to federated foundation models (FFMs). More recently, FMs have evolved into multi-modal multi-task (M3T) FMs (e.g., GPT-4) capable of processing diverse modalities across multiple tasks, which motivates a new underexplored paradigm: M3T FFMs. In this paper, we unveil an unexplored variation of M3T FFMs by proposing hierarchical federated foundation models (HF-FMs), which in turn expose two overlooked heterogeneity dimensions to fog/edge networks that have a direct impact on these emerging models: (i) heterogeneity in collected modalities and (ii) heterogeneity in executed tasks across fog/edge nodes. HF-FMs strategically align the modular structure of M3T FMs, comprising modality encoders, prompts, mixture-of-experts (MoEs), adapters, and task heads, with the hierarchical nature of fog/edge infrastructures. Moreover, HF-FMs enable the optional usage of device-to-device (D2D) communications, enabling horizontal module relaying and localized cooperative training among nodes when feasible. Through delving into the architectural design of HF-FMs, we highlight their unique capabilities along with a series of tailored future research directions. Finally, to demonstrate their potential, we prototype HF-FMs in a wireless network setting and release the open-source code for the development of HF-FMs with the goal of fostering exploration in this untapped field (GitHub: https://github.com/payamsiabd/M3T-FFM).

CVFeb 14, 2025Code

ReStyle3D: Scene-Level Appearance Transfer with Semantic Correspondences

Liyuan Zhu, Shengqu Cai, Shengyu Huang et al. · stanford

We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. It first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.

LGJan 7, 2024

Multi-Modal Federated Learning for Cancer Staging over Non-IID Datasets with Unbalanced Modalities

Kasra Borazjani, Naji Khosravan, Leslie Ying et al.

The use of machine learning (ML) for cancer staging through medical image analysis has gained substantial interest across medical disciplines. When accompanied by the innovative federated learning (FL) framework, ML techniques can further overcome privacy concerns related to patient data exposure. Given the frequent presence of diverse data modalities within patient records, leveraging FL in a multi-modal learning framework holds considerable promise for cancer staging. However, existing works on multi-modal FL often presume that all data-collecting institutions have access to all data modalities. This oversimplified approach neglects institutions that have access to only a portion of data modalities within the system. In this work, we introduce a novel FL architecture designed to accommodate not only the heterogeneity of data samples, but also the inherent heterogeneity/non-uniformity of data modalities across institutions. We shed light on the challenges associated with varying convergence speeds observed across different data modalities within our FL system. Subsequently, we propose a solution to tackle these challenges by devising a distributed gradient blending and proximity-aware client weighting strategy tailored for multi-modal FL. To show the superiority of our method, we conduct experiments using The Cancer Genome Atlas program (TCGA) datalake considering different cancer types and three modalities of data: mRNA sequences, histopathological image data, and clinical information. Our results further unveil the impact and severity of class-based vs type-based heterogeneity across institutions on the model performance, which widens the perspective to the notion of data heterogeneity in multi-modal FL literature.

CVMar 17, 2025

Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan et al.

Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

LGApr 15, 2024

Unsupervised Federated Optimization at the Edge: D2D-Enabled Learning without Labels

Satyavrat Wagle, Seyyedali Hosseinalipour, Naji Khosravan et al.

Federated learning (FL) is a popular solution for distributed machine learning (ML). While FL has traditionally been studied for supervised ML tasks, in many applications, it is impractical to assume availability of labeled data across devices. To this end, we develop Cooperative Federated unsupervised Contrastive Learning ({\tt CF-CL)} to facilitate FL across edge devices with unlabeled datasets. {\tt CF-CL} employs local device cooperation where either explicit (i.e., raw data) or implicit (i.e., embeddings) information is exchanged through device-to-device (D2D) communications to improve local diversity. Specifically, we introduce a \textit{smart information push-pull} methodology for data/embedding exchange tailored to FL settings with either soft or strict data privacy restrictions. Information sharing is conducted through a probabilistic importance sampling technique at receivers leveraging a carefully crafted reserve dataset provided by transmitters. In the implicit case, embedding exchange is further integrated into the local ML training at the devices via a regularization term incorporated into the contrastive loss, augmented with a dynamic contrastive margin to adjust the volume of latent space explored. Numerical evaluations demonstrate that {\tt CF-CL} leads to alignment of latent spaces learned across devices, results in faster and more efficient global model training, and is effective in extreme non-i.i.d. data distribution settings across devices.

LGSep 9, 2025

Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges

Kasra Borazjani, Naji Khosravan, Rajeev Sahay et al.

Multi-modal multi-task (M3T) foundation models (FMs) have recently shown transformative potential in artificial intelligence, with emerging applications in education. However, their deployment in real-world educational settings is hindered by privacy regulations, data silos, and limited domain-specific data availability. We introduce M3T Federated Foundation Models (FedFMs) for education: a paradigm that integrates federated learning (FL) with M3T FMs to enable collaborative, privacy-preserving training across decentralized institutions while accommodating diverse modalities and tasks. Subsequently, this position paper aims to unveil M3T FedFMs as a promising yet underexplored approach to the education community, explore its potentials, and reveal its related future research directions. We outline how M3T FedFMs can advance three critical pillars of next-generation intelligent education systems: (i) privacy preservation, by keeping sensitive multi-modal student and institutional data local; (ii) personalization, through modular architectures enabling tailored models for students, instructors, and institutions; and (iii) equity and inclusivity, by facilitating participation from underrepresented and resource-constrained entities. We finally identify various open research challenges, including studying of (i) inter-institution heterogeneous privacy regulations, (ii) the non-uniformity of data modalities' characteristics, (iii) the unlearning approaches for M3T FedFMs, (iv) the continual learning frameworks for M3T FedFMs, and (v) M3T FedFM model interpretability, which must be collectively addressed for practical deployment.

CVApr 11, 2021

Deformable Capsules for Object Detection

Rodney Lalonde, Naji Khosravan, Ulas Bagci

Capsule networks promise significant benefits over convolutional networks by storing stronger internal representations, and routing information based on the agreement between intermediate representations' projections. Despite this, their success has been limited to small-scale classification datasets due to their computationally expensive nature. Though memory efficient, convolutional capsules impose geometric constraints that fundamentally limit the ability of capsules to model the pose/deformation of objects. Further, they do not address the bigger memory concern of class-capsules scaling up to bigger tasks such as detection or large-scale classification. In this study, we introduce a new family of capsule networks, deformable capsules (\textit{DeformCaps}), to address a very important problem in computer vision: object detection. We propose two new algorithms associated with our \textit{DeformCaps}: a novel capsule structure (\textit{SplitCaps}), and a novel dynamic routing algorithm (\textit{SE-Routing}), which balance computational efficiency with the need for modeling a large number of objects and classes, which have never been achieved with capsule networks before. We demonstrate that the proposed methods efficiently scale up to create the first-ever capsule network for object detection in the literature. Our proposed architecture is a one-stage detection framework and it obtains results on MS COCO which are on par with state-of-the-art one-stage CNN-based methods, while producing fewer false positive detection, generalizing to unusual poses/viewpoints of objects.

IVApr 29, 2020

The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset

Arjun D. Desai, Francesco Caliva, Claudia Iriondo et al.

Purpose: To organize a knee MRI segmentation challenge for characterizing the semantic and clinical efficacy of automatic segmentation methods relevant for monitoring osteoarthritis progression. Methods: A dataset partition consisting of 3D knee MRI from 88 subjects at two timepoints with ground-truth articular (femoral, tibial, patellar) cartilage and meniscus segmentations was standardized. Challenge submissions and a majority-vote ensemble were evaluated using Dice score, average symmetric surface distance, volumetric overlap error, and coefficient of variation on a hold-out test set. Similarities in network segmentations were evaluated using pairwise Dice correlations. Articular cartilage thickness was computed per-scan and longitudinally. Correlation between thickness error and segmentation metrics was measured using Pearson's coefficient. Two empirical upper bounds for ensemble performance were computed using combinations of model outputs that consolidated true positives and true negatives. Results: Six teams (T1-T6) submitted entries for the challenge. No significant differences were observed across all segmentation metrics for all tissues (p=1.0) among the four top-performing networks (T2, T3, T4, T6). Dice correlations between network pairs were high (>0.85). Per-scan thickness errors were negligible among T1-T4 (p=0.99) and longitudinal changes showed minimal bias (<0.03mm). Low correlations (<0.41) were observed between segmentation metrics and thickness error. The majority-vote ensemble was comparable to top performing networks (p=1.0). Empirical upper bound performances were similar for both combinations (p=1.0). Conclusion: Diverse networks learned to segment the knee similarly where high segmentation accuracy did not correlate to cartilage thickness accuracy. Voting ensembles did not outperform individual networks but may help regularize individual models.

IVAug 26, 2019

Cross-modality Knowledge Transfer for Prostate Segmentation from CT Scans

Yucheng Liu, Naji Khosravan, Yulin Liu et al.

Creating large scale high-quality annotations is a known challenge in medical imaging. In this work, based on the CycleGAN algorithm, we propose leveraging annotations from one modality to be useful in other modalities. More specifically, the proposed algorithm creates highly realistic synthetic CT images (SynCT) from prostate MR images using unpaired data sets. By using SynCT images (without segmentation labels) and MR images (with segmentation labels available), we have trained a deep segmentation network for precise delineation of prostate from real CT scans. For the generator in our CycleGAN, the cycle consistency term is used to guarantee that SynCT shares the identical manually-drawn, high-quality masks originally delineated on MR images. Further, we introduce a cost function based on structural similarity index (SSIM) to improve the anatomical similarity between real and synthetic images. For segmentation followed by the SynCT generation from CycleGAN, automatic delineation is achieved through a 2.5D Residual U-Net. Quantitative evaluation demonstrates comparable segmentation results between our SynCT and radiologist drawn masks for real CT images, solving an important problem in medical image segmentation field when ground truth annotations are not available for the modality of interest.

MLAug 18, 2019

Weakly Supervised Segmentation by A Deep Geodesic Prior

Aliasghar Mortazi, Naji Khosravan, Drew A. Torigian et al.

The performance of the state-of-the-art image segmentation methods heavily relies on the high-quality annotations, which are not easily affordable, particularly for medical data. To alleviate this limitation, in this study, we propose a weakly supervised image segmentation method based on a deep geodesic prior. We hypothesize that integration of this prior information can reduce the adverse effects of weak labels in segmentation accuracy. Our proposed algorithm is based on a prior information, extracted from an auto-encoder, trained to map objects geodesic maps to their corresponding binary maps. The obtained information is then used as an extra term in the loss function of the segmentor. In order to show efficacy of the proposed strategy, we have experimented segmentation of cardiac substructures with clean and two levels of noisy labels (L1, L2). Our experiments showed that the proposed algorithm boosted the performance of baseline deep learning-based segmentation for both clean and noisy labels by 4.4%, 4.6%(L1), and 6.3%(L2) in dice score, respectively. We also showed that the proposed method was more robust in the presence of high-level noise due to the existence of shape priors.

CVJun 11, 2019

PAN: Projective Adversarial Network for Medical Image Segmentation

Naji Khosravan, Aliasghar Mortazi, Michael Wallace et al.

Adversarial learning has been proven to be effective for capturing long-range and high-level label consistencies in semantic segmentation. Unique to medical imaging, capturing 3D semantics in an effective yet computationally efficient way remains an open problem. In this study, we address this computational burden by proposing a novel projective adversarial network, called PAN, which incorporates high-level 3D information through 2D projections. Furthermore, we introduce an attention module into our framework that helps for a selective integration of global information directly from our segmentor to our adversarial network. For the clinical application we chose pancreas segmentation from CT scans. Our proposed framework achieved state-of-the-art performance without adding to the complexity of the segmentor.

CVDec 14, 2018

On Attention Modules for Audio-Visual Synchronization

Naji Khosravan, Shervin Ardeshir, Rohit Puri

With the development of media and networking technologies, multimedia applications ranging from feature presentation in a cinema setting to video on demand to interactive video conferencing are in great demand. Good synchronization between audio and video modalities is a key factor towards defining the quality of a multimedia presentation. The audio and visual signals of a multimedia presentation are commonly managed by independent workflows - they are often separately authored, processed, stored and even delivered to the playback system. This opens up the possibility of temporal misalignment between the two modalities - such a tendency is often more pronounced in the case of produced content (such as movies). To judge whether audio and video signals of a multimedia presentation are synchronized, we as humans often pay close attention to discriminative spatio-temporal blocks of the video (e.g. synchronizing the lip movement with the utterance of words, or the sound of a bouncing ball at the moment it hits the ground). At the same time, we ignore large portions of the video in which no discriminative sounds exist (e.g. background music playing in a movie). Inspired by this observation, we study leveraging attention modules for automatically detecting audio-visual synchronization. We propose neural network based attention modules, capable of weighting different portions (spatio-temporal blocks) of the video based on their respective discriminative power. Our experiments indicate that incorporating attention modules yields state-of-the-art results for the audio-visual synchronization classification problem.

CVMay 6, 2018

S4ND: Single-Shot Single-Scale Lung Nodule Detection

Naji Khosravan, Ulas Bagci

The state of the art lung nodule detection studies rely on computationally expensive multi-stage frameworks to detect nodules from CT scans. To address this computational challenge and provide better performance, in this paper we propose S4ND, a new deep learning based method for lung nodule detection. Our approach uses a single feed forward pass of a single network for detection and provides better performance when compared to the current literature. The whole detection pipeline is designed as a single $3D$ Convolutional Neural Network (CNN) with dense connections, trained in an end-to-end manner. S4ND does not require any further post-processing or user guidance to refine detection results. Experimentally, we compared our network with the current state-of-the-art object detection network (SSD) in computer vision as well as the state-of-the-art published method for lung nodule detection (3D DCNN). We used publically available $888$ CT scans from LUNA challenge dataset and showed that the proposed method outperforms the current literature both in terms of efficiency and accuracy by achieving an average FROC-score of $0.897$. We also provide an in-depth analysis of our proposed network to shed light on the unclear paradigms of tiny object detection.

CVFeb 17, 2018

A Collaborative Computer Aided Diagnosis (C-CAD) System with Eye-Tracking, Sparse Attentional Model, and Deep Learning

Naji Khosravan, Haydar Celik, Baris Turkbey et al.

There are at least two categories of errors in radiology screening that can lead to suboptimal diagnostic decisions and interventions:(i)human fallibility and (ii)complexity of visual search. Computer aided diagnostic (CAD) tools are developed to help radiologists to compensate for some of these errors. However, despite their significant improvements over conventional screening strategies, most CAD systems do not go beyond their use as second opinion tools due to producing a high number of false positives, which human interpreters need to correct. In parallel with efforts in computerized analysis of radiology scans, several researchers have examined behaviors of radiologists while screening medical images to better understand how and why they miss tumors, how they interact with the information in an image, and how they search for unknown pathology in the images. Eye-tracking tools have been instrumental in exploring answers to these fundamental questions. In this paper, we aim to develop a paradigm shift CAD system, called collaborative CAD (C-CAD), that unifies both of the above mentioned research lines: CAD and eye-tracking. We design an eye-tracking interface providing radiologists with a real radiology reading room experience. Then, we propose a novel algorithm that unifies eye-tracking data and a CAD system. Specifically, we present a new graph based clustering and sparsification algorithm to transform eye-tracking data (gaze) into a signal model to interpret gaze patterns quantitatively and qualitatively. The proposed C-CAD collaborates with radiologists via eye-tracking technology and helps them to improve diagnostic decisions. The C-CAD learns radiologists' search efficiency by processing their gaze patterns. To do this, the C-CAD uses a deep learning algorithm in a newly designed multi-task learning platform to segment and diagnose cancers simultaneously.

CVFeb 17, 2018

Semi-supervised multi-task learning for lung cancer diagnosis

Naji Khosravan, Ulas Bagci

Early detection of lung nodules is of great importance in lung cancer screening. Existing research recognizes the critical role played by CAD systems in early detection and diagnosis of lung nodules. However, many CAD systems, which are used as cancer detection tools, produce a lot of false positives (FP) and require a further FP reduction step. Furthermore, guidelines for early diagnosis and treatment of lung cancer are consist of different shape and volume measurements of abnormalities. Segmentation is at the heart of our understanding of nodules morphology making it a major area of interest within the field of computer aided diagnosis systems. This study set out to test the hypothesis that joint learning of false positive (FP) nodule reduction and nodule segmentation can improve the computer aided diagnosis (CAD) systems' performance on both tasks. To support this hypothesis we propose a 3D deep multi-task CNN to tackle these two problems jointly. We tested our system on LUNA16 dataset and achieved an average dice similarity coefficient (DSC) of 91% as segmentation accuracy and a score of nearly 92% for FP reduction. As a proof of our hypothesis, we showed improvements of segmentation and FP reduction tasks over two baselines. Our results support that joint training of these two tasks through a multi-task learning approach improves system performance on both. We also showed that a semi-supervised approach can be used to overcome the limitation of lack of labeled data for the 3D segmentation task.

CVAug 10, 2016

Gaze2Segment: A Pilot Study for Integrating Eye-Tracking Technology into Medical Image Segmentation

Naji Khosravan, Haydar Celik, Baris Turkbey et al.

This study introduced a novel system, called Gaze2Segment, integrating biological and computer vision techniques to support radiologists' reading experience with an automatic image segmentation task. During diagnostic assessment of lung CT scans, the radiologists' gaze information were used to create a visual attention map. This map was then combined with a computer-derived saliency map, extracted from the gray-scale CT images. The visual attention map was used as an input for indicating roughly the location of a object of interest. With computer-derived saliency information, on the other hand, we aimed at finding foreground and background cues for the object of interest. At the final step, these cues were used to initiate a seed-based delineation process. Segmentation accuracy of the proposed Gaze2Segment was found to be 86% with dice similarity coefficient and 1.45 mm with Hausdorff distance. To the best of our knowledge, Gaze2Segment is the first true integration of eye-tracking technology into a medical image segmentation task without the need for any further user-interaction.