29.6CVJun 3Code
Enhancing MedSAM with a Lightweight Box Predictor for Medical Image SegmentationAmirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi
Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor
CVSep 17, 2024Code
MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided PrototypingAmirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
Few-shot Semantic Segmentation addresses the challenge of segmenting objects in query images with only a handful of annotated examples. However, many previous state-of-the-art methods either have to discard intricate local semantic features or suffer from high computational complexity. To address these challenges, we propose a new Few-shot Semantic Segmentation framework based on the Transformer architecture. Our approach introduces the spatial transformer decoder and the contextual mask generation module to improve the relational understanding between support and query images. Moreover, we introduce a multi scale decoder to refine the segmentation mask by incorporating features from different resolutions in a hierarchical manner. Additionally, our approach integrates global features from intermediate encoder stages to improve contextual understanding, while maintaining a lightweight structure to reduce complexity. This balance between performance and efficiency enables our method to achieve competitive results on benchmark datasets such as PASCAL-5^i and COCO-20^i in both 1-shot and 5-shot settings. Notably, our model with only 1.5 million parameters demonstrates competitive performance while overcoming limitations of existing methodologies. https://github.com/amirrezafateh/MSDNet
CVSep 12, 2024Code
Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention MechanismsFatemeh Askari, Amirreza Fateh, Mohammad Reza Mohammadi
In the context of few-shot classification, the goal is to train a classifier using a limited number of samples while maintaining satisfactory performance. However, traditional metric-based methods exhibit certain limitations in achieving this objective. These methods typically rely on a single distance value between the query feature and support feature, thereby overlooking the contribution of shallow features. To overcome this challenge, we propose a novel approach in this paper. Our approach involves utilizing a multi-output embedding network that maps samples into distinct feature spaces. The proposed method extracts feature vectors at different stages, enabling the model to capture both global and abstract features. By utilizing these diverse feature spaces, our model enhances its performance. Moreover, employing a self-attention mechanism improves the refinement of features at each stage, leading to even more robust representations and improved overall performance. Furthermore, assigning learnable weights to each stage significantly improved performance and results. We conducted comprehensive evaluations on the MiniImageNet and FC100 datasets, specifically in the 5-way 1-shot and 5-way 5-shot scenarios. Additionally, we performed cross-domain tasks across eight benchmark datasets, achieving high accuracy in the testing domains. These evaluations demonstrate the efficacy of our proposed method in comparison to state-of-the-art approaches. https://github.com/FatemehAskari/MSENet
CLJan 21, 2023
A Multi-Purpose Audio-Visual Corpus for Multi-Modal Persian Speech Recognition: the Arman-AV DatasetJavad Peymanfard, Samin Heydarian, Ali Lashini et al.
In recent years, significant progress has been made in automatic lip reading. But these methods require large-scale datasets that do not exist for many low-resource languages. In this paper, we have presented a new multipurpose audio-visual dataset for Persian. This dataset consists of almost 220 hours of videos with 1760 corresponding speakers. In addition to lip reading, the dataset is suitable for automatic speech recognition, audio-visual speech recognition, and speaker recognition. Also, it is the first large-scale lip reading dataset in Persian. A baseline method was provided for each mentioned task. In addition, we have proposed a technique to detect visemes (a visual equivalent of a phoneme) in Persian. The visemes obtained by this method increase the accuracy of the lip reading task by 7% relatively compared to the previously proposed visemes, which can be applied to other languages as well.
CVJul 19, 2023
Leveraging Visemes for Better Visual Speech Representation and Lip ReadingJavad Peymanfard, Vahid Saeedi, Mohammad Reza Mohammadi et al.
Lip reading is a challenging task that has many potential applications in speech recognition, human-computer interaction, and security systems. However, existing lip reading systems often suffer from low accuracy due to the limitations of video features. In this paper, we propose a novel approach that leverages visemes, which are groups of phonetically similar lip shapes, to extract more discriminative and robust video features for lip reading. We evaluate our approach on various tasks, including word-level and sentence-level lip reading, and audiovisual speech recognition using the Arman-AV dataset, a largescale Persian corpus. Our experimental results show that our viseme based approach consistently outperforms the state-of-theart methods in all these tasks. The proposed method reduces the lip-reading word error rate (WER) by 9.1% relative to the best previous method.
CVApr 26, 2021Code
Wise-SrNet: A Novel Architecture for Enhancing Image Classification by Learning Spatial Resolution of Feature MapsMohammad Rahimzadeh, AmirAli Askari, Soroush Parvin et al.
One of the main challenges since the advancement of convolutional neural networks is how to connect the extracted feature map to the final classification layer. VGG models used two sets of fully connected layers for the classification part of their architectures, which significantly increased the number of models' weights. ResNet and the next deep convolutional models used the Global Average Pooling (GAP) layer to compress the feature map and feed it to the classification layer. Although using the GAP layer reduces the computational cost, but also causes losing spatial resolution of the feature map, which results in decreasing learning efficiency. In this paper, we aim to tackle this problem by replacing the GAP layer with a new architecture called Wise-SrNet. It is inspired by the depthwise convolutional idea and is designed for processing spatial resolution while not increasing computational cost. We have evaluated our method using three different datasets: Intel Image Classification Challenge, MIT Indoors Scenes, and a part of the ImageNet dataset. We investigated the implementation of our architecture on several models of the Inception, ResNet, and DenseNet families. Applying our architecture has revealed a significant effect on increasing convergence speed and accuracy. Our Experiments on images with 224*224 resolution increased the Top-1 accuracy between 2% to 8% on different datasets and models. Running our models on 512*512 resolution images of the MIT Indoors Scenes dataset showed a notable result of improving the Top-1 accuracy within 3% to 26%. We will also demonstrate the GAP layer's disadvantage when the input images are large and the number of classes is not few. In this circumstance, our proposed architecture can do a great help in enhancing classification results. The code is shared at https://github.com/mr7495/image-classification-spatial.
HCNov 3, 2023
End-to-End assessment of AR-assisted neurosurgery systemsMahdi Bagheri, Farhad Piri, Hadi Digale et al.
Augmented Reality (AR) has emerged as a significant advancement in surgical procedures, offering a solution to the challenges posed by traditional neuronavigation methods. These conventional techniques often necessitate surgeons to split their focus between the surgical site and a separate monitor that displays guiding images. Over the years, many systems have been developed to register and track the hologram at the targeted locations, each employed its own evaluation technique. On the other hand, hologram displacement measurement is not a straightforward task because of various factors such as occlusion, Vengence-Accomodation Conflict, and unstable holograms in space. In this study, we explore and classify different techniques for assessing an AR-assisted neurosurgery system and propose a new technique to systematize the assessment procedure. Moreover, we conduct a deeper investigation to assess surgeon error in the pre- and intra-operative phases of the surgery based on the respective feedback given. We found that although the system can undergo registration and tracking errors, physical feedback can significantly reduce the error caused by hologram displacement. However, the lack of visual feedback on the hologram does not have a significant effect on the user 3D perception.
CVDec 25, 2024
A Culturally-Aware Benchmark for Person Re-Identification in Modest AttireAlireza Sedighi Moghaddam, Fatemeh Anvari, Mohammadjavad Mirshekari Haghighi et al.
Person Re-Identification (ReID) is a fundamental task in computer vision with critical applications in surveillance and security. Despite progress in recent years, most existing ReID models often struggle to generalize across diverse cultural contexts, particularly in Islamic regions like Iran, where modest clothing styles are prevalent. Existing datasets predominantly feature Western and East Asian fashion, limiting their applicability in these settings. To address this gap, we introduce Iran University of Science and Technology Person Re-Identification (IUST_PersonReId), a dataset designed to reflect the unique challenges of ReID in new cultural environments, emphasizing modest attire and diverse scenarios from Iran, including markets, campuses, and mosques. Experiments on IUST_PersonReId with state-of-the-art models, such as Semantic Controllable Self-supervised Learning (SOLIDER) and Contrastive Language-Image Pretraining Re-Identification (CLIP-ReID), reveal significant performance drops compared to benchmarks like Market1501 and Multi-Scene MultiTime (MSMT17), specifically, SOLIDER shows a drop of 50.75% and 23.01% Mean Average Precision (mAP) compared to Market1501 and MSMT17 respectively, while CLIP-ReID exhibits a drop of 38.09% and 21.74% mAP, highlighting the challenges posed by occlusion and limited distinctive features. Sequence-based evaluations show improvements by leveraging temporal context, emphasizing the dataset's potential for advancing culturally sensitive and robust ReID systems. IUST_PersonReId offers a critical resource for addressing fairness and bias in ReID research globally.
CVDec 23, 2024
Feature Based Methods in Domain Adaptation for Object Detection: A Review PaperHelia Mohamadi, Mohammad Ali Keyvanrad, Mohammad Reza Mohammadi
Domain adaptation, a pivotal branch of transfer learning, aims to enhance the performance of machine learning models when deployed in target domains with distinct data distributions. This is particularly critical for object detection tasks, where domain shifts (caused by factors such as lighting conditions, viewing angles, and environmental variations) can lead to significant performance degradation. This review delves into advanced methodologies for domain adaptation, including adversarial learning, discrepancy-based, multi-domain, teacher-student, ensemble, and Vision Language Models techniques, emphasizing their efficacy in reducing domain gaps and enhancing model robustness. Feature-based methods have emerged as powerful tools for addressing these challenges by harmonizing feature representations across domains. These techniques, such as Feature Alignment, Feature Augmentation/Reconstruction, and Feature Transformation, are employed alongside or as integral parts of other domain adaptation strategies to minimize domain gaps and improve model performance. Special attention is given to strategies that minimize the reliance on extensive labeled data and using unlabeled data, particularly in scenarios involving synthetic-to-real domain shifts. Applications in fields such as autonomous driving and medical imaging are explored, showcasing the potential of these methods to ensure reliable object detection in diverse and complex settings. By providing a thorough analysis of state-of-the-art techniques, challenges, and future directions, this work offers a valuable reference for researchers striving to develop resilient and adaptable object detection frameworks, advancing the seamless deployment of artificial intelligence in dynamic environments.
CVSep 2, 2025
Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy LabelsAlireza Sedighi Moghaddam, Mohammad Reza Mohammadi
Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.
CVApr 10, 2021
Lip reading using external viseme decodingJavad Peymanfard, Mohammad Reza Mohammadi, Hossein Zeinali et al.
Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip movements during a conversation. This paper aims to show how to use external text data (for viseme-to-character mapping) by dividing video-to-character into two stages, namely converting video to viseme, and then converting viseme to character by using separate models. Our proposed method improves word error rate by 4\% compared to the normal sequence to sequence lip-reading model on the BBC-Oxford Lip Reading Sentences 2 (LRS2) dataset.
IVDec 22, 2020
Cloud removal in remote sensing images using generative adversarial networks and SAR-to-optical image translationFaramarz Naderi Darbaghshahi, Mohammad Reza Mohammadi, Mohsen Soryani
Satellite images are often contaminated by clouds. Cloud removal has received much attention due to the wide range of satellite image applications. As the clouds thicken, the process of removing the clouds becomes more challenging. In such cases, using auxiliary images such as near-infrared or synthetic aperture radar (SAR) for reconstructing is common. In this study, we attempt to solve the problem using two generative adversarial networks (GANs). The first translates SAR images into optical images, and the second removes clouds using the translated images of prior GAN. Also, we propose dilated residual inception blocks (DRIBs) instead of vanilla U-net in the generator networks and use structural similarity index measure (SSIM) in addition to the L1 Loss function. Reducing the number of downsamplings and expanding receptive fields by dilated convolutions increase the quality of output images. We used the SEN1-2 dataset to train and test both GANs, and we made cloudy images by adding synthetic clouds to optical images. The restored images are evaluated with PSNR and SSIM. We compare the proposed method with state-of-the-art deep learning models and achieve more accurate results in both SAR-to-optical translation and cloud removal parts.
CVFeb 11, 2020
Sperm Detection and Tracking in Phase-Contrast Microscopy Image Sequences using Deep Learning and Modified CSR-DCFMohammad reza Mohammadi, Mohammad Rahimzadeh, Abolfazl Attar
Nowadays, computer-aided sperm analysis (CASA) systems have made a big leap in extracting the characteristics of spermatozoa for studies or measuring human fertility. The first step in sperm characteristics analysis is sperm detection in the frames of the video sample. In this article, we used RetinaNet, a deep fully convolutional neural network as the object detector. Sperms are small objects with few attributes, that makes the detection more difficult in high-density samples and especially when there are other particles in semen, which could be like sperm heads. One of the main attributes of sperms is their movement, but this attribute cannot be extracted when only one frame would be fed to the network. To improve the performance of the sperm detection network, we concatenated some consecutive frames to use as the input of the network. With this method, the motility attribute has also been extracted, and then with the help of the deep convolutional network, we have achieved high accuracy in sperm detection. The second step is tracking the sperms, for extracting the motility parameters that are essential for indicating fertility and other studies on sperms. In the tracking phase, we modify the CSR-DCF algorithm. This method also has shown excellent results in sperm tracking even in high-density sperm samples, occlusions, sperm colliding, and when sperms exit from a frame and re-enter in the next frames. The average precision of the detection phase is 99.1%, and the F1 score of the tracking method evaluation is 96.61%. These results can be a great help in studies investigating sperm behavior and analyzing fertility possibility.
CVAug 19, 2018
Deep Multiple Instance Learning for Airplane Detection in High Resolution ImageryMohammad Reza Mohammadi
Automatic airplane detection in aerial imagery has a variety of applications. Two of the significant challenges in this task are variations in the scale and direction of the airplanes. To solve these challenges, we present a rotation-and-scale invariant airplane proposal generator. We call this generator symmetric line segments (SLS) that is developed based on the symmetric and regular boundaries of airplanes from the top view. Then, the generated proposals are used to train a deep convolutional neural network for removing non-airplane proposals. Since each airplane can have multiple SLS proposals, where some of them are not in the direction of the fuselage, we collect all proposals corresponding to one ground-truth as a positive bag and the others as the negative instances. To have multiple instance deep learning, we modify the loss function of the network to learn from each positive bag at least one instance as well as all negative instances. Finally, we employ non-maximum suppression to remove duplicate detections. Our experiments on NWPU VHR-10 and DOTA datasets show that our method is a promising approach for automatic airplane detection in very high-resolution images. Moreover, we estimate the direction of the airplanes using box-level annotations as an extra achievement.