CVFeb 23, 2023
ArtiFact: A Large-Scale Dataset with Artificial and Factual Images for Generalizable and Robust Synthetic Image DetectionMd Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker et al.
Synthetic image generation has opened up new opportunities but has also created threats in regard to privacy, authenticity, and security. Detecting fake images is of paramount importance to prevent illegal activities, and previous research has shown that generative models leave unique patterns in their synthetic images that can be exploited to detect them. However, the fundamental problem of generalization remains, as even state-of-the-art detectors encounter difficulty when facing generators never seen during training. To assess the generalizability and robustness of synthetic image detectors in the face of real-world impairments, this paper presents a large-scale dataset named ArtiFact, comprising diverse generators, object categories, and real-world challenges. Moreover, the proposed multi-class classification scheme, combined with a filter stride reduction strategy addresses social platform impairments and effectively detects synthetic images from both seen and unseen generators. The proposed solution significantly outperforms other top teams by 8.34% on Test 1, 1.26% on Test 2, and 15.08% on Test 3 in the IEEE VIP Cup challenge at ICIP 2022, as measured by the accuracy metric.
SDAug 26, 2024
SONICS: Synthetic Or Not -- Identifying Counterfeit SongsMd Awsafur Rahman, Zaber Ibn Abdul Hakim, Najibul Haque Sarker et al.
The recent surge in AI-generated songs presents exciting possibilities and challenges. These innovations necessitate the ability to distinguish between human-composed and synthetic songs to safeguard artistic integrity and protect human musical artistry. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, these approaches are inadequate for detecting contemporary end-to-end artificial songs where all components (vocals, music, lyrics, and style) could be AI-generated. Additionally, existing datasets lack music-lyrics diversity, long-duration songs, and open-access fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs (4,751 hours) with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect entirely overlooked in existing methods. To utilize long-range patterns, we introduce SpecTTTra, a novel architecture that significantly improves time and memory efficiency over conventional CNN and Transformer-based models. For long songs, our top-performing variant outperforms ViT by 8% in F1 score, is 38% faster, and uses 26% less memory, while also surpassing ConvNeXt with a 1% F1 score gain, 20% speed boost, and 67% memory reduction.
QUANT-PHJul 20, 2023
Quantum Convolutional Neural Networks with Interaction Layers for Classification of Classical DataJishnu Mahmud, Raisa Mashtura, Shaikh Anowarul Fattah et al.
Quantum Machine Learning (QML) has come into the limelight due to the exceptional computational abilities of quantum computers. With the promises of near error-free quantum computers in the not-so-distant future, it is important that the effect of multi-qubit interactions on quantum neural networks is studied extensively. This paper introduces a Quantum Convolutional Network with novel Interaction layers exploiting three-qubit interactions, while studying the network's expressibility and entangling capability, for classifying both image and one-dimensional data. The proposed approach is tested on three publicly available datasets namely MNIST, Fashion MNIST, and Iris datasets, flexible in performing binary and multiclass classifications, and is found to supersede the performance of existing state-of-the-art methods.
CVOct 3, 2023
Decoding Human Activities: Analyzing Wearable Accelerometer and Gyroscope Data for Activity RecognitionUtsab Saha, Sawradip Saha, Tahmid Kabir et al.
A person's movement or relative positioning can be effectively captured by different types of sensors and corresponding sensor output can be utilized in various manipulative techniques for the classification of different human activities. This letter proposes an effective scheme for human activity recognition, which introduces two unique approaches within a multi-structural architecture, named FusionActNet. The first approach aims to capture the static and dynamic behavior of a particular action by using two dedicated residual networks and the second approach facilitates the final decision-making process by introducing a guidance module. A two-stage training process is designed where at the first stage, residual networks are pre-trained separately by using static (where the human body is immobile) and dynamic (involving movement of the human body) data. In the next stage, the guidance module along with the pre-trained static or dynamic models are used to train the given sensor data. Here the guidance module learns to emphasize the most relevant prediction vector obtained from the static or dynamic models, which helps to effectively classify different human activities. The proposed scheme is evaluated using two benchmark datasets and compared with state-of-the-art methods. The results clearly demonstrate that our method outperforms existing approaches in terms of accuracy, precision, recall, and F1 score, achieving 97.35% and 95.35% accuracy on the UCI HAR and Motion-Sense datasets, respectively which highlights both the effectiveness and stability of the proposed scheme.
CVMar 7, 2023
CIFF-Net: Contextual Image Feature Fusion for Melanoma DiagnosisMd Awsafur Rahman, Bishmoy Paul, Tanvir Mahmud et al.
Melanoma is considered to be the deadliest variant of skin cancer causing around 75\% of total skin cancer deaths. To diagnose Melanoma, clinicians assess and compare multiple skin lesions of the same patient concurrently to gather contextual information regarding the patterns, and abnormality of the skin. So far this concurrent multi-image comparative method has not been explored by existing deep learning-based schemes. In this paper, based on contextual image feature fusion (CIFF), a deep neural network (CIFF-Net) is proposed, which integrates patient-level contextual information into the traditional approaches for improved Melanoma diagnosis by concurrent multi-image comparative method. The proposed multi-kernel self attention (MKSA) module offers better generalization of the extracted features by introducing multi-kernel operations in the self attention mechanisms. To utilize both self attention and contextual feature-wise attention, an attention guided module named contextual feature fusion (CFF) is proposed that integrates extracted features from different contextual images into a single feature vector. Finally, in comparative contextual feature fusion (CCFF) module, primary and contextual features are compared concurrently to generate comparative features. Significant improvement in performance has been achieved on the ISIC-2020 dataset over the traditional approaches that validate the effectiveness of the proposed contextual learning scheme.
CVMar 6, 2023
DwinFormer: Dual Window Transformers for End-to-End Monocular Depth EstimationMd Awsafur Rahman, Shaikh Anowarul Fattah
Depth estimation from a single image is of paramount importance in the realm of computer vision, with a multitude of applications. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network part of the architecture. In this paper, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for end-to-end monocular depth estimation. The DwinFormer consists of dual window self-attention and cross-attention transformers, Dwin-SAT and Dwin-CAT, respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equvariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and KITTI datasets demonstrates the superiority of the proposed method, consistently outperforming existing approaches across both indoor and outdoor environments.
CVAug 28, 2023
Semi-Supervised Semantic Depth Estimation using Symbiotic Transformer and NearFarMix AugmentationMd Awsafur Rahman, Shaikh Anowarul Fattah
In computer vision, depth estimation is crucial for domains like robotics, autonomous vehicles, augmented reality, and virtual reality. Integrating semantics with depth enhances scene understanding through reciprocal information sharing. However, the scarcity of semantic information in datasets poses challenges. Existing convolutional approaches with limited local receptive fields hinder the full utilization of the symbiotic potential between depth and semantics. This paper introduces a dataset-invariant semi-supervised strategy to address the scarcity of semantic information. It proposes the Depth Semantics Symbiosis module, leveraging the Symbiotic Transformer for achieving comprehensive mutual awareness by information exchange within both local and global contexts. Additionally, a novel augmentation, NearFarMix is introduced to combat overfitting and compensate both depth-semantic tasks by strategically merging regions from two images, generating diverse and structurally consistent samples with enhanced control. Extensive experiments on NYU-Depth-V2 and KITTI datasets demonstrate the superiority of our proposed techniques in indoor and outdoor environments.
SDSep 15, 2023
Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown Multi-Class Ensemble of CNNsMd Awsafur Rahman, Bishmoy Paul, Najibul Haque Sarker et al.
With the huge technological advances introduced by deep learning in audio & speech processing, many novel synthetic speech techniques achieved incredible realistic results. As these methods generate realistic fake human voices, they can be used in malicious acts such as people imitation, fake news, spreading, spoofing, media manipulations, etc. Hence, the ability to detect synthetic or natural speech has become an urgent necessity. Moreover, being able to tell which algorithm has been used to generate a synthetic speech track can be of preeminent importance to track down the culprit. In this paper, a novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it. The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms, utilizing semi-supervision and ensemble to improve its robustness and generalizability significantly. The proposed detector is validated on two evaluation datasets consisting of a total of 18,000 weakly perturbed (Eval 1) & 10,000 strongly perturbed (Eval 2) synthetic speeches. The proposed method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
CVDec 8, 2022
A Novel Hierarchical-Classification-Block Based Convolutional Neural Network for Source Camera Model IdentificationMohammad Zunaed, Shaikh Anowarul Fattah
Digital security has been an active area of research interest due to the rapid adaptation of internet infrastructure, the increasing popularity of social media, and digital cameras. Due to inherent differences in working principles to generate an image, different camera brands left behind different intrinsic processing noises which can be used to identify the camera brand. In the last decade, many signal processing and deep learning-based methods have been proposed to identify and isolate this noise from the scene details in an image to detect the source camera brand. One prominent solution is to utilize a hierarchical classification system rather than the traditional single-classifier approach. Different individual networks are used for brand-level and model-level source camera identification. This approach allows for better scaling and requires minimal modifications for adding a new camera brand/model to the solution. However, using different full-fledged networks for both brand and model-level classification substantially increases memory consumption and training complexity. Moreover, extracted low-level features from the different network's initial layers often coincide, resulting in redundant weights. To mitigate the training and memory complexity, we propose a classifier-block-level hierarchical system instead of a network-level one for source camera model classification. Our proposed approach not only results in significantly fewer parameters but also retains the capability to add a new camera model with minimal modification. Thorough experimentation on the publicly available Dresden dataset shows that our proposed approach can achieve the same level of state-of-the-art performance but requires fewer parameters compared to a state-of-the-art network-level hierarchical-based system.
CVFeb 6, 2025
An Optimized YOLOv5 Based Approach For Real-time Vehicle Detection At Road Intersections Using Fisheye CamerasMd. Jahin Alam, Muhammad Zubair Hasan, Md Maisoon Rahman et al.
Real time vehicle detection is a challenging task for urban traffic surveillance. Increase in urbanization leads to increase in accidents and traffic congestion in junction areas resulting in delayed travel time. In order to solve these problems, an intelligent system utilizing automatic detection and tracking system is significant. But this becomes a challenging task at road intersection areas which require a wide range of field view. For this reason, fish eye cameras are widely used in real time vehicle detection purpose to provide large area coverage and 360 degree view at junctions. However, it introduces challenges such as light glare from vehicles and street lights, shadow, non-linear distortion, scaling issues of vehicles and proper localization of small vehicles. To overcome each of these challenges, a modified YOLOv5 object detection scheme is proposed. YOLOv5 is a deep learning oriented convolutional neural network (CNN) based object detection method. The proposed scheme for detecting vehicles in fish-eye images consists of a light-weight day-night CNN classifier so that two different solutions can be implemented to address the day-night detection issues. Furthurmore, challenging instances are upsampled in the dataset for proper localization of vehicles and later on the detection model is ensembled and trained in different combination of vehicle datasets for better generalization, detection and accuracy. For testing, a real world fisheye dataset provided by the Video and Image Processing (VIP) Cup organizer ISSD has been used which includes images from video clips of different fisheye cameras at junction of different cities during day and night time. Experimental results show that our proposed model has outperformed the YOLOv5 model on the dataset by 13.7% mAP @ 0.5.
CVOct 20, 2025
Towards Explainable Skin Cancer Classification: A Dual-Network Attention Model with Lesion Segmentation and Clinical Metadata FusionMd. Enamul Atiq, Shaikh Anowarul Fattah
Skin cancer is a life-threatening disease where early detection significantly improves patient outcomes. Automated diagnosis from dermoscopic images is challenging due to high intra-class variability and subtle inter-class differences. Many deep learning models operate as "black boxes," limiting clinical trust. In this work, we propose a dual-encoder attention-based framework that leverages both segmented lesions and clinical metadata to enhance skin lesion classification in terms of both accuracy and interpretability. A novel Deep-UNet architecture with Dual Attention Gates (DAG) and Atrous Spatial Pyramid Pooling (ASPP) is first employed to segment lesions. The classification stage uses two DenseNet201 encoders-one on the original image and another on the segmented lesion whose features are fused via multi-head cross-attention. This dual-input design guides the model to focus on salient pathological regions. In addition, a transformer-based module incorporates patient metadata (age, sex, lesion site) into the prediction. We evaluate our approach on the HAM10000 dataset and the ISIC 2018 and 2019 challenges. The proposed method achieves state-of-the-art segmentation performance and significantly improves classification accuracy and average AUC compared to baseline models. To validate our model's reliability, we use Gradient-weighted Class Activation Mapping (Grad-CAM) to generate heatmaps. These visualizations confirm that our model's predictions are based on the lesion area, unlike models that rely on spurious background features. These results demonstrate that integrating precise lesion segmentation and clinical data with attention-based fusion leads to a more accurate and interpretable skin cancer classification model.
LGAug 12, 2025
Load Forecasting on A Highly Sparse Electrical Load Dataset Using Gaussian InterpolationChinmoy Biswas, Nafis Faisal, Vivek Chowdhury et al.
Sparsity, defined as the presence of missing or zero values in a dataset, often poses a major challenge while operating on real-life datasets. Sparsity in features or target data of the training dataset can be handled using various interpolation methods, such as linear or polynomial interpolation, spline, moving average, or can be simply imputed. Interpolation methods usually perform well with Strict Sense Stationary (SSS) data. In this study, we show that an approximately 62\% sparse dataset with hourly load data of a power plant can be utilized for load forecasting assuming the data is Wide Sense Stationary (WSS), if augmented with Gaussian interpolation. More specifically, we perform statistical analysis on the data, and train multiple machine learning and deep learning models on the dataset. By comparing the performance of these models, we empirically demonstrate that Gaussian interpolation is a suitable option for dealing with load forecasting problems. Additionally, we demonstrate that Long Short-term Memory (LSTM)-based neural network model offers the best performance among a diverse set of classical and neural network-based models.
QUANT-PHJul 2, 2025
Selective Feature Re-Encoded Quantum Convolutional Neural Network with Joint Optimization for Image ClassificationShaswata Mahernob Sarkar, Sheikh Iftekhar Ahmed, Jishnu Mahmud et al.
Quantum Machine Learning (QML) has seen significant advancements, driven by recent improvements in Noisy Intermediate-Scale Quantum (NISQ) devices. Leveraging quantum principles such as entanglement and superposition, quantum convolutional neural networks (QCNNs) have demonstrated promising results in classifying both quantum and classical data. This study examines QCNNs in the context of image classification and proposes a novel strategy to enhance feature processing and a QCNN architecture for improved classification accuracy. First, a selective feature re-encoding strategy is proposed, which directs the quantum circuits to prioritize the most informative features, thereby effectively navigating the crucial regions of the Hilbert space to find the optimal solution space. Secondly, a novel parallel-mode QCNN architecture is designed to simultaneously incorporate features extracted by two classical methods, Principal Component Analysis (PCA) and Autoencoders, within a unified training scheme. The joint optimization involved in the training process allows the QCNN to benefit from complementary feature representations, enabling better mutual readjustment of model parameters. To assess these methodologies, comprehensive experiments have been performed using the widely used MNIST and Fashion MNIST datasets for binary classification tasks. Experimental findings reveal that the selective feature re-encoding method significantly improves the quantum circuit's feature processing capability and performance. Furthermore, the jointly optimized parallel QCNN architecture consistently outperforms the individual QCNN models and the traditional ensemble approach involving independent learning followed by decision fusion, confirming its superior accuracy and generalization capabilities.
CVJun 5, 2024
Npix2Cpix: A GAN-Based Image-to-Image Translation Network With Retrieval- Classification Integration for Watermark Retrieval From Historical Document ImagesUtsab Saha, Sawradip Saha, Shaikh Anowarul Fattah et al.
The identification and restoration of ancient watermarks have long been a major topic in codicology and history. Classifying historical documents based on watermarks is challenging due to their diversity, noisy samples, multiple representation modes, and minor distinctions between classes and intra-class variations. This paper proposes a modified U-net-based conditional generative adversarial network (GAN) named Npix2Cpix to translate noisy raw historical watermarked images into clean, handwriting-free watermarked images by performing image translation from degraded (noisy) pixels to clean pixels. Using image-to-image translation and adversarial learning, the network creates clutter-free images for watermark restoration and categorization. The generator and discriminator of the proposed GAN are trained using two separate loss functions, each based on the distance between images, to learn the mapping from the input noisy image to the output clean image. After using the proposed GAN to pre-process noisy watermarked images, Siamese-based one-shot learning is employed for watermark classification. Experimental results on a large-scale historical watermark dataset demonstrate that cleaning the noisy watermarked images can help to achieve high one-shot classification accuracy. The qualitative and quantitative evaluation of the retrieved watermarked image highlights the effectiveness of the proposed approach.
LGFeb 25, 2024
A Machine Learning Approach to Detect Customer Satisfaction From Multiple Tweet ParametersMd Mahmudul Hasan, Shaikh Anowarul Fattah
Since internet technologies have advanced, one of the primary factors in company development is customer happiness. Online platforms have become prominent places for sharing reviews. Twitter is one of these platforms where customers frequently post their thoughts. Reviews of flights on these platforms have become a concern for the airline business. A positive review can help the company grow, while a negative one can quickly ruin its revenue and reputation. So it's vital for airline businesses to examine the feedback and experiences of their customers and enhance their services to remain competitive. But studying thousands of tweets and analyzing them to find the satisfaction of the customer is quite a difficult task. This tedious process can be made easier by using a machine learning approach to analyze tweets to determine client satisfaction levels. Some work has already been done on this strategy to automate the procedure using machine learning and deep learning techniques. However, they are all purely concerned with assessing the text's sentiment. In addition to the text, the tweet also includes the time, location, username, airline name, and so on. This additional information can be crucial for improving the model's outcome. To provide a machine learning based solution, this work has broadened its perspective to include these qualities. And it has come as no surprise that the additional features beyond text sentiment analysis produce better outcomes in machine learning based models.
SPJan 3, 2021
A Novel Multi-Stage Training Approach for Human Activity Recognition from Multimodal Wearable Sensor Data Using Deep Neural NetworkTanvir Mahmud, A. Q. M. Sazzad Sayyed, Shaikh Anowarul Fattah et al.
Deep neural network is an effective choice to automatically recognize human actions utilizing data from various wearable sensors. These networks automate the process of feature extraction relying completely on data. However, various noises in time series data with complex inter-modal relationships among sensors make this process more complicated. In this paper, we have proposed a novel multi-stage training approach that increases diversity in this feature extraction process to make accurate recognition of actions by combining varieties of features extracted from diverse perspectives. Initially, instead of using single type of transformation, numerous transformations are employed on time series data to obtain variegated representations of the features encoded in raw data. An efficient deep CNN architecture is proposed that can be individually trained to extract features from different transformed spaces. Later, these CNN feature extractors are merged into an optimal architecture finely tuned for optimizing diversified extracted features through a combined training stage or multiple sequential training stages. This approach offers the opportunity to explore the encoded features in raw sensor data utilizing multifarious observation windows with immense scope for efficient selection of features for final convergence. Extensive experimentations have been carried out in three publicly available datasets that provide outstanding performance consistently with average five-fold cross-validation accuracy of 99.29% on UCI HAR database, 99.02% on USC HAR database, and 97.21% on SKODA database outperforming other state-of-the-art approaches.
IVJan 3, 2021
CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT ScansTanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury et al.
Rapid and precise diagnosis of COVID-19 is one of the major challenges faced by the global community to control the spread of this overgrowing pandemic. In this paper, a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans. A multi-phase optimization strategy is introduced for solving the challenges of complicated diagnosis at a very early stage of infection, where an efficient lesion segmentation network is optimized initially which is later integrated into a joint optimization framework for the diagnosis and severity prediction tasks providing feature enhancement of the infected regions. Moreover, for overcoming the challenges with diffused, blurred, and varying shaped edges of COVID lesions with novel and diverse characteristics, a novel segmentation network is introduced, namely Tri-level Attention-based Segmentation Network (TA-SegNet). This network has significantly reduced semantic gaps in subsequent encoding decoding stages, with immense parallelization of multi-scale features for faster convergence providing considerable performance improvement over traditional networks. Furthermore, a novel tri-level attention mechanism has been introduced, which is repeatedly utilized over the network, combining channel, spatial, and pixel attention schemes for faster and efficient generalization of contextual information embedded in the feature map through feature re-calibration and enhancement operations. Outstanding performances have been achieved in all three-tasks through extensive experimentation on a large publicly available dataset containing 1110 chest CT-volumes that signifies the effectiveness of the proposed scheme at the current stage of the pandemic.
CVDec 9, 2020
Automatic Diagnosis of Malaria from Thin Blood Smear Images using Deep Convolutional Neural Network with Multi-Resolution Feature FusionTanvir Mahmud, Shaikh Anowarul Fattah
Malaria, a life-threatening disease, infects millions of people every year throughout the world demanding faster diagnosis for proper treatment before any damages occur. In this paper, an end-to-end deep learning-based approach is proposed for faster diagnosis of malaria from thin blood smear images by making efficient optimizations of features extracted from diversified receptive fields. Firstly, an efficient, highly scalable deep neural network, named as DilationNet, is proposed that incorporates features from a large spectrum by varying dilation rates of convolutions to extract features from different receptive areas. Next, the raw images are resampled to various resolutions to introduce variations in the receptive fields that are used for independently optimizing different forms of DilationNet scaled for different resolutions of images. Afterward, a feature fusion scheme is introduced with the proposed DeepFusionNet architecture for jointly optimizing the feature space of these individually trained networks operating on different levels of observations. All the convolutional layers of various forms of DilationNets that are optimized to extract spatial features from different resolutions of images are directly transferred to provide a variegated feature space from any image. Later, joint optimization of these spatial features is carried out in the DeepFusionNet to extract the most relevant representation of the sample image. This scheme offers the opportunity to explore the feature space extensively by varying the observation level to accurately diagnose the abnormality. Intense experimentations on a publicly available dataset show outstanding performance with accuracy over 99.5% outperforming other state-of-the-art approaches.
IVDec 2, 2020
CovSegNet: A Multi Encoder-Decoder Architecture for Improved Lesion Segmentation of COVID-19 Chest CT ScansTanvir Mahmud, Md Awsafur Rahman, Shaikh Anowarul Fattah et al.
Automatic lung lesions segmentation of chest CT scans is considered a pivotal stage towards accurate diagnosis and severity measurement of COVID-19. Traditional U-shaped encoder-decoder architecture and its variants suffer from diminutions of contextual information in pooling/upsampling operations with increased semantic gaps among encoded and decoded feature maps as well as instigate vanishing gradient problems for its sequential gradient propagation that result in sub-optimal performance. Moreover, operating with 3D CT-volume poses further limitations due to the exponential increase of computational complexity making the optimization difficult. In this paper, an automated COVID-19 lesion segmentation scheme is proposed utilizing a highly efficient neural network architecture, namely CovSegNet, to overcome these limitations. Additionally, a two-phase training scheme is introduced where a deeper 2D-network is employed for generating ROI-enhanced CT-volume followed by a shallower 3D-network for further enhancement with more contextual information without increasing computational burden. Along with the traditional vertical expansion of Unet, we have introduced horizontal expansion with multi-stage encoder-decoder modules for achieving optimum performance. Additionally, multi-scale feature maps are integrated into the scale transition process to overcome the loss of contextual information. Moreover, a multi-scale fusion module is introduced with a pyramid fusion scheme to reduce the semantic gaps between subsequent encoder/decoder modules while facilitating the parallel optimization for efficient gradient propagation. Outstanding performances have been achieved in three publicly available datasets that largely outperform other state-of-the-art approaches. The proposed scheme can be easily extended for achieving optimum segmentation performances in a wide variety of applications.