Amir Hussain

CV
h-index24
66papers
4,556citations
Novelty42%
AI Score55

66 Papers

CLSep 27, 2022Code
WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs

Hoang Thang Ta, Abu Bakar Siddiqur Rahman, Navonil Majumder et al.

As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method - description generation (Phase I) and candidate ranking (Phase II) - as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.

CVApr 8, 2022Code
From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion

Weiguang Zhao, Chaolong Yang, Jianan Ye et al. · nvidia

While weakly supervised multi-view face reconstruction (MVR) is garnering increased attention, one critical issue still remains open: how to effectively interact and fuse multiple image information to reconstruct high-precision 3D models. In this regard, we propose a novel pipeline called Deep Fusion MVR (DF-MVR) to explore the feature correspondences between multi-view images and reconstruct high-precision 3D faces. Specifically, we present a novel multi-view feature fusion backbone that utilizes face masks to align features from multiple encoders and integrates one multi-layer attention mechanism to enhance feature interaction and fusion, resulting in one unified facial representation. Additionally, we develop one concise face mask mechanism that facilitates multi-view feature fusion and facial reconstruction by identifying common areas and guiding the network's focus on critical facial features (e.g., eyes, brows, nose, and mouth). Experiments on Pixel-Face and Bosphorus datasets indicate the superiority of our model. Without 3D annotation, DF-MVR achieves 5.2% and 3.0% RMSE improvement over the existing weakly supervised MVRs respectively on Pixel-Face and Bosphorus dataset. Code will be available publicly at https://github.com/weiguangzhao/DF_MVR.

HCJul 28, 2023
Beyond Reality: The Pivotal Role of Generative AI in the Metaverse

Vinay Chamola, Gaurang Bansal, Tridib Kumar Das et al.

Imagine stepping into a virtual world that's as rich, dynamic, and interactive as our physical one. This is the promise of the Metaverse, and it's being brought to life by the transformative power of Generative Artificial Intelligence (AI). This paper offers a comprehensive exploration of how generative AI technologies are shaping the Metaverse, transforming it into a dynamic, immersive, and interactive virtual world. We delve into the applications of text generation models like ChatGPT and GPT-3, which are enhancing conversational interfaces with AI-generated characters. We explore the role of image generation models such as DALL-E and MidJourney in creating visually stunning and diverse content. We also examine the potential of 3D model generation technologies like Point-E and Lumirithmic in creating realistic virtual objects that enrich the Metaverse experience. But the journey doesn't stop there. We also address the challenges and ethical considerations of implementing these technologies in the Metaverse, offering insights into the balance between user control and AI automation. This paper is not just a study, but a guide to the future of the Metaverse, offering readers a roadmap to harnessing the power of generative AI in creating immersive virtual worlds.

IRJul 2, 2023
Filter Bubbles in Recommender Systems: Fact or Fallacy -- A Systematic Review

Qazi Mohammad Areeb, Mohammad Nadeem, Shahab Saquib Sohail et al.

A filter bubble refers to the phenomenon where Internet customization effectively isolates individuals from diverse opinions or materials, resulting in their exposure to only a select set of content. This can lead to the reinforcement of existing attitudes, beliefs, or conditions. In this study, our primary focus is to investigate the impact of filter bubbles in recommender systems. This pioneering research aims to uncover the reasons behind this problem, explore potential solutions, and propose an integrated tool to help users avoid filter bubbles in recommender systems. To achieve this objective, we conduct a systematic literature review on the topic of filter bubbles in recommender systems. The reviewed articles are carefully analyzed and classified, providing valuable insights that inform the development of an integrated approach. Notably, our review reveals evidence of filter bubbles in recommendation systems, highlighting several biases that contribute to their existence. Moreover, we propose mechanisms to mitigate the impact of filter bubbles and demonstrate that incorporating diversity into recommendations can potentially help alleviate this issue. The findings of this timely review will serve as a benchmark for researchers working in interdisciplinary fields such as privacy, artificial intelligence ethics, and recommendation systems. Furthermore, it will open new avenues for future research in related domains, prompting further exploration and advancement in this critical area.

CLAug 27, 2024
Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation

Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria et al.

Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these tools to improve them further. Currently identified flaws include hallucination, biases, and bypassing restricted commands to generate harmful content. In the present work, we have identified a fundamental limitation related to the image generation ability of LLMs, and termed it The NO Syndrome. This negation blindness refers to LLMs inability to correctly comprehend NO related natural language prompts to generate the desired images. Interestingly, all tested LLMs including GPT-4, Gemini, and Copilot were found to be suffering from this syndrome. To demonstrate the generalization of this limitation, we carried out simulation experiments and conducted entropy-based and benchmark statistical analysis tests on various LLMs in multiple languages, including English, Hindi, and French. We conclude that the NO syndrome is a significant flaw in current LLMs that needs to be addressed. A related finding of this study showed a consistent discrepancy between image and textual responses as a result of this NO syndrome. We posit that the introduction of a negation context-aware reinforcement learning based feedback loop between the LLMs textual response and generated image could help ensure the generated text is based on both the LLMs correct contextual understanding of the negation query and the generated visual output.

SDJun 6, 2022
Canonical Cortical Graph Neural Networks and its Application for Speech Enhancement in Audio-Visual Hearing Aids

Leandro A. Passos, João Paulo Papa, Amir Hussain et al.

Despite the recent success of machine learning algorithms, most models face drawbacks when considering more complex tasks requiring interaction between different sources, such as multimodal input data and logical time sequences. On the other hand, the biological brain is highly sharpened in this sense, empowered to automatically manage and integrate such streams of information. In this context, this work draws inspiration from recent discoveries in brain cortical circuits to propose a more biologically plausible self-supervised machine learning approach. This combines multimodal information using intra-layer modulations together with Canonical Correlation Analysis, and a memory mechanism to keep track of temporal data, the overall approach termed Canonical Cortical Graph Neural networks. This is shown to outperform recent state-of-the-art models in terms of clean audio reconstruction and energy efficiency for a benchmark audio-visual speech dataset. The enhanced performance is demonstrated through a reduced and smother neuron firing rate distribution. suggesting that the proposed model is amenable for speech enhancement in future audio-visual hearing aid devices.

SDSep 3, 2024Code
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech Enhancement

Arnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary et al.

In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), $0.03$ in short-time objective intelligibility (STOI), and $1.32$ in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at \url{https://github.com/mtanveer1/AVSEC-3-Challenge}.

NEOct 24, 2022
Unlocking the potential of two-point cells for energy-efficient and resilient training of deep nets

Ahsan Adeel, Adewale Adetomi, Khubaib Ahmed et al.

Context-sensitive two-point layer 5 pyramidal cells (L5PCs) were discovered as long ago as 1999. However, the potential of this discovery to provide useful neural computation has yet to be demonstrated. Here we show for the first time how a transformative L5PCs-driven deep neural network (DNN), termed the multisensory cooperative computing (MCC) architecture, can effectively process large amounts of heterogeneous real-world audio-visual (AV) data, using far less energy compared to best available 'point' neuron-driven DNNs. A novel highly-distributed parallel implementation on a Xilinx UltraScale+ MPSoC device estimates energy savings up to 245759 $ \times $ 50000 $μ$J (i.e., 62% less than the baseline model in a semi-supervised learning setup) where a single synapse consumes $8e^{-5}μ$J. In a supervised learning setup, the energy-saving can potentially reach up to 1250x less (per feedforward transmission) than the baseline model. The significantly reduced neural activity in MCC leads to inherently fast learning and resilience against sudden neural damage. This remarkable performance in pilot experiments demonstrates the embodied neuromorphic intelligence of our proposed cooperative L5PC that receives input from diverse neighbouring neurons as context to amplify the transmission of most salient and relevant information for onward transmission, from overwhelmingly large multimodal information utilised at the early stages of on-chip training. Our proposed approach opens new cross-disciplinary avenues for future on-chip DNN training implementations and posits a radical shift in current neuromorphic computing paradigms.

SDApr 19
Audio-Visual Speech Enhancement: Architectural Design and Deployment Strategies

Anis Hamadouche, Haifeng Luo, Mathini Sellathurai et al.

Real-time audio-visual speech enhancement (AVSE) is a key enabler for immersive and interactive multimedia services, yet its performance is tightly constrained by network latency, uplink capacity, and computational delay. This paper presents the design, deployment, and evaluation of a complete cloud-edge-assisted AVSE system operating over a public 5G edge network. The system integrates CNN-based acoustic enhancement and OpenCV-based facial feature extraction with an LSTM fusion network to preserve temporal coherence, and is deployed on a Vodafone-compatible AWS Wavelength edge cloud. Through extensive stress testing, we analyze end-to-end performance under varying network load and adaptive multimedia profiles. Results show that compute placement at the network edge is critical for meeting real-time coherence constraints, and that uplink capacity is often the dominant bottleneck for interactive AVSE services. Only 5G and wired Ethernet consistently satisfied the required communication delay bound for uncompressed audio-video chunks, while aggressive compression reduced payload sizes by up to 80% with negligible perceptual degradation, enabling robust operation under constrained conditions. We further demonstrate a fundamental trade-off between processing latency and enhancement quality, where reduced model complexity lowers delay but degrades reconstruction performance in low-SNR scenarios. Our findings indicate that public 5G edge environments can sustain real-time, interactive AVSE workloads when network and compute resources are carefully orchestrated, although performance margins remain tighter than in dedicated infrastructures. The architectural insights derived from this study provide practical guidelines for the design of delay-sensitive multimedia and perceptual enhancement services on emerging 5G edge-cloud platforms.

CVDec 13, 2022
Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

Chaolong Yang, Yuyao Yan, Weiguang Zhao et al.

3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network's flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network's inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.

CVDec 12, 2023Code
Open-Pose 3D Zero-Shot Learning: Benchmark and Challenges

Weiguang Zhao, Guanyu Yang, Rui Zhang et al.

With the explosive 3D data growth, the urgency of utilizing zero-shot learning to facilitate data labeling becomes evident. Recently, methods transferring language or language-image pre-training models like Contrastive Language-Image Pre-training (CLIP) to 3D vision have made significant progress in the 3D zero-shot classification task. These methods primarily focus on 3D object classification with an aligned pose; such a setting is, however, rather restrictive, which overlooks the recognition of 3D objects with open poses typically encountered in real-world scenarios, such as an overturned chair or a lying teddy bear. To this end, we propose a more realistic and challenging scenario named open-pose 3D zero-shot classification, focusing on the recognition of 3D objects regardless of their orientation. First, we revisit the current research on 3D zero-shot classification, and propose two benchmark datasets specifically designed for the open-pose setting. We empirically validate many of the most popular methods in the proposed open-pose benchmark. Our investigations reveal that most current 3D zero-shot classification models suffer from poor performance, indicating a substantial exploration room towards the new direction. Furthermore, we study a concise pipeline with an iterative angle refinement mechanism that automatically optimizes one ideal angle to classify these open-pose 3D objects. In particular, to make validation more compelling and not just limited to existing CLIP-based methods, we also pioneer the exploration of knowledge transfer based on Diffusion models. While the proposed solutions can serve as a new benchmark for open-pose 3D zero-shot classification, we discuss the complexities and challenges of this scenario that remain for further research development. The code is available publicly at https://github.com/weiguangzhao/Diff-OP3D.

SYApr 19
Stochastic Delayed Dynamics of Rumor Propagation with Awareness and Fact-Checking

Lamia Alyami, Anis Hamadouche, Amir Hussain

This paper presents a stochastic delayed differential model for rumor propagation during infodemic that incorporates human behavioral response, public skepticism and fact-checking mechanisms. A discrete time delay is introduced to model natural lags in information processing and institutional response. Additionally, we adopt additive stochastic perturbations to model random fluctuations in social interaction and exposure. We present a rigorous stability analysis of the proposed rumor transmission model and derive convergence guarantees under reproduction number conditions. We also validate the model by numerical simulations and analyze the outbreak severity and quantify uncertainty under variable information processing delays. The results highlight the importance of timely awareness and fact-checking interventions for mitigating misinformation spread during pandemics

SDOct 6, 2025Code
AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement

M. Sajid, Deepanshu Gupta, Yash Modi et al.

In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin Transformer V2 for efficient and expressive visual feature extraction. Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities, enabling rich and complementary representation learning. To capture temporal dependencies within the fused embeddings, a stack of lightweight Squeezeformer blocks combining convolutional and attention modules is introduced. The enhanced embeddings are then decoded via a U-Net-style decoder for direct waveform reconstruction, ensuring perceptually consistent and intelligible speech output. Experimental evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant performance improvements over noisy baselines, with STOI of 0.516, PESQ of 1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at https://github.com/mtanveer1/AVSEC-4-Challenge-2025.

QMFeb 28, 2020Code
Deep Learning in Mining Biological Data

Mufti Mahmud, M Shamim Kaiser, Amir Hussain

Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Broadly categorized in three types (i.e., sequences, images, and signals), these data are huge in amount and complex in nature. Mining such an enormous amount of data for pattern recognition is a big challenge and requires sophisticated data-intensive machine learning techniques. Artificial neural network-based learning systems are well known for their pattern recognition capabilities and lately their deep architectures - known as deep learning (DL) - have been successfully applied to solve many complex pattern recognition problems. Highlighting the role of DL in recognizing patterns in biological data, this article provides - applications of DL to biological sequences, images, and signals data; overview of open access sources of these data; description of open source DL tools applicable on these data; and comparison of these tools from qualitative and quantitative perspectives. At the end, it outlines some open research challenges in mining biological data and puts forward a number of possible future perspectives.

CVDec 30, 2024
Gender Bias in Text-to-Video Generation Models: A case study of Sora

Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria et al.

The advent of text-to-video generation models has revolutionized content creation as it produces high-quality videos from textual prompts. However, concerns regarding inherent biases in such models have prompted scrutiny, particularly regarding gender representation. Our study investigates the presence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video generation model. We uncover significant evidence of bias by analyzing the generated videos from a diverse set of gender-neutral and stereotypical prompts. The results indicate that Sora disproportionately associates specific genders with stereotypical behaviors and professions, which reflects societal prejudices embedded in its training data.

CLJun 18, 2025
From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents

Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam et al.

Centuries-old Islamic medical texts like Avicenna's Canon of Medicine and the Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and holistic therapies, yet remain inaccessible to many and underutilized in modern AI systems. Existing language-model benchmarks focus narrowly on factual recall or user preference, leaving a gap in validating culturally grounded medical guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that aligns 30 carefully curated Prophetic-medicine questions with human-verified remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three configurations: direct generation, retrieval-augmented generation, and a scientific self-critique filter. Each answer is then assessed by a secondary LLM serving as an agentic judge, yielding a single 3C3H quality score. Retrieval improves factual accuracy by 13%, while the agentic prompt adds another 10% improvement through deeper mechanistic insight and safety considerations. Our results demonstrate that blending classical Islamic texts with retrieval and self-evaluation enables reliable, culturally sensitive medical question-answering.

ASFeb 15, 2025
NeuroAMP: A Novel End-to-end General Purpose Deep Neural Amplifier for Personalized Hearing Aids

Shafique Ahmed, Ryandhimas E. Zezario, Hui-Guan Yuan et al.

The prevalence of hearing aids is increasing. However, optimizing the amplification processes of hearing aids remains challenging due to the complexity of integrating multiple modular components in traditional methods. To address this challenge, we present NeuroAMP, a novel deep neural network designed for end-to-end, personalized amplification in hearing aids. NeuroAMP leverages both spectral features and the listener's audiogram as inputs, and we investigate four architectures: Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Convolutional Recurrent Neural Network (CRNN), and Transformer. We also introduce Denoising NeuroAMP, an extension that integrates noise reduction along with amplification capabilities for improved performance in real-world scenarios. To enhance generalization, a comprehensive data augmentation strategy was employed during training on diverse speech (TIMIT and TMHINT) and music (Cadenza Challenge MUSIC) datasets. Evaluation using the Hearing Aid Speech Perception Index (HASPI), Hearing Aid Speech Quality Index (HASQI), and Hearing Aid Audio Quality Index (HAAQI) demonstrates that the Transformer architecture within NeuroAMP achieves the best performance, with SRCC scores of 0.9927 (HASQI) and 0.9905 (HASPI) on TIMIT, and 0.9738 (HAAQI) on the Cadenza Challenge MUSIC dataset. Notably, our data augmentation strategy maintains high performance on unseen datasets (e.g., VCTK, MUSDB18-HQ). Furthermore, Denoising NeuroAMP outperforms both the conventional NAL-R+WDRC approach and a two-stage baseline on the VoiceBank+DEMAND dataset, achieving a 10% improvement in both HASPI (0.90) and HASQI (0.59) scores. These results highlight the potential of NeuroAMP and Denoising NeuroAMP to deliver notable improvements in personalized hearing aid amplification.

AIAug 28, 2025
Bridging Minds and Machines: Toward an Integration of AI and Cognitive Science

Rui Mao, Qian Liu, Xiao Li et al.

Cognitive Science has profoundly shaped disciplines such as Artificial Intelligence (AI), Philosophy, Psychology, Neuroscience, Linguistics, and Culture. Many breakthroughs in AI trace their roots to cognitive theories, while AI itself has become an indispensable tool for advancing cognitive research. This reciprocal relationship motivates a comprehensive review of the intersections between AI and Cognitive Science. By synthesizing key contributions from both perspectives, we observe that AI progress has largely emphasized practical task performance, whereas its cognitive foundations remain conceptually fragmented. We argue that the future of AI within Cognitive Science lies not only in improving performance but also in constructing systems that deepen our understanding of the human mind. Promising directions include aligning AI behaviors with cognitive frameworks, situating AI in embodiment and culture, developing personalized cognitive models, and rethinking AI ethics through cognitive co-evaluation.

CVFeb 18, 2022
Towards Simple and Accurate Human Pose Estimation with Stair Network

Chenru Jiang, Kaizhu Huang, Shufei Zhang et al.

In this paper, we focus on tackling the precise keypoint coordinates regression task. Most existing approaches adopt complicated networks with a large number of parameters, leading to a heavy model with poor cost-effectiveness in practice. To overcome this limitation, we develop a small yet discrimicative model called STair Network, which can be simply stacked towards an accurate multi-stage pose estimation system. Specifically, to reduce computational cost, STair Network is composed of novel basic feature extraction blocks which focus on promoting feature diversity and obtaining rich local representations with fewer parameters, enabling a satisfactory balance on efficiency and performance. To further improve the performance, we introduce two mechanisms with negligible computational cost, focusing on feature fusion and replenish. We demonstrate the effectiveness of the STair Network on two standard datasets, e.g., 1-stage STair Network achieves a higher accuracy than HRNet by 5.5% on COCO test dataset with 80\% fewer parameters and 68% fewer GFLOPs.

SDFeb 11, 2022
A Novel Speech Intelligibility Enhancement Model based on CanonicalCorrelation and Deep Learning

Tassadaq Hussain, Muhammad Diyan, Mandar Gogate et al.

Current deep learning (DL) based approaches to speech intelligibility enhancement in noisy environments are often trained to minimise the feature distance between noise-free speech and enhanced speech signals. Despite improving the speech quality, such approaches do not deliver required levels of speech intelligibility in everyday noisy environments . Intelligibility-oriented (I-O) loss functions have recently been developed to train DL approaches for robust speech enhancement. Here, we formulate, for the first time, a novel canonical correlation based I-O loss function to more effectively train DL algorithms. Specifically, we present a canonical-correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model. We carry out comparative simulation experiments to show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions, using objective and subjective evaluation measures for case of both unseen speakers and noises. Ongoing future work is evaluating the proposed approach for design of robust hearing-assistive technology.

CRFeb 11, 2022
A Novel Chaos-based Light-weight Image Encryption Scheme for Multi-modal Hearing Aids

Awais Aziz Shah, Ahsan Adeel, Jawad Ahmad et al.

Multimodal hearing aids (HAs) aim to deliver more intelligible audio in noisy environments by contextually sensing and processing data in the form of not only audio but also visual information (e.g. lip reading). Machine learning techniques can play a pivotal role for the contextually processing of multimodal data. However, since the computational power of HA devices is low, therefore this data must be processed either on the edge or cloud which, in turn, poses privacy concerns for sensitive user data. Existing literature proposes several techniques for data encryption but their computational complexity is a major bottleneck to meet strict latency requirements for development of future multi-modal hearing aids. To overcome this problem, this paper proposes a novel real-time audio/visual data encryption scheme based on chaos-based encryption using the Tangent-Delay Ellipse Reflecting Cavity-Map System (TD-ERCS) map and Non-linear Chaotic (NCA) Algorithm. The results achieved against different security parameters, including Correlation Coefficient, Unified Averaged Changed Intensity (UACI), Key Sensitivity Analysis, Number of Changing Pixel Rate (NPCR), Mean-Square Error (MSE), Peak Signal to Noise Ratio (PSNR), Entropy test, and Chi-test, indicate that the newly proposed scheme is more lightweight due to its lower execution time as compared to existing schemes and more secure due to increased key-space against modern brute-force attacks.

SDFeb 9, 2022
Multimodal Audio-Visual Information Fusion using Canonical-Correlated Graph Neural Network for Energy-Efficient Speech Enhancement

Leandro Aparecido Passos, João Paulo Papa, Javier Del Ser et al.

This paper proposes a novel multimodal self-supervised architecture for energy-efficient audio-visual (AV) speech enhancement that integrates Graph Neural Networks with canonical correlation analysis (CCA-GNN). The proposed approach lays its foundations on a state-of-the-art CCA-GNN that learns representative embeddings by maximizing the correlation between pairs of augmented views of the same input while decorrelating disconnected features. The key idea of the conventional CCA-GNN involves discarding augmentation-variant information and preserving augmentation-invariant information while preventing capturing of redundant information. Our proposed AV CCA-GNN model deals with multimodal representation learning context. Specifically, our model improves contextual AV speech processing by maximizing canonical correlation from augmented views of the same channel and canonical correlation from audio and visual embeddings. In addition, it proposes a positional node encoding that considers a prior-frame sequence distance instead of a feature-space representation when computing the node's nearest neighbors, introducing temporal information in the embeddings through the neighborhood's connectivity. Experiments conducted on the benchmark ChiME3 dataset show that our proposed prior frame-based AV CCA-GNN ensures better feature learning in the temporal context, leading to more energy-efficient speech reconstruction than state-of-the-art CCA-GNN and multilayer perceptron.

ASFeb 8, 2022
A Speech Intelligibility Enhancement Model based on Canonical Correlation and Deep Learning for Hearing-Assistive Technologies

Tassadaq Hussain, Muhammad Diyan, Mandar Gogate et al.

Current deep learning (DL) based approaches to speech intelligibility enhancement in noisy environments are generally trained to minimise the distance between clean and enhanced speech features. These often result in improved speech quality however they suffer from a lack of generalisation and may not deliver the required speech intelligibility in everyday noisy situations. In an attempt to address these challenges, researchers have explored intelligibility-oriented (I-O) loss functions to train DL approaches for robust speech enhancement (SE). In this paper, we formulate a novel canonical correlation-based I-O loss function to more effectively train DL algorithms. Specifically, we present a fully convolutional SE model that uses a modified canonical-correlation based short-time objective intelligibility (CC-STOI) metric as a training cost function. To the best of our knowledge, this is the first work that exploits the integration of canonical correlation in an I-O based loss function for SE. Comparative experimental results demonstrate that our proposed CC-STOI based SE framework outperforms DL models trained with conventional STOI and distance-based loss functions, in terms of both standard objective and subjective evaluation measures when dealing with unseen speakers and noises.

SPFeb 8, 2022
Design of Flexible Meander Line Antenna for Healthcare for Wireless Medical Body Area Networks

Shahid M Ali, Cheab Sovuthy, Sima Noghanian et al.

A flexible meander line monopole antenna (MMA) is presented in this paper. The antenna can be worn for on-and off-body applications. The overall dimension of the MMA is 37 mm x 50 mm x2.37 mm3. The MMA was manufactured and measured, and the results matched with simulation results. The MMA design shows a bandwidth of up to 1282.4 (450.5) MHz and provides gains of 3.03 (4.85) dBi in the lower and upper operating bands, respectively, showing omnidirectional radiation patterns in free space. While worn on the chest or arm, bandwidths as high as 688.9 (500.9) MHz and 1261.7 (524.2) MHz, and the gains of 3.80 (4.67) dBi and 3.00 (4.55) dBi were observed. The experimental measurements of the read range confirmed the results of the coverage range of up to 11 meters.

HCFeb 8, 2022
Hearing Loss, Cognitive Load and Dementia: An Overview of Interrelation, Detection and Monitoring Challenges with Wearable Non-invasive Microwave Sensors

Usman Anwar, Tughrul Arslan, Amir Hussain

This paper provides an overview of hearing loss effects on neurological function and progressive diseases; and explores the role of cognitive load monitoring to detect dementia. It also investigates the prospects of utilizing hearing aid technology to reverse cognitive decline and delay the onset of dementia, for the old age population. The interrelation between hearing loss, cognitive load and dementia is discussed. Future considerations for improvement with respect to robust diagnosis, user centricity, device accuracy and privacy for wider clinical practice is also explored. The review concludes by discussing the future scope and potential of designing practical wearable microwave technologies and evaluating their use in smart care homes setting.

ASJan 24, 2022
A Novel Temporal Attentive-Pooling based Convolutional Recurrent Architecture for Acoustic Signal Enhancement

Tassadaq Hussain, Wei-Chien Wang, Mandar Gogate et al.

In acoustic signal processing, the target signals usually carry semantic information, which is encoded in a hierarchal structure of short and long-term contexts. However, the background noise distorts these structures in a nonuniform way. The existing deep acoustic signal enhancement (ASE) architectures ignore this kind of local and global effect. To address this problem, we propose to integrate a novel temporal attentive-pooling (TAP) mechanism into a conventional convolutional recurrent neural network, termed as TAP-CRNN. The proposed approach considers both global and local attention for ASE tasks. Specifically, we first utilize a convolutional layer to extract local information of the acoustic signals and then a recurrent neural network (RNN) architecture is used to characterize temporal contextual information. Second, we exploit a novelattention mechanism to contextually process salient regions of the noisy signals. The proposed ASE system is evaluated using a benchmark infant cry dataset and compared with several well-known methods. It is shown that the TAPCRNN can more effectively reduce noise components from infant cry signals in unseen background noises at challenging signal-to-noise levels.

SDDec 16, 2021
Towards Robust Real-time Audio-Visual Speech Enhancement

Mandar Gogate, Kia Dashtipour, Amir Hussain

The human brain contextually exploits heterogeneous sensory information to efficiently perform cognitive tasks including vision and hearing. For example, during the cocktail party situation, the human auditory cortex contextually integrates audio-visual (AV) cues in order to better perceive speech. Recent studies have shown that AV speech enhancement (SE) models can significantly improve speech quality and intelligibility in very low signal to noise ratio (SNR) environments as compared to audio-only SE models. However, despite significant research in the area of AV SE, development of real-time processing models with low latency remains a formidable technical challenge. In this paper, we present a novel framework for low latency speaker-independent AV SE that can generalise on a range of visual and acoustic noises. In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE. In addition, we propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE. The proposed framework is evaluated on synthetic and real noisy AV corpora using objective speech quality and intelligibility metrics and subjective listing tests. Comparative simulation results show that our real time AV SE framework outperforms state-of-the-art SE approaches, including recent DNN based SE models.

SDNov 18, 2021
Towards Intelligibility-Oriented Audio-Visual Speech Enhancement

Tassadaq Hussain, Mandar Gogate, Kia Dashtipour et al.

Existing deep learning (DL) based speech enhancement approaches are generally optimised to minimise the distance between clean and enhanced speech features. These often result in improved speech quality however they suffer from a lack of generalisation and may not deliver the required speech intelligibility in real noisy situations. In an attempt to address these challenges, researchers have explored intelligibility-oriented (I-O) loss functions and integration of audio-visual (AV) information for more robust speech enhancement (SE). In this paper, we introduce DL based I-O SE algorithms exploiting AV information, which is a novel and previously unexplored research direction. Specifically, we present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function. To the best of our knowledge, this is the first work that exploits the integration of AV modalities with an I-O based loss function for SE. Comparative experimental results demonstrate that our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions, in terms of standard objective evaluation measures when dealing with unseen speakers and noises.

IVNov 1, 2021
PointNu-Net: Keypoint-assisted Convolutional Neural Network for Simultaneous Multi-tissue Histology Nuclei Segmentation and Classification

Kai Yao, Kaizhu Huang, Jie Sun et al.

Automatic nuclei segmentation and classification play a vital role in digital pathology. However, previous works are mostly built on data with limited diversity and small sizes, making the results questionable or misleading in actual downstream tasks. In this paper, we aim to build a reliable and robust method capable of dealing with data from the 'the clinical wild'. Specifically, we study and design a new method to simultaneously detect, segment, and classify nuclei from Haematoxylin and Eosin (H&E) stained histopathology data, and evaluate our approach using the recent largest dataset: PanNuke. We address the detection and classification of each nuclei as a novel semantic keypoint estimation problem to determine the center point of each nuclei. Next, the corresponding class-agnostic masks for nuclei center points are obtained using dynamic instance segmentation. Meanwhile, we proposed a novel Joint Pyramid Fusion Module (JPFM) to model the cross-scale dependencies, thus enhancing the local feature for better nuclei detection and classification. By decoupling two simultaneous challenging tasks and taking advantage of JPFM, our method can benefit from class-aware detection and class-agnostic segmentation, thus leading to a significant performance boost. We demonstrate the superior performance of our proposed approach for nuclei segmentation and classification across 19 different tissue types, delivering new benchmark results.

CVJun 19, 2021
Cloud based Scalable Object Recognition from Video Streams using Orientation Fusion and Convolutional Neural Networks

Muhammad Usman Yaseen, Ashiq Anjum, Giancarlo Fortino et al.

Object recognition from live video streams comes with numerous challenges such as the variation in illumination conditions and poses. Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition. Yet, CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets. To address this problem, we propose a new CNN method based on orientation fusion for visual object recognition. The proposed cloud-based video analytics system pioneers the use of bi-dimensional empirical mode decomposition to split a video frame into intrinsic mode functions (IMFs). We further propose these IMFs to endure Reisz transform to produce monogenic object components, which are in turn used for the training of CNNs. Past works have demonstrated how the object orientation component may be used to pursue accuracy levels as high as 93\%. Herein we demonstrate how a feature-fusion strategy of the orientation components leads to further improving visual recognition accuracy to 97\%. We also assess the scalability of our method, looking at both the number and the size of the video streams under scrutiny. We carry out extensive experimentation on the publicly available Yale dataset, including also a self generated video datasets, finding significant improvements (both in accuracy and scale), in comparison to AlexNet, LeNet and SE-ResNeXt, which are the three most commonly used deep learning models for visual object recognition and classification.

LGApr 29, 2021
FairDrop: Biased Edge Dropout for Enhancing Fairness in Graph Representation Learning

Indro Spinelli, Simone Scardapane, Amir Hussain et al.

Graph representation learning has become a ubiquitous component in many scenarios, ranging from social network analysis to energy forecasting in smart grids. In several applications, ensuring the fairness of the node (or graph) representations with respect to some protected attributes is crucial for their correct deployment. Yet, fairness in graph deep learning remains under-explored, with few solutions available. In particular, the tendency of similar nodes to cluster on several real-world graphs (i.e., homophily) can dramatically worsen the fairness of these procedures. In this paper, we propose a novel biased edge dropout algorithm (FairDrop) to counter-act homophily and improve fairness in graph representation learning. FairDrop can be plugged in easily on many existing algorithms, is efficient, adaptable, and can be combined with other fairness-inducing solutions. After describing the general algorithm, we demonstrate its application on two benchmark tasks, specifically, as a random walk model for producing node embeddings, and to a graph convolutional network for link prediction. We prove that the proposed algorithm can successfully improve the fairness of all models up to a small or negligible drop in accuracy, and compares favourably with existing state-of-the-art solutions. In an ablation study, we demonstrate that our algorithm can flexibly interpolate between biasing towards fairness and an unbiased edge dropout. Furthermore, to better evaluate the gains, we propose a new dyadic group definition to measure the bias of a link prediction task when paired with group-based fairness metrics. In particular, we extend the metric used to measure the bias in the node embeddings to take into account the graph structure.

LGApr 28, 2021
FastAdaBelief: Improving Convergence Rate for Belief-based Adaptive Optimizers by Exploiting Strong Convexity

Yangfan Zhou, Kaizhu Huang, Cheng Cheng et al.

AdaBelief, one of the current best optimizers, demonstrates superior generalization ability compared to the popular Adam algorithm by viewing the exponential moving average of observed gradients. AdaBelief is theoretically appealing in that it has a data-dependent $O(\sqrt{T})$ regret bound when objective functions are convex, where $T$ is a time horizon. It remains however an open problem whether the convergence rate can be further improved without sacrificing its generalization ability. %on how to exploit strong convexity to further improve the convergence rate of AdaBelief. To this end, we make a first attempt in this work and design a novel optimization algorithm called FastAdaBelief that aims to exploit its strong convexity in order to achieve an even faster convergence rate. In particular, by adjusting the step size that better considers strong convexity and prevents fluctuation, our proposed FastAdaBelief demonstrates excellent generalization ability as well as superior convergence. As an important theoretical contribution, we prove that FastAdaBelief attains a data-dependant $O(\log T)$ regret bound, which is substantially lower than AdaBelief. On the empirical side, we validate our theoretical analysis with extensive experiments in both scenarios of strong and non-strong convexity on three popular baseline models. Experimental results are very encouraging: FastAdaBelief converges the quickest in comparison to all mainstream algorithms while maintaining an excellent generalization ability, in cases of both strong or non-strong convexity. FastAdaBelief is thus posited as a new benchmark model for the research community.

LGApr 19, 2021
A New Class of Efficient Adaptive Filters for Online Nonlinear Modeling

Danilo Comminiello, Alireza Nezamdoust, Simone Scardapane et al.

Nonlinear models are known to provide excellent performance in real-world applications that often operate in non-ideal conditions. However, such applications often require online processing to be performed with limited computational resources. To address this problem, we propose a new class of efficient nonlinear models for online applications. The proposed algorithms are based on linear-in-the-parameters (LIP) nonlinear filters using functional link expansions. In order to make this class of functional link adaptive filters (FLAFs) efficient, we propose low-complexity expansions and frequency-domain adaptation of the parameters. Among this family of algorithms, we also define the partitioned-block frequency-domain FLAF, whose implementation is particularly suitable for online nonlinear modeling problems. We assess and compare frequency-domain FLAFs with different expansions providing the best possible tradeoff between performance and computational complexity. Experimental results prove that the proposed algorithms can be considered as an efficient and effective solution for online applications, such as the acoustic echo cancellation, even in the presence of adverse nonlinear conditions and with limited availability of computational resources.

CVMar 24, 2021
Learning Polar Encodings for Arbitrary-Oriented Ship Detection in SAR Images

Yishan He, Fei Gao, Jun Wang et al.

Common horizontal bounding box (HBB)-based methods are not capable of accurately locating slender ship targets with arbitrary orientations in synthetic aperture radar (SAR) images. Therefore, in recent years, methods based on oriented bounding box (OBB) have gradually received attention from researchers. However, most of the recently proposed deep learning-based methods for OBB detection encounter the boundary discontinuity problem in angle or key point regression. In order to alleviate this problem, researchers propose to introduce some manually set parameters or extra network branches for distinguishing the boundary cases, which make training more diffcult and lead to performance degradation. In this paper, in order to solve the boundary discontinuity problem in OBB regression, we propose to detect SAR ships by learning polar encodings. The encoding scheme uses a group of vectors pointing from the center of the ship target to the boundary points to represent an OBB. The boundary discontinuity problem is avoided by training and inference directly according to the polar encodings. In addition, we propose an Intersect over Union (IOU) -weighted regression loss, which further guides the training of polar encodings through the IOU metric and improves the detection performance. Experiments on the Rotating SAR Ship Detection Dataset (RSSDD) show that the proposed method can achieve better detection performance over other comparison algorithms and other OBB encoding schemes, demonstrating the effectiveness of our method.

CVMar 20, 2021
A novel multimodal fusion network based on a joint coding model for lane line segmentation

Zhenhong Zou, Xinyu Zhang, Huaping Liu et al.

There has recently been growing interest in utilizing multimodal sensors to achieve robust lane line segmentation. In this paper, we introduce a novel multimodal fusion architecture from an information theory perspective, and demonstrate its practical utility using Light Detection and Ranging (LiDAR) camera fusion networks. In particular, we develop, for the first time, a multimodal fusion network as a joint coding model, where each single node, layer, and pipeline is represented as a channel. The forward propagation is thus equal to the information transmission in the channels. Then, we can qualitatively and quantitatively analyze the effect of different fusion approaches. We argue the optimal fusion architecture is related to the essential capacity and its allocation based on the source and channel. To test this multimodal fusion hypothesis, we progressively determine a series of multimodal models based on the proposed fusion methods and evaluate them on the KITTI and the A2D2 datasets. Our optimal fusion network achieves 85%+ lane line accuracy and 98.7%+ overall. The performance gap among the models will inform continuing future research into development of optimal fusion algorithms for the deep multimodal learning community.

CVMar 16, 2021
Conceptual Text Region Network: Cognition-Inspired Accurate Scene Text Detection

Chenwei Cui, Liangfu Lu, Zhiyuan Tan et al.

Segmentation-based methods are widely used for scene text detection due to their superiority in describing arbitrary-shaped text instances. However, two major problems still exist: 1) current label generation techniques are mostly empirical and lack theoretical support, discouraging elaborate label design; 2) as a result, most methods rely heavily on text kernel segmentation which is unstable and requires deliberate tuning. To address these challenges, we propose a human cognition-inspired framework, termed, Conceptual Text Region Network (CTRNet). The framework utilizes Conceptual Text Regions (CTRs), which is a class of cognition-based tools inheriting good mathematical properties, allowing for sophisticated label design. Another component of CTRNet is an inference pipeline that, with the help of CTRs, completely omits the need for text kernel segmentation. Compared with previous segmentation-based methods, our approach is not only more interpretable but also more accurate. Experimental results show that CTRNet achieves state-of-the-art performance on benchmark CTW1500, Total-Text, MSRA-TD500, and ICDAR 2015 datasets, yielding performance gains of up to 2.0%. Notably, to the best of our knowledge, CTRNet is among the first detection models to achieve F-measures higher than 85.0% on all four of the benchmarks, with remarkable consistency and stability.

CLMar 3, 2021
A Novel Context-Aware Multimodal Framework for Persian Sentiment Analysis

Kia Dashtipour, Mandar Gogate, Erik Cambria et al.

Most recent works on sentiment analysis have exploited the text modality. However, millions of hours of video recordings posted on social media platforms everyday hold vital unstructured information that can be exploited to more effectively gauge public perception. Multimodal sentiment analysis offers an innovative solution to computationally understand and harvest sentiments from videos by contextually exploiting audio, visual and textual cues. In this paper, we, firstly, present a first of its kind Persian multimodal dataset comprising more than 800 utterances, as a benchmark resource for researchers to evaluate multimodal sentiment analysis approaches in Persian language. Secondly, we present a novel context-aware multimodal sentiment analysis framework, that simultaneously exploits acoustic, visual and textual cues to more accurately determine the expressed sentiment. We employ both decision-level (late) and feature-level (early) fusion methods to integrate affective cross-modal information. Experimental results demonstrate that the contextual integration of multimodal features such as textual, acoustic and visual features deliver better performance (91.39%) compared to unimodal features (89.24%).

CLNov 19, 2020
Persuasive Dialogue Understanding: the Baselines and Negative Results

Hui Chen, Deepanway Ghosal, Navonil Majumder et al.

Persuasion aims at forming one's opinion and action via a series of persuasive messages containing persuader's strategies. Due to its potential application in persuasive dialogue systems, the task of persuasive strategy recognition has gained much attention lately. Previous methods on user intent recognition in dialogue systems adopt recurrent neural network (RNN) or convolutional neural network (CNN) to model context in conversational history, neglecting the tactic history and intra-speaker relation. In this paper, we demonstrate the limitations of a Transformer-based approach coupled with Conditional Random Field (CRF) for the task of persuasive strategy recognition. In this model, we leverage inter- and intra-speaker contextual semantic features, as well as label dependencies to improve the recognition. Despite extensive hyper-parameter optimizations, this architecture fails to outperform the baseline methods. We observe two negative results. Firstly, CRF cannot capture persuasive label dependencies, possibly as strategies in persuasive dialogues do not follow any strict grammar or rules as the cases in Named Entity Recognition (NER) or part-of-speech (POS) tagging. Secondly, the Transformer encoder trained from scratch is less capable of capturing sequential information in persuasive dialogues than Long Short-Term Memory (LSTM). We attribute this to the reason that the vanilla Transformer encoder does not efficiently consider relative position information of sequence elements.

CVMay 20, 2020
Discriminative Dictionary Design for Action Classification in Still Images and Videos

Abhinaba Roy, Biplab Banerjee, Amir Hussain et al.

In this paper, we address the problem of action recognition from still images and videos. Traditional local features such as SIFT, STIP etc. invariably pose two potential problems: 1) they are not evenly distributed in different entities of a given category and 2) many of such features are not exclusive of the visual concept the entities represent. In order to generate a dictionary taking the aforementioned issues into account, we propose a novel discriminative method for identifying robust and category specific local features which maximize the class separability to a greater extent. Specifically, we pose the selection of potent local descriptors as filtering based feature selection problem which ranks the local features per category based on a novel measure of distinctiveness. The underlying visual entities are subsequently represented based on the learned dictionary and this stage is followed by action classification using the random forest model followed by label propagation refinement. The framework is validated on the action recognition datasets based on still images (Stanford-40) as well as videos (UCF-50) and exhibits superior performances than the representative methods from the literature.

CLMay 3, 2020
Improving Aspect-Level Sentiment Analysis with Aspect Extraction

Navonil Majumder, Rishabh Bhardwaj, Soujanya Poria et al.

Aspect-based sentiment analysis (ABSA), a popular research area in NLP has two distinct parts -- aspect extraction (AE) and labeling the aspects with sentiment polarity (ALSA). Although distinct, these two tasks are highly correlated. The work primarily hypothesize that transferring knowledge from a pre-trained AE model can benefit the performance of ALSA models. Based on this hypothesis, word embeddings are obtained during AE and subsequently, feed that to the ALSA model. Empirically, this work show that the added information significantly improves the performance of three different baseline ALSA models on two distinct domains. This improvement also translates well across domains between AE and ALSA tasks.

AIFeb 19, 2020
Comprehensive Taxonomies of Nature- and Bio-inspired Optimization: Inspiration versus Algorithmic Behavior, Critical Analysis and Recommendations (from 2020 to 2024)

Daniel Molina, Javier Poyatos, Javier Del Ser et al.

In recent years, bio-inspired optimization methods, which mimic biological processes to solve complex problems, have gained popularity in recent literature. The proliferation of proposals prove the growing interest in this field. The increase in nature- and bio-inspired algorithms, applications, and guidelines highlights growing interest in this field. However, the exponential rise in the number of bio-inspired algorithms poses a challenge to the future trajectory of this research domain. Along the five versions of this document, the number of approaches grows incessantly, and where having a new biological description takes precedence over real problem-solving. This document presents two comprehensive taxonomies. One based on principles of biological similarity, and the other one based on operational aspects associated with the iteration of population models that initially have a biological inspiration. Therefore, these taxonomies enable researchers to categorize existing algorithmic developments into well-defined classes, considering two criteria: the source of inspiration, and the behavior exhibited by each algorithm. Using these taxonomies, we classify 518 algorithms based on nature-inspired and bio-inspired principles. Each algorithm within these categories is thoroughly examined, allowing for a critical synthesis of design trends and similarities, and identifying the most analogous classical algorithm for each proposal. From our analysis, we conclude that a poor relationship is often found between the natural inspiration of an algorithm and its behavior. Furthermore, similarities in terms of behavior between different algorithms are greater than what is claimed in their public disclosure: specifically, we show that more than one-fourth of the reviewed solvers are versions of classical algorithms. The conclusions from the analysis of the algorithms lead to several learned lessons.

SDSep 30, 2019
AV Speech Enhancement Challenge using a Real Noisy Corpus

Mandar Gogate, Ahsan Adeel, Kia Dashtipour et al.

This paper presents, a first of its kind, audio-visual (AV) speech enhacement challenge in real-noisy settings. A detailed description of the AV challenge, a novel real noisy AV corpus (ASPIRE), benchmark speech enhancement task, and baseline performance results are outlined. The latter are based on training a deep neural architecture on a synthetic mixture of Grid corpus and ChiME3 noises (consisting of bus, pedestrian, cafe, and street noises) and testing on the ASPIRE corpus. Subjective evaluations of five different speech enhancement algorithms (including SEAGN, spectrum subtraction (SS) , log-minimum mean-square error (LMMSE), audio-only CochleaNet, and AV CochleaNet) are presented as baseline results. The aim of the multi-modal challenge is to provide a timely opportunity for comprehensive evaluation of novel AV speech enhancement algorithms, using our new benchmark, real-noisy AV corpus and specified performance metrics. This will promote AV speech processing research globally, stimulate new ground-breaking multi-modal approaches, and attract interest from companies, academics and researchers working in AV speech technologies and applications. We encourage participants (through a challenge website sign-up) from both the speech and hearing research communities, to benefit from their complementary approaches to AV speech in noise processing.

CLSep 30, 2019
A Hybrid Persian Sentiment Analysis Framework: Integrating Dependency Grammar Based Rules and Deep Neural Networks

Kia Dashtipour, Mandar Gogate, Jingpeng Li et al.

Social media hold valuable, vast and unstructured information on public opinion that can be utilized to improve products and services. The automatic analysis of such data, however, requires a deep understanding of natural language. Current sentiment analysis approaches are mainly based on word co-occurrence frequencies, which are inadequate in most practical cases. In this work, we propose a novel hybrid framework for concept-level sentiment analysis in Persian language, that integrates linguistic rules and deep learning to optimize polarity detection. When a pattern is triggered, the framework allows sentiments to flow from words to concepts based on symbolic dependency relations. When no pattern is triggered, the framework switches to its subsymbolic counterpart and leverages deep neural networks (DNN) to perform the classification. The proposed framework outperforms state-of-the-art approaches (including support vector machine, and logistic regression) and DNN classifiers (long short-term memory, and Convolutional Neural Networks) with a margin of 10-15% and 3-4% respectively, using benchmark Persian product and hotel reviews corpora.

SDSep 23, 2019
CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Mandar Gogate, Kia Dashtipour, Ahsan Adeel et al.

Noisy situations cause huge problems for suffers of hearing loss as hearing aids often make the signal more audible but do not always restore the intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of the speech to selectively suppress the background noise and to focus on the target speaker. In this paper, we present a causal, language, noise and speaker independent AV deep neural network (DNN) architecture for speech enhancement (SE). The model exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve the speech intelligibility. To evaluate the proposed SE framework a first of its kind AV binaural speech corpus, called ASPIRE, is recorded in real noisy environments including cafeteria and restaurant. We demonstrate superior performance of our approach in terms of objective measures and subjective listening tests over the state-of-the-art SE approaches as well as recent DNN based SE models. In addition, our work challenges a popular belief that a scarcity of multi-language large vocabulary AV corpus and wide variety of noises is a major bottleneck to build a robust language, speaker and noise independent SE systems. We show that a model trained on synthetic mixture of Grid corpus (with 33 speakers and a small English vocabulary) and ChiME 3 Noises (consisting of only bus, pedestrian, cafeteria, and street noises) generalise well not only on large vocabulary corpora but also on completely unrelated languages (such as Mandarin), wide variety of speakers and noises.

CVMay 22, 2019
Attributes Guided Feature Learning for Vehicle Re-identification

Hongchao Li, Xianmin Lin, Aihua Zheng et al.

Vehicle Re-ID has recently attracted enthusiastic attention due to its potential applications in smart city and urban surveillance. However, it suffers from large intra-class variation caused by view variations and illumination changes, and inter-class similarity especially for different identities with the similar appearance. To handle these issues, in this paper, we propose a novel deep network architecture, which guided by meaningful attributes including camera views, vehicle types and colors for vehicle Re-ID. In particular, our network is end-to-end trained and contains three subnetworks of deep features embedded by the corresponding attributes (i.e., camera view, vehicle type and vehicle color). Moreover, to overcome the shortcomings of limited vehicle images of different views, we design a view-specified generative adversarial network to generate the multi-view vehicle images. For network training, we annotate the view labels on the VeRi-776 dataset. Note that one can directly adopt the pre-trained view (as well as type and color) subnetwork on the other datasets with only ID information, which demonstrates the generalization of our model. Extensive experiments on the benchmark datasets VeRi-776 and VehicleID suggest that the proposed approach achieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.

CRSep 13, 2018
Real-Time Lightweight Chaotic Encryption for 5G IoT Enabled Lip-Reading Driven Secure Hearing-Aid

Ahsan Adeel, Jawad Ahmad, Amir Hussain

Existing audio-only hearing-aids are known to perform poorly in noisy situations where overwhelming noise is present. Next-generation audio-visual (lip-reading driven) hearing-aids stand as a major enabler to realise more intelligible audio. However, high data rate, low latency, low computational complexity, and privacy are some of the major bottlenecks to the successful deployment of such advanced hearing aids. To address these challenges, we envision an integration of 5G Cloud-Radio Access Network, Internet of Things (IoT), and strong privacy algorithms to fully benefit from the possibilities these technologies have to offer. The envisioned 5G IoT enabled secure audio-visual (AV) hearing-aid transmits the encrypted compressed AV information and receives encrypted enhanced reconstructed speech in real-time which fully addresses cybersecurity attacks such as location privacy and eavesdropping. For security implementation, a real-time lightweight AV encryption is utilized. For speech enhancement, the received AV information in the cloud is used to filter noisy audio using both deep learning and analytical acoustic modelling (filtering based approach). To offload the computational complexity and real-time optimization issues, the framework runs deep learning and big data optimization processes in the background on the cloud. Specifically, in this work, three key contributions are reported: (1) 5G IoT enabled secure audio-visual hearing-aid framework that aims to achieve a round-trip latency up to 5ms with 100 Mbps datarate (2) Real-time lightweight audio-visual encryption (3) Lip-reading driven deep learning approach for speech enhancement in the cloud. The critical analysis in terms of both speech enhancement and AV encryption demonstrate the potential of the envisioned technology in acquiring high-quality speech reconstruction and secure mobile AV hearing aid communication.

CVAug 28, 2018
Contextual Audio-Visual Switching For Speech Enhancement in Real-World Environments

Ahsan Adeel, Mandar Gogate, Amir Hussain

Human speech processing is inherently multimodal, where visual cues (lip movements) help to better understand the speech in noise. Lip-reading driven speech enhancement significantly outperforms benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, at high SNRs or low levels of background noise, visual cues become fairly less effective for speech enhancement. Therefore, a more optimal, context-aware audio-visual (AV) system is required, that contextually utilises both visual and noisy audio features and effectively accounts for different noisy conditions. In this paper, we introduce a novel contextual AV switching component that contextually exploits AV cues with respect to different operating conditions to estimate clean audio, without requiring any SNR estimation. The switching module switches between visual-only (V-only), audio-only (A-only), and both AV cues at low, high and moderate SNR levels, respectively. The contextual AV switching component is developed by integrating a convolutional neural network and long-short-term memory network. For testing, the estimated clean audio features are utilised by the developed novel enhanced visually derived Wiener filter for clean audio power spectrum estimation. The contextual AV speech enhancement method is evaluated under real-world scenarios using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score method is used. The critical analysis and comparative study demonstrate the outperformance of proposed contextual AV approach, over A-only, V-only, spectral subtraction, and log-minimum mean square error based speech enhancement methods at both low and high SNRs, revealing its capability to tackle spectro-temporal variation in any real-world noisy condition.

CVAug 25, 2018
Saliency Detection via Bidirectional Absorbing Markov Chain

Fengling Jiang, Bin Kong, Ahsan Adeel et al.

Traditional saliency detection via Markov chain only considers boundaries nodes. However, in addition to boundaries cues, background prior and foreground prior cues play a complementary role to enhance saliency detection. In this paper, we propose an absorbing Markov chain based saliency detection method considering both boundary information and foreground prior cues. The proposed approach combines both boundaries and foreground prior cues through bidirectional Markov chain. Specifically, the image is first segmented into superpixels and four boundaries nodes (duplicated as virtual nodes) are selected. Subsequently, the absorption time upon transition node's random walk to the absorbing state is calculated to obtain foreground possibility. Simultaneously, foreground prior as the virtual absorbing nodes is used to calculate the absorption time and obtain the background possibility. Finally, two obtained results are fused to obtain the combined saliency map using cost function for further optimization at multi-scale. Experimental results demonstrate the outperformance of our proposed model on 4 benchmark datasets as compared to 17 state-of-the-art methods.

CRAug 16, 2018
Statistical Analysis Driven Optimized Deep Learning System for Intrusion Detection

Cosimo Ieracitano, Ahsan Adeel, Mandar Gogate et al.

Attackers have developed ever more sophisticated and intelligent ways to hack information and communication technology systems. The extent of damage an individual hacker can carry out upon infiltrating a system is well understood. A potentially catastrophic scenario can be envisaged where a nation-state intercepting encrypted financial data gets hacked. Thus, intelligent cybersecurity systems have become inevitably important for improved protection against malicious threats. However, as malware attacks continue to dramatically increase in volume and complexity, it has become ever more challenging for traditional analytic tools to detect and mitigate threat. Furthermore, a huge amount of data produced by large networks has made the recognition task even more complicated and challenging. In this work, we propose an innovative statistical analysis driven optimized deep learning system for intrusion detection. The proposed intrusion detection system (IDS) extracts optimized and more correlated features using big data visualization and statistical analysis methods (human-in-the-loop), followed by a deep autoencoder for potential threat detection. Specifically, a pre-processing module eliminates the outliers and converts categorical variables into one-hot-encoded vectors. The feature extraction module discard features with null values and selects the most significant features as input to the deep autoencoder model (trained in a greedy-wise manner). The NSL-KDD dataset from the Canadian Institute for Cybersecurity is used as a benchmark to evaluate the feasibility and effectiveness of the proposed architecture. Simulation results demonstrate the potential of our proposed system and its outperformance as compared to existing state-of-the-art methods and recently published novel approaches. Ongoing work includes further optimization and real-time evaluation of our proposed IDS.

CLAug 15, 2018
SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis

Imane Guellil, Ahsan Adeel, Faical Azouaou et al.

Data annotation is an important but time-consuming and costly procedure. To sort a text into two classes, the very first thing we need is a good annotation guideline, establishing what is required to qualify for each class. In the literature, the difficulties associated with an appropriate data annotation has been underestimated. In this paper, we present a novel approach to automatically construct an annotated sentiment corpus for Algerian dialect (a Maghrebi Arabic dialect). The construction of this corpus is based on an Algerian sentiment lexicon that is also constructed automatically. The presented work deals with the two widely used scripts on Arabic social media: Arabic and Arabizi. The proposed approach automatically constructs a sentiment corpus containing 8000 messages (where 4000 are dedicated to Arabic and 4000 to Arabizi). The achieved F1-score is up to 72% and 78% for an Arabic and Arabizi test sets, respectively. Ongoing work is aimed at integrating transliteration process for Arabizi messages to further improve the obtained results.