CLJan 30, 2023
KG-BERTScore: Incorporating Knowledge Graph into BERTScore for Reference-Free Machine Translation EvaluationZhanglin Wu, Min Zhang, Ming Zhu et al.
BERTScore is an effective and robust automatic metric for referencebased machine translation evaluation. In this paper, we incorporate multilingual knowledge graph into BERTScore and propose a metric named KG-BERTScore, which linearly combines the results of BERTScore and bilingual named entity matching for reference-free machine translation evaluation. From the experimental results on WMT19 QE as a metric without references shared tasks, our metric KG-BERTScore gets higher overall correlation with human judgements than the current state-of-the-art metrics for reference-free machine translation evaluation.1 Moreover, the pre-trained multilingual model used by KG-BERTScore and the parameter for linear combination are also studied in this paper.
CVDec 28, 2022
Efficient Semantic Segmentation on Edge DevicesFarshad Safavi, Irfan Ali, Venkatesh Dasari et al.
Semantic segmentation works on the computer vision algorithm for assigning each pixel of an image into a class. The task of semantic segmentation should be performed with both accuracy and efficiency. Most of the existing deep FCNs yield to heavy computations and these networks are very power hungry, unsuitable for real-time applications on portable devices. This project analyzes current semantic segmentation models to explore the feasibility of applying these models for emergency response during catastrophic events. We compare the performance of real-time semantic segmentation models with non-real-time counterparts constrained by aerial images under oppositional settings. Furthermore, we train several models on the Flood-Net dataset, containing UAV images captured after Hurricane Harvey, and benchmark their execution on special classes such as flooded buildings vs. non-flooded buildings or flooded roads vs. non-flooded roads. In this project, we developed a real-time UNet based model and deployed that network on Jetson AGX Xavier module.
68.1CVMay 11
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal ConditionYu He, Ting Zhu, Yichun Liu et al.
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.
CVJan 13
UM-Text: A Unified Multimodal Model for Image UnderstandingLichen Ma, Xiaolong Fu, Gaojing Zhou et al.
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
ASDec 14, 2023
Design, construction and evaluation of emotional multimodal pathological speech databaseTing Zhu, Shufei Duan, Huizhi Liang et al.
The lack of an available emotion pathology database is one of the key obstacles in studying the emotion expression status of patients with dysarthria. The first Chinese multimodal emotional pathological speech database containing multi-perspective information is constructed in this paper. It includes 29 controls and 39 patients with different degrees of motor dysarthria, expressing happy, sad, angry and neutral emotions. All emotional speech was labeled for intelligibility, types and discrete dimensional emotions by developed WeChat mini-program. The subjective analysis justifies from emotion discrimination accuracy, speech intelligibility, valence-arousal spatial distribution, and correlation between SCL-90 and disease severity. The automatic recognition tested on speech and glottal data, with average accuracy of 78% for controls and 60% for patients in audio, while 51% for controls and 38% for patients in glottal data, indicating an influence of the disease on emotional expression.
LGDec 17, 2025
On-device Large Multi-modal Agent for Human Activity RecognitionMd Shakhrul Iman Siam, Ishtiaque Ahmed Showmik, Guanqun Song et al.
Human Activity Recognition (HAR) has been an active area of research, with applications ranging from healthcare to smart environments. The recent advancements in Large Language Models (LLMs) have opened new possibilities to leverage their capabilities in HAR, enabling not just activity classification but also interpretability and human-like interaction. In this paper, we present a Large Multi-Modal Agent designed for HAR, which integrates the power of LLMs to enhance both performance and user engagement. The proposed framework not only delivers activity classification but also bridges the gap between technical outputs and user-friendly insights through its reasoning and question-answering capabilities. We conduct extensive evaluations using widely adopted HAR datasets, including HHAR, Shoaib, Motionsense to assess the performance of our framework. The results demonstrate that our model achieves high classification accuracy comparable to state-of-the-art methods while significantly improving interpretability through its reasoning and Q&A capabilities.
LGDec 17, 2025
EdgeFlex-Transformer: Transformer Inference for Edge DevicesShoaib Mohammad, Guanqun Song, Ting Zhu
Deploying large-scale transformer models on edge devices presents significant challenges due to strict constraints on memory, compute, and latency. In this work, we propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs) for deployment in resource-constrained environments. Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning. Starting from a ViT-Huge backbone with 632 million parameters, we first identify low-importance channels using activation statistics collected via forward hooks, followed by structured pruning to shrink the MLP layers under a target memory budget. We further apply FP16 conversion to selected components and leverage AWQ to quantize the remaining model weights and activations to INT8 with minimal accuracy degradation. Our experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline. This framework offers a practical path toward efficient transformer inference on edge platforms, and opens future avenues for integrating dynamic sparsity and Mixture-of-Experts (MoE) architectures to further scale performance across diverse tasks.
SDNov 27, 2025
Art2Music: Generating Music for Art Images with Multi-modal Feeling AlignmentJiaying Hong, Ting Zhu, Thanet Markchom et al.
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
LGNov 17, 2024
Mitigating Relative Over-Generalization in Multi-Agent Reinforcement LearningTing Zhu, Yue Jin, Jeremie Houssineau et al.
In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
ASDec 30, 2023
Enhancing dysarthria speech feature representation with empirical mode decomposition and Walsh-Hadamard transformTing Zhu, Shufei Duan, Camille Dingam et al.
Dysarthria speech contains the pathological characteristics of vocal tract and vocal fold, but so far, they have not yet been included in traditional acoustic feature sets. Moreover, the nonlinearity and non-stationarity of speech have been ignored. In this paper, we propose a feature enhancement algorithm for dysarthria speech called WHFEMD. It combines empirical mode decomposition (EMD) and fast Walsh-Hadamard transform (FWHT) to enhance features. With the proposed algorithm, the fast Fourier transform of the dysarthria speech is first performed and then followed by EMD to get intrinsic mode functions (IMFs). After that, FWHT is used to output new coefficients and to extract statistical features based on IMFs, power spectral density, and enhanced gammatone frequency cepstral coefficients. To evaluate the proposed approach, we conducted experiments on two public pathological speech databases including UA Speech and TORGO. The results show that our algorithm performed better than traditional features in classification. We achieved improvements of 13.8% (UA Speech) and 3.84% (TORGO), respectively. Furthermore, the incorporation of an imbalanced classification algorithm to address data imbalance has resulted in a 12.18% increase in recognition accuracy. This algorithm effectively addresses the challenges of the imbalanced dataset and non-linearity in dysarthric speech and simultaneously provides a robust representation of the local pathological features of the vocal folds and tracts.
CVDec 23, 2023
Beyond the Frame: Single and mutilple video summarization method with user-defined lengthVahid Ahmadi Kalkhorani, Qingquan Zhang, Guanqun Song et al.
Video smmarization is a crucial method to reduce the time of videos which reduces the spent time to watch/review a long video. This apporach has became more important as the amount of publisehed video is increasing everyday. A single or multiple videos can be summarized into a relatively short video using various of techniques from multimodal audio-visual techniques, to natural language processing approaches. Audiovisual techniques may be used to recognize significant visual events and pick the most important parts, while NLP techniques can be used to evaluate the audio transcript and extract the main sentences (timestamps) and corresponding video frames from the original video. Another approach is to use the best of both domain. Meaning that we can use audio-visual cues as well as video transcript to extract and summarize the video. In this paper, we combine a variety of NLP techniques (extractive and contect-based summarizers) with video processing techniques to convert a long video into a single relatively short video. We design this toll in a way that user can specify the relative length of the summarized video. We have also explored ways of summarizing and concatenating multiple videos into a single short video which will help having most important concepts from the same subject in a single short video. Out approach shows that video summarizing is a difficult but significant work, with substantial potential for further research and development, and it is possible thanks to the development of NLP models.
LGDec 23, 2023
Data Classification With MultiprocessingAnuja Dixit, Shreya Byreddy, Guanqun Song et al.
Classification is one of the most important tasks in Machine Learning (ML) and with recent advancements in artificial intelligence (AI) it is important to find efficient ways to implement it. Generally, the choice of classification algorithm depends on the data it is dealing with, and accuracy of the algorithm depends on the hyperparameters it is tuned with. One way is to check the accuracy of the algorithms by executing it with different hyperparameters serially and then selecting the parameters that give the highest accuracy to predict the final output. This paper proposes another way where the algorithm is parallelly trained with different hyperparameters to reduce the execution time. In the end, results from all the trained variations of the algorithms are ensembled to exploit the parallelism and improve the accuracy of prediction. Python multiprocessing is used to test this hypothesis with different classification algorithms such as K-Nearest Neighbors (KNN), Support Vector Machines (SVM), random forest and decision tree and reviews factors affecting parallelism. Ensembled output considers the predictions from all processes and final class is the one predicted by maximum number of processes. Doing this increases the reliability of predictions. We conclude that ensembling improves accuracy and multiprocessing reduces execution time for selected algorithms.
CVJan 1, 2022
Computer Vision Based Parking Optimization SystemSiddharth Chandrasekaran, Jeffrey Matthew Reginald, Wei Wang et al.
An improvement in technology is linearly related to time and time-relevant problems. It has been seen that as time progresses, the number of problems humans face also increases. However, technology to resolve these problems tends to improve as well. One of the earliest existing problems which started with the invention of vehicles was parking. The ease of resolving this problem using technology has evolved over the years but the problem of parking still remains unsolved. The main reason behind this is that parking does not only involve one problem but it consists of a set of problems within itself. One of these problems is the occupancy detection of the parking slots in a distributed parking ecosystem. In a distributed system, users would find preferable parking spaces as opposed to random parking spaces. In this paper, we propose a web-based application as a solution for parking space detection in different parking spaces. The solution is based on Computer Vision (CV) and is built using the Django framework written in Python 3.0. The solution works to resolve the occupancy detection problem along with providing the user the option to determine the block based on availability and his preference. The evaluation results for our proposed system are promising and efficient. The proposed system can also be integrated with different systems and be used for solving other relevant parking problems.
NIDec 30, 2021
Machine Learning and Artificial Intelligence in Next-Generation Wireless NetworkWafeeq Iqbal, Wei Wang, Ting Zhu
Due to the advancement in technologies, the next-generation wireless network will be very diverse, complicated, and according to the changed demands of the consumers. The current network operator methodologies and approaches are traditional and cannot help the next generation networks to utilize their resources most appropriately. The limited capability of the traditional tools will not allow the network providers to fulfill the demands of the network's subscribers in the future. Therefore, this paper will focus on machine learning, automation, artificial intelligence, and big data analytics for improving the capacity and effectiveness of next-generation wireless networks. The paper will discuss the role of these new technologies in improving the service and performance of the network providers in the future. The paper will find out that machine learning, big data analytics, and artificial intelligence will help in making the next-generation wireless network self-adaptive, self-aware, prescriptive, and proactive. At the end of the paper, it will be provided that future wireless network operators cannot work without shifting their operational framework to AI and machine learning technologies.
HCDec 30, 2021
Investigations of Smart Health ReliabilitySharlet Claros, Wei Wang, Ting Zhu
A balanced investigation into the reliability of wireless smart health devices when it comes to the collection of biometric data under varying network/environmental conditions. Followed by a program implementation to begin introductory analysis on measurement accuracy and data collection to gauge the reliability of smart health devices.
SEDec 30, 2021
Chatbot for fitness management using IBM WatsonSai Rugved Lola, Rahul Dhadvai, Wei Wang et al.
Chatbots have revolutionized the way humans interact with computer systems and they have substituted the use of service agents, call-center representatives etc. Fitness industry has always been a growing industry although it has not adapted to the latest technologies like AI, ML and cloud computing. In this paper, we propose an idea to develop a chatbot for fitness management using IBM Watson and integrate it with a web application. We proposed using Natural Language Processing (NLP) and Natural Language Understanding (NLU) along with frameworks of IBM Cloud Watson provided for the Chatbot Assistant. This software uses a serverless architecture to combine the services of a professional by offering diet plans, home exercises, interactive counseling sessions, fitness recommendations.
LGJan 8, 2021
Benchmarking Machine Learning: How Fast Can Your Algorithms Go?Zeyu Ning, Hugues Nelson Iradukunda, Qingquan Zhang et al.
This paper is focused on evaluating the effect of some different techniques in machine learning speed-up, including vector caches, parallel execution, and so on. The following content will include some review of the previous approaches and our own experimental results.
LGDec 22, 2020
MailLeak: Obfuscation-Robust Character Extraction Using Transfer LearningWei Wang, Emily Sallenback, Zeyu Ning et al.
The following work presents a new algorithm for character recognition from obfuscated images. The presented method is an example of a potential threat to current postal services. This paper both analyzes the efficiency of the given algorithm and suggests countermeasures to prevent such threats from occurring.
CRDec 21, 2020
A Secured Protocol for IoT NetworksAnanth Vishnu Bhaskar, Ankit Baingane, Ryan Jahnige et al.
Researchers in the past have shown that Symmetric key cryptography is generally considered infeasible and public key cryptography, at times, fails to provide sufficient security and integrity to data. In contrast to this prejudice, our paper presents a novel approach that establishes security to data through encryption techniques like RSA and more importantly it identifies a randomized path to route messages from source to the destination and ensures that packets are delivered safely even when intermediate nodes are attacked by identifying alternate paths between source and the destination.