Kexin Zhu

AI
h-index117
7papers
3,103citations
Novelty46%
AI Score48

7 Papers

SPMar 14, 2023
Improving EEG-based Emotion Recognition by Fusing Time-frequency And Spatial Representations

Kexin Zhu, Xulong Zhang, Jianzong Wang et al.

Using deep learning methods to classify EEG signals can accurately identify people's emotions. However, existing studies have rarely considered the application of the information in another domain's representations to feature selection in the time-frequency domain. We propose a classification network of EEG signals based on the cross-domain feature fusion method, which makes the network more focused on the features most related to brain activities and thinking changes by using the multi-domain attention mechanism. In addition, we propose a two-step fusion method and apply these methods to the EEG emotion recognition network. Experimental results show that our proposed network, which combines multiple representations in the time-frequency domain and spatial domain, outperforms previous methods on public datasets and achieves state-of-the-art at present.

SDOct 25, 2022
Improving Speech Representation Learning via Speech-level and Phoneme-level Masking Approach

Xulong Zhang, Jianzong Wang, Ning Cheng et al.

Recovering the masked speech frames is widely applied in speech representation learning. However, most of these models use random masking in the pre-training. In this work, we proposed two kinds of masking approaches: (1) speech-level masking, making the model to mask more speech segments than silence segments, (2) phoneme-level masking, forcing the model to mask the whole frames of the phoneme, instead of phoneme pieces. We pre-trained the model via these two approaches, and evaluated on two downstream tasks, phoneme classification and speaker recognition. The experiments demonstrated that the proposed masking approaches are beneficial to improve the performance of speech representation.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

CVOct 20, 2023
Boosting Generalization with Adaptive Style Techniques for Fingerprint Liveness Detection

Kexin Zhu, Bo Lin, Yang Qiu et al.

We introduce a high-performance fingerprint liveness feature extraction technique that secured first place in LivDet 2023 Fingerprint Representation Challenge. Additionally, we developed a practical fingerprint recognition system with 94.68% accuracy, earning second place in LivDet 2023 Liveness Detection in Action. By investigating various methods, particularly style transfer, we demonstrate improvements in accuracy and generalization when faced with limited training data. As a result, our approach achieved state-of-the-art performance in LivDet 2023 Challenges.

AIMar 2
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Jiahao Huang, Fengyan Lin, Xuechao Yang et al.

The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.

AIJul 4, 2025
REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services

Kexin Zhu, Yang Han

The development of large language models (LLMs) has greatly promoted the progress of chatbot in multiple fields. There is an urgent need to evaluate whether LLMs can play the role of agent in housing transactions and services as well as humans. We present Real Estate Agent Large Language Model Evaluation (REAL), the first evaluation suite designed to assess the abilities of LLMs in the field of housing transactions and services. REAL comprises 5,316 high-quality evaluation entries across 4 topics: memory, comprehension, reasoning and hallucination. All these entries are organized as 14 categories to assess whether LLMs have the knowledge and ability in housing transactions and services scenario. Additionally, the REAL is used to evaluate the performance of most advanced LLMs. The experiment results indicate that LLMs still have significant room for improvement to be applied in the real estate field.

CVOct 18, 2024
Unlabeled Action Quality Assessment Based on Multi-dimensional Adaptive Constrained Dynamic Time Warping

Renguang Chen, Guolong Zheng, Xu Yang et al.

The growing popularity of online sports and exercise necessitates effective methods for evaluating the quality of online exercise executions. Previous action quality assessment methods, which relied on labeled scores from motion videos, exhibited slightly lower accuracy and discriminability. This limitation hindered their rapid application to newly added exercises. To address this problem, this paper presents an unlabeled Multi-Dimensional Exercise Distance Adaptive Constrained Dynamic Time Warping (MED-ACDTW) method for action quality assessment. Our approach uses an athletic version of DTW to compare features from template and test videos, eliminating the need for score labels during training. The result shows that utilizing both 2D and 3D spatial dimensions, along with multiple human body features, improves the accuracy by 2-3% compared to using either 2D or 3D pose estimation alone. Additionally, employing MED for score calculation enhances the precision of frame distance matching, which significantly boosts overall discriminability. The adaptive constraint scheme enhances the discriminability of action quality assessment by approximately 30%. Furthermore, to address the absence of a standardized perspective in sports class evaluations, we introduce a new dataset called BGym.