4.8CLJul 13, 2024
Bilingual Adaptation of Monolingual Foundation ModelsGurpreet Gosal, Yishi Xu, Gokul Ramakrishnan et al.
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language, addressing challenges of catastrophic forgetting and tokenizer limitations. We focus this study on adapting Llama 2 to Arabic. Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix, followed by full model continual pre-training on a bilingual corpus. By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic. Our approach results in significant improvements in Arabic and slight enhancements in English, demonstrating cost-effective cross-lingual transfer. We perform ablations on embedding initialization techniques, data mix ratios, and learning rates and release a detailed training recipe. To demonstrate generalizability of this approach we also adapted Llama 3 8B to Arabic and Llama 2 13B to Hindi.
2.7SDSep 29, 2024
Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic PerspectiveChen Chen, Xiaolou Li, Zehua Liu et al.
In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
4.6LGJul 25, 2024
RIDA: A Robust Attack Framework on Incomplete GraphsJianke Yu, Hanchen Wang, Chen Chen et al.
Graph Neural Networks (GNNs) are vital in data science but are increasingly susceptible to adversarial attacks. To help researchers develop more robust GNN models, it's essential to focus on designing strong attack models as foundational benchmarks and guiding references. Among adversarial attacks, gray-box poisoning attacks are noteworthy due to their effectiveness and fewer constraints. These attacks exploit GNNs' need for retraining on updated data, thereby impacting their performance by perturbing these datasets. However, current research overlooks the real-world scenario of incomplete graphs. To address this gap, we introduce the Robust Incomplete Deep Attack Framework (RIDA). It is the first algorithm for robust gray-box poisoning attacks on incomplete graphs. The approach innovatively aggregates distant vertex information and ensures powerful data utilization. Extensive tests against 9 SOTA baselines on 3 real-world datasets demonstrate that RIDA's superiority in handling incompleteness and high attack performance on the incomplete graph.
10.2CVFeb 3, 2025
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse UpcyclingXinze Wang, Chen Chen, Yinfei Yang et al.
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
Universal Online Temporal Calibration for Optimization-based Visual-Inertial Navigation SystemsYunfei Fan, Tianyu Zhao, Linan Guo et al.
6-Degree of Freedom (6DoF) motion estimation with a combination of visual and inertial sensors is a growing area with numerous real-world applications. However, precise calibration of the time offset between these two sensor types is a prerequisite for accurate and robust tracking. To address this, we propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems. Technically, we incorporate the time offset td as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp using td, angular velocity and translational velocity. This allows the temporal misalignment td to be optimized alongside other tracking states during the process. As our method only modifies the structure of the residual model, it can be applied to various optimization-based frameworks with different tracking frontends. We evaluate our calibration method with both EuRoC and simulation data and extensive experiments demonstrate that our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.
8.6HCNov 5, 2021
Understanding Barriers and Design Opportunities to Improve Healthcare and QOL for Older Adults through Voice AssistantsChen Chen, Janet G. Johnson, Kemeberly Charles et al.
Voice based Intelligent Virtual Assistants (IVAs) promise to improve healthcare management and Quality of Life (QOL) by introducing the paradigm of hands free and eye free interactions. However, there has been little understanding regarding the challenges for designing such systems for older adults, especially when it comes to healthcare related tasks. To tackle this, we consider the processes of care delivery and QOL enhancements for older adults as a collaborative task between patients and providers. By interviewing 16 older adults living independently or semi independently and 5 providers, we identified 12 barriers that older adults might encounter during daily routine and while managing health. We ultimately highlighted key design challenges and opportunities that might be introduced when integrating voice based IVAs into the life of older adults. Our work will benefit practitioners who study and attempt to create full fledged IVA powered smart devices to deliver better care and support an increased QOL for aging populations.
1.9RONov 12, 2019
HMTNet:3D Hand Pose Estimation from Single Depth Image Based on Hand Morphological TopologyWeiguo Zhou, Xin Jiang, Chen Chen et al.
Thanks to the rapid development of CNNs and depth sensors, great progress has been made in 3D hand pose estimation. Nevertheless, it is still far from being solved for its cluttered circumstance and severe self-occlusion of hand. In this paper, we propose a method that takes advantage of human hand morphological topology (HMT) structure to improve the pose estimation performance. The main contributions of our work can be listed as below. Firstly, in order to extract more powerful features, we concatenate original and last layer of initial feature extraction module to preserve hand information better. Next, regression module inspired from hand morphological topology is proposed. In this submodule, we design a tree-like network structure according to hand joints distribution to make use of high order dependency of hand joints. Lastly, we conducted sufficient ablation experiments to verify our proposed method on each dataset. Experimental results on three popular hand pose dataset show superior performance of our method compared with the state-of-the-art methods. On ICVL and NYU dataset, our method outperforms great improvement over 2D state-of-the-art methods. On MSRA dataset, our method achieves comparable accuracy with the state-of-the-art methods. To summarize, our method is the most efficient method which can run at 220:7 fps on a single GPU compared with approximate accurate methods at present. The code will be available at.
3.8CVJul 12, 2017
Deep Fisher Discriminant Learning for Mobile Hand Gesture RecognitionChunyu Xie, Ce Li, Baochang Zhang et al.
Gesture recognition is a challenging problem in the field of biometrics. In this paper, we integrate Fisher criterion into Bidirectional Long-Short Term Memory (BLSTM) network and Bidirectional Gated Recurrent Unit (BGRU),thus leading to two new deep models termed as F-BLSTM and F-BGRU. BothFisher discriminative deep models can effectively classify the gesture based on analyzing the acceleration and angular velocity data of the human gestures. Moreover, we collect a large Mobile Gesture Database (MGD) based on the accelerations and angular velocities containing 5547 sequences of 12 gestures. Extensive experiments are conducted to validate the superior performance of the proposed networks as compared to the state-of-the-art BLSTM and BGRU on MGD database and two benchmark databases (i.e. BUAA mobile gesture and SmartWatch gesture).