Yu Lu

CV
h-index66
12papers
6,137citations
Novelty40%
AI Score37

12 Papers

3.9AIAug 8, 2023Code
NEOLAF, an LLM-powered neural-symbolic cognitive architecture

Richard Jiarui Tong, Cassie Chen Cao, Timothy Xueqian Lee et al.

This paper presents the Never Ending Open Learning Adaptive Framework (NEOLAF), an integrated neural-symbolic cognitive architecture that models and constructs intelligent agents. The NEOLAF framework is a superior approach to constructing intelligent agents than both the pure connectionist and pure symbolic approaches due to its explainability, incremental learning, efficiency, collaborative and distributed learning, human-in-the-loop enablement, and self-improvement. The paper further presents a compelling experiment where a NEOLAF agent, built as a problem-solving agent, is fed with complex math problems from the open-source MATH dataset. The results demonstrate NEOLAF's superior learning capability and its potential to revolutionize the field of cognitive architectures and self-improving adaptive instructional systems.

46.2CLApr 20, 2021Code
RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan et al.

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning process of transformer-based language models. Then, we propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. Specifically, the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Notably, RoPE enables valuable properties, including the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding. Finally, we evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets. Our experiments show that it consistently overcomes its alternatives. Furthermore, we provide a theoretical analysis to explain some experimental results. RoFormer is already integrated into Huggingface: \url{https://huggingface.co/docs/transformers/model_doc/roformer}.

6.2LGApr 25, 2018Code
Structured Pruning for Efficient ConvNets via Incremental Regularization

Huan Wang, Qiming Zhang, Yuehai Wang et al.

Parameter pruning is a promising approach for CNN compression and acceleration by eliminating redundant model parameters with tolerable performance degrade. Despite its effectiveness, existing regularization-based parameter pruning methods usually drive weights towards zero with large and constant regularization factors, which neglects the fragility of the expressiveness of CNNs, and thus calls for a more gentle regularization scheme so that the networks can adapt during pruning. To achieve this, we propose a new and novel regularization-based pruning method, named IncReg, to incrementally assign different regularization factors to different weights based on their relative importance. Empirical analysis on CIFAR-10 dataset verifies the merits of IncReg. Further extensive experiments with popular CNNs on CIFAR-10 and ImageNet datasets show that IncReg achieves comparable to even better results compared with state-of-the-arts. Our source codes and trained models are available here: https://github.com/mingsun-tse/caffe_increg.

2.3ASDec 26, 2024
Robust Speech and Natural Language Processing Models for Depression Screening

Y. Lu, A. Harati, T. Rutowski et al.

Depression is a global health concern with a critical need for increased patient screening. Speech technology offers advantages for remote screening but must perform robustly across patients. We have described two deep learning models developed for this purpose. One model is based on acoustics; the other is based on natural language processing. Both models employ transfer learning. Data from a depression-labeled corpus in which 11,000 unique users interacted with a human-machine application using conversational speech is used. Results on binary depression classification have shown that both models perform at or above AUC=0.80 on unseen data with no speaker overlap. Performance is further analyzed as a function of test subset characteristics, finding that the models are generally robust over speaker and session variables. We conclude that models based on these approaches offer promise for generalized automated depression screening.

10.9CLMar 19, 2025
MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Jiazheng Li, Lu Yu, Qing Cui et al.

High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

38.7CVNov 30, 2021Code
CRIS: CLIP-Driven Referring Image Segmentation

Zhaoqing Wang, Yu Lu, Qiang Li et al.

Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.

2.4AIOct 31, 2021
Interpreting Deep Knowledge Tracing Model on EdNet Dataset

Deliang Wang, Yu Lu, Qinggang Meng et al.

With more deep learning techniques being introduced into the knowledge tracing domain, the interpretability issue of the knowledge tracing models has aroused researchers' attention. Our previous study(Lu et al. 2020) on building and interpreting the KT model mainly adopts the ASSISTment dataset(Feng, Heffernan, and Koedinger 2009),, whose size is relatively small. In this work, we perform the similar tasks but on a large and newly available dataset, called EdNet(Choi et al. 2020). The preliminary experiment results show the effectiveness of the interpreting techniques, while more questions and tasks are worthy to be further explored and accomplished.

10.6CVSep 14, 2020Code
GINet: Graph Interaction Network for Scene Parsing

Tianyi Wu, Yu Lu, Yu Zhu et al.

Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorporate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.

9.6LGMay 13, 2020
Towards Interpretable Deep Learning Models for Knowledge Tracing

Yu Lu, Deliang Wang, Qinggang Meng et al.

As an important technique for modeling the knowledge states of learners, the traditional knowledge tracing (KT) models have been widely used to support intelligent tutoring systems and MOOC platforms. Driven by the fast advancements of deep learning techniques, deep neural network has been recently adopted to design new KT models for achieving better prediction performance. However, the lack of interpretability of these models has painfully impeded their practical applications, as their outputs and working mechanisms suffer from the intransparent decision process and complex inner structures. We thus propose to adopt the post-hoc method to tackle the interpretability issue for deep learning based knowledge tracing (DLKT) models. Specifically, we focus on applying the layer-wise relevance propagation (LRP) method to interpret RNN-based DLKT model by backpropagating the relevance from the model's output layer to its input layer. The experiment results show the feasibility using the LRP method for interpreting the DLKT model's predictions, and partially validate the computed relevance scores from both question level and concept level. We believe it can be a solid step towards fully interpreting the DLKT models and promote their practical applications in the education domain.

1.2CVJan 31, 2020
C-DLinkNet: considering multi-level semantic features for human parsing

Yu Lu, Muyan Feng, Ming Wu et al.

Human parsing is an essential branch of semantic segmentation, which is a fine-grained semantic segmentation task to identify the constituent parts of human. The challenge of human parsing is to extract effective semantic features to resolve deformation and multi-scale variations. In this work, we proposed an end-to-end model called C-DLinkNet based on LinkNet, which contains a new module named Smooth Module to combine the multi-level features in Decoder part. C-DLinkNet is capable of producing competitive parsing performance compared with the state-of-the-art methods with smaller input sizes and no additional information, i.e., achiving mIoU=53.05 on the validation set of LIP dataset.

4.1CVNov 26, 2019
Learning Efficient Video Representation with Video Shuffle Networks

Pingchuan Ma, Yao Zhou, Yu Lu et al.

3D CNN shows its strong ability in learning spatiotemporal representation in recent video recognition tasks. However, inflating 2D convolution to 3D inevitably introduces additional computational costs, making it cumbersome in practical deployment. We consider whether there is a way to equip the conventional 2D convolution with temporal vision no requiring expanding its kernel. To this end, we propose the video shuffle, a parameter-free plug-in component that efficiently reallocates the inputs of 2D convolution so that its receptive field can be extended to the temporal dimension. In practical, video shuffle firstly divides each frame feature into multiple groups and then aggregate the grouped features via temporal shuffle operation. This allows the following 2D convolution aggregate the global spatiotemporal features. The proposed video shuffle can be flexibly inserted into popular 2D CNNs, forming the Video Shuffle Networks (VSN). With a simple yet efficient implementation, VSN performs surprisingly well on temporal modeling benchmarks. In experiments, VSN not only gains non-trivial improvements on Kinetics and Moments in Time, but also achieves state-of-the-art performance on Something-Something-V1, Something-Something-V2 datasets.

15.5MLMay 25, 2016
Exact Exponent in Optimal Rates for Crowdsourcing

Chao Gao, Yu Lu, Dengyong Zhou

In many machine learning applications, crowdsourcing has become the primary means for label collection. In this paper, we study the optimal error rate for aggregating labels provided by a set of non-expert workers. Under the classic Dawid-Skene model, we establish matching upper and lower bounds with an exact exponent $mI(π)$ in which $m$ is the number of workers and $I(π)$ the average Chernoff information that characterizes the workers' collective ability. Such an exact characterization of the error exponent allows us to state a precise sample size requirement $m>\frac{1}{I(π)}\log\frac{1}ε$ in order to achieve an $ε$ misclassification error. In addition, our results imply the optimality of various EM algorithms for crowdsourcing initialized by consistent estimators.