CVSep 18, 2024Code
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any ResolutionPeng Wang, Shuai Bai, Sinan Tan et al.
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .
CLJul 15, 2024
Qwen2 Technical ReportAn Yang, Baosong Yang, Binyuan Hui et al.
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.
CLDec 5, 2024Code
Agent AI with LangGraph: A Modular Framework for Enhancing Machine Translation Using Large Language ModelsJialin Wang, Zhihua Duan
This paper explores the transformative role of Agent AI and LangGraph in advancing the automation and effectiveness of machine translation (MT). Agents are modular components designed to perform specific tasks, such as translating between particular languages, with specializations like TranslateEnAgent, TranslateFrenchAgent, and TranslateJpAgent for English, French, and Japanese translations, respectively. These agents leverage the powerful semantic capabilities of large language models (LLMs), such as GPT-4o, to ensure accurate, contextually relevant translations while maintaining modularity, scalability, and context retention. LangGraph, a graph-based framework built on LangChain, simplifies the creation and management of these agents and their workflows. It supports dynamic state management, enabling agents to maintain dialogue context and automates complex workflows by linking agents and facilitating their collaboration. With flexibility, open-source community support, and seamless integration with LLMs, LangGraph empowers agents to deliver high-quality translations. Together, Agent AI and LangGraph create a cohesive system where LangGraph orchestrates agent interactions, ensuring that user inputs are analyzed, routed, and processed efficiently. Experimental results demonstrate the potential of this system to enhance multilingual translation accuracy and scalability. By highlighting modular design and automated workflows, this paper sets the stage for further innovations in intelligent machine translation services.
CVFeb 19, 2025
Qwen2.5-VL Technical ReportShuai Bai, Keqin Chen, Xuejing Liu et al. · pku
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
HCMar 27
FlexiCamAR: Enhancing Everyday Camera Interactions on AR Glasses with a Flexible Additional ViewpointZiming Li, Hongji Li, Jialin Wang et al.
The recent emergence and popularity of consumer-grade augmented reality (AR) glasses from major technology companies highlight their potential to become the next daily computing platform. A dominant design trend in this context is the integration of a front-facing camera to deliver a first-person perspective. While this approach is intuitive, there is limited evidence that it is optimal (or sufficient) for supporting users in daily tasks. This paper explores a more effective camera interaction technique for AR glasses, which we term ``FlexiCamAR." This novel method aims to enhance both efficiency and the range of applications for AR glasses by offering flexible and comfortable secondary camera viewpoints. To investigate the applicability and usability of this approach, we developed a ring camera prototype that can be attached to users' fingers. We then conducted a user study with 12 participants, comparing FlexiCamAR against the baseline, a traditional front-facing AR camera setup, across two common tasks: taking photos and scanning QR codes. Our findings show that FlexiCamAR significantly reduces physical load. We also explore potential scenarios where the additional viewpoint afforded by FlexiCamAR proves valuable, such as capturing low-angle perspectives or navigating confined spaces. Participant feedback further suggests strong potential for additional applications, including selfie taking, video conferencing, and object scanning. Overall, FlexiCamAR presents a novel interaction approach that can serve as a powerful supplement or alternative to the first-person perspective, significantly improving the adaptability of AR glasses for everyday use.
CLAug 22, 2024
Implicit Sentiment Analysis Based on Chain of Thought PromptingZhihua Duan, Jialin Wang
Implicit Sentiment Analysis (ISA) is a crucial research area in natural language processing. Inspired by the idea of large language model Chain of Thought (CoT), this paper introduces a Sentiment Analysis of Thinking (SAoT) framework. The framework first analyzes the implicit aspects and opinions in the text using common sense and thinking chain capabilities. Then, it reflects on the process of implicit sentiment analysis and finally deduces the polarity of sentiment. The model is evaluated on the SemEval 2014 dataset, consisting of 1120 restaurant reviews and 638 laptop reviews. The experimental results demonstrate that the utilization of the ERNIE-Bot-4+SAoT model yields a notable performance improvement. Specifically, on the restaurant dataset, the F1 score reaches 75.27, accompanied by an ISA score of 66.29. Similarly, on the computer dataset, the F1 score achieves 76.50, while the ISA score amounts to 73.46. Comparatively, the ERNIE-Bot-4+SAoT model surpasses the BERTAsp + SCAPt baseline by an average margin of 47.99%.
CLMar 26, 2025
Qwen2.5-Omni Technical ReportJin Xu, Zhifang Guo, Jinzheng He et al.
In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.
AIAug 22, 2024
Multi-tool Integration Application for Math Reasoning Using Large Language ModelZhihua Duan, Jialin Wang
Mathematical reasoning is an important research direction in the field of artificial intelligence. This article proposes a novel multi tool application framework for mathematical reasoning, aiming to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools. Firstly, use a Math Tool to perform basic mathematical calculations during the inference process through interaction with LLM. Secondly, Code Tool can generate code fragments that comply with syntax rules and execute them, providing support for complex mathematical problems. Then, through the iterative reasoning of the CoT Tool, the logical coherence and accuracy of mathematical reasoning are enhanced. Ultimately, by using self consistency tools to select the final answer based on different parameters, the consistency and reliability of reasoning are improved. Through the synergistic effect of these tools, the framework has achieved significant performance improvement in mathematical reasoning tasks. We conducted experiments on the NumGLUE Task 4 test set, which includes 220 mathematical reasoning fill in the blank questions. The experimental results showed that, based on Math Tool, Code Tool, and CoT Tool, in Task 4 task,our method achieved an accuracy of 89.09,compared with the GPT3+FewShot baseline, Few Shot+ERNIE-4.0+self consistency improved by 49.09%, and compared with fine-tuning the Fine tuning baseline, Few Shot+ERNIE-4.0+self consistency improved by 52.29%
SEAug 22, 2024
AutoTest: Evolutionary Code Solution Selection with Test CasesZhihua Duan, Jialin Wang
With the development of code generation techniques, selecting the correct code solution from multiple candidate solutions has become a crucial task. This study proposes AutoTest, a novel technique that combines automated test case generation with code solution execution to optimize the selection process using an evolutionary genetic algorithm. Firstly, AutoTest utilizes large pre-trained language models such as codegen-16B, code-davinci-002, and incoder-6B to provide code solutions and their corresponding test cases. Then, by executing the code solutions and evaluating their performance on the test cases, a consensus set is formed. Fine-grained ranking is achieved through the selection, mutation, and crossover mechanisms based on the evolutionary genetic algorithm, with the adjustment of alpha and beta parameters. Finally, the best code solution is chosen. AutoTest demonstrates significant performance improvements on the HumanEval benchmark test. The HumanEval dataset consists of 164 programming problems, and AutoTest achieves approximately a 10% improvement over the baseline method in terms of pass@1 score.
MANov 27, 2024
Exploration of LLM Multi-Agent Application Implementation Based on LangGraph+CrewAIZhihua Duan, Jialin Wang
With the rapid development of large model technology, the application of agent technology in various fields is becoming increasingly widespread, profoundly changing people's work and lifestyles. In complex and dynamic systems, multi-agents achieve complex tasks that are difficult for a single agent to complete through division of labor and collaboration among agents. This paper discusses the integrated application of LangGraph and CrewAI. LangGraph improves the efficiency of information transmission through graph architecture, while CrewAI enhances team collaboration capabilities and system performance through intelligent task allocation and resource management. The main research contents of this paper are: (1) designing the architecture of agents based on LangGraph for precise control; (2) enhancing the capabilities of agents based on CrewAI to complete a variety of tasks. This study aims to delve into the application of LangGraph and CrewAI in multi-agent systems, providing new perspectives for the future development of agent technology, and promoting technological progress and application innovation in the field of large model intelligent agents.
HCMar 23
Would You Like to Visit My World? Cultivating Perceived Equality in Human-Agent Interaction via Observable Social Life SpacesZihong He, Shuqin Wang, Songchen Zhou et al.
Most AI agents remain confined to an instrumental "command-execution" model, resulting in unequal, one-sided interactions. While recent works attempt to build relationships through hidden memory backends, these invisible processes often fail to break the instrumental bias. In this paper, we argue that true relational equality requires agents to have an independent, observable existence. We introduce the \textit{Observable Life Spaces} paradigm, where agents inhabit a continuous virtual environment, engage in daily activities, and form social relationships that users can directly observe. Through a mixed-methods study ($N=24$), we demonstrate that only when agents are endowed with a socialized life space that is visually observable to humans can the perceived equality during interaction be significantly enhanced ($p = 0.015$). Our findings suggest that visually representing an agent's social life space can effectively shift the human-agent dynamic from a purely instrumental relationship to one characterized by perceived equality.
AIJan 17, 2025
Prompt-Based Monte Carlo Tree Search for Mitigating Hallucinations in Large ModelsZhihua Duan, Jialin Wang
With the rapid development of large models in the field of artificial intelligence, how to enhance their application capabilities in handling complex problems in the field of scientific research remains a challenging problem to be solved. This study proposes an improved Monte Carlo Tree Search (MCTS) method based on prompt words. In the simulation search stage, it introduces dynamic adjustment of exploration parameters and adaptive selection strategies, which can better balance exploration and exploitation, thereby reducing the hallucination phenomenon. This paper takes the four subsets of the SciEval dataset as the test objects, and compares the Glm-4-flash+Improved MCTS method with the methods of several existing models. The results show that the Improved MCTS method performs better, providing new ideas and methods for the application of large models in the field of scientific research.
AIDec 2, 2024
Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning WorkflowsJialin Wang, Zhihua Duan
This paper presents a Spark-based modular LangGraph framework, designed to enhance machine learning workflows through scalability, visualization, and intelligent process optimization. At its core, the framework introduces Agent AI, a pivotal innovation that leverages Spark's distributed computing capabilities and integrates with LangGraph for workflow orchestration. Agent AI facilitates the automation of data preprocessing, feature engineering, and model evaluation while dynamically interacting with data through Spark SQL and DataFrame agents. Through LangGraph's graph-structured workflows, the agents execute complex tasks, adapt to new inputs, and provide real-time feedback, ensuring seamless decision-making and execution in distributed environments. This system simplifies machine learning processes by allowing users to visually design workflows, which are then converted into Spark-compatible code for high-performance execution. The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data and enabling advanced data analysis. Experimental evaluations demonstrate significant improvements in process efficiency and scalability, as well as accurate data-driven decision-making in diverse application scenarios. This paper emphasizes the integration of Spark with intelligent agents and graph-based workflows to redefine the development and execution of machine learning tasks in big data environments, paving the way for scalable and user-friendly AI solutions.
AINov 25, 2024
Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language ModelsZhihua Duan, Jialin Wang
Large Language Models (LLMs) still face challenges when dealing with complex reasoning tasks, often resulting in hallucinations, which limit the practical application of LLMs. To alleviate this issue, this paper proposes a new method that integrates different LLMs to expand the knowledge boundary, reduce dependence on a single model, and promote in-depth debate among agents. The main contributions include: 1) Introducing third-party LLMs to adjust the attention weights of agents through uncertainty estimation and confidence analysis, optimizing consensus formation in multi-agent systems; 2) Experiments on arithmetic datasets have validated the effectiveness of the method, surpassing traditional multi-agent baselines. This research provides a new perspective for large models to alleviate hallucination phenomena when dealing with complex tasks.
DCDec 10, 2024
Research on the Application of Spark Streaming Real-Time Data Analysis System and large language model Intelligent AgentsJialin Wang, Zhihua Duan
This study explores the integration of Agent AI with LangGraph to enhance real-time data analysis systems in big data environments. The proposed framework overcomes limitations of static workflows, inefficient stateful computations, and lack of human intervention by leveraging LangGraph's graph-based workflow construction and dynamic decision-making capabilities. LangGraph allows large language models (LLMs) to dynamically determine control flows, invoke tools, and assess the necessity of further actions, improving flexibility and efficiency. The system architecture incorporates Apache Spark Streaming, Kafka, and LangGraph to create a high-performance sentiment analysis system. LangGraph's capabilities include precise state management, dynamic workflow construction, and robust memory checkpointing, enabling seamless multi-turn interactions and context retention. Human-in-the-loop mechanisms are integrated to refine sentiment analysis, particularly in ambiguous or high-stakes scenarios, ensuring greater reliability and contextual relevance. Key features such as real-time state streaming, debugging via LangGraph Studio, and efficient handling of large-scale data streams make this framework ideal for adaptive decision-making. Experimental results confirm the system's ability to classify inquiries, detect sentiment trends, and escalate complex issues for manual review, demonstrating a synergistic blend of LLM capabilities and human oversight. This work presents a scalable, adaptable, and reliable solution for real-time sentiment analysis and decision-making, advancing the use of Agent AI and LangGraph in big data applications.
LGOct 4, 2025
Personalized federated prototype learning in mixed heterogeneous data scenariosJiahao Zeng, Wolong Xing, Liangtao Shi et al.
Federated learning has received significant attention for its ability to simultaneously protect customer privacy and leverage distributed data from multiple devices for model training. However, conventional approaches often focus on isolated heterogeneous scenarios, resulting in skewed feature distributions or label distributions. Meanwhile, data heterogeneity is actually a key factor in improving model performance. To address this issue, we propose a new approach called PFPL in mixed heterogeneous scenarios. The method provides richer domain knowledge and unbiased convergence targets by constructing personalized, unbiased prototypes for each client. Moreover, in the local update phase, we introduce consistent regularization to align local instances with their personalized prototypes, which significantly improves the convergence of the loss function. Experimental results on Digits and Office Caltech datasets validate the effectiveness of our approach and successfully reduce the communication cost.
LGFeb 27, 2025
Enhancing Transformer with GNN Structural Knowledge via Distillation: A Novel ApproachZhihua Duan, Jialin Wang
Integrating the structural inductive biases of Graph Neural Networks (GNNs) with the global contextual modeling capabilities of Transformers represents a pivotal challenge in graph representation learning. While GNNs excel at capturing localized topological patterns through message-passing mechanisms, their inherent limitations in modeling long-range dependencies and parallelizability hinder their deployment in large-scale scenarios. Conversely, Transformers leverage self-attention mechanisms to achieve global receptive fields but struggle to inherit the intrinsic graph structural priors of GNNs. This paper proposes a novel knowledge distillation framework that systematically transfers multiscale structural knowledge from GNN teacher models to Transformer student models, offering a new perspective on addressing the critical challenges in cross-architectural distillation. The framework effectively bridges the architectural gap between GNNs and Transformers through micro-macro distillation losses and multiscale feature alignment. This work establishes a new paradigm for inheriting graph structural biases in Transformer architectures, with broad application prospects.
HCJan 9, 2022
In-Device Feedback in Immersive Head-Mounted Displays for Distance Perception During Teleoperation of Unmanned Ground VehiclesYiming Luo, Jialin Wang, Rongkai Shi et al.
In recent years, Virtual Reality (VR) Head-Mounted Displays (HMD) have been used to provide an immersive, first-person view in real-time for the remote-control of Unmanned Ground Vehicles (UGV). One critical issue is that it is challenging to perceive the distance of obstacles surrounding the vehicle from 2D views in the HMD, which deteriorates the control of UGV. Conventional distance indicators used in HMD take up screen space which leads clutter on the display and can further reduce situation awareness of the physical environment. To address the issue, in this paper we propose off-screen in-device feedback using vibro-tactile and/or light-visual cues to provide real-time distance information for the remote control of UGV. Results from a study show a significantly better performance with either feedback type, reduced workload and improved usability in a driving task that requires continuous perception of the distance between the UGV and its environmental objects or obstacles. Our findings show a solid case for in-device vibro-tactile and/or light-visual feedback to support remote operation of UGVs that highly relies on distance perception of objects.
HCJul 12, 2021
Monoscopic vs. Stereoscopic Views and Display Types in the Teleoperation of Unmanned Ground Vehicles for Object AvoidanceYiming Luo, Jialin Wang, Hai-Ning Liang et al.
Virtual reality (VR) head-mounted displays (HMD) have recently been used to provide an immersive, first-person vision/view in real-time for manipulating remotely-controlled unmanned ground vehicles (UGV). The teleoperation of UGV can be challenging for operators when it is done in real time. One big challenge is for operators to perceive quickly and rapidly the distance of objects that are around the UGV while it is moving. In this research, we explore the use of monoscopic and stereoscopic views and display types (immersive and non-immersive VR) for operating vehicles remotely. We conducted two user studies to explore their feasibility and advantages. Results show a significantly better performance when using an immersive display with stereoscopic view for dynamic, real-time navigation tasks that require avoiding both moving and static obstacles. The use of stereoscopic view in an immersive display in particular improved user performance and led to better usability.
HCOct 13, 2020
Real-Time Detection of Simulator Sickness in Virtual Reality Games Based on Players' Psychophysiological Data during GameplayJialin Wang, Hai-Ning Liang, Diego Monteiro et al.
Virtual Reality (VR) technology has been proliferating in the last decade, especially in the last few years. However, Simulator Sickness (SS) still represents a significant problem for its wider adoption. Currently, the most common way to detect SS is using the Simulator Sickness Questionnaire (SSQ). SSQ is a subjective measurement and is inadequate for real-time applications such as VR games. This research aims to investigate how to use machine learning techniques to detect SS based on in-game characters' and users' physiological data during gameplay in VR games. To achieve this, we designed an experiment to collect such data with three types of games. We trained a Long Short-Term Memory neural network with the dataset eye-tracking and character movement data to detect SS in real-time. Our results indicate that, in VR games, our model is an accurate and efficient way to detect SS in real-time.
HCOct 7, 2020
An In-Depth Exploration of the Effect of 2D/3D Views and Controller Types on First Person Shooter Games in Virtual RealityDiego Monteiro, Hai-Ning Liang, Jialin Wang et al.
The amount of interest in Virtual Reality (VR) research has significantly increased over the past few years, both in academia and industry. The release of commercial VR Head-Mounted Displays (HMDs) has been a major contributing factor. However, there is still much to be learned, especially how views and input techniques, as well as their interaction, affect the VR experience. There is little work done on First-Person Shooter (FPS) games in VR, and those few studies have focused on a single aspect of VR FPS. They either focused on the view, e.g., comparing VR to a typical 2D display or on the controller types. To the best of our knowledge, there are no studies investigating variations of 2D/3D views in HMDs, controller types, and their interactions. As such, it is challenging to distinguish findings related to the controller type from those related to the view. If a study does not control for the input method and finds that 2D displays lead to higher performance than VR, we cannot generalize the results because of the confounding variables. To understand their interaction, we propose to analyze in more depth, whether it is the view (2D vs. 3D) or the way it is controlled that gives the platforms their respective advantages. To study the effects of the 2D/3D views, we created a 2D visual technique, PlaneFrame, that was applied inside the VR headset. Our results show that the controller type can have a significant positive impact on performance, immersion, and simulator sickness when associated with a 2D view. They further our understanding of the interactions that controllers and views have and demonstrate that comparisons are highly dependent on how both factors go together. Further, through a series of three experiments, we developed a technique that can lead to a substantial performance, a good level of immersion, and can minimize the level of simulator sickness.
CRMay 9, 2019
Bidirectional RNN-based Few-shot Training for Detecting Multi-stage AttackDi Zhao, Jiqiang Liu, Jialin Wang et al.
"Feint Attack", as a new type of APT attack, has become the focus of attention. It adopts a multi-stage attacks mode which can be concluded as a combination of virtual attacks and real attacks. Under the cover of virtual attacks, real attacks can achieve the real purpose of the attacker, as a result, it often caused huge losses inadvertently. However, to our knowledge, all previous works use common methods such as Causal-Correlation or Cased-based to detect outdated multi-stage attacks. Few attentions have been paid to detect the "Feint Attack", because the difficulty of detection lies in the diversification of the concept of "Feint Attack" and the lack of professional datasets, many detection methods ignore the semantic relationship in the attack. Aiming at the existing challenge, this paper explores a new method to solve the problem. In the attack scenario, the fuzzy clustering method based on attribute similarity is used to mine multi-stage attack chains. Then we use a few-shot deep learning algorithm (SMOTE&CNN-SVM) and bidirectional Recurrent Neural Network model (Bi-RNN) to obtain the "Feint Attack" chains. "Feint Attack" is simulated by the real attack inserted in the normal causal attack chain, and the addition of the real attack destroys the causal relationship of the original attack chain. So we used Bi-RNN coding to obtain the hidden feature of "Feint Attack" chain. In the end, our method achieved the goal to detect the "Feint Attack" accurately by using the LLDoS1.0 and LLDoS2.0 of DARPA2000 and CICIDS2017 of Canadian Institute for Cybersecurity.
CVNov 19, 2018
Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural NetworksChenchen Li, Jialin Wang, Hongwei Wang et al.
User emotion analysis toward videos is to automatically recognize the general emotional status of viewers from the multimedia content embedded in the online video stream. Existing works fall in two categories: 1) visual-based methods, which focus on visual content and extract a specific set of features of videos. However, it is generally hard to learn a mapping function from low-level video pixels to high-level emotion space due to great intra-class variance. 2) textual-based methods, which focus on the investigation of user-generated comments associated with videos. The learned word representations by traditional linguistic approaches typically lack emotion information and the global comments usually reflect viewers' high-level understandings rather than instantaneous emotions. To address these limitations, in this paper, we propose to jointly utilize video content and user-generated texts simultaneously for emotion analysis. In particular, we introduce exploiting a new type of user-generated texts, i.e., "danmu", which are real-time comments floating on the video and contain rich information to convey viewers' emotional opinions. To enhance the emotion discriminativeness of words in textual feature extraction, we propose Emotional Word Embedding (EWE) to learn text representations by jointly considering their semantics and emotions. Afterwards, we propose a novel visual-textual emotion analysis model with Deep Coupled Video and Danmu Neural networks (DCVDN), in which visual and textual features are synchronously extracted and fused to form a comprehensive representation by deep-canonically-correlated-autoencoder-based multi-view learning. Through extensive experiments on a self-crawled real-world video-danmu dataset, we prove that DCVDN significantly outperforms the state-of-the-art baselines.
IRMar 9, 2018
RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender SystemsHongwei Wang, Fuzheng Zhang, Jialin Wang et al.
To address the sparsity and cold start problem of collaborative filtering, researchers usually make use of side information, such as social networks or item attributes, to improve recommendation performance. This paper considers the knowledge graph as the source of side information. To address the limitations of existing embedding-based and path-based methods for knowledge-graph-aware recommendation, we propose Ripple Network, an end-to-end framework that naturally incorporates the knowledge graph into recommender systems. Similar to actual ripples propagating on the surface of water, Ripple Network stimulates the propagation of user preferences over the set of knowledge entities by automatically and iteratively extending a user's potential interests along links in the knowledge graph. The multiple "ripples" activated by a user's historically clicked items are thus superposed to form the preference distribution of the user with respect to a candidate item, which could be used for predicting the final clicking probability. Through extensive experiments on real-world datasets, we demonstrate that Ripple Network achieves substantial gains in a variety of scenarios, including movie, book and news recommendation, over several state-of-the-art baselines.
LGNov 22, 2017
GraphGAN: Graph Representation Learning with Generative Adversarial NetsHongwei Wang, Jia Wang, Jialin Wang et al.
The goal of graph representation learning is to embed each vertex in a graph into a low-dimensional vector space. Existing graph representation learning methods can be classified into two categories: generative models that learn the underlying connectivity distribution in the graph, and discriminative models that predict the probability of edge existence between a pair of vertices. In this paper, we propose GraphGAN, an innovative graph representation learning framework unifying above two classes of methods, in which the generative model and discriminative model play a game-theoretical minimax game. Specifically, for a given vertex, the generative model tries to fit its underlying true connectivity distribution over all other vertices and produces "fake" samples to fool the discriminative model, while the discriminative model tries to detect whether the sampled vertex is from ground truth or generated by the generative model. With the competition between these two models, both of them can alternately and iteratively boost their performance. Moreover, when considering the implementation of generative model, we propose a novel graph softmax to overcome the limitations of traditional softmax function, which can be proven satisfying desirable properties of normalization, graph structure awareness, and computational efficiency. Through extensive experiments on real-world datasets, we demonstrate that GraphGAN achieves substantial gains in a variety of applications, including link prediction, node classification, and recommendation, over state-of-the-art baselines.