92.4SYJun 2
APX-Hardness of Computing Lipschitz Constants for Multi-Parametric Quadratic ProgramsXingchen Li, Kunpeng Liu, Keyou You
Computing the Lipschitz constant of the solution map of a multi-parametric quadratic program is important for the analysis of optimization-based control. This problem is governed by three factors: the parameter dimension, the number of decision variables, and the number of constraints. While empirical evidence has long suggested exponential complexity, a rigorous complexity-theoretic proof has been lacking. In this paper, we fill this gap by proving that this problem is not only NP-hard but also APX-hard. Furthermore, we reveal that: (a) the problem becomes polynomial-time solvable when the number of constraints or decision variables is fixed; and (b) both NP-hardness and APX-hardness persist even in the scalar parameter case. These results confirm that the complexity stems from the number of constraints and variables, rather than the parameter dimension. Numerical experiments further validate these theoretical findings.
CVAug 3, 2022
Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph GenerationXingchen Li, Long Chen, Wenbo Ma et al.
Recently, increasing efforts have been focused on Weakly Supervised Scene Graph Generation (WSSGG). The mainstream solution for WSSGG typically follows the same pipeline: they first align text entities in the weak image-level supervisions (e.g., unlocalized relation triplets or captions) with image regions, and then train SGG models in a fully-supervised manner with aligned instance-level "pseudo" labels. However, we argue that most existing WSSGG works only focus on object-consistency, which means the grounded regions should have the same object category label as text entities. While they neglect another basic requirement for an ideal alignment: interaction-consistency, which means the grounded region pairs should have the same interactions (i.e., visual relations) as text entity pairs. Hence, in this paper, we propose to enhance a simple grounding module with both object-aware and interaction-aware knowledge to acquire more reliable pseudo labels. To better leverage these two types of knowledge, we regard them as two teachers and fuse their generated targets to guide the training process of our grounding module. Specifically, we design two different strategies to adaptively assign weights to different teachers by assessing their reliability on each training sample. Extensive experiments have demonstrated that our method consistently improves WSSGG performance on various kinds of weak supervision.
CVAug 3, 2022
Rethinking the Evaluation of Unbiased Scene Graph GenerationXingchen Li, Long Chen, Jian Shao et al.
Current Scene Graph Generation (SGG) methods tend to predict frequent predicate categories and fail to recognize rare ones due to the severe imbalanced distribution of predicates. To improve the robustness of SGG models on different predicate categories, recent research has focused on unbiased SGG and adopted mean Recall@K (mR@K) as the main evaluation metric. However, we discovered two overlooked issues about this de facto standard metric, which makes current unbiased SGG evaluation vulnerable and unfair: 1) mR@K neglects the correlations among predicates and unintentionally breaks category independence when ranking all the triplet predictions together regardless of the predicate categories. 2) mR@K neglects the compositional diversity of different predicates and assigns excessively high weights to some oversimple category samples with limited composable relation triplet types. In addition, we investigate the under-explored correlation between objects and predicates, which can serve as a simple but strong baseline for unbiased SGG. In this paper, we refine mR@K and propose two complementary evaluation metrics for unbiased SGG: Independent Mean Recall (MR) and weighted IMR (wIMR). These two metrics are designed by considering the category independence and diversity of composable relation triplets, respectively. We compare the proposed metrics with the de facto standard metrics through extensive experiments and discuss the solutions to evaluate unbiased SGG in a more trustworthy way.
CVMar 20, 2023
Decomposed Prototype Learning for Few-Shot Scene Graph GenerationXingchen Li, Jun Xiao, Guikun Chen et al.
Today's scene graph generation (SGG) models typically require abundant manual annotations to learn new predicate types. Therefore, it is difficult to apply them to real-world applications with massive uncommon predicate categories whose annotations are hard to collect. In this paper, we focus on Few-Shot SGG (FSSGG), which encourages SGG models to be able to quickly transfer previous knowledge and recognize unseen predicates well with only a few examples. However, current methods for FSSGG are hindered by the high intra-class variance of predicate categories in SGG: On one hand, each predicate category commonly has multiple semantic meanings under different contexts. On the other hand, the visual appearance of relation triplets with the same predicate differs greatly under different subject-object compositions. Such great variance of inputs makes it hard to learn generalizable representation for each predicate category with current few-shot learning (FSL) methods. However, we found that this intra-class variance of predicates is highly related to the composed subjects and objects. To model the intra-class variance of predicates with subject-object context, we propose a novel Decomposed Prototype Learning (DPL) model for FSSGG. Specifically, we first construct a decomposable prototype space to capture diverse semantics and visual patterns of subjects and objects for predicates by decomposing them into multiple prototypes. Afterwards, we integrate these prototypes with different weights to generate query-adaptive predicate representation with more reliable semantics for each query sample. We conduct extensive experiments and compare with various baseline methods to show the effectiveness of our method.
ROApr 4, 2023
USTC FLICAR: A Sensors Fusion Dataset of LiDAR-Inertial-Camera for Heavy-duty Autonomous Aerial Work RobotsZiming Wang, Yujiang Liu, Yifan Duan et al.
In this paper, we present the USTC FLICAR Dataset, which is dedicated to the development of simultaneous localization and mapping and precise 3D reconstruction of the workspace for heavy-duty autonomous aerial work robots. In recent years, numerous public datasets have played significant roles in the advancement of autonomous cars and unmanned aerial vehicles (UAVs). However, these two platforms differ from aerial work robots: UAVs are limited in their payload capacity, while cars are restricted to two-dimensional movements. To fill this gap, we create the "Giraffe" mapping robot based on a bucket truck, which is equipped with a variety of well-calibrated and synchronized sensors: four 3D LiDARs, two stereo cameras, two monocular cameras, Inertial Measurement Units (IMUs), and a GNSS/INS system. A laser tracker is used to record the millimeter-level ground truth positions. We also make its ground twin, the "Okapi" mapping robot, to gather data for comparison. The proposed dataset extends the typical autonomous driving sensing suite to aerial scenes, demonstrating the potential of combining autonomous driving perception systems with bucket trucks to create a versatile autonomous aerial working platform. Moreover, based on the Segment Anything Model (SAM), we produce the Semantic FLICAR dataset, which provides fine-grained semantic segmentation annotations for multimodal continuous data in both temporal and spatial dimensions. The dataset is available for download at: https://ustc-flicar.github.io/.
CVOct 25, 2023
EdgeCalib: Multi-Frame Weighted Edge Features for Automatic Targetless LiDAR-Camera CalibrationXingchen Li, Yifan Duan, Beibei Wang et al.
In multimodal perception systems, achieving precise extrinsic calibration between LiDAR and camera is of critical importance. Previous calibration methods often required specific targets or manual adjustments, making them both labor-intensive and costly. Online calibration methods based on features have been proposed, but these methods encounter challenges such as imprecise feature extraction, unreliable cross-modality associations, and high scene-specific requirements. To address this, we introduce an edge-based approach for automatic online calibration of LiDAR and cameras in real-world scenarios. The edge features, which are prevalent in various environments, are aligned in both images and point clouds to determine the extrinsic parameters. Specifically, stable and robust image edge features are extracted using a SAM-based method and the edge features extracted from the point cloud are weighted through a multi-frame weighting strategy for feature filtering. Finally, accurate extrinsic parameters are optimized based on edge correspondence constraints. We conducted evaluations on both the KITTI dataset and our dataset. The results show a state-of-the-art rotation accuracy of 0.086° and a translation accuracy of 0.977 cm, outperforming existing edge-based calibration methods in both precision and robustness.
63.9AIMay 11Code
LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer ProgramsZhinan Hou, Xingchen Li, Yankai Zhang et al.
Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver's end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at https://github.com/hzn18/LLM4Branch.
CVNov 26, 2023
CalibFormer: A Transformer-based Automatic LiDAR-Camera Calibration NetworkYuxuan Xiao, Yao Li, Chengzhen Meng et al.
The fusion of LiDARs and cameras has been increasingly adopted in autonomous driving for perception tasks. The performance of such fusion-based algorithms largely depends on the accuracy of sensor calibration, which is challenging due to the difficulty of identifying common features across different data modalities. Previously, many calibration methods involved specific targets and/or manual intervention, which has proven to be cumbersome and costly. Learning-based online calibration methods have been proposed, but their performance is barely satisfactory in most cases. These methods usually suffer from issues such as sparse feature maps, unreliable cross-modality association, inaccurate calibration parameter regression, etc. In this paper, to address these issues, we propose CalibFormer, an end-to-end network for automatic LiDAR-camera calibration. We aggregate multiple layers of camera and LiDAR image features to achieve high-resolution representations. A multi-head correlation module is utilized to identify correlations between features more accurately. Lastly, we employ transformer architectures to estimate accurate calibration parameters from the correlation information. Our method achieved a mean translation error of $0.8751 \mathrm{cm}$ and a mean rotation error of $0.0562 ^{\circ}$ on the KITTI dataset, surpassing existing state-of-the-art methods and demonstrating strong robustness, accuracy, and generalization capabilities.
CVJul 16, 2024
Perception Helps Planning: Facilitating Multi-Stage Lane-Level Integration via Double-Edge StructuresGuoliang You, Xiaomeng Chu, Yifan Duan et al.
When planning for autonomous driving, it is crucial to consider essential traffic elements such as lanes, intersections, traffic regulations, and dynamic agents. However, they are often overlooked by the traditional end-to-end planning methods, likely leading to inefficiencies and non-compliance with traffic regulations. In this work, we endeavor to integrate the perception of these elements into the planning task. To this end, we propose Perception Helps Planning (PHP), a novel framework that reconciles lane-level planning with perception. This integration ensures that planning is inherently aligned with traffic constraints, thus facilitating safe and efficient driving. Specifically, PHP focuses on both edges of a lane for planning and perception purposes, taking into consideration the 3D positions of both lane edges and attributes for lane intersections, lane directions, lane occupancy, and planning. In the algorithmic design, the process begins with the transformer encoding multi-camera images to extract the above features and predicting lane-level perception results. Next, the hierarchical feature early fusion module refines the features for predicting planning attributes. Finally, the double-edge interpreter utilizes a late-fusion process specifically designed to integrate lane-level perception and planning information, culminating in the generation of vehicle control signals. Experiments on three Carla benchmarks show significant improvements in driving score of 27.20%, 33.47%, and 15.54% over existing algorithms, respectively, achieving the state-of-the-art performance, with the system operating up to 22.57 FPS.
LGJul 3, 2024
Multi-Scenario Combination Based on Multi-Agent Reinforcement Learning to Optimize the Advertising Recommendation SystemYang Zhao, Chang Zhou, Jin Cao et al.
This paper explores multi-scenario optimization on large platforms using multi-agent reinforcement learning (MARL). We address this by treating scenarios like search, recommendation, and advertising as a cooperative, partially observable multi-agent decision problem. We introduce the Multi-Agent Recurrent Deterministic Policy Gradient (MARDPG) algorithm, which aligns different scenarios under a shared objective and allows for strategy communication to boost overall performance. Our results show marked improvements in metrics such as click-through rate (CTR), conversion rate, and total sales, confirming our method's efficacy in practical settings.
CVSep 21, 2024
LFP: Efficient and Accurate End-to-End Lane-Level Planning via Camera-LiDAR FusionGuoliang You, Xiaomeng Chu, Yifan Duan et al.
Multi-modal systems enhance performance in autonomous driving but face inefficiencies due to indiscriminate processing within each modality. Additionally, the independent feature learning of each modality lacks interaction, which results in extracted features that do not possess the complementary characteristics. These issue increases the cost of fusing redundant information across modalities. To address these challenges, we propose targeting driving-relevant elements, which reduces the volume of LiDAR features while preserving critical information. This approach enhances lane level interaction between the image and LiDAR branches, allowing for the extraction and fusion of their respective advantageous features. Building upon the camera-only framework PHP, we introduce the Lane-level camera-LiDAR Fusion Planning (LFP) method, which balances efficiency with performance by using lanes as the unit for sensor fusion. Specifically, we design three modules to enhance efficiency and performance. For efficiency, we propose an image-guided coarse lane prior generation module that forecasts the region of interest (ROI) for lanes and assigns a confidence score, guiding LiDAR processing. The LiDAR feature extraction modules leverages lane-aware priors from the image branch to guide sampling for pillar, retaining essential pillars. For performance, the lane-level cross-modal query integration and feature enhancement module uses confidence score from ROI to combine low-confidence image queries with LiDAR queries, extracting complementary depth features. These features enhance the low-confidence image features, compensating for the lack of depth. Experiments on the Carla benchmarks show that our method achieves state-of-the-art performance in both driving score and infraction score, with maximum improvement of 15% and 14% over existing algorithms, respectively, maintaining high frame rate of 19.27 FPS.
ROMar 20, 2025Code
GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping under Flexible Language InstructionsXiaomeng Chu, Jiajun Deng, Guoliang You et al.
Flexible instruction-guided 6-DoF grasping is a significant yet challenging task for real-world robotic systems. Existing methods utilize the contextual understanding capabilities of the large language models (LLMs) to establish mappings between expressions and targets, allowing robots to comprehend users' intentions in the instructions. However, the LLM's knowledge about objects' physical properties remains underexplored despite its tight relevance to grasping. In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. Particularly, we design a set of QA templates to enable hierarchical reasoning that includes three stages: target parsing, physical property analysis, and grasp action selection. Moreover, GraspCoT presents a unified multimodal LLM architecture, which encodes multi-view observations of 3D scenes into 3D-aware visual tokens, and then jointly embeds these visual tokens with CoT-derived textual tokens within LLMs to generate grasp pose predictions. Furthermore, we present IntentGrasp, a large-scale benchmark that fills the gap in public datasets for multi-object grasp detection under diverse and indirect verbal commands. Extensive experiments on IntentGrasp demonstrate the superiority of our method, with additional validation in real-world robotic applications confirming its practicality. The code is available at https://github.com/cxmomo/GraspCoT.
98.4IVMar 13
DiT-IC: Aligned Diffusion Transformer for Efficient Image CompressionJunqi Shi, Ming Lu, Xingchen Li et al.
Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.
IVMay 30, 2025Code
DLiPath: A Benchmark for the Comprehensive Assessment of Donor Liver Based on Histopathological Image DatasetLiangrui Pan, Xingchen Li, Zhongyi Chen et al.
Pathologists comprehensive evaluation of donor liver biopsies provides crucial information for accepting or discarding potential grafts. However, rapidly and accurately obtaining these assessments intraoperatively poses a significant challenge for pathologists. Features in donor liver biopsies, such as portal tract fibrosis, total steatosis, macrovesicular steatosis, and hepatocellular ballooning are correlated with transplant outcomes, yet quantifying these indicators suffers from substantial inter- and intra-observer variability. To address this, we introduce DLiPath, the first benchmark for comprehensive donor liver assessment based on a histopathology image dataset. We collected and publicly released 636 whole slide images from 304 donor liver patients at the Department of Pathology, the Third Xiangya Hospital, with expert annotations for key pathological features (including cholestasis, portal tract fibrosis, portal inflammation, total steatosis, macrovesicular steatosis, and hepatocellular ballooning). We selected nine state-of-the-art multiple-instance learning (MIL) models based on the DLiPath dataset as baselines for extensive comparative analysis. The experimental results demonstrate that several MIL models achieve high accuracy across donor liver assessment indicators on DLiPath, charting a clear course for future automated and intelligent donor liver assessment research. Data and code are available at https://github.com/panliangrui/ACM_MM_2025.
SYJan 23
ReLU Networks for Model Predictive Control: Network Complexity and Performance GuaranteesXingchen Li, Keyou You
Recent years have witnessed a resurgence in using ReLU neural networks (NNs) to represent model predictive control (MPC) policies. However, determining the required network complexity to ensure closed-loop performance remains a fundamental open problem. This involves a critical precision-complexity trade-off: undersized networks may fail to capture the MPC policy, while oversized ones may outweigh the benefits of ReLU network approximation. In this work, we propose a projection-based method to enforce hard constraints and establish a state-dependent Lipschitz continuity property for the optimal MPC cost function, which enables sharp convergence analysis of the closed-loop system. For the first time, we derive explicit bounds on ReLU network width and depth for approximating MPC policies with guaranteed closed-loop performance. To further reduce network complexity and enhance closed-loop performance, we propose a non-uniform error framework with a state-aware scaling function to adaptively adjust both the input and output of the ReLU network. Our contributions provide a foundational step toward certifiable ReLU NN-based MPC.
SDOct 12, 2025
LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech SeparationJun Chen, Shichao Hu, Jiuxin Lin et al.
In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.
CVMay 10, 2025
ElectricSight: 3D Hazard Monitoring for Power Lines Using Low-Cost SensorsXingchen Li, LiDian Wang, Yu Sheng et al.
Protecting power transmission lines from potential hazards involves critical tasks, one of which is the accurate measurement of distances between power lines and potential threats, such as large cranes. The challenge with this task is that the current sensor-based methods face challenges in balancing accuracy and cost in distance measurement. A common practice is to install cameras on transmission towers, which, however, struggle to measure true 3D distances due to the lack of depth information. Although 3D lasers can provide accurate depth data, their high cost makes large-scale deployment impractical. To address this challenge, we present ElectricSight, a system designed for 3D distance measurement and monitoring of potential hazards to power transmission lines. This work's key innovations lie in both the overall system framework and a monocular depth estimation method. Specifically, the system framework combines real-time images with environmental point cloud priors, enabling cost-effective and precise 3D distance measurements. As a core component of the system, the monocular depth estimation method enhances the performance by integrating 3D point cloud data into image-based estimates, improving both the accuracy and reliability of the system. To assess ElectricSight's performance, we conducted tests with data from a real-world power transmission scenario. The experimental results demonstrate that ElectricSight achieves an average accuracy of 1.08 m for distance measurements and an early warning accuracy of 92%.
CVJun 4, 2024
Research on Driver Facial Fatigue Detection Based on Yolov8 ModelChang Zhou, Yang Zhao, Shaobo Liu et al.
In a society where traffic accidents frequently occur, fatigue driving has emerged as a grave issue. Fatigue driving detection technology, especially those based on the YOLOv8 deep learning model, has seen extensive research and application as an effective preventive measure. This paper discusses in depth the methods and technologies utilized in the YOLOv8 model to detect driver fatigue, elaborates on the current research status both domestically and internationally, and systematically introduces the processing methods and algorithm principles for various datasets. This study aims to provide a robust technical solution for preventing and detecting fatigue driving, thereby contributing significantly to reducing traffic accidents and safeguarding lives.
CVMay 6, 2024
Research on Image Recognition Technology Based on Multimodal Deep LearningJinyin Wang, Xingchen Li, Yixuan Jin et al.
This project investigates the human multi-modal behavior identification algorithm utilizing deep neural networks. According to the characteristics of different modal information, different deep neural networks are used to adapt to different modal video information. Through the integration of various deep neural networks, the algorithm successfully identifies behaviors across multiple modalities. In this project, multiple cameras developed by Microsoft Kinect were used to collect corresponding bone point data based on acquiring conventional images. In this way, the motion features in the image can be extracted. Ultimately, the behavioral characteristics discerned through both approaches are synthesized to facilitate the precise identification and categorization of behaviors. The performance of the suggested algorithm was evaluated using the MSR3D data set. The findings from these experiments indicate that the accuracy in recognizing behaviors remains consistently high, suggesting that the algorithm is reliable in various scenarios. Additionally, the tests demonstrate that the algorithm substantially enhances the accuracy of detecting pedestrian behaviors in video footage.
LGJan 26, 2022
A deep learning method based on patchwise training for reconstructing temperature fieldXingwen Peng, Xingchen Li, Zhiqiang Gong et al.
Physical field reconstruction is highly desirable for the measurement and control of engineering systems. The reconstruction of the temperature field from limited observation plays a crucial role in thermal management for electronic equipment. Deep learning has been employed in physical field reconstruction, whereas the accurate estimation for the regions with large gradients is still diffcult. To solve the problem, this work proposes a novel deep learning method based on patchwise training to reconstruct the temperature field of electronic equipment accurately from limited observation. Firstly, the temperature field reconstruction (TFR) problem of the electronic equipment is modeled mathematically and transformed as an image-to-image regression task. Then a patchwise training and inference framework consisting of an adaptive UNet and a shallow multilayer perceptron (MLP) is developed to establish the mapping from the observation to the temperature field. The adaptive UNet is utilized to reconstruct the whole temperature field while the MLP is designed to predict the patches with large temperature gradients. Experiments employing finite element simulation data are conducted to demonstrate the accuracy of the proposed method. Furthermore, the generalization is evaluated by investigating cases under different heat source layouts, different power intensities, and different observation point locations. The maximum absolute errors of the reconstructed temperature field are less than 1K under the patchwise training approach.
CVOct 20, 2021
Unified Style TransferGuanjie Huang, Hongjian He, Xiang Li et al.
Currently, it is hard to compare and evaluate different style transfer algorithms due to chaotic definitions of style and the absence of agreed objective validation methods in the study of style transfer. In this paper, a novel approach, the Unified Style Transfer (UST) model, is proposed. With the introduction of a generative model for internal style representation, UST can transfer images in two approaches, i.e., Domain-based and Image-based, simultaneously. At the same time, a new philosophy based on the human sense of art and style distributions for evaluating the transfer model is presented and demonstrated, called Statistical Style Analysis. It provides a new path to validate style transfer models' feasibility by validating the general consistency between internal style representation and art facts. Besides, the translation-invariance of AdaIN features is also discussed.
IRMay 26, 2020
Hierarchical Fashion Graph Network for Personalized Outfit RecommendationXingchen Li, Xiang Wang, Xiangnan He et al.
Fashion outfit recommendation has attracted increasing attentions from online shopping services and fashion communities.Distinct from other scenarios (e.g., social networking or content sharing) which recommend a single item (e.g., a friend or picture) to a user, outfit recommendation predicts user preference on a set of well-matched fashion items.Hence, performing high-quality personalized outfit recommendation should satisfy two requirements -- 1) the nice compatibility of fashion items and 2) the consistence with user preference. However, present works focus mainly on one of the requirements and only consider either user-outfit or outfit-item relationships, thereby easily leading to suboptimal representations and limiting the performance. In this work, we unify two tasks, fashion compatibility modeling and personalized outfit recommendation. Towards this end, we develop a new framework, Hierarchical Fashion Graph Network(HFGN), to model relationships among users, items, and outfits simultaneously. In particular, we construct a hierarchical structure upon user-outfit interactions and outfit-item mappings. We then get inspirations from recent graph neural networks, and employ the embedding propagation on such hierarchical graph, so as to aggregate item information into an outfit representation, and then refine a user's representation via his/her historical outfits. Furthermore, we jointly train these two tasks to optimize these representations. To demonstrate the effectiveness of HFGN, we conduct extensive experiments on a benchmark dataset, and HFGN achieves significant improvements over the state-of-the-art compatibility matching models like NGNN and outfit recommenders like FHN.