CYOct 5, 2022
APGKT: Exploiting Associative Path on Skills Graph for Knowledge TracingHaotian Zhang, Chenyang Bu, Fei Liu et al.
Knowledge tracing (KT) is a fundamental task in educational data mining that mainly focuses on students' dynamic cognitive states of skills. The question-answering process of students can be regarded as a thinking process that considers the following two problems. One problem is which skills are needed to answer the question, and the other is how to use these skills in order. If a student wants to answer a question correctly, the student should not only master the set of skills involved in the question but also think and obtain the associative path on the skills graph. The nodes in the associative path refer to the skills needed and the path shows the order of using them. The associative path is referred to as the skill mode. Thus, obtaining the skill modes is the key to answering questions successfully. However, most existing KT models only focus on a set of skills, without considering the skill modes. We propose a KT model, called APGKT, that exploits skill modes. Specifically, we extract the subgraph topology of the skills involved in the question and combine the difficulty level of the skills to obtain the skill modes via encoding; then, through multi-layer recurrent neural networks, we obtain a student's higher-order cognitive states of skills, which is used to predict the student's future answering performance. Experiments on five benchmark datasets validate the effectiveness of the proposed model.
CLMar 22, 2022
Learning Relation-Specific Representations for Few-shot Knowledge Graph CompletionYuling Li, Kui Yu, Yuhong Zhang et al.
Recent years have witnessed increasing interest in few-shot knowledge graph completion (FKGC), which aims to infer unseen query triples for a few-shot relation using a few reference triples about the relation. The primary focus of existing FKGC methods lies in learning relation representations that can reflect the common information shared by the query and reference triples. To this end, these methods learn entity-pair representations from the direct neighbors of head and tail entities, and then aggregate the representations of reference entity pairs. However, the entity-pair representations learned only from direct neighbors may have low expressiveness when the involved entities have sparse direct neighbors or share a common local neighborhood with other entities. Moreover, merely modeling the semantic information of head and tail entities is insufficient to accurately infer their relational information especially when they have multiple relations. To address these issues, we propose a Relation-Specific Context Learning (RSCL) framework, which exploits graph contexts of triples to learn global and local relation-specific representations for few-shot relations. Specifically, we first extract graph contexts for each triple, which can provide long-term entity-relation dependencies. To encode the extracted graph contexts, we then present a hierarchical attention network to capture contextualized information of triples and highlight valuable local neighborhood information of entities. Finally, we design a hybrid attention aggregator to evaluate the likelihood of the query triples at the global and local levels. Experimental results on two public datasets demonstrate that RSCL outperforms state-of-the-art FKGC methods.
CVDec 4, 2025Code
OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-ResolutionXinning Chai, Zhengxue Cheng, Yuhong Zhang et al.
Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at https://github.com/chaixinning/OmniScaleSR.
CYOct 18, 2022
DAGKT: Difficulty and Attempts Boosted Graph-based Knowledge TracingRui Luo, Fei Liu, Wenhao Liang et al.
In the field of intelligent education, knowledge tracing (KT) has attracted increasing attention, which estimates and traces students' mastery of knowledge concepts to provide high-quality education. In KT, there are natural graph structures among questions and knowledge concepts so some studies explored the application of graph neural networks (GNNs) to improve the performance of the KT models which have not used graph structure. However, most of them ignored both the questions' difficulties and students' attempts at questions. Actually, questions with the same knowledge concepts have different difficulties, and students' different attempts also represent different knowledge mastery. In this paper, we propose a difficulty and attempts boosted graph-based KT (DAGKT), using rich information from students' records. Moreover, a novel method is designed to establish the question similarity relationship inspired by the F1 score. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed DAGKT.
CVJul 4, 2024
Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image RestorationYuhong Zhang, Hengsheng Zhang, Xinning Chai et al.
Image restoration is a classic low-level problem aimed at recovering high-quality images from low-quality images with various degradations such as blur, noise, rain, haze, etc. However, due to the inherent complexity and non-uniqueness of degradation in real-world images, it is challenging for a model trained for single tasks to handle real-world restoration problems effectively. Moreover, existing methods often suffer from over-smoothing and lack of realism in the restored results. To address these issues, we propose Diff-Restorer, a universal image restoration method based on the diffusion model, aiming to leverage the prior knowledge of Stable Diffusion to remove degradation while generating high perceptual quality restoration results. Specifically, we utilize the pre-trained visual language model to extract visual prompts from degraded images, including semantic and degradation embeddings. The semantic embeddings serve as content prompts to guide the diffusion model for generation. In contrast, the degradation embeddings modulate the Image-guided Control Module to generate spatial priors for controlling the spatial structure of the diffusion process, ensuring faithfulness to the original image. Additionally, we design a Degradation-aware Decoder to perform structural correction and convert the latent code to the pixel domain. We conducted comprehensive qualitative and quantitative analysis on restoration tasks with different degradations, demonstrating the effectiveness and superiority of our approach.
CVJul 4, 2024
SSP-IR: Semantic and Structure Priors for Diffusion-based Realistic Image RestorationYuhong Zhang, Hengsheng Zhang, Zhengxue Cheng et al.
Realistic image restoration is a crucial task in computer vision, and diffusion-based models for image restoration have garnered significant attention due to their ability to produce realistic results. Restoration can be seen as a controllable generation conditioning on priors. However, due to the severity of image degradation, existing diffusion-based restoration methods cannot fully exploit priors from low-quality images and still have many challenges in perceptual quality, semantic fidelity, and structure accuracy. Based on the challenges, we introduce a novel image restoration method, SSP-IR. Our approach aims to fully exploit semantic and structure priors from low-quality images to guide the diffusion model in generating semantically faithful and structurally accurate natural restoration results. Specifically, we integrate the visual comprehension capabilities of Multimodal Large Language Models (explicit) and the visual representations of the original image (implicit) to acquire accurate semantic prior. To extract degradation-independent structure prior, we introduce a Processor with RGB and FFT constraints to extract structure prior from the low-quality images, guiding the diffusion model and preventing the generation of unreasonable artifacts. Lastly, we employ a multi-level attention mechanism to integrate the acquired semantic and structure priors. The qualitative and quantitative results demonstrate that our method outperforms other state-of-the-art methods overall on both synthetic and real-world datasets. Our project page is https://zyhrainbow.github.io/projects/SSP-IR.
CLSep 27, 2023
Integrating LLM, EEG, and Eye-Tracking Biomarker Analysis for Word-Level Neural State Classification in Semantic Inference Reading ComprehensionYuhong Zhang, Qin Li, Sujal Nahata et al.
With the recent proliferation of large language models (LLMs), such as Generative Pre-trained Transformers (GPT), there has been a significant shift in exploring human and machine comprehension of semantic language meaning. This shift calls for interdisciplinary research that bridges cognitive science and natural language processing (NLP). This pilot study aims to provide insights into individuals' neural states during a semantic relation reading-comprehension task. We propose jointly analyzing LLMs, eye-gaze, and electroencephalographic (EEG) data to study how the brain processes words with varying degrees of relevance to a keyword during reading. We also use a feature engineering approach to improve the fixation-related EEG data classification while participants read words with high versus low relevance to the keyword. The best validation accuracy in this word-level classification is over 60\% across 12 subjects. Words of high relevance to the inference keyword had significantly more eye fixations per word: 1.0584 compared to 0.6576 when excluding no-fixation words, and 1.5126 compared to 1.4026 when including them. This study represents the first attempt to classify brain states at a word level using LLM knowledge. It provides valuable insights into human cognitive abilities and the realm of Artificial General Intelligence (AGI), and offers guidance for developing potential reading-assisted technologies.
CVJul 27, 2025Code
AnimeColor: Reference-based Animation Colorization with Diffusion TransformersYuhong Zhang, Liyao Wang, Han Wang et al.
Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.
CVNov 21, 2024
DINO-X: A Unified Vision Model for Open-World Object Detection and UnderstandingTianhe Ren, Yihao Chen, Qing Jiang et al.
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.
HCFeb 3
"Help Me, But Don't Watch Me": Intervention Timing and Privacy Boundaries for Process-Aware AI TutorsJane Hanqi Li, Yuhong Zhang, Jiaqi Liu et al.
The use of generative AI (genAI) tools as informal tutors is becoming increasingly prevalent among secondary school students in mathematics learning. In many schools, individualized instructional support is limited, and one-on-one human tutoring remains costly in most learning contexts. GenAI has the potential to provide timely, on-demand help to students when teachers or tutors are not available. However, there are still few studies that examine students' preferences for AI tutor support that enhances autonomous learning. We investigated learner expectations for AI tutoring through a survey with secondary school students in China (Grades 7-11; N=330). Students generally preferred support that preserves learner autonomy (e.g., time to think, hints over direct answers), expressed mixed or cautious preferences between human and AI tutors, and held nuanced views of proactive intervention, valuing adaptivity but also worrying about annoyance and autonomy. Privacy boundaries were uneven: many accepted sharing problem steps and error patterns, while willingness dropped for more sensitive signals such as attention or behavior. Our findings offer learner-centered insights for designing AI tutors that balance timely intervention with student agency, and personalization with perceived boundaries in a K-12 context.
CVJan 9, 2025
Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion DatasetYuhong Zhang, Jing Lin, Ailing Zeng et al. · tsinghua
In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
ROApr 27
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-SkillsSiyao Xiao, Yuhong Zhang, Zhifang Liu et al.
Current Vision-Language-Action (VLA) models predominantly rely on end-to-end fine-tuning. While effective, this paradigm compromises the inherent generalization capabilities of Vision-Language Models (VLMs) and incurs catastrophic forgetting. To address these limitations, we propose $M^2$-VLA, which demonstrates that a generalized VLM is able to serve as a powerful backbone for robotic manipulation directly. However, it remains a key challenge to bridge the gap between the high-level semantic understanding of VLMs and the precise requirements of robotic control. To overcome this, we introduce the Mixture of Layers (MoL) strategy that selectively extracts task-critical information from dense semantic features. Furthermore, to facilitate efficient trajectory learning under constrained model capacity, we propose a Meta Skill Module (MSM) that integrates strong inductive biases. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of our approach. Furthermore, generalization and ablation studies validate the architecture's zero-shot capabilities and confirm the contribution of each key component. Our code and pre-trained models will be made publicly available.
CVMar 10, 2025
HumanMM: Global Human Motion Recovery from Multi-shot VideosYuhong Zhang, Guanlin Wu, Ling-Hao Chen et al.
In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.
CVApr 25, 2024
Multimodal Semantic-Aware Automatic Colorization with Diffusion PriorHan Wang, Xinning Chai, Yiwen Wang et al.
Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.
CVJan 31, 2025
Consistent Video Colorization via Palette GuidanceHan Wang, Yuang Zhang, Yuhong Zhang et al.
Colorization is a traditional computer vision task and it plays an important role in many time-consuming tasks, such as old film restoration. Existing methods suffer from unsaturated color and temporally inconsistency. In this paper, we propose a novel pipeline to overcome the challenges. We regard the colorization task as a generative task and introduce Stable Video Diffusion (SVD) as our base model. We design a palette-based color guider to assist the model in generating vivid and consistent colors. The color context introduced by the palette not only provides guidance for color generation, but also enhances the stability of the generated colors through a unified color context across multiple sequences. Experiments demonstrate that the proposed method can provide vivid and stable colors for videos, surpassing previous methods.
CVSep 25, 2025
FreeInsert: Personalized Object Insertion with Geometric and Style ControlYuhong Zhang, Han Wang, Yiwen Wang et al.
Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.
LGSep 1, 2025
IMU-Enhanced EEG Motion Artifact Removal with Fine-Tuned Large Brain ModelsYuhong Zhang, Xusheng Zhu, Yuchen Xu et al.
Electroencephalography (EEG) is a non-invasive method for measuring brain activity with high temporal resolution; however, EEG signals often exhibit low signal-to-noise ratios because of contamination from physiological and environmental artifacts. One of the major challenges hindering the real-world deployment of brain-computer interfaces (BCIs) involves the frequent occurrence of motion-related EEG artifacts. Most prior studies on EEG motion artifact removal rely on single-modality approaches, such as Artifact Subspace Reconstruction (ASR) and Independent Component Analysis (ICA), without incorporating simultaneously recorded modalities like inertial measurement units (IMUs), which directly capture the extent and dynamics of motion. This work proposes a fine-tuned large brain model (LaBraM)-based correlation attention mapping method that leverages spatial channel relationships in IMU data to identify motion-related artifacts in EEG signals. The fine-tuned model contains approximately 9.2 million parameters and uses 5.9 hours of EEG and IMU recordings for training, just 0.2346\% of the 2500 hours used to train the base model. We compare our results against the established ASR-ICA benchmark across varying time scales and motion activities, showing that incorporating IMU reference signals significantly improves robustness under diverse motion scenarios.
CVAug 18, 2025
Motion2Motion: Cross-topology Motion Transfer with Sparse CorrespondenceLing-Hao Chen, Yuhong Zhang, Zixin Yin et al.
This work studies the challenge of transfer animations between characters whose skeletal topologies differ substantially. While many techniques have advanced retargeting techniques in decades, transfer motions across diverse topologies remains less-explored. The primary obstacle lies in the inherent topological inconsistency between source and target skeletons, which restricts the establishment of straightforward one-to-one bone correspondences. Besides, the current lack of large-scale paired motion datasets spanning different topological structures severely constrains the development of data-driven approaches. To address these limitations, we introduce Motion2Motion, a novel, training-free framework. Simply yet effectively, Motion2Motion works with only one or a few example motions on the target skeleton, by accessing a sparse set of bone correspondences between the source and target skeletons. Through comprehensive qualitative and quantitative evaluations, we demonstrate that Motion2Motion achieves efficient and reliable performance in both similar-skeleton and cross-species skeleton transfer scenarios. The practical utility of our approach is further evidenced by its successful integration in downstream applications and user interfaces, highlighting its potential for industrial applications. Code and data are available at https://lhchen.top/Motion2Motion.
CVAug 1, 2025
Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-ResolutionYiwen Wang, Xinning Chai, Yuhong Zhang et al.
Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.
CLJul 16, 2025
Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking BiomarkerYuhong Zhang, Jialu Li, Shilai Yang et al.
Reading comprehension is a fundamental skill in human cognitive development. With the advancement of Large Language Models (LLMs), there is a growing need to compare how humans and LLMs understand language across different contexts and apply this understanding to functional tasks such as inference, emotion interpretation, and information retrieval. Our previous work used LLMs and human biomarkers to study the reading comprehension process. The results showed that the biomarkers corresponding to words with high and low relevance to the inference target, as labeled by the LLMs, exhibited distinct patterns, particularly when validated using eye-tracking data. However, focusing solely on individual words limited the depth of understanding, which made the conclusions somewhat simplistic despite their potential significance. This study used an LLM-based AI agent to group words from a reading passage into nodes and edges, forming a graph-based text representation based on semantic meaning and question-oriented prompts. We then compare the distribution of eye fixations on important nodes and edges. Our findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure. These results build on our previous findings and offer insights into effective human-AI co-learning strategies.
CVJun 9, 2025
NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D GenerationYuxiao Yang, Peihao Li, Yuhong Zhang et al.
3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.
LGDec 17, 2024
Towards Effective Graph Rationalization via Boosting Environment DiversityYujie Wang, Kui Yu, Yuhong Zhang et al.
Graph Neural Networks (GNNs) perform effectively when training and testing graphs are drawn from the same distribution, but struggle to generalize well in the face of distribution shifts. To address this issue, existing mainstreaming graph rationalization methods first identify rationale and environment subgraphs from input graphs, and then diversify training distributions by augmenting the environment subgraphs. However, these methods merely combine the learned rationale subgraphs with environment subgraphs in the representation space to produce augmentation samples, failing to produce sufficiently diverse distributions. Thus, in this paper, we propose to achieve an effective Graph Rationalization by Boosting Environmental diversity, a GRBE approach that generates the augmented samples in the original graph space to improve the diversity of the environment subgraph. Firstly, to ensure the effectiveness of augmentation samples, we propose a precise rationale subgraph extraction strategy in GRBE to refine the rationale subgraph learning process in the original graph space. Secondly, to ensure the diversity of augmented samples, we propose an environment diversity augmentation strategy in GRBE that mixes the environment subgraphs of different graphs in the original graph space and then combines the new environment subgraphs with rationale subgraphs to generate augmented graphs. The average improvements of 7.65% and 6.11% in rationalization and classification performance on benchmark datasets demonstrate the superiority of GRBE over state-of-the-art approaches.
AIAug 30, 2021
A Temporal Knowledge Graph Completion Method Based on Balanced Timestamp DistributionKangzheng Liu, Yuhong Zhang
Completion through the embedding representation of the knowledge graph (KGE) has been a research hotspot in recent years. Realistic knowledge graphs are mostly related to time, while most of the existing KGE algorithms ignore the time information. A few existing methods directly or indirectly encode the time information, ignoring the balance of timestamp distribution, which greatly limits the performance of temporal knowledge graph completion (KGC). In this paper, a temporal KGC method is proposed based on the direct encoding time information framework, and a given time slice is treated as the finest granularity for balanced timestamp distribution. A large number of experiments on temporal knowledge graph datasets extracted from the real world demonstrate the effectiveness of our method.
LGAug 4, 2019
Semi-supervised representation learning via dual autoencoders for domain adaptationShuai Yang, Hao Wang, Yuhong Zhang et al.
Domain adaptation aims to exploit the knowledge in source domain to promote the learning tasks in target domain, which plays a critical role in real-world applications. Recently, lots of deep learning approaches based on autoencoders have achieved a significance performance in domain adaptation. However, most existing methods focus on minimizing the distribution divergence by putting the source and target data together to learn global feature representations, while they do not consider the local relationship between instances in the same category from different domains. To address this problem, we propose a novel Semi-Supervised Representation Learning framework via Dual Autoencoders for domain adaptation, named SSRLDA. More specifically, we extract richer feature representations by learning the global and local feature representations simultaneously using two novel autoencoders, which are referred to as marginalized denoising autoencoder with adaptation distribution (MDAad) and multi-class marginalized denoising autoencoder (MMDA) respectively. Meanwhile, we make full use of label information to optimize feature representations. Experimental results show that our proposed approach outperforms several state-of-the-art baseline methods.