Jinghan Yu

CV
h-index6
4papers
4citations
Novelty68%
AI Score50

4 Papers

82.5ITMar 19
Pólya Thresholds Graphs

Jinghan Yu, Fady Alajaji, Bahman Gharesifard

We introduce the Pólya threshold graph model and derive its stochastic and algebraic properties. This random threshold graph is generated sequentially via a two-color Pólya urn process. Starting from an empty graph, each time step involves a draw from the urn that produces an indicator variable, determining whether a newly added node is universal (connected to all existing nodes and itself) or isolated (connected to no existing nodes). This construction yields a random threshold graph with an adjacency matrix that admits an explicit representation in terms of the draw sequence. Using the structure of the Pólya draw process, we derive the exact degree distribution for any arbitrary node, including its mean and variance. Furthermore, we evaluate a distance-based decay centrality score and provide an explicit expression for its expectation. On the algebraic side, we explicitly characterize the Laplacian matrix of the random threshold graph, obtaining a closed-form description of its spectrum and corresponding eigenbasis. Finally, as an application of these structural results, we analyze discrete-time consensus dynamics on Pólya threshold graphs.

49.6MMApr 16
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD

Junhao Xiao, Shun Feng, Zhiyu Wu et al.

Audio-Visual Speaker Detection (AVSD) hinges on modeling both individual temporal continuity and inter-personal social context. Existing coupled architectures struggle to reconcile these tasks in shared representation spaces due to conflicting inductive biases: temporal modeling favors low-frequency smoothness, while inter-personal interaction requires high-frequency discriminability. We propose D$^2$Stream, a decoupled dual-stream framework that explicitly isolates these functionalities into parallel, task-specific branches. Specifically, the Intra-speaker Temporal Continuity (ITC) stream captures longitudinal stability, whereas the Inter-personal Social Relation (ISR) stream models transversal social cues. Quantitative gradient analysis reveals an evolutionary divergence in update directions, stabilizing at 86.1°, which confirms the inherent task conflict and the effectiveness of our structural decoupling. D$^2$Stream breaks the long-standing performance plateau, achieving a state-of-the-art 95.6% mAP on AVA-ActiveSpeaker and superior generalization on Columbia ASD, all within a lightweight and efficient design.

CVJan 7
I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Jinghan Yu, Junhao Xiao, Chenyu Zhu et al.

Existing text-guided image editing methods primarily rely on end-to-end pixel-level inpainting paradigm. Despite its success in simple scenarios, this paradigm still significantly struggles with compositional editing tasks that require precise local control and complex multi-object spatial reasoning. This paradigm is severely limited by 1) the implicit coupling of planning and execution, 2) the lack of object-level control granularity, and 3) the reliance on unstructured, pixel-centric modeling. To address these limitations, we propose I2E, a novel "Decompose-then-Action" paradigm that revisits image editing as an actionable interaction process within a structured environment. I2E utilizes a Decomposer to transform unstructured images into discrete, manipulable object layers and then introduces a physics-aware Vision-Language-Action Agent to parse complex instructions into a series of atomic actions via Chain-of-Thought reasoning. Further, we also construct I2E-Bench, a benchmark designed for multi-instance spatial reasoning and high-precision editing. Experimental results on I2E-Bench and multiple public benchmarks demonstrate that I2E significantly outperforms state-of-the-art methods in handling complex compositional instructions, maintaining physical plausibility, and ensuring multi-turn editing stability.

CVAug 5, 2025
MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Jinghan Yu, Junhao Xiao, Zhiyuan Ma et al.

Recent years have witnessed the success of diffusion models in image customization tasks. However, existing mask-guided human erasing methods still struggle in complex scenarios such as human-human occlusion, human-object entanglement, and human-background interference, mainly due to the lack of large-scale multi-instance datasets and effective spatial decoupling to separate foreground from background. To bridge these gaps, we curate the MILD dataset capturing diverse poses, occlusions, and complex multi-instance interactions. We then define the Cross-Domain Attention Gap (CAG), an attention-gap metric to quantify semantic leakage. On top of these, we propose Multi-Layer Diffusion (MILD), which decomposes the generation process into independent denoising pathways, enabling separate reconstruction of each foreground instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, a plug-and-play module that incorporates pose, parsing, and spatial relationships into the diffusion process to improve structural awareness and restoration quality. Additionally, we present Spatially-Modulated Attention, an adaptive mechanism that leverages spatial mask priors to modulate attention across semantic regions, further widening the CAG to effectively minimize boundary artifacts and mitigate semantic leakage. Experiments show that MILD significantly outperforms existing methods. Datasets and code are publicly available at: https://mild-multi-layer-diffusion.github.io/.