CVJun 2Code
MemoGen: Can Past Experience Improve Future Text-to-Image Generation?Wenshuo Chen, Kuimou Yu, Bowen Tian et al.
Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.
LGMay 20Code
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation AlignmentHaozhe Jia, Pengyu Yin, Wenshuo Chen et al.
Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).
CVDec 28, 2025
Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path GuidanceHaosen Li, Wenshuo Chen, Shaofeng Liang et al.
Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.
CVDec 19, 2024Code
DCTdiff: Intriguing Properties of Image Generative Modeling in the DCT SpaceMang Ning, Mingxiao Li, Jianlin Su et al.
This paper explores image modeling from the frequency space and introduces DCTdiff, an end-to-end diffusion generative paradigm that efficiently models images in the discrete cosine transform (DCT) space. We investigate the design space of DCTdiff and reveal the key design factors. Experiments on different frameworks (UViT, DiT), generation tasks, and various diffusion samplers demonstrate that DCTdiff outperforms pixel-based diffusion models regarding generative quality and training efficiency. Remarkably, DCTdiff can seamlessly scale up to 512$\times$512 resolution without using the latent diffusion paradigm and beats latent diffusion (using SD-VAE) with only 1/4 training cost. Finally, we illustrate several intriguing properties of DCT image modeling. For example, we provide a theoretical proof of why 'image diffusion can be seen as spectral autoregression', bridging the gap between diffusion and autoregressive models. The effectiveness of DCTdiff and the introduced properties suggest a promising direction for image modeling in the frequency space. The code is https://github.com/forever208/DCTdiff.
CVMar 17
ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion ControlHaozhe Jia, Jianfei Song, Yuan Zhang et al.
We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.
ROMay 14
Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid ControlHaozhe Jia, Honglei Jin, Yuan Zhang et al.
Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.
CVOct 10, 2025Code
RadioFlow: Efficient Radio Map Construction Framework with Flow MatchingHaozhe Jia, Wenshuo Chen, Xiucheng Wang et al.
Accurate and real-time radio map (RM) generation is crucial for next-generation wireless systems, yet diffusion-based approaches often suffer from large model sizes, slow iterative denoising, and high inference latency, which hinder practical deployment. To overcome these limitations, we propose \textbf{RadioFlow}, a novel flow-matching-based generative framework that achieves high-fidelity RM generation through single-step efficient sampling. Unlike conventional diffusion models, RadioFlow learns continuous transport trajectories between noise and data, enabling both training and inference to be significantly accelerated while preserving reconstruction accuracy. Comprehensive experiments demonstrate that RadioFlow achieves state-of-the-art performance with \textbf{up to 8$\times$ fewer parameters} and \textbf{over 4$\times$ faster inference} compared to the leading diffusion-based baseline (RadioDiff). This advancement provides a promising pathway toward scalable, energy-efficient, and real-time electromagnetic digital twins for future 6G networks. We release the code at \href{https://github.com/Hxxxz0/RadioFlow}{GitHub}.
CVApr 26
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent OptimizationHaosen Li, Wenshuo Chen, Lei Wang et al.
Text-to-image diffusion models have achieved remarkable generative capabilities, yet accurately aligning complex textual prompts with synthesized layouts remains an ongoing challenge. In these models, the initial Gaussian noise acts as a critical structural seed dictating the macroscopic layout. Recent online optimization and search methods attempt to refine this noise to enhance text-image alignment. However, relying on unconstrained Euclidean gradient ascent mathematically inflates the latent norm and destroys the standard Gaussian prior, causing severe visual artifacts like color over-saturation. Furthermore, these methods suffer from inefficient semantic routing and easily fall into the ``reward hacking'' trap of external proxy models. To address these intertwined bottlenecks, we propose Oracle Noise, a zero-shot framework reframing noise initialization as semantic-driven optimization strictly confined to a Riemannian hypersphere. Instead of relying on complex external parsers, we directly identify the most impactful structural words in the prompt to efficiently route optimization energy. By updating the noise strictly along a spherical path, we mathematically preserve the original Gaussian distribution. This geometric constraint eliminates norm inflation and unlocks aggressive step sizes for rapid convergence. Extensive experiments demonstrate that Oracle Noise significantly accelerates semantic alignment and achieves superior aesthetics without black-box models. It completely mitigates Euclidean-induced degradation, establishing state-of-the-art performance across human preference metrics (e.g., HPSv2, ImageReward), semantic alignment (CLIP Score), and sample diversity, all within a strict 2-second optimization budget.
CVMar 12, 2024
Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image GenerationLikun Li, Haoqi Zeng, Changpeng Yang et al.
The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method.
CVJun 3, 2025
ANT: Adaptive Neural Temporal-Aware Text-to-Motion ModelWenshuo Chen, Kuimou Yu, Haozhe Jia et al.
While diffusion models advance text-to-motion generation, their static semantic conditioning ignores temporal-frequency demands: early denoising requires structural semantics for motion foundations while later stages need localized details for text alignment. This mismatch mirrors biological morphogenesis where developmental phases demand distinct genetic programs. Inspired by epigenetic regulation governing morphological specialization, we propose **(ANT)**, an **A**daptive **N**eural **T**emporal-Aware architecture. ANT orchestrates semantic granularity through: **(i) Semantic Temporally Adaptive (STA) Module:** Automatically partitions denoising into low-frequency structural planning and high-frequency refinement via spectral analysis. **(ii) Dynamic Classifier-Free Guidance scheduling (DCFG):** Adaptively adjusts conditional to unconditional ratio enhancing efficiency while maintaining fidelity. Extensive experiments show that ANT can be applied to various baselines, significantly improving model performance, and achieving state-of-the-art semantic alignment on StableMoFusion.
CVDec 11, 2023
DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image EditingHaozhe Jia, Yan Li, Hengfei Cui et al.
In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters (e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over state-of-the-art methods. Project page: https://discontrolface.github.io/
CVJan 30, 2025
Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-DomainWenshuo Chen, Haozhe Jia, Songning Lai et al.
Enabling humanoid robots to synthesize complex, physically coherent motions from natural language commands is a cornerstone of autonomous robotics and human-robot interaction. While diffusion models have shown promise in this text-to-motion (T2M) task, they often generate semantically flawed or unstable motions, limiting their applicability to real-world robots. This paper reframes the T2M problem from a frequency-domain perspective, revealing that the generative process mirrors a hierarchical control paradigm. We identify two critical phases: a semantic planning stage, where low-frequency components establish the global motion trajectory, and a fine-grained execution stage, where high-frequency details refine the movement. To address the distinct challenges of each phase, we introduce Frequency enhanced text-to-motion (Free-T2M), a framework incorporating stage-specific frequency-domain consistency alignment. We design a frequency-domain temporal-adaptive module to modulate the alignment effects of different frequency bands. These designs enforce robustness in the foundational semantic plan and enhance the accuracy of detailed execution. Extensive experiments show our method dramatically improves motion quality and semantic correctness. Notably, when applied to the StableMoFusion baseline, Free-T2M reduces the FID from 0.152 to 0.060, establishing a new state-of-the-art within diffusion architectures. These findings underscore the critical role of frequency-domain insights for generating robust and reliable motions, paving the way for more intuitive natural language control of robots.
CVJan 31, 2025
Physics-Informed Representation Alignment for Sparse Radio-Map ReconstructionHaozhe Jia, Wenshuo Chen, Zhihui Huang et al.
Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse observational data hinder accurate reconstruction in practical scenarios. Existing methods often fail to align physical constraints with data-driven features, particularly under sparse measurement conditions. To address these issues, we propose **Phy**sics-Aligned **R**adio **M**ap **D**iffusion **M**odel (**PhyRMDM**), a novel framework that establishes cross-domain representation alignment between physical principles and neural network features through dual learning pathways. The proposed model integrates **Physics-Informed Neural Networks (PINNs)** with a **representation alignment mechanism** that explicitly enforces consistency between Helmholtz equation constraints and environmental propagation patterns. Experimental results demonstrate significant improvements over state-of-the-art methods, achieving **NMSE of 0.0031** under *Static Radio Map (SRM)* conditions, and **NMSE of 0.0047** with **Dynamic Radio Map (DRM)** scenarios. The proposed representation alignment paradigm provides **37.2%** accuracy enhancement in ultra-sparse cases (**1%** sampling rate), confirming its effectiveness in bridging physics-based modeling and deep learning for radio map reconstruction.
CVSep 29, 2025
LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion ModelHaozhe Jia, Wenshuo Chen, Yuqi Lin et al.
While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.
CVJun 1, 2024
Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head CaptureXuanchen Li, Yuhao Cheng, Xingyu Ren et al.
4D head capture aims to generate dynamic topological meshes and corresponding texture maps from videos, which is widely utilized in movies and games for its ability to simulate facial muscle movements and recover dynamic textures in pore-squeezing. The industry often adopts the method involving multi-view stereo and non-rigid alignment. However, this approach is prone to errors and heavily reliant on time-consuming manual processing by artists. To simplify this process, we propose Topo4D, a novel framework for automatic geometry and texture generation, which optimizes densely aligned 4D heads and 8K texture maps directly from calibrated multi-view time-series images. Specifically, we first represent the time-series faces as a set of dynamic 3D Gaussians with fixed topology in which the Gaussian centers are bound to the mesh vertices. Afterward, we perform alternative geometry and texture optimization frame-by-frame for high-quality geometry and texture learning while maintaining temporal topology stability. Finally, we can extract dynamic facial meshes in regular wiring arrangement and high-fidelity textures with pore-level details from the learned Gaussians. Extensive experiments show that our method achieves superior results than the current SOTA face reconstruction methods both in the quality of meshes and textures. Project page: https://xuanchenli.github.io/Topo4D/.
IVFeb 10, 2022
HNF-Netv2 for Brain Tumor Segmentation using multi-modal MR ImagingHaozhe Jia, Chao Bai, Weidong Cai et al.
In our previous work, $i.e.$, HNF-Net, high-resolution feature representation and light-weight non-local self-attention mechanism are exploited for brain tumor segmentation using multi-modal MR imaging. In this paper, we extend our HNF-Net to HNF-Netv2 by adding inter-scale and intra-scale semantic discrimination enhancing blocks to further exploit global semantic discrimination for the obtained high-resolution features. We trained and evaluated our HNF-Netv2 on the multi-modal Brain Tumor Segmentation Challenge (BraTS) 2021 dataset. The result on the test set shows that our HNF-Netv2 achieved the average Dice scores of 0.878514, 0.872985, and 0.924919, as well as the Hausdorff distances ($95\%$) of 8.9184, 16.2530, and 4.4895 for the enhancing tumor, tumor core, and whole tumor, respectively. Our method won the RSNA 2021 Brain Tumor AI Challenge Prize (Segmentation Task), which ranks 8th out of all 1250 submitted results.
CVAug 9, 2021
PSGR: Pixel-wise Sparse Graph Reasoning for COVID-19 Pneumonia Segmentation in CT ImagesHaozhe Jia, Haoteng Tang, Guixiang Ma et al.
Automated and accurate segmentation of the infected regions in computed tomography (CT) images is critical for the prediction of the pathological stage and treatment response of COVID-19. Several deep convolutional neural networks (DCNNs) have been designed for this task, whose performance, however, tends to be suppressed by their limited local receptive fields and insufficient global reasoning ability. In this paper, we propose a pixel-wise sparse graph reasoning (PSGR) module and insert it into a segmentation network to enhance the modeling of long-range dependencies for COVID-19 infected region segmentation in CT images. In the PSGR module, a graph is first constructed by projecting each pixel on a node based on the features produced by the segmentation backbone, and then converted into a sparsely-connected graph by keeping only K strongest connections to each uncertain pixel. The long-range information reasoning is performed on the sparsely-connected graph to generate enhanced features. The advantages of this module are two-fold: (1) the pixel-wise mapping strategy not only avoids imprecise pixel-to-node projections but also preserves the inherent information of each pixel for global reasoning; and (2) the sparsely-connected graph construction results in effective information retrieval and reduction of the noise propagation. The proposed solution has been evaluated against four widely-used segmentation models on three public datasets. The results show that the segmentation model equipped with our PSGR module can effectively segment COVID-19 infected regions in CT images, outperforming all other competing models.
CVAug 9, 2021
Boundary-aware Graph Reasoning for Semantic SegmentationHaoteng Tang, Haozhe Jia, Weidong Cai et al.
In this paper, we propose a Boundary-aware Graph Reasoning (BGR) module to learn long-range contextual features for semantic segmentation. Rather than directly construct the graph based on the backbone features, our BGR module explores a reasonable way to combine segmentation erroneous regions with the graph construction scenario. Motivated by the fact that most hard-to-segment pixels broadly distribute on boundary regions, our BGR module uses the boundary score map as prior knowledge to intensify the graph node connections and thereby guide the graph reasoning focus on boundary regions. In addition, we employ an efficient graph convolution implementation to reduce the computational cost, which benefits the integration of our BGR module into current segmentation backbones. Extensive experiments on three challenging segmentation benchmarks demonstrate the effectiveness of our proposed BGR module for semantic segmentation.
IVDec 30, 2020
H2NF-Net for Brain Tumor Segmentation using Multimodal MR Imaging: 2nd Place Solution to BraTS Challenge 2020 Segmentation TaskHaozhe Jia, Weidong Cai, Heng Huang et al.
In this paper, we propose a Hybrid High-resolution and Non-local Feature Network (H2NF-Net) to segment brain tumor in multimodal MR images. Our H2NF-Net uses the single and cascaded HNF-Nets to segment different brain tumor sub-regions and combines the predictions together as the final segmentation. We trained and evaluated our model on the Multimodal Brain Tumor Segmentation Challenge (BraTS) 2020 dataset. The results on the test set show that the combination of the single and cascaded models achieved average Dice scores of 0.78751, 0.91290, and 0.85461, as well as Hausdorff distances ($95\%$) of 26.57525, 4.18426, and 4.97162 for the enhancing tumor, whole tumor, and tumor core, respectively. Our method won the second place in the BraTS 2020 challenge segmentation task out of nearly 80 participants.
CVJul 18, 2018
3D Global Convolutional Adversarial Network\\ for Prostate MR Volume SegmentationHaozhe Jia, Yang Song, Donghao Zhang et al.
Advanced deep learning methods have been developed to conduct prostate MR volume segmentation in either a 2D or 3D fully convolutional manner. However, 2D methods tend to have limited segmentation performance, since large amounts of spatial information of prostate volumes are discarded during the slice-by-slice segmentation process; and 3D methods also have room for improvement, since they use isotropic kernels to perform 3D convolutions whereas most prostate MR volumes have anisotropic spatial resolution. Besides, the fully convolutional structural methods achieve good performance for localization issues but neglect the per-voxel classification for segmentation tasks. In this paper, we propose a 3D Global Convolutional Adversarial Network (3D GCA-Net) to address efficient prostate MR volume segmentation. We first design a 3D ResNet encoder to extract 3D features from prostate scans, and then develop the decoder, which is composed of a multi-scale 3D global convolutional block and a 3D boundary refinement block, to address the classification and localization issues simultaneously for volumetric segmentation. Additionally, we combine the encoder-decoder segmentation network with an adversarial network in the training phrase to enforce the contiguity of long-range spatial predictions. Throughout the proposed model, we use anisotropic convolutional processing for better feature learning on prostate MR scans. We evaluated our 3D GCA-Net model on two public prostate MR datasets and achieved state-of-the-art performances.