CVJul 22, 2023Code
GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation ModelsJing Hao, Moyun Liu, Jinrong Yang et al.
Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple Glass surface sEgMentor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings. Codes, datasets and models are publicly available at: https://github.com/isbrycee/GEM
CLApr 9Code
OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image AnalysisJing Hao, Siyuan Dai, Yongxin Zhang et al.
Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.
CVSep 20, 2024
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMsJing Hao, Yuxiang Zhao, Song Chen et al.
Multimodal Large Language Models (MLLMs) have shown promise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they heavily depend on high-quality data in the Supervised Fine-Tuning (SFT) phase. The existing approaches aim to curate high-quality data via GPT-4V, but they are not scalable due to the commercial nature of GPT-4V and the simplicity of the prompts used to instruct the model. To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions. This engine is characterized by its cascade annotation process, which involves multiple expert models and employs rich prompts to instruct LLMs in generating dense image captions. We re-annotated the COCO and Visual Genome datasets using our FullAnno system, tripling the number of object annotations and increasing the length of the original image captions by a factor of 15. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks. The re-annotated data are available at: https://arcana-project-page.github.io
CVApr 7, 2023
Language-aware Multiple Datasets Detection Pretraining for DETRsJing Hao, Song Chen, Xiaodi Wang et al.
Pretraining on large-scale datasets can boost the performance of object detectors while the annotated datasets for object detection are hard to scale up due to the high labor cost. What we possess are numerous isolated filed-specific datasets, thus, it is appealing to jointly pretrain models across aggregation of datasets to enhance data volume and diversity. In this paper, we propose a strong framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR, without the need for manual label spaces integration. It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model. Specifically, we design a category extraction module for extracting potential categories involved in an image and assign these categories into different queries by language embeddings. Each query is only responsible for predicting a class-specific object. Besides, to adapt our novel detection paradigm, we propose a group bipartite matching strategy that limits the ground truths to match queries assigned to the same category. Extensive experiments demonstrate that METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm. Notably, our pre-trained models have high flexible transferability and increase the performance upon various DETR-like detectors on COCO val2017 benchmark. Codes will be available after this paper is published.
CVJul 22, 2023
A Stronger Stitching Algorithm for Fisheye Images based on Deblurring and RegistrationJing Hao, Jingming Xie, Jinyuan Zhang et al.
Fisheye lens, which is suitable for panoramic imaging, has the prominent advantage of a large field of view and low cost. However, the fisheye image has a severe geometric distortion which may interfere with the stage of image registration and stitching. Aiming to resolve this drawback, we devise a stronger stitching algorithm for fisheye images by combining the traditional image processing method with deep learning. In the stage of fisheye image correction, we propose the Attention-based Nonlinear Activation Free Network (ANAFNet) to deblur fisheye images corrected by Zhang calibration method. Specifically, ANAFNet adopts the classical single-stage U-shaped architecture based on convolutional neural networks with soft-attention technique and it can restore a sharp image from a blurred image effectively. In the part of image registration, we propose the ORB-FREAK-GMS (OFG), a comprehensive image matching algorithm, to improve the accuracy of image registration. Experimental results demonstrate that panoramic images of superior quality stitching by fisheye images can be obtained through our method.
CVSep 18, 2023
Heterogeneous Generative Knowledge Distillation with Masked Image ModelingZiming Wang, Shumin Han, Xiaodi Wang et al.
Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.
CVApr 1, 2024Code
T-Mamba: A unified framework with Long-Range Dependency in dual-domain for 2D & 3D Tooth SegmentationJing Hao, Yonghui Zhu, Lei He et al.
Tooth segmentation is a pivotal step in modern digital dentistry, essential for applications across orthodontic diagnosis and treatment planning. Despite its importance, this process is fraught with challenges due to the high noise and low contrast inherent in 2D and 3D tooth data. Both Convolutional Neural Networks (CNNs) and Transformers has shown promise in medical image segmentation, yet each method has limitations in handling long-range dependencies and computational complexity. To address this issue, this paper introduces T-Mamba, integrating frequency-based features and shared bi-positional encoding into vision mamba to address limitations in efficient global feature modeling. Besides, we design a gate selection unit to integrate two features in spatial domain and one feature in frequency domain adaptively. T-Mamba is the first work to introduce frequency-based features into vision mamba, and its flexibility allows it to process both 2D and 3D tooth data without the need for separate modules. Also, the TED3, a large-scale public tooth 2D dental X-ray dataset, has been presented in this paper. Extensive experiments demonstrate that T-Mamba achieves new SOTA results on a public tooth CBCT dataset and outperforms previous SOTA methods on TED3 dataset. The code and models are publicly available at: https://github.com/isbrycee/T-Mamba.
CVDec 18, 2024Code
Descriptive Caption Enhancement with Visual Specialists for Multimodal PerceptionYanpeng Sun, Jing Hao, Ke Zhu et al.
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at \url{https://github.com/syp2ysy/DCE}.
QMMar 17
Topology-Guided Biomechanical Profiling: A White-Box Framework for Opportunistic Screening of Spinal Instability on Routine CTZanting Ye, Xuanbin Wu, Guoqing Zhong et al.
Routine oncologic computed tomography (CT) presents an ideal opportunity for screening spinal instability, yet prophylactic stabilization windows are frequently missed due to the complex geometric reasoning required by the Spinal Instability Neoplastic Score (SINS). Automating SINS is fundamentally hindered by metastatic osteolysis, which induces topological ambiguity that confounds standard segmentation and black-box AI. We propose Topology-Guided Biomechanical Profiling (TGBP), an auditable white-box framework decoupling anatomical perception from structural reasoning. TGBP anchors SINS assessment on two deterministic geometric innovations: (i) canal-referenced partitioning to resolve posterolateral boundary ambiguity, and (ii) context-aware morphometric normalization via covariance-based oriented bounding boxes (OBB) to quantify vertebral collapse. Integrated with auxiliary radiomic and large language model (LLM) modules, TGBP provides an end-to-end, interpretable SINS evaluation. Validated on a multi-center, multi-cancer cohort ($N=482$), TGBP achieved 90.2\% accuracy in 3-tier stability triage. In a blinded reader study ($N=30$), TGBP significantly outperformed medical oncologists on complex structural features ($κ=0.857$ vs.\ $0.570$) and prevented compounding errors in Total Score estimation ($κ=0.625$ vs.\ $0.207$), democratizing expert-level opportunistic screening.
CVSep 11, 2025Code
Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray AnalysisJing Hao, Yuxuan Fan, Yanpeng Sun et al.
Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
CVJan 27, 2024Code
GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data SynthesisJing Hao, Moyun Liu, Kuo Feng Hung
Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.
LGMay 13
CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint ConditionsTianbo Liu, Chixiang Lu, Jing Hao et al.
Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.
IVNov 28, 2025Code
MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT ImagesYaqi Wang, Zhi Li, Chengyu Wu et al.
Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge confirms the substantial benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants' submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.
CVMar 6
OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray AnalysisYuxuan Fan, Jing Hao, Hong Chen et al.
Panoramic dental radiographs require fine-grained spatial reasoning, bilateral symmetry understanding, and multi-step diagnostic verification, yet existing vision-language models operate under a static single-pass paradigm that limits their clinical reliability. In this paper, we introduce OralGPT-Plus, an agentic vision-language model designed to perform iterative and symmetry-aware diagnostic reasoning for panoramic dental radiograph analysis. To support this paradigm, we construct DentalProbe, a five-thousand-image dataset with expert-curated diagnostic trajectories that provide structured supervision for localized inspection and contralateral comparison. We further develop a Reinspection-driven reinforcement learning framework that encourages clinically meaningful re-examination and stabilizes long-horizon reasoning with rubric-based reward and conditioned diagnostic-driven reward. In parallel, we present MMOral-X, the first benchmark for holistic panoramic diagnosis, containing 300 open-ended questions and region-level annotations across multiple difficulty levels. OralGPT-Plus demonstrates consistent and reliable improvements over strong baselines on MMOral-X and established panoramic benchmarks, indicating the effectiveness of interactive and symmetry-informed reasoning. Our work highlights the value of agentic modeling for dental imaging and provides a foundation for future research in clinically aligned panoramic radiograph analysis.
CVNov 27, 2025
OralGPT-Omni: A Versatile Dental Multimodal Large Language ModelJing Hao, Yuci Liang, Lizhuo Lin et al.
Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.
IVAug 7, 2025
HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI ReconstructionHongli Chen, Pengcheng Fang, Yuxia Chen et al.
Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.
CVOct 18, 2021
A Lightweight and Accurate Recognition Framework for Signs of X-ray Weld ImagesMoyun Liu, Jingming Xie, Jing Hao et al.
X-ray images are commonly used to ensure the security of devices in quality inspection industry. The recognition of signs printed on X-ray weld images plays an essential role in digital traceability system of manufacturing industry. However, the scales of objects vary different greatly in weld images, and it hinders us to achieve satisfactory recognition. In this paper, we propose a signs recognition framework based on convolutional neural networks (CNNs) for weld images. The proposed framework firstly contains a shallow classification network for correcting the pose of images. Moreover, we present a novel spatial and channel enhancement (SCE) module to address the above scale problem. This module can integrate multi-scale features and adaptively assign weights for each feature source. Based on SCE module, a narrow network is designed for final weld information recognition. To enhance the practicability of our framework, we carefully design the architecture of framework with a few parameters and computations. Experimental results show that our framework achieves 99.7% accuracy with 1.1 giga floating-point of operations (GFLOPs) on classification stage, and 90.0 mean average precision (mAP) with 176.1 frames per second (FPS) on recognition stage.
CVAug 9, 2021
TransForensics: Image Forgery Localization with Dense Self-AttentionJing Hao, Zhixin Zhang, Shicai Yang et al.
Nowadays advanced image editing tools and technical skills produce tampered images more realistically, which can easily evade image forensic systems and make authenticity verification of images more difficult. To tackle this challenging problem, we introduce TransForensics, a novel image forgery localization method inspired by Transformers. The two major components in our framework are dense self-attention encoders and dense correction modules. The former is to model global context and all pairwise interactions between local patches at different scales, while the latter is used for improving the transparency of the hidden layers and correcting the outputs from different branches. Compared to previous traditional and deep learning methods, TransForensics not only can capture discriminative representations and obtain high-quality mask predictions but is also not limited by tampering types and patch sequence orders. By conducting experiments on main benchmarks, we show that TransForensics outperforms the stateof-the-art methods by a large margin.
ASNov 21, 2019
An End-to-End Audio Classification System based on Raw Waveforms and Mix-Training StrategyJiaxu Chen, Jing Hao, Kai Chen et al.
Audio classification can distinguish different kinds of sounds, which is helpful for intelligent applications in daily life. However, it remains a challenging task since the sound events in an audio clip is probably multiple, even overlapping. This paper introduces an end-to-end audio classification system based on raw waveforms and mix-training strategy. Compared to human-designed features which have been widely used in existing research, raw waveforms contain more complete information and are more appropriate for multi-label classification. Taking raw waveforms as input, our network consists of two variants of ResNet structure which can learn a discriminative representation. To explore the information in intermediate layers, a multi-level prediction with attention structure is applied in our model. Furthermore, we design a mix-training strategy to break the performance limitation caused by the amount of training data. Experiments show that the mean average precision of the proposed audio classification system on Audio Set dataset is 37.2%. Without using extra training data, our system exceeds the state-of-the-art multi-level attention model.