NAFeb 1, 2012
Weak Galerkin Methods for Second Order Elliptic Interface ProblemsLin Mu, Junping Wang, Guowei Wei et al.
Weak Galerkin methods refer to general finite element methods for PDEs in which differential operators are approximated by their weak forms as distributions. Such weak forms give rise to desirable flexibilities in enforcing boundary and interface conditions. A weak Galerkin finite element method (WG-FEM) is developed in this paper for solving elliptic partial differential equations (PDEs) with discontinuous coefficients and interfaces. The paper also presents many numerical tests for validating the WG-FEM for solving second order elliptic interface problems. For such interface problems, the solution possesses a certain singularity due to the nonsmoothness of the interface. A challenge in research is to design high order numerical methods that work well for problems with low regularity in the solution. The best known numerical scheme in the literature is of order one for the solution itself in $L_\infty$ norm. It is demonstrated that the WG-FEM of lowest order is capable of delivering numerical approximations that are of order 1.75 in the usual $L_\infty$ norm for $C^1$ or Lipschitz continuous interfaces associated with a $C^1$ or $H^2$ continuous solutions. Theoretically, it is proved that high order of numerical schemes can be designed by using the WG-FEM with polynomials of high order on each element.
CVMar 16, 2023
Rt-Track: Robust Tricks for Multi-Pedestrian TrackingYukuan Zhang, Yunhua Jia, Housheng Xie et al.
Object tracking is divided into single-object tracking (SOT) and multi-object tracking (MOT). MOT aims to maintain the identities of multiple objects across a series of continuous video sequences. In recent years, MOT has made rapid progress. However, modeling the motion and appearance models of objects in complex scenes still faces various challenging issues. In this paper, we design a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information and overcome the lack of robustness in previous methods in complex scenes. Existing methods use pedestrian re-identification (Re-ID) to model appearance, however, they extract more background information which lacks discriminability in occlusion and crowded scenes. We propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models, thus generating robust appearance descriptors. We also proposed other robustness techniques, including CF-ECM for storing robust appearance information and SK-AS for improving association accuracy. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques. It achieves 79.5 MOTA, 76.0 IDF1 and 62.1 HOTA on the test set of MOT17.Rt-track also achieves 77.9 MOTA, 78.4 IDF1 and 63.3 HOTA on MOT20, surpassing all published methods.
LGSep 11, 2023
Exploring Geometric Deep Learning For Precipitation NowcastingShan Zhao, Sudipan Saha, Zhitong Xiong et al.
Precipitation nowcasting (up to a few hours) remains a challenge due to the highly complex local interactions that need to be captured accurately. Convolutional Neural Networks rely on convolutional kernels convolving with grid data and the extracted features are trapped by limited receptive field, typically expressed in excessively smooth output compared to ground truth. Thus they lack the capacity to model complex spatial relationships among the grids. Geometric deep learning aims to generalize neural network models to non-Euclidean domains. Such models are more flexible in defining nodes and edges and can effectively capture dynamic spatial relationship among geographical grids. Motivated by this, we explore a geometric deep learning-based temporal Graph Convolutional Network (GCN) for precipitation nowcasting. The adjacency matrix that simulates the interactions among grid cells is learned automatically by minimizing the L1 loss between prediction and ground truth pixel value during the training procedure. Then, the spatial relationship is refined by GCN layers while the temporal information is extracted by 1D convolution with various kernel lengths. The neighboring information is fed as auxiliary input layers to improve the final result. We test the model on sequences of radar reflectivity maps over the Trento/Italy area. The results show that GCNs improves the effectiveness of modeling the local details of the cloud profile as well as the prediction accuracy by achieving decreased error measures.
NANov 2, 2011
A Numerical Study on the Weak Galerkin Method for the Helmholtz Equation with Large Wave NumbersLin Mu, Junping Wang, Xiu Ye et al.
Weak Galerkin (WG) refers to general finite element methods for partial differential equations in which differential operators are approximated by weak forms through the usual integration by parts. In particular, WG methods allow the use of discontinuous finite element functions in the algorithm design. One of such examples was recently introduced by Wang and Ye for solving second order elliptic problems. The goal of this paper is to apply the WG method of Wang and Ye to the Helmholtz equation with high wave numbers. Several test scenarios are designed for a numerical investigation on the accuracy, convergence, and robustness of the WG method in both inhomogeneous and homogeneous media over convex and non-convex domains. Our numerical experiments indicate that weak Galerkin is a finite element technique that is easy to implement, and provides very accurate and robust numerical solutions for the Helmholtz problem with high wave numbers.
CVOct 5, 2022Code
InterFace:Adjustable Angular Margin Inter-class Loss for Deep Face RecognitionMeng Sang, Jiaxuan Chen, Mengzhen Li et al.
In the field of face recognition, it is always a hot research topic to improve the loss solution to make the face features extracted by the network have greater discriminative power. Research works in recent years has improved the discriminative power of the face model by normalizing softmax to the cosine space step by step and then adding a fixed penalty margin to reduce the intra-class distance to increase the inter-class distance. Although a great deal of previous work has been done to optimize the boundary penalty to improve the discriminative power of the model, adding a fixed margin penalty to the depth feature and the corresponding weight is not consistent with the pattern of data in the real scenario. To address this issue, in this paper, we propose a novel loss function, InterFace, releasing the constraint of adding a margin penalty only between the depth feature and the corresponding weight to push the separability of classes by adding corresponding margin penalties between the depth features and all weights. To illustrate the advantages of InterFace over a fixed penalty margin, we explained geometrically and comparisons on a set of mainstream benchmarks. From a wider perspective, our InterFace has advanced the state-of-the-art face recognition performance on five out of thirteen mainstream benchmarks. All training codes, pre-trained models, and training logs, are publicly released \footnote{$https://github.com/iamsangmeng/InterFace$}.
AIDec 19, 2023Code
A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity LinkingShezheng Song, Shan Zhao, Chengyu Wang et al.
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia, which plays a key role in many applications. However, existing methods suffer from shortcomings, including modality impurity such as noise in raw image and ambiguous textual entity representation, which puts obstacles to MEL. We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query, and the model learns the mapping from each query to the relevant entity from candidate entities. This paper introduces a dual-way enhanced (DWE) framework for MEL: (1) our model refines queries with multimodal data and addresses semantic gaps using cross-modal enhancers between text and image information. Besides, DWE innovatively leverages fine-grained image attributes, including facial characteristic and scene feature, to enhance and refine visual features. (2)By using Wikipedia descriptions, DWE enriches entity semantics and obtains more comprehensive textual representation, which reduces between textual representation and the entities in KG. Extensive experiments on three public benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance, indicating the superiority of our model. The code is released on https://github.com/season1blue/DWE
78.7CVMay 18
Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous DrivingLijun Zhou, Hongcheng Luo, Zhenxin Zhu et al.
This report presents a unified technical system addressing the two core capabilities of world models for autonomous driving: world representation and world generation. For world representation, we propose WorldRec, a feed-forward reconstruction architecture driven by sparse scene queries. WorldRec initializes structured queries in 3D space, leveraging them to aggregate cross-view, cross-temporal features, thereby naturally enforcing spatial consistency across frames and yielding compact yet high-fidelity 3D Gaussian scene representations. For world generation, we propose WorldGen, a two-stage training framework of bidirectional pretraining followed by causal fine-tuning through three progressive stages (Teacher Forcing, ODE distillation, and DMD), enabling high-quality online causal video generation in as few as 4 denoising steps. Building on both modules, we further introduce the JWM, which deeply integrates WorldRec and WorldGen to achieve synergistic gains in generation stability, cross-frame consistency, and visual fidelity, providing a solid foundation for closed-loop simulation, data synthesis, and end-to-end training in autonomous driving.
CLNov 10, 2023
How to Bridge the Gap between Modalities: Survey on Multimodal Large Language ModelShezheng Song, Xiaopeng Li, Shasha Li et al.
We explore Multimodal Large Language Models (MLLMs), which integrate LLMs like GPT-4 to handle multimodal data, including text, images, audio, and more. MLLMs demonstrate capabilities such as generating image captions and answering image-based questions, bridging the gap towards real-world human-computer interactions and hinting at a potential pathway to artificial general intelligence. However, MLLMs still face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs, posing potential risks to society. Selecting the appropriate modality alignment method is crucial, as improper methods might require more parameters without significant performance improvements. This paper aims to explore modality alignment methods for LLMs and their current capabilities. Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility. The study surveys existing modality alignment methods for MLLMs, categorizing them into four groups: (1) Multimodal Converter, which transforms data into a format that LLMs can understand; (2) Multimodal Perceiver, which improves how LLMs percieve different types of data; (3) Tool Learning, which leverages external tools to convert data into a common format, usually text; and (4) Data-Driven Method, which teaches LLMs to understand specific data types within datasets.
AIApr 7, 2024Code
DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity LinkingShezheng Song, Shasha Li, Shan Zhao et al.
Multimodal entity linking (MEL) aims to utilize multimodal information (usually textual and visual information) to link ambiguous mentions to unambiguous entities in knowledge base. Current methods facing main issues: (1)treating the entire image as input may contain redundant information. (2)the insufficient utilization of entity-related information, such as attributes in images. (3)semantic inconsistency between the entity in knowledge base and its representation. To this end, we propose DWE+ for multimodal entity linking. DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities. This is achieved by three aspects: (a)we introduce a method for extracting fine-grained image features by partitioning the image into multiple local objects. Then, hierarchical contrastive learning is used to further align semantics between coarse-grained information(text and image) and fine-grained (mention and visual objects). (b)we explore ways to extract visual attributes from images to enhance fusion feature such as facial features and identity. (c)we leverage Wikipedia and ChatGPT to capture the entity representation, achieving semantic enrichment from both static and dynamic perspectives, which better reflects the real-world entity semantics. Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance. Specifically, we optimize these datasets and achieve state-of-the-art performance on the enhanced datasets. The code and enhanced datasets are released on https://github.com/season1blue/DWET
LGMay 13, 2025Code
ExEBench: Benchmarking Foundation Models on Extreme Earth EventsShan Zhao, Zhitong Xiong, Jie Zhao et al.
Our planet is facing increasingly frequent extreme events, which pose major risks to human lives and ecosystems. Recent advances in machine learning (ML), especially with foundation models (FMs) trained on extensive datasets, excel in extracting features and show promise in disaster management. Nevertheless, these models often inherit biases from training data, challenging their performance over extreme values. To explore the reliability of FM in the context of extreme events, we introduce \textbf{ExE}Bench (\textbf{Ex}treme \textbf{E}arth Benchmark), a collection of seven extreme event categories across floods, wildfires, storms, tropical cyclones, extreme precipitation, heatwaves, and cold waves. The dataset features global coverage, varying data volumes, and diverse data sources with different spatial, temporal, and spectral characteristics. To broaden the real-world impact of FMs, we include multiple challenging ML tasks that are closely aligned with operational needs in extreme events detection, monitoring, and forecasting. ExEBench aims to (1) assess FM generalizability across diverse, high-impact tasks and domains, (2) promote the development of novel ML methods that benefit disaster management, and (3) offer a platform for analyzing the interactions and cascading effects of extreme events to advance our understanding of Earth system, especially under the climate change expected in the decades to come. The dataset and code are public https://github.com/zhaoshan2/EarthExtreme-Bench.
CVJan 4Code
ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous ParkingXiaobao Wei, Zhangjie Ye, Yuxiang Gu et al.
Parking is a critical task for autonomous driving systems (ADS), with unique challenges in crowded parking slots and GPS-denied environments. However, existing works focus on 2D parking slot perception, mapping, and localization, 3D reconstruction remains underexplored, which is crucial for capturing complex spatial geometry in parking scenarios. Naively improving the visual quality of reconstructed parking scenes does not directly benefit autonomous parking, as the key entry point for parking is the slots perception module. To address these limitations, we curate the first benchmark named ParkRecon3D, specifically designed for parking scene reconstruction. It includes sensor data from four surround-view fisheye cameras with calibrated extrinsics and dense parking slot annotations. We then propose ParkGaussian, the first framework that integrates 3D Gaussian Splatting (3DGS) for parking scene reconstruction. To further improve the alignment between reconstruction and downstream parking slot detection, we introduce a slot-aware reconstruction strategy that leverages existing parking perception methods to enhance the synthesis quality of slot regions. Experiments on ParkRecon3D demonstrate that ParkGaussian achieves state-of-the-art reconstruction quality and better preserves perception consistency for downstream tasks. The code and dataset will be released at: https://github.com/wm-research/ParkGaussian
CLJun 27, 2024Code
DIM: Dynamic Integration of Multimodal Entity Linking with Large Language ModelShezheng Song, Shasha Li, Jie Yu et al.
Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dynamically Integrate Multimodal information with knowledge base (DIM), employing the capability of the Large Language Model (LLM) for visual understanding. The LLM, such as BLIP-2, extracts information relevant to entities in the image, which can facilitate improved extraction of entity features and linking them with the dynamic entity representations provided by ChatGPT. The experiments demonstrate that our proposed DIM method outperforms the majority of existing methods on the three original datasets, and achieves state-of-the-art (SOTA) on the dynamically enhanced datasets (Wiki+, Rich+, Diverse+). For reproducibility, our code and collected datasets are released on \url{https://github.com/season1blue/DIM}.
CVSep 14, 2021Code
Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale TransformerFushun Zhu, Shan Zhao, Peng Wang et al.
We propose a semi-supervised network for wide-angle portraits correction. Wide-angle images often suffer from skew and distortion affected by perspective distortion, especially noticeable at the face regions. Previous deep learning based approaches need the ground-truth correction flow maps for training guidance. However, such labels are expensive, which can only be obtained manually. In this work, we design a semi-supervised scheme and build a high-quality unlabeled dataset with rich scenarios, allowing us to simultaneously use labeled and unlabeled data to improve performance. Specifically, our semi-supervised scheme takes advantage of the consistency mechanism, with several novel components such as direction and range consistency (DRC) and regression consistency (RC). Furthermore, different from the existing methods, we propose the Multi-Scale Swin-Unet (MS-Unet) based on the multi-scale swin transformer block (MSTB), which can simultaneously learn short-distance and long-distance information to avoid artifacts. Extensive experiments demonstrate that the proposed method is superior to the state-of-the-art methods and other representative baselines. The source code and dataset are available at: https://github.com/megvii-research/Portraits_Correction.
LGNov 5, 2024
Beyond Grid Data: Exploring Graph Neural Networks for Earth ObservationShan Zhao, Zhaiyu Chen, Zhitong Xiong et al.
Earth Observation (EO) data analysis has been significantly revolutionized by deep learning (DL), with applications typically limited to grid-like data structures. Graph Neural Networks (GNNs) emerge as an important innovation, propelling DL into the non-Euclidean domain. Naturally, GNNs can effectively tackle the challenges posed by diverse modalities, multiple sensors, and the heterogeneous nature of EO data. To introduce GNNs in the related domains, our review begins by offering fundamental knowledge on GNNs. Then, we summarize the generic problems in EO, to which GNNs can offer potential solutions. Following this, we explore a broad spectrum of GNNs' applications to scientific problems in Earth systems, covering areas such as weather and climate analysis, disaster management, air quality monitoring, agriculture, land cover classification, hydrological process modeling, and urban modeling. The rationale behind adopting GNNs in these fields is explained, alongside methodologies for organizing graphs and designing favorable architectures for various tasks. Furthermore, we highlight methodological challenges of implementing GNNs in these domains and possible solutions that could guide future research. While acknowledging that GNNs are not a universal solution, we conclude the paper by comparing them with other popular architectures like transformers and analyzing their potential synergies.
LGMar 13, 2024
Causal Graph Neural Networks for Wildfire Danger PredictionShan Zhao, Ioannis Prapas, Ilektra Karasante et al.
Wildfire forecasting is notoriously hard due to the complex interplay of different factors such as weather conditions, vegetation types and human activities. Deep learning models show promise in dealing with this complexity by learning directly from data. However, to inform critical decision making, we argue that we need models that are right for the right reasons; that is, the implicit rules learned should be grounded by the underlying processes driving wildfires. In that direction, we propose integrating causality with Graph Neural Networks (GNNs) that explicitly model the causal mechanism among complex variables via graph learning. The causal adjacency matrix considers the synergistic effect among variables and removes the spurious links from highly correlated impacts. Our methodology's effectiveness is demonstrated through superior performance forecasting wildfire patterns in the European boreal and mediterranean biome. The gain is especially prominent in a highly imbalanced dataset, showcasing an enhanced robustness of the model to adapt to regime shifts in functional relationships. Furthermore, SHAP values from our trained model further enhance our understanding of the model's inner workings.
CVNov 25, 2024
MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex ImageShezheng Song, Chengxiang He, Shan Zhao et al.
Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
CLMay 23, 2024
PTA: Enhancing Multimodal Sentiment Analysis through Pipelined Prediction and Translation-based AlignmentShezheng Song, Shasha Li, Shan Zhao et al.
Multimodal aspect-based sentiment analysis (MABSA) aims to understand opinions in a granular manner, advancing human-computer interaction and other fields. Traditionally, MABSA methods use a joint prediction approach to identify aspects and sentiments simultaneously. However, we argue that joint models are not always superior. Our analysis shows that joint models struggle to align relevant text tokens with image patches, leading to misalignment and ineffective image utilization. In contrast, a pipeline framework first identifies aspects through MATE (Multimodal Aspect Term Extraction) and then aligns these aspects with image patches for sentiment classification (MASC: Multimodal Aspect-Oriented Sentiment Classification). This method is better suited for multimodal scenarios where effective image use is crucial. We present three key observations: (a) MATE and MASC have different feature requirements, with MATE focusing on token-level features and MASC on sequence-level features; (b) the aspect identified by MATE is crucial for effective image utilization; and (c) images play a trivial role in previous MABSA methods due to high noise. Based on these observations, we propose a pipeline framework that first predicts the aspect and then uses translation-based alignment (TBA) to enhance multimodal semantic consistency for better image utilization. Our method achieves state-of-the-art (SOTA) performance on widely used MABSA datasets Twitter-15 and Twitter-17. This demonstrates the effectiveness of the pipeline approach and its potential to provide valuable insights for future MABSA research. For reproducibility, the code and checkpoint will be released.
LGJan 31, 2024
Efficient Subseasonal Weather Forecast using Teleconnection-informed TransformersShan Zhao, Zhitong Xiong, Xiao Xiang Zhu
Subseasonal forecasting, which is pivotal for agriculture, water resource management, and early warning of disasters, faces challenges due to the chaotic nature of the atmosphere. Recent advances in machine learning (ML) have revolutionized weather forecasting by achieving competitive predictive skills to numerical models. However, training such foundation models requires thousands of GPU days, which causes substantial carbon emissions and limits their broader applicability. Moreover, ML models tend to fool the pixel-wise error scores by producing smoothed results which lack physical consistency and meteorological meaning. To deal with the aforementioned problems, we propose a teleconnection-informed transformer. Our architecture leverages the pretrained Pangu model to achieve good initial weights and integrates a teleconnection-informed temporal module to improve predictability in an extended temporal range. Remarkably, by adjusting 1.1% of the Pangu model's parameters, our method enhances predictability on four surface and five upper-level atmospheric variables at a two-week lead time. Furthermore, the teleconnection-filtered features improve the spatial granularity of outputs significantly, indicating their potential physical consistency. Our research underscores the importance of atmospheric and oceanic teleconnections in driving future weather conditions. Besides, it presents a resource-efficient pathway for researchers to leverage existing foundation models on versatile downstream tasks.
CVAug 21, 2025
ExtraGS: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative PriorsKaiyuan Tan, Yingying Shen, Haohui Zhu et al.
Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.
CVAug 26, 2021
Reiterative Domain Aware Multi-Target AdaptationSudipan Saha, Shan Zhao, Nasrullah Sheikh et al.
Most domain adaptation methods focus on single-source-single-target adaptation settings. Multi-target domain adaptation is a powerful extension in which a single classifier is learned for multiple unlabeled target domains. To build a multi-target classifier, it is important to have: a feature extractor that generalizes well across domains; and effective aggregation of features from the labeled source and different unlabeled target domains. Towards the first, we use the recently popular Transformer as a feature extraction backbone. Towards the second, we use a co-teaching-based approach using a dual-classifier head, one of which is based on the graph neural network. The proposed approach uses a sequential adaptation strategy that adapts one domain at a time starting from the target domains that are more similar to the source, assuming that the network finds it easier to adapt to such target domains. After adapting on each target, samples with a softmax-based confidence score greater than a threshold are added to the pseudo-source, thus aggregating knowledge from different domains. However, softmax is not entirely trustworthy as a confidence score and may generate a high score for unreliable samples if trained for many iterations. To mitigate this effect, we adopt a reiterative approach, where we reduce target adaptation iterations, however, reiterate multiple times over the target domains. The experimental evaluation on the Office-Home, Office-31 and DomainNet datasets shows significant improvement over the existing methods. We have achieved 10.7$\%$ average improvement in Office-Home dataset over the state-of-art methods.
CVApr 26, 2021
Practical Wide-Angle Portraits Correction with Deep Structured ModelsJing Tan, Shan Zhao, Pengfei Xiong et al.
Wide-angle portraits often enjoy expanded views. However, they contain perspective distortions, especially noticeable when capturing group portrait photos, where the background is skewed and faces are stretched. This paper introduces the first deep learning based approach to remove such artifacts from freely-shot photos. Specifically, given a wide-angle portrait as input, we build a cascaded network consisting of a LineNet, a ShapeNet, and a transition module (TM), which corrects perspective distortions on the background, adapts to the stereographic projection on facial regions, and achieves smooth transitions between these two projections, accordingly. To train our network, we build the first perspective portrait dataset with a large diversity in identities, scenes and camera modules. For the quantitative evaluation, we introduce two novel metrics, line consistency and face congruence. Compared to the previous state-of-the-art approach, our method does not require camera distortion parameters. We demonstrate that our approach significantly outperforms the previous state-of-the-art approach both qualitatively and quantitatively.
CLFeb 16, 2020
The Utility of General Domain Transfer Learning for Medical Language TasksDaniel Ranti, Katie Hanss, Shan Zhao et al.
The purpose of this study is to analyze the efficacy of transfer learning techniques and transformer-based models as applied to medical natural language processing (NLP) tasks, specifically radiological text classification. We used 1,977 labeled head CT reports, from a corpus of 96,303 total reports, to evaluate the efficacy of pretraining using general domain corpora and a combined general and medical domain corpus with a bidirectional representations from transformers (BERT) model for the purpose of radiological text classification. Model performance was benchmarked to a logistic regression using bag-of-words vectorization and a long short-term memory (LSTM) multi-label multi-class classification model, and compared to the published literature in medical text classification. The BERT models using either set of pretrained checkpoints outperformed the logistic regression model, achieving sample-weighted average F1-scores of 0.87 and 0.87 for the general domain model and the combined general and biomedical-domain model. General text transfer learning may be a viable technique to generate state-of-the-art results within medical NLP tasks on radiological corpora, outperforming other deep models such as LSTMs. The efficacy of pretraining and transformer-based models could serve to facilitate the creation of groundbreaking NLP models in the uniquely challenging data environment of medical text.
NAOct 10, 2014
Unconditionally stable time splitting methods for the electrostatic analysis of solvated biomoleculesLeighton Wilson, Shan Zhao
This work introduces novel unconditionally stable operator splitting methods for solving the time dependent nonlinear Poisson-Boltzmann (NPB) equation for the electrostatic analysis of solvated biomolecules. In a pseudo-transient continuation solution of the NPB equation, a long time integration is needed to reach the steady state. This calls for time stepping schemes that are stable and accurate for large time increments. The existing alternating direction implicit (ADI) methods for the NPB equation are known to be conditionally stable, although being fully implicit. To overcome this difficulty, we propose several new operator splitting schemes, in both multiplicative and additive styles, including locally one-dimensional (LOD) schemes and additive operator splitting (AOS) schemes. The proposed schemes become much more stable than the ADI methods, and some of them are indeed unconditionally stable in dealing with solvated proteins with source singularities and non-smooth solutions. Numerically, the orders of convergence in both space and time are found to be one. Nevertheless, the precision in calculating the electrostatic free energy is low, unless a small time increment is used. Further accuracy improvements are thus considered. After acceleration, the optimized LOD method can produce a reliable energy estimate by integrating for a small and fixed number of time steps. Since one only needs to solve a tridiagonal linear system in each independent one dimensional process, the overall computation is very efficient. The unconditionally stable LOD method scales linearly with respect to the number of atoms in the protein studies, and is over 20 times faster than the conditionally stable ADI methods.
NAOct 5, 2006
On the validity of "A proof that the discrete singular convolution (DSC)/Lagrange-distributed approximation function (LDAF) method is inferior to high order finite differences"G. W. Wei, Shan Zhao
A few families of counterexamples are provided to "A proof that the discrete singular convolution (DSC)/Lagrange-distributed approximation function (LDAF) method is inferior to high order finite differences", Journal of Computational Physics, 214, 538-549 (2006).