Ni Lao

CV
h-index117
35papers
5,783citations
Novelty51%
AI Score60

35 Papers

AIApr 13, 2023
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

Gengchen Mai, Weiming Huang, Jin Sun et al. · stanford

Large pre-trained models, also known as foundation models (FMs), are trained in a task-agnostic manner on large-scale data and can be adapted to a wide range of downstream tasks by fine-tuning, few-shot, or even zero-shot learning. Despite their successes in language and vision tasks, we have yet seen an attempt to develop foundation models for geospatial artificial intelligence (GeoAI). In this work, we explore the promises and challenges of developing multimodal foundation models for GeoAI. We first investigate the potential of many existing FMs by testing their performances on seven tasks across multiple geospatial subdomains including Geospatial Semantics, Health Geography, Urban Geography, and Remote Sensing. Our results indicate that on several geospatial tasks that only involve text modality such as toponym recognition, location description recognition, and US state-level/county-level dementia time series forecasting, these task-agnostic LLMs can outperform task-specific fully-supervised models in a zero-shot or few-shot learning setting. However, on other geospatial tasks, especially tasks that involve multiple data modalities (e.g., POI-based urban function classification, street view image-based urban noise intensity classification, and remote sensing image scene classification), existing foundation models still underperform task-specific models. Based on these observations, we propose that one of the major challenges of developing a FM for GeoAI is to address the multimodality nature of geospatial tasks. After discussing the distinct challenges of each geospatial data modality, we suggest the possibility of a multimodal foundation model which can reason over various types of geospatial data through geospatial alignments. We conclude this paper by discussing the unique risks and challenges to develop such a model for GeoAI.

CLOct 17, 2022
RARR: Researching and Revising What Language Models Say, Using Language Models

Luyu Gao, Zhuyun Dai, Panupong Pasupat et al. · cmu

Language models (LMs) now excel at many tasks such as few-shot learning, question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.

CVJun 30, 2023
Sphere2Vec: A General-Purpose Location Representation Learning over a Spherical Surface for Large-Scale Geospatial Predictions

Gengchen Mai, Yao Xuan, Wenyun Zuo et al.

Generating learning-friendly representations for points in space is a fundamental and long-standing problem in ML. Recently, multi-scale encoding schemes (such as Space2Vec and NeRF) were proposed to directly encode any point in 2D/3D Euclidean space as a high-dimensional vector, and has been successfully applied to various geospatial prediction and generative tasks. However, all current 2D and 3D location encoders are designed to model point distances in Euclidean space. So when applied to large-scale real-world GPS coordinate datasets, which require distance metric learning on the spherical surface, both types of models can fail due to the map projection distortion problem (2D) and the spherical-to-Euclidean distance approximation error (3D). To solve these problems, we propose a multi-scale location encoder called Sphere2Vec which can preserve spherical distances when encoding point coordinates on a spherical surface. We developed a unified view of distance-reserving encoding on spheres based on the DFS. We also provide theoretical proof that the Sphere2Vec preserves the spherical surface distance between any two points, while existing encoding schemes do not. Experiments on 20 synthetic datasets show that Sphere2Vec can outperform all baseline models on all these datasets with up to 30.8% error rate reduction. We then apply Sphere2Vec to three geo-aware image classification tasks - fine-grained species recognition, Flickr image recognition, and remote sensing image classification. Results on 7 real-world datasets show the superiority of Sphere2Vec over multiple location encoders on all three tasks. Further analysis shows that Sphere2Vec outperforms other location encoder models, especially in the polar regions and data-sparse areas because of its nature for spherical surface distance preservation. Code and data are available at https://gengchenmai.github.io/sphere2vec-website/.

CVSep 29, 2022
Towards General-Purpose Representation Learning of Polygonal Geometries

Gengchen Mai, Chiyu Jiang, Weiwei Sun et al.

Neural network representation learning for spatial data is a common need for geographic artificial intelligence (GeoAI) problems. In recent years, many advancements have been made in representation learning for points, polylines, and networks, whereas little progress has been made for polygons, especially complex polygonal geometries. In this work, we focus on developing a general-purpose polygon encoding model, which can encode a polygonal geometry (with or without holes, single or multipolygons) into an embedding space. The result embeddings can be leveraged directly (or finetuned) for downstream tasks such as shape classification, spatial relation prediction, and so on. To achieve model generalizability guarantees, we identify a few desirable properties: loop origin invariance, trivial vertex invariance, part permutation invariance, and topology awareness. We explore two different designs for the encoder: one derives all representations in the spatial domain; the other leverages spectral domain representations. For the spatial domain approach, we propose ResNet1D, a 1D CNN-based polygon encoder, which uses circular padding to achieve loop origin invariance on simple polygons. For the spectral domain approach, we develop NUFTspec based on Non-Uniform Fourier Transformation (NUFT), which naturally satisfies all the desired properties. We conduct experiments on two tasks: 1) shape classification based on MNIST; 2) spatial relation prediction based on two new datasets - DBSR-46K and DBSR-cplx46K. Our results show that NUFTspec and ResNet1D outperform multiple existing baselines with significant margins. While ResNet1D suffers from model performance degradation after shape-invariance geometry modifications, NUFTspec is very robust to these modifications due to the nature of the NUFT.

CVSep 30, 2023
SSIF: Learning Continuous Image Representation for Spatial-Spectral Super-Resolution

Gengchen Mai, Ni Lao, Weiwei Sun et al.

Existing digital sensors capture images at fixed spatial and spectral resolutions (e.g., RGB, multispectral, and hyperspectral images), and each combination requires bespoke machine learning models. Neural Implicit Functions partially overcome the spatial resolution challenge by representing an image in a resolution-independent way. However, they still operate at fixed, pre-defined spectral resolutions. To address this challenge, we propose Spatial-Spectral Implicit Function (SSIF), a neural implicit model that represents an image as a function of both continuous pixel coordinates in the spatial domain and continuous wavelengths in the spectral domain. We empirically demonstrate the effectiveness of SSIF on two challenging spatio-spectral super-resolution benchmarks. We observe that SSIF consistently outperforms state-of-the-art baselines even when the baselines are allowed to train separate models at each spectral resolution. We show that SSIF generalizes well to both unseen spatial resolutions and spectral resolutions. Moreover, SSIF can generate high-resolution images that improve the performance of downstream tasks (e.g., land use classification) by 1.7%-7%.

LGMay 22
Training-Free Looped Transformers

Lizhang Chen, Jonathan Li, Chen Liang et al.

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

LGMay 14
$ϕ$-Balancing for Mixture-of-Experts Training

Lizhang Chen, Jonathan Li, Qi Wang et al.

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $ϕ$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $ϕ$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

CVJun 21, 2024Code
TorchSpatial: A Location Encoding Framework and Benchmark for Spatial Representation Learning

Nemin Wu, Qian Cao, Zhangyu Wang et al.

Spatial representation learning (SRL) aims at learning general-purpose neural network representations from various types of spatial data (e.g., points, polylines, polygons, networks, images, etc.) in their native formats. Learning good spatial representations is a fundamental problem for various downstream applications such as species distribution modeling, weather forecasting, trajectory generation, geographic question answering, etc. Even though SRL has become the foundation of almost all geospatial artificial intelligence (GeoAI) research, we have not yet seen significant efforts to develop an extensive deep learning framework and benchmark to support SRL model development and evaluation. To fill this gap, we propose TorchSpatial, a learning framework and benchmark for location (point) encoding, which is one of the most fundamental data types of spatial representation learning. TorchSpatial contains three key components: 1) a unified location encoding framework that consolidates 15 commonly recognized location encoders, ensuring scalability and reproducibility of the implementations; 2) the LocBench benchmark tasks encompassing 7 geo-aware image classification and 10 geo-aware image regression datasets; 3) a comprehensive suite of evaluation metrics to quantify geo-aware model's overall performance as well as their geographic bias, with a novel Geo-Bias Score metric. Finally, we provide a detailed analysis and insights into the model performance and geographic bias of different location encoders. We believe TorchSpatial will foster future advancement of spatial representation learning and spatial fairness in GeoAI research. The TorchSpatial model framework and LocBench benchmark are available at https://github.com/seai-lab/TorchSpatial, and the Geo-Bias Score evaluation framework is available at https://github.com/seai-lab/PyGBS.

CLFeb 28, 2019Code
FastFusionNet: New State-of-the-Art for DAWNBench SQuAD

Felix Wu, Boyi Li, Lequn Wang et al.

In this technical report, we introduce FastFusionNet, an efficient variant of FusionNet [12]. FusionNet is a high performing reading comprehension architecture, which was designed primarily for maximum retrieval accuracy with less regard towards computational requirements. For FastFusionNets we remove the expensive CoVe layers [21] and substitute the BiLSTMs with far more efficient SRU layers [19]. The resulting architecture obtains state-of-the-art results on DAWNBench [5] while achieving the lowest training and inference time on SQuAD [25] to-date. The code is available at https://github.com/felixgwu/FastFusionNet.

LGJul 6, 2018Code
Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Chen Liang, Mohammad Norouzi, Jonathan Berant et al.

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://github.com/crazydonkey200/neural-symbolic-machines

CVMay 7
TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

Maria Despoina Siampou, Gengchen Mai, Ni Lao et al.

Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.

CVMar 28, 2024
Img2Loc: Revisiting Image Geolocalization using Multi-modality Foundation Models and Image-based Retrieval-Augmented Generation

Zhongliang Zhou, Jielu Zhang, Zihan Guan et al.

Geolocating precise locations from images presents a challenging problem in computer vision and information retrieval.Traditional methods typically employ either classification, which dividing the Earth surface into grid cells and classifying images accordingly, or retrieval, which identifying locations by matching images with a database of image-location pairs. However, classification-based approaches are limited by the cell size and cannot yield precise predictions, while retrieval-based systems usually suffer from poor search quality and inadequate coverage of the global landscape at varied scale and aggregation levels. To overcome these drawbacks, we present Img2Loc, a novel system that redefines image geolocalization as a text generation task. This is achieved using cutting-edge large multi-modality models like GPT4V or LLaVA with retrieval augmented generation. Img2Loc first employs CLIP-based representations to generate an image-based coordinate query database. It then uniquely combines query results with images itself, forming elaborate prompts customized for LMMs. When tested on benchmark datasets such as Im2GPS3k and YFCC4k, Img2Loc not only surpasses the performance of previous state-of-the-art models but does so without any model training.

CVMar 20, 2025
GAIR: Improving Multimodal Geo-Foundation Model with Geo-Aligned Implicit Representations

Zeping Liu, Fan Zhang, Junfeng Jiao et al.

Advancements in vision and language foundation models have inspired the development of geo-foundation models (GeoFMs), enhancing performance across diverse geospatial tasks. However, many existing GeoFMs primarily focus on overhead remote sensing (RS) data while neglecting other data modalities such as ground-level imagery. A key challenge in multimodal GeoFM development is to explicitly model geospatial relationships across modalities, which enables generalizability across tasks, spatial scales, and temporal contexts. To address these limitations, we propose GAIR, a novel multimodal GeoFM architecture integrating overhead RS data, street view (SV) imagery, and their geolocation metadata. We utilize three factorized neural encoders to project an SV image, its geolocation, and an RS image into the embedding space. The SV image needs to be located within the RS image's spatial footprint but does not need to be at its geographic center. In order to geographically align the SV image and RS image, we propose a novel implicit neural representations (INR) module that learns a continuous RS image representation and looks up the RS embedding at the SV image's geolocation. Next, these geographically aligned SV embedding, RS embedding, and location embedding are trained with contrastive learning objectives from unlabeled data. We evaluate GAIR across 10 geospatial tasks spanning RS image-based, SV image-based, and location embedding-based benchmarks. Experimental results demonstrate that GAIR outperforms state-of-the-art GeoFMs and other strong baselines, highlighting its effectiveness in learning generalizable and transferable geospatial representations.

AISep 27, 2025
GeoBS: Information-Theoretic Quantification of Geographic Bias in AI Models

Zhangyu Wang, Nemin Wu, Qian Cao et al.

The widespread adoption of AI models, especially foundation models (FMs), has made a profound impact on numerous domains. However, it also raises significant ethical concerns, including bias issues. Although numerous efforts have been made to quantify and mitigate social bias in AI models, geographic bias (in short, geo-bias) receives much less attention, which presents unique challenges. While previous work has explored ways to quantify geo-bias, these measures are model-specific (e.g., mean absolute deviation of LLM ratings) or spatially implicit (e.g., average fairness scores of all spatial partitions). We lack a model-agnostic, universally applicable, and spatially explicit geo-bias evaluation framework that allows researchers to fairly compare the geo-bias of different AI models and to understand what spatial factors contribute to the geo-bias. In this paper, we establish an information-theoretic framework for geo-bias evaluation, called GeoBS (Geo-Bias Scores). We demonstrate the generalizability of the proposed framework by showing how to interpret and analyze existing geo-bias measures under this framework. Then, we propose three novel geo-bias scores that explicitly take intricate spatial factors (multi-scalability, distance decay, and anisotropy) into consideration. Finally, we conduct extensive experiments on 3 tasks, 8 datasets, and 8 models to demonstrate that both task-specific GeoAI models and general-purpose foundation models may suffer from various types of geo-bias. This framework will not only advance the technical understanding of geographic bias but will also establish a foundation for integrating spatial fairness into the design, deployment, and evaluation of AI systems.

CVMar 23, 2025
LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

Zhangyu Wang, Zeping Liu, Jielu Zhang et al.

Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

LGOct 14, 2025
Cautious Weight Decay

Lizhang Chen, Jonathan Li, Kaizhao Liang et al.

We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

CVMay 1, 2023
CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations

Gengchen Mai, Ni Lao, Yutong He et al.

Geo-tagged images are publicly available in large quantities, whereas labels such as object classes are rather scarce and expensive to collect. Meanwhile, contrastive learning has achieved tremendous success in various natural image and language tasks with limited labeled data. However, existing methods fail to fully leverage geospatial information, which can be paramount to distinguishing objects that are visually similar. To directly leverage the abundant geospatial information associated with images in pre-training, fine-tuning, and inference stages, we present Contrastive Spatial Pre-Training (CSP), a self-supervised learning framework for geo-tagged images. We use a dual-encoder to separately encode the images and their corresponding geo-locations, and use contrastive objectives to learn effective location representations from images, which can be transferred to downstream supervised tasks such as image classification. Experiments show that CSP can improve model performance on both iNat2018 and fMoW datasets. Especially, on iNat2018, CSP significantly boosts the model performance with 10-34% relative improvement with various labeled training data sampling ratios.

CVJan 25, 2022
Sphere2Vec: Multi-Scale Representation Learning over a Spherical Surface for Geospatial Predictions

Gengchen Mai, Yao Xuan, Wenyun Zuo et al.

Generating learning-friendly representations for points in a 2D space is a fundamental and long-standing problem in machine learning. Recently, multi-scale encoding schemes (such as Space2Vec) were proposed to directly encode any point in 2D space as a high-dimensional vector, and has been successfully applied to various (geo)spatial prediction tasks. However, a map projection distortion problem rises when applying location encoding models to large-scale real-world GPS coordinate datasets (e.g., species images taken all over the world) - all current location encoding models are designed for encoding points in a 2D (Euclidean) space but not on a spherical surface, e.g., earth surface. To solve this problem, we propose a multi-scale location encoding model called Sphere2V ec which directly encodes point coordinates on a spherical surface while avoiding the mapprojection distortion problem. We provide theoretical proof that the Sphere2Vec encoding preserves the spherical surface distance between any two points. We also developed a unified view of distance-reserving encoding on spheres based on the Double Fourier Sphere (DFS). We apply Sphere2V ec to the geo-aware image classification task. Our analysis shows that Sphere2V ec outperforms other 2D space location encoder models especially on the polar regions and data-sparse areas for image classification tasks because of its nature for spherical surface distance preservation.

AIDec 2, 2021
Narrative Cartography with Knowledge Graphs

Gengchen Mai, Weiming Huang, Ling Cai et al.

Narrative cartography is a discipline which studies the interwoven nature of stories and maps. However, conventional geovisualization techniques of narratives often encounter several prominent challenges, including the data acquisition & integration challenge and the semantic challenge. To tackle these challenges, in this paper, we propose the idea of narrative cartography with knowledge graphs (KGs). Firstly, to tackle the data acquisition & integration challenge, we develop a set of KG-based GeoEnrichment toolboxes to allow users to search and retrieve relevant data from integrated cross-domain knowledge graphs for narrative mapping from within a GISystem. With the help of this tool, the retrieved data from KGs are directly materialized in a GIS format which is ready for spatial analysis and mapping. Two use cases - Magellan's expedition and World War II - are presented to show the effectiveness of this approach. In the meantime, several limitations are identified from this approach, such as data incompleteness, semantic incompatibility, and the semantic challenge in geovisualization. For the later two limitations, we propose a modular ontology for narrative cartography, which formalizes both the map content (Map Content Module) and the geovisualization process (Cartography Module). We demonstrate that, by representing both the map content and the geovisualization process in KGs (an ontology), we can realize both data reusability and map reproducibility for narrative cartography.

LGNov 7, 2021
A Review of Location Encoding for GeoAI: Methods and Applications

Gengchen Mai, Krzysztof Janowicz, Yingjie Hu et al.

A common need for artificial intelligence models in the broader geoscience is to represent and encode various types of spatial data, such as points (e.g., points of interest), polylines (e.g., trajectories), polygons (e.g., administrative regions), graphs (e.g., transportation networks), or rasters (e.g., remote sensing images), in a hidden embedding space so that they can be readily incorporated into deep learning models. One fundamental step is to encode a single point location into an embedding space, such that this embedding is learning-friendly for downstream machine learning models such as support vector machines and neural networks. We call this process location encoding. However, there lacks a systematic review on the concept of location encoding, its potential applications, and key challenges that need to be addressed. This paper aims to fill this gap. We first provide a formal definition of location encoding, and discuss the necessity of location encoding for GeoAI research from a machine learning perspective. Next, we provide a comprehensive survey and discussion about the current landscape of location encoding research. We classify location encoding models into different categories based on their inputs and encoding methods, and compare them based on whether they are parametric, multi-scale, distance preserving, and direction aware. We demonstrate that existing location encoding models can be unified under a shared formulation framework. We also discuss the application of location encoding for different types of spatial data. Finally, we point out several challenges in location encoding research that need to be solved in the future.

CLMay 19, 2021
Geographic Question Answering: Challenges, Uniqueness, Classification, and Future Directions

Gengchen Mai, Krzysztof Janowicz, Rui Zhu et al.

As an important part of Artificial Intelligence (AI), Question Answering (QA) aims at generating answers to questions phrased in natural language. While there has been substantial progress in open-domain question answering, QA systems are still struggling to answer questions which involve geographic entities or concepts and that require spatial operations. In this paper, we discuss the problem of geographic question answering (GeoQA). We first investigate the reasons why geographic questions are difficult to answer by analyzing challenges of geographic questions. We discuss the uniqueness of geographic questions compared to general QA. Then we review existing work on GeoQA and classify them by the types of questions they can address. Based on this survey, we provide a generic classification framework for geographic questions. Finally, we conclude our work by pointing out unique future research directions for GeoQA.

CLDec 28, 2020
Pivot Through English: Reliably Answering Multilingual Questions without Document Retrieval

Ivan Montero, Shayne Longpre, Ni Lao et al.

Existing methods for open-retrieval question answering in lower resource languages (LRLs) lag significantly behind English. They not only suffer from the shortcomings of non-English document retrieval, but are reliant on language-specific supervision for either the task or translation. We formulate a task setup more realistic to available resources, that circumvents document retrieval to reliably transfer knowledge from English to lower resource languages. Assuming a strong English question answering model or database, we compare and analyze methods that pivot through English: to map foreign queries to English and then English answers back to target language answers. Within this task setup we propose Reranked Multilingual Maximal Inner Product Search (RM-MIPS), akin to semantic similarity retrieval over the English training set with reranking, which outperforms the strongest baselines by 2.7% on XQuAD and 6.2% on MKQA. Analysis demonstrates the particular efficacy of this strategy over state-of-the-art alternatives in challenging settings: low-resource languages, with extensive distractor data and query distribution misalignment. Circumventing retrieval, our analysis shows this approach offers rapid answer generation to almost any language off-the-shelf, without the need for any additional training data in the target language.

DBApr 25, 2020
SE-KGE: A Location-Aware Knowledge Graph Embedding Model for Geographic Question Answering and Spatial Semantic Lifting

Gengchen Mai, Krzysztof Janowicz, Ling Cai et al.

Learning knowledge graph (KG) embeddings is an emerging technique for a variety of downstream tasks such as summarization, link prediction, information retrieval, and question answering. However, most existing KG embedding models neglect space and, therefore, do not perform well when applied to (geo)spatial data and tasks. For those models that consider space, most of them primarily rely on some notions of distance. These models suffer from higher computational complexity during training while still losing information beyond the relative distance between entities. In this work, we propose a location-aware KG embedding model called SE-KGE. It directly encodes spatial information such as point coordinates or bounding boxes of geographic entities into the KG embedding space. The resulting model is capable of handling different types of spatial reasoning. We also construct a geographic knowledge graph as well as a set of geographic query-answer pairs called DBGeo to evaluate the performance of SE-KGE in comparison to multiple baselines. Evaluation results show that SE-KGE outperforms these baselines on the DBGeo dataset for geographic logic query answering task. This demonstrates the effectiveness of our spatially-explicit model and the importance of considering the scale of different geographic entities. Finally, we introduce a novel downstream task called spatial semantic lifting which links an arbitrary location in the study area to entities in the KG via some relations. Evaluation on DBGeo shows that our model outperforms the baseline by a substantial margin.

IRMar 14, 2020
Semantically-Enriched Search Engine for Geoportals: A Case Study with ArcGIS Online

Gengchen Mai, Krzysztof Janowicz, Sathya Prasad et al.

Many geoportals such as ArcGIS Online are established with the goal of improving geospatial data reusability and achieving intelligent knowledge discovery. However, according to previous research, most of the existing geoportals adopt Lucene-based techniques to achieve their core search functionality, which has a limited ability to capture the user's search intentions. To better understand a user's search intention, query expansion can be used to enrich the user's query by adding semantically similar terms. In the context of geoportals and geographic information retrieval, we advocate the idea of semantically enriching a user's query from both geospatial and thematic perspectives. In the geospatial aspect, we propose to enrich a query by using both place partonomy and distance decay. In terms of the thematic aspect, concept expansion and embedding-based document similarity are used to infer the implicit information hidden in a user's query. This semantic query expansion 1 2 G. Mai et al. framework is implemented as a semantically-enriched search engine using ArcGIS Online as a case study. A benchmark dataset is constructed to evaluate the proposed framework. Our evaluation results show that the proposed semantic query expansion framework is very effective in capturing a user's search intention and significantly outperforms a well-established baseline-Lucene's practical scoring function-with more than 3.0 increments in DCG@K (K=3,5,10).

CVFeb 16, 2020
Multi-Scale Representation Learning for Spatial Feature Distributions using Grid Cells

Gengchen Mai, Krzysztof Janowicz, Bo Yan et al.

Unsupervised text encoding models have recently fueled substantial progress in NLP. The key idea is to use neural networks to convert words in texts to vector space representations based on word positions in a sentence and their contexts, which are suitable for end-to-end training of downstream tasks. We see a strikingly similar situation in spatial analysis, which focuses on incorporating both absolute positions and spatial contexts of geographic objects such as POIs into models. A general-purpose representation model for space is valuable for a multitude of tasks. However, no such general model exists to date beyond simply applying discretization or feed-forward nets to coordinates, and little effort has been put into jointly modeling distributions with vastly different characteristics, which commonly emerges from GIS data. Meanwhile, Nobel Prize-winning Neuroscience research shows that grid cells in mammals provide a multi-scale periodic representation that functions as a metric for location encoding and is critical for recognizing places and for path-integration. Therefore, we propose a representation learning model called Space2Vec to encode the absolute positions and spatial relationships of places. We conduct experiments on two real-world geographic data for two different tasks: 1) predicting types of POIs given their positions and context, 2) image classification leveraging their geo-locations. Results show that because of its multi-scale representations, Space2Vec outperforms well-established ML approaches such as RBF kernels, multi-layer feed-forward nets, and tile embedding approaches for location modeling and image classification tasks. Detailed analysis shows that all baselines can at most well handle distribution at one scale but show poor performances in other scales. In contrast, Space2Vec's multi-scale representation can handle distributions at different scales.

LGSep 30, 2019
Contextual Graph Attention for Answering Logical Queries over Incomplete Knowledge Graphs

Gengchen Mai, Krzysztof Janowicz, Bo Yan et al.

Recently, several studies have explored methods for using KG embedding to answer logical queries. These approaches either treat embedding learning and query answering as two separated learning tasks, or fail to deal with the variability of contributions from different query paths. We proposed to leverage a graph attention mechanism to handle the unequal contribution of different query paths. However, commonly used graph attention assumes that the center node embedding is provided, which is unavailable in this task since the center node is to be predicted. To solve this problem we propose a multi-head attention-based end-to-end logical query answering model, called Contextual Graph Attention model(CGA), which uses an initial neighborhood aggregation layer to generate the center embedding, and the whole model is trained jointly on the original KG structure as well as the sampled query-answer pairs. We also introduce two new datasets, DB18 and WikiGeo19, which are rather large in size compared to the existing datasets and contain many more relation types, and use them to evaluate the performance of the proposed model. Our result shows that the proposed CGA with fewer learnable parameters consistently outperforms the baseline models on both datasets as well as Bio dataset.

CLSep 28, 2019
Integrated Triaging for Fast Reading Comprehension

Felix Wu, Boyi Li, Lequn Wang et al.

Although according to several benchmarks automatic machine reading comprehension (MRC) systems have recently reached super-human performance, less attention has been paid to their computational efficiency. However, efficiency is of crucial importance for training and deployment in real world applications. This paper introduces Integrated Triaging, a framework that prunes almost all context in early layers of a network, leaving the remaining (deep) layers to scan only a tiny fraction of the full corpus. This pruning drastically increases the efficiency of MRC models and further prevents the later layers from overfitting to prevalent short paragraphs in the training set. Our framework is extremely flexible and naturally applicable to a wide variety of models. Our experiment on doc-SQuAD and TriviaQA tasks demonstrates its effectiveness in consistently improving both speed and quality of several diverse MRC models.

AIOct 5, 2018
POIReviewQA: A Semantically Enriched POI Retrieval and Question Answering Dataset

Gengchen Mai, Krzysztof Janowicz, Cheng He et al.

Many services that perform information retrieval for Points of Interest (POI) utilize a Lucene-based setup with spatial filtering. While this type of system is easy to implement it does not make use of semantics but relies on direct word matches between a query and reviews leading to a loss in both precision and recall. To study the challenging task of semantically enriching POIs from unstructured data in order to support open-domain search and question answering (QA), we introduce a new dataset POIReviewQA. It consists of 20k questions (e.g."is this restaurant dog friendly?") for 1022 Yelp business types. For each question we sampled 10 reviews, and annotated each sentence in the reviews whether it answers the question and what the corresponding answer is. To test a system's ability to understand the text we adopt an information retrieval evaluation by ranking all the review sentences for a question based on the likelihood that they answer this question. We build a Lucene-based baseline model, which achieves 77.0% AUC and 48.8% MAP. A sentence embedding-based model achieves 79.2% AUC and 41.8% MAP, indicating that the dataset presents a challenging problem for future research by the GIR community. The result technology can help exploit the thematic content of web documents and social media for characterisation of locations.

CLNov 17, 2017
Learning to Organize Knowledge and Answer Questions with N-Gram Machines

Fan Yang, Jiazhong Nie, William W. Cohen et al.

Though deep neural networks have great success in natural language processing, they are limited at more knowledge intensive AI tasks, such as open-domain Question Answering (QA). Existing end-to-end deep QA models need to process the entire text after observing the question, and therefore their complexity in responding a question is linear in the text size. This is prohibitive for practical tasks such as QA from Wikipedia, a novel, or the Web. We propose to solve this scalability issue by using symbolic meaning representations, which can be indexed and retrieved efficiently with complexity that is independent of the text size. We apply our approach, called the N-Gram Machine (NGM), to three representative tasks. First as proof-of-concept, we demonstrate that NGM successfully solves the bAbI tasks of synthetic text. Second, we show that NGM scales to large corpus by experimenting on "life-long bAbI", a special version of bAbI that contains millions of sentences. Lastly on the WikiMovies dataset, we use NGM to induce latent structure (i.e. schema) and answer questions from natural language Wikipedia text, with only QA pairs as weak supervision.

CLNov 12, 2017
Fast Reading Comprehension with ConvNets

Felix Wu, Ni Lao, John Blitzer et al.

State-of-the-art deep reading comprehension models are dominated by recurrent neural nets. Their sequential nature is a natural fit for language, but it also precludes parallelization within an instances and often becomes the bottleneck for deploying such models to latency critical scenarios. This is particularly problematic for longer texts. Here we present a convolutional architecture as an alternative to these recurrent architectures. Using simple dilated convolutional units in place of recurrent ones, we achieve results comparable to the state of the art on two question answering tasks, while at the same time achieving up to two orders of magnitude speedups for question answering.

CLDec 4, 2016
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision (Short Version)

Chen Liang, Jonathan Berant, Quoc Le et al.

Extending the success of deep neural networks to natural language understanding and symbolic reasoning requires complex operations and external memory. Recent neural program induction approaches have attempted to address this problem, but are typically limited to differentiable memory, and consequently cannot scale beyond small synthetic tasks. In this work, we propose the Manager-Programmer-Computer framework, which integrates neural networks with non-differentiable memory to support abstract, scalable and precise operations through a friendly neural computer interface. Specifically, we introduce a Neural Symbolic Machine, which contains a sequence-to-sequence neural "programmer", and a non-differentiable "computer" that is a Lisp interpreter with code assist. To successfully apply REINFORCE for training, we augment it with approximate gold programs found by an iterative maximum likelihood training process. NSM is able to learn a semantic parser from weak supervision over a large knowledge base. It achieves new state-of-the-art performance on WebQuestionsSP, a challenging semantic parsing dataset, with weak supervision. Compared to previous approaches, NSM is end-to-end, therefore does not rely on feature engineering or domain specific knowledge.

CLOct 31, 2016
Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision

Chen Liang, Jonathan Berant, Quoc Le et al.

Harnessing the statistical power of neural networks to perform language understanding and symbolic reasoning is difficult, when it requires executing efficient discrete operations against a large knowledge-base. In this work, we introduce a Neural Symbolic Machine, which contains (a) a neural "programmer", i.e., a sequence-to-sequence model that maps language utterances to programs and utilizes a key-variable memory to handle compositionality (b) a symbolic "computer", i.e., a Lisp interpreter that performs program execution, and helps find good programs by pruning the search space. We apply REINFORCE to directly optimize the task reward of this structured prediction problem. To train with weak supervision and improve the stability of REINFORCE, we augment it with an iterative maximum-likelihood training process. NSM outperforms the state-of-the-art on the WebQuestionsSP dataset when trained from question-answer pairs only, without requiring any feature engineering or domain-specific knowledge.

LGJun 28, 2014
Contrastive Feature Induction for Efficient Structure Learning of Conditional Random Fields

Ni Lao, Jun Zhu

Structure learning of Conditional Random Fields (CRFs) can be cast into an L1-regularized optimization problem. To avoid optimizing over a fully linked model, gain-based or gradient-based feature selection methods start from an empty model and incrementally add top ranked features to it. However, for high-dimensional problems like statistical relational learning, training time of these incremental methods can be dominated by the cost of evaluating the gain or gradient of a large collection of candidate features. In this study we propose a fast feature evaluation algorithm called Contrastive Feature Induction (CFI), which only evaluates a subset of features that involve both variables with high signals (deviation from mean) and variables with high errors (residue). We prove that the gradient of candidate features can be represented solely as a function of signals and errors, and that CFI is an efficient approximation of gradient-based evaluation methods. Experiments on synthetic and real data sets show competitive learning speed and accuracy of CFI on pairwise CRFs, compared to state-of-the-art structure learning methods such as full optimization over all features, and Grafting.

AIApr 12, 2014
Efficient Inference and Learning in a Large Knowledge Base: Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic

William Yang Wang, Kathryn Mazaitis, Ni Lao et al.

One important challenge for probabilistic logics is reasoning with very large knowledge bases (KBs) of imperfect information, such as those produced by modern web-scale information extraction systems. One scalability problem shared by many probabilistic logics is that answering queries involves "grounding" the query---i.e., mapping it to a propositional representation---and the size of a "grounding" grows with database size. To address this bottleneck, we present a first-order probabilistic language called ProPPR in which that approximate "local groundings" can be constructed in time independent of database size. Technically, ProPPR is an extension to stochastic logic programs (SLPs) that is biased towards short derivations; it is also closely related to an earlier relational learning algorithm called the path ranking algorithm (PRA). We show that the problem of constructing proofs for this logic is related to computation of personalized PageRank (PPR) on a linearized version of the proof space, and using on this connection, we develop a proveably-correct approximate grounding scheme, based on the PageRank-Nibble algorithm. Building on this, we develop a fast and easily-parallelized weight-learning algorithm for ProPPR. In experiments, we show that learning for ProPPR is orders magnitude faster than learning for Markov logic networks; that allowing mutual recursion (joint learning) in KB inference leads to improvements in performance; and that ProPPR can learn weights for a mutually recursive program with hundreds of clauses, which define scores of interrelated predicates, over a KB containing one million entities.