Jia Zou

DB
h-index9
13papers
152citations
Novelty45%
AI Score45

13 Papers

AIOct 27, 2023Code
FormalGeo: An Extensible Formalized Framework for Olympiad Geometric Problem Solving

Xiaokai Zhang, Na Zhu, Yiming He et al.

This is the first paper in a series of work we have accomplished over the past three years. In this paper, we have constructed a consistent formal plane geometry system. This will serve as a crucial bridge between IMO-level plane geometry challenges and readable AI automated reasoning. Within this formal framework, we have been able to seamlessly integrate modern AI models with our formal system. AI is now capable of providing deductive reasoning solutions to IMO-level plane geometry problems, just like handling other natural languages, and these proofs are readable, traceable, and verifiable. We propose the geometry formalization theory (GFT) to guide the development of the geometry formal system. Based on the GFT, we have established the FormalGeo, which consists of 88 geometric predicates and 196 theorems. It can represent, validate, and solve IMO-level geometry problems. we also have crafted the FGPS (formal geometry problem solver) in Python. It serves as both an interactive assistant for verifying problem-solving processes and an automated problem solver. We've annotated the formalgeo7k and formalgeo-imo datasets. The former contains 6,981 (expand to 133,818 through data augmentation) geometry problems, while the latter includes 18 (expand to 2,627 and continuously increasing) IMO-level challenging geometry problems. All annotated problems include detailed formal language descriptions and solutions. Implementation of the formal system and experiments validate the correctness and utility of the GFT. The backward depth-first search method only yields a 2.42% problem-solving failure rate, and we can incorporate deep learning techniques to achieve lower one. The source code of FGPS and datasets are available at https://github.com/BitSecret/FGPS.

CVAug 3, 2024
IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection

Hong Guan, Yancheng Wang, Lulu Xie et al.

Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.

DBFeb 9, 2023
A Comparison of Decision Forest Inference Platforms from A Database Perspective

Hong Guan, Mahidhar Reddy Dwarampudi, Venkatesh Gunda et al.

Decision forest, including RandomForest, XGBoost, and LightGBM, is one of the most popular machine learning techniques used in many industrial scenarios, such as credit card fraud detection, ranking, and business intelligence. Because the inference process is usually performance-critical, a number of frameworks were developed and dedicated for decision forest inference, such as ONNX, TreeLite from Amazon, TensorFlow Decision Forest from Google, HummingBird from Microsoft, Nvidia FIL, and lleaves. However, these frameworks are all decoupled with data management frameworks. It is unclear whether in-database inference will improve the overall performance. In addition, these frameworks used different algorithms, optimization techniques, and parallelism models. It is unclear how these implementations will affect the overall performance and how to make design decisions for an in-database inference framework. In this work, we investigated the above questions by comprehensively comparing the end-to-end performance of the aforementioned inference frameworks and netsDB, an in-database inference framework we implemented. Through this study, we identified that netsDB is best suited for handling small-scale models on large-scale datasets and all-scale models on small-scale datasets, for which it achieved up to hundreds of times of speedup. In addition, the relation-centric representation we proposed significantly improved netsDB's performance in handling large-scale models, while the model reuse optimization we proposed further improved netsDB's performance in handling small-scale datasets.

AIFeb 14, 2024Code
FGeo-DRL: Deductive Reasoning for Geometric Problems through Deep Reinforcement Learning

Jia Zou, Xiaokai Zhang, Yiming He et al.

The human-like automatic deductive reasoning has always been one of the most challenging open problems in the interdiscipline of mathematics and artificial intelligence. This paper is the third in a series of our works. We built a neural-symbolic system, called FGeoDRL, to automatically perform human-like geometric deductive reasoning. The neural part is an AI agent based on reinforcement learning, capable of autonomously learning problem-solving methods from the feedback of a formalized environment, without the need for human supervision. It leverages a pre-trained natural language model to establish a policy network for theorem selection and employ Monte Carlo Tree Search for heuristic exploration. The symbolic part is a reinforcement learning environment based on geometry formalization theory and FormalGeo, which models GPS as a Markov Decision Process. In this formal symbolic system, the known conditions and objectives of the problem form the state space, while the set of theorems forms the action space. Leveraging FGeoDRL, we have achieved readable and verifiable automated solutions to geometric problems. Experiments conducted on the formalgeo7k dataset have achieved a problem-solving success rate of 86.40%. The project is available at https://github.com/PersonNoName/FGeoDRL.

DBOct 7, 2023
Serving Deep Learning Model in Relational Databases

Lixi Zhou, Qi Lin, Kanchan Chowdhury et al.

Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains, sparking growing interest recently. In this visionary paper, we embark on a comprehensive exploration of representative architectures to address the requirement. We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks. The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS). The potential relation-centric architecture aims to represent a large-scale tensor computation through relational operators. While each of these architectures demonstrates promise in specific use scenarios, we identify urgent requirements for seamless integration of these architectures and the middle ground in-between these architectures. We delve into the gaps that impede the integration and explore innovative strategies to close them. We present a pathway to establish a novel RDBMS for enabling a broad class of data-intensive DL inference applications.

LGApr 16
PAWN: Piece Value Analysis with Neural Networks

Ethan Tang, Hasan Davulcu, Jia Zou et al.

Predicting the relative value of any given chess piece in a position remains an open challenge, as a piece's contribution depends on its spatial relationships with every other piece on the board. We demonstrate that incorporating the state of the full chess board via latent position representations derived using a CNN-based autoencoder significantly improves accuracy for MLP-based piece value prediction architectures. Using a dataset of over 12 million piece-value pairs gathered from Grandmaster-level games, with ground-truth labels generated by Stockfish 17, our enhanced piece value predictor significantly outperforms context-independent MLP-based systems, reducing validation mean absolute error by 16% and predicting relative piece value within approximately 0.65 pawns. More generally, our findings suggest that encoding the full problem state as context provides useful inductive bias for predicting the contribution of any individual component.

DBNov 25, 2025Code
InferF: Declarative Factorization of AI/ML Inferences over Joins

Kanchan Chowdhury, Lixi Zhou, Lulu Xie et al.

Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join's output, factorized ML has been proposed to decompose ML computations into sub-computations to be executed on each normalized dataset. However, there is insufficient discussion on how factorized ML could impact AI/ML inference over multi-way joins. To address the limitations, we propose a novel declarative InferF system, focusing on the factorization of arbitrary inference workflows represented as analyzable expressions over the multi-way joins. We formalize our problem to flexibly push down partial factorized computations to qualified nodes in the join tree to minimize the overall inference computation and join costs and propose two algorithms to resolve the problem: (1) a greedy algorithm based on a per-node cost function that estimates the influence on overall latency if a subset of factorized computations is pushed to a node, and (2) a genetic algorithm for iteratively enumerating and evaluating promising factorization plans. We implement InferF on Velox, an open-sourced database engine from Meta, evaluate it on real-world datasets, observed up to 11.3x speedups, and systematically summarized the factors that determine when factorized ML can benefit AI/ML inference workflows.

AIFeb 14, 2024
FGeo-TP: A Language Model-Enhanced Solver for Geometry Problems

Yiming He, Jia Zou, Xiaokai Zhang et al.

The application of contemporary artificial intelligence techniques to address geometric problems and automated deductive proof has always been a grand challenge to the interdiscipline field of mathematics and artificial Intelligence. This is the fourth article in a series of our works, in our previous work, we established of a geometric formalized system known as FormalGeo. Moreover we annotated approximately 7000 geometric problems, forming the FormalGeo7k dataset. Despite the FGPS (Formal Geometry Problem Solver) can achieve interpretable algebraic equation solving and human-like deductive reasoning, it often experiences timeouts due to the complexity of the search strategy. In this paper, we introduced FGeo-TP (Theorem Predictor), which utilizes the language model to predict theorem sequences for solving geometry problems. We compared the effectiveness of various Transformer architectures, such as BART or T5, in theorem prediction, implementing pruning in the search process of FGPS, thereby improving its performance in solving geometry problems. Our results demonstrate a significant increase in the problem-solving rate of the language model-enhanced FGeo-TP on the FormalGeo7k dataset, rising from 39.7% to 80.86%. Furthermore, FGeo-TP exhibits notable reductions in solving time and search steps across problems of varying difficulty levels.

DBJan 22, 2024
Declarative Privacy-Preserving Inference Queries

Hong Guan, Ansh Tiwari, Summer Gautier et al.

Detecting inference queries running over personal attributes and protecting such queries from leaking individual information requires tremendous effort from practitioners. To tackle this problem, we propose an end-to-end workflow for automating privacy-preserving inference queries including the detection of subqueries that involve AI/ML model inferences on sensitive attributes. Our proposed novel declarative privacy-preserving workflow allows users to specify "what private information to protect" rather than "how to protect". Under the hood, the system automatically chooses privacy-preserving plans and hyper-parameters.

DBOct 15, 2020
Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning

Zijie Wang, Lixi Zhou, Amitabh Das et al.

Data is the king in the age of AI. However data integration is often a laborious task that is hard to automate. Schema change is one significant obstacle to the automation of the end-to-end data integration process. Although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. However, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. In this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration.

IROct 13, 2020
It's the Best Only When It Fits You Most: Finding Related Models for Serving Based on Dynamic Locality Sensitive Hashing

Lixi Zhou, Zijie Wang, Amitabh Das et al.

In recent, deep learning has become the most popular direction in machine learning and artificial intelligence. However, preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research. Reusing models for inferencing a dataset can greatly save the human costs required for training data creation. Although there exist a number of model sharing platform such as TensorFlow Hub, PyTorch Hub, DLHub, most of these systems require model uploaders to manually specify the details of each model and model downloaders to screen keyword search results for selecting a model. They are in lack of an automatic model searching tool. This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models. While there exist many similarity measurements, we study how to efficiently apply these metrics without pair-wise comparison and compare the effectiveness of these metrics. We find that our proposed adaptivity measurement which is based on Jensen-Shannon (JS) divergence, is an effective measurement, and its computation can be significantly accelerated by using the technique of locality sensitive hashing.

DBSep 1, 2020
Tensor Relational Algebra for Machine Learning System Design

Binhang Yuan, Dimitrije Jankov, Jia Zou et al.

We consider the question: what is the abstraction that should be implemented by the computational engine of a machine learning system? Current machine learning systems typically push whole tensors through a series of compute kernels such as matrix multiplications or activation functions, where each kernel runs on an AI accelerator (ASIC) such as a GPU. This implementation abstraction provides little built-in support for ML systems to scale past a single machine, or for handling large models with matrices or tensors that do not easily fit into the RAM of an ASIC. In this paper, we present an alternative implementation abstraction called the tensor relational algebra (TRA). The TRA is a set-based algebra based on the relational algebra. Expressions in the TRA operate over binary tensor relations, where keys are multi-dimensional arrays and values are tensors. The TRA is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML workflows in distributed clusters.

DBApr 25, 2019
Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning

Dimitrije Jankov, Shangyu Luo, Binhang Yuan et al.

A number of popular systems, most notably Google's TensorFlow, have been implemented from the ground up to support machine learning tasks. We consider how to make a very small set of changes to a modern relational database management system (RDBMS) to make it suitable for distributed learning computations. Changes include adding better support for recursion, and optimization and execution of very large compute plans. We also show that there are key advantages to using an RDBMS as a machine learning platform. In particular, learning based on a database management system allows for trivial scaling to large data sets and especially large models, where different computational units operate on different parts of a model that may be too large to fit into RAM.