DB AISep 18, 2025

A Case for Computing on Unstructured Data

Mushtari Sadia, Amrita Roy Chowdhury, Ang Chen

arXiv:2509.14601v13.33 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses a foundational issue for data systems and AI, as unstructured data makes up most of the world's information, but it is incremental in proposing a specific framework rather than a breakthrough.

The paper tackles the problem of computing on unstructured data like text and images, which is poorly supported by traditional systems, by proposing a new paradigm with a bi-directional pipeline involving extraction, transformation, and projection to enable analytical benefits while preserving richness.

Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.

View on arXiv PDF

Similar