DB AI CYJul 23, 2020

Graph integration of structured, semistructured and unstructured data for data journalism

Oana Balalau, Catarina Conceiç{ã}o, Helena Galhardas, Ioana Manolescu, Tayeb Merabti, Jingmao You, Youssr Youssef

arXiv:2007.12488v25.150 citations

Originality Incremental advance

AI Analysis

This addresses the challenge for journalists and non-experts in making sense of dynamic, heterogeneous data without custom IT workflows.

The paper tackles the problem of integrating heterogeneous data sources (structured, semi-structured, and unstructured) for data journalism by proposing a complete approach implemented in the ConnectionLens system, validated through experiments.

Nowadays, journalism is facilitated by the existence of large amounts of digital data sources, including many Open Data ones. Such data sources are extremely heterogeneous, ranging from highly struc-tured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to de ne and deploy custom extract-transform-load work ows. These are di cult to set up not only for arbitrary heterogeneous inputs , but also given that users may want to add (or remove) datasets to (from) the corpus. We describe a complete approach for integrating dynamic sets of heterogeneous data sources along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.

View on arXiv PDF

Similar