Peter Egger

CLJul 5, 2022

Keyword Extraction in Scientific Documents

Susie Xi Rao, Piriyakorn Piriyatamwong, Parijat Ghoshal et al. · eth-zurich

The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications.

CVMar 3, 2023

Building Floorspace in China: A Dataset and Learning Pipeline

Peter Egger, Susie Xi Rao, Sebastiano Papini

This paper provides a first milestone in measuring the floorspace of buildings (that is, building footprint and height) for 40 major Chinese cities. The intent is to maximize city coverage and, eventually provide longitudinal data. Doing so requires building on imagery that is of a medium-fine-grained granularity, as larger cross sections of cities and longer time series for them are only available in such format. We use a multi-task object segmenter approach to learn the building footprint and height in the same framework in parallel: (1) we determine the surface area is covered by any buildings (the square footage of occupied land); (2) we determine floorspace from multi-image representations of buildings from various angles to determine the height of buildings. We use Sentinel-1 and -2 satellite images as our main data source. The benefits of these data are their large cross-sectional and longitudinal scope plus their unrestricted accessibility. We provide a detailed description of our data, algorithms, and evaluations. In addition, we analyze the quality of reference data and their role for measuring the building floorspace with minimal error. We conduct extensive quantitative and qualitative analyses with Shenzhen as a case study using our multi-task learner. Finally, we conduct correlation studies between our results (on both pixel and aggregated urban area levels) and nightlight data to gauge the merits of our approach in studying urban development. Our data and codebase are publicly accessible under https://gitlab.ethz.ch/raox/urban-satellite-public-v2.

CVJan 5, 2022

TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets

Susie Xi Rao, Johannes Rausch, Peter Egger et al.

Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in developing such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.

Peter Egger

3 Papers