CLJul 5, 2022
Keyword Extraction in Scientific DocumentsSusie Xi Rao, Piriyakorn Piriyatamwong, Parijat Ghoshal et al. · eth-zurich
The scientific publication output grows exponentially. Therefore, it is increasingly challenging to keep track of trends and changes. Understanding scientific documents is an important step in downstream tasks such as knowledge graph building, text mining, and discipline classification. In this workshop, we provide a better understanding of keyword and keyphrase extraction from the abstract of scientific publications.
CVMar 3, 2023
Building Floorspace in China: A Dataset and Learning PipelinePeter Egger, Susie Xi Rao, Sebastiano Papini
This paper provides a first milestone in measuring the floorspace of buildings (that is, building footprint and height) for 40 major Chinese cities. The intent is to maximize city coverage and, eventually provide longitudinal data. Doing so requires building on imagery that is of a medium-fine-grained granularity, as larger cross sections of cities and longer time series for them are only available in such format. We use a multi-task object segmenter approach to learn the building footprint and height in the same framework in parallel: (1) we determine the surface area is covered by any buildings (the square footage of occupied land); (2) we determine floorspace from multi-image representations of buildings from various angles to determine the height of buildings. We use Sentinel-1 and -2 satellite images as our main data source. The benefits of these data are their large cross-sectional and longitudinal scope plus their unrestricted accessibility. We provide a detailed description of our data, algorithms, and evaluations. In addition, we analyze the quality of reference data and their role for measuring the building floorspace with minimal error. We conduct extensive quantitative and qualitative analyses with Shenzhen as a case study using our multi-task learner. Finally, we conduct correlation studies between our results (on both pixel and aggregated urban area levels) and nightlight data to gauge the merits of our approach in studying urban development. Our data and codebase are publicly accessible under https://gitlab.ethz.ch/raox/urban-satellite-public-v2.
CVJan 5, 2022
TableParser: Automatic Table Parsing with Weak Supervision from SpreadsheetsSusie Xi Rao, Johannes Rausch, Peter Egger et al.
Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in developing such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.