Pontificia Universidad Católica de Chile Pontificia Universidad Católica de Chile
Arenas M., Maturana F., Riveros C. and Vrgoc D. (2016)

A framework for annotating CSV-like data

Revista : Proceedings of the VLDB Endowment
Volumen : 9
Número : 11
Páginas : 876 - 887
Tipo de publicación : ISI

Abstract

In this paper, we propose a simple and expressive framework for adding metadata to CSV documents and their noisy variants. The framework is based on annotating parts of the document that can be later used to read, query, or exchange the data. The core of our framework is a language based on extended regular expressions that are used for selecting data. These expressions are then combined using a set of rules in order to annotate the data. We study the computational complexity of implementing our framework and present an efficient evaluation algorithm that runs in time proportional to its output and linear in its input. As a proof of concept, we test an implementation of our framework against a large number of real world datasets and show that it can be efficiently used in practice.