Pytheas: Pattern-based Table Discovery in CSV Files
CSV is a popular Open Data format widely used in a variety of domains for its simplicity and effectiveness in storing and disseminating data. Unfortunately, data published in this format often does not conform to strict specifications, making automated data extraction from CSV files a painful task. While table discovery from HTML pages or spreadsheets has been studied extensively, extracting tables from CSV files still poses a considerable challenge due to their loosely defined format and limited embedded metadata. In this work we lay out the challenges of discovering tables in CSV files, and propose Pytheas: a principled method for automatically classifying lines in a CSV file and discovering tables within it based on the intuition that tables maintain a coherency of values in each column. We evaluate our methods over two manually annotated data sets: 2000 CSV files sampled from four Canadian Open Data portals, and 2500 additional files sampled from Canadian, US, UK and Australian portals. Our comparison to state-of-the-art approaches shows that Pytheas is able to successfully discover tables with precision and recall of over 95.9% and 95.7% respectively, while current approaches achieve around 89.6% precision and 81.3% recall. Furthermore, Pytheas’s accuracy for correctly classifying all lines per CSV file is 95.6%, versus a maximum of 86.9% for compared approaches. Pytheas generalizes well to new data, with a table discovery F-measure above 95% even when trained on Canadian data and applied to data from different countries. Finally, we introduce a confidence measure for table discovery and demonstrate its value for accurately identifying potential errors.
In this talk we will also cover recent followup work in metadata discovery and enrichment for supporting data integration.
About the Speaker
Christina Christodoulakis is a PhD candidate in the Department of Computer Science of the University of Toronto, advised by Professor Angela Demke Brown and Professor Moshe Gabel. She earned an MSc in Computer Science from the University of Toronto in 2015 working with Professor Renee J Miller, and a Diploma in Electrical and Computer Engineering from the Technical University of Crete, Greece, in 2012, where she worked with Prof. Antonis Deligiannakis.
Christina’s research interests lie in the intersection of facilitating data discovery across heterogeneous data sources, data curation, and collaborative data analytics. Her recent research has focused on developing a framework supporting data integration in Open Data portals. Her work has been published at VLDB, ICDE, BigData, IUI, DAS, CASCON, and WWW, and she has had return internships with IBM Almaden Research Center.