TASHEEH: repairing row-structure in raw CSV files

  • Comma-separated value (CSV) files follow a useful and widespread format for data exchange due to their flexible standard. However, due to this flexibility and plain text format, such files often have structural issues, such as unescaped quote characters within quoted fields, columns containing different value formats, rows with different numbers of cells, etc. We refer to rows that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database or an analytics platform, often requires prior data preparation steps. Traditionally, data scientists write custom code to clean illformed rows, even before they can use data cleaning tools and libraries, which assume all data to be properly loaded. These tasks are tedious and time-consuming, requiring expertise and frequent human intervention. To automate this process, we propose Tasheeh, a system that automatically detects ill-formed rows containing data and then standardizesComma-separated value (CSV) files follow a useful and widespread format for data exchange due to their flexible standard. However, due to this flexibility and plain text format, such files often have structural issues, such as unescaped quote characters within quoted fields, columns containing different value formats, rows with different numbers of cells, etc. We refer to rows that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database or an analytics platform, often requires prior data preparation steps. Traditionally, data scientists write custom code to clean illformed rows, even before they can use data cleaning tools and libraries, which assume all data to be properly loaded. These tasks are tedious and time-consuming, requiring expertise and frequent human intervention. To automate this process, we propose Tasheeh, a system that automatically detects ill-formed rows containing data and then standardizes their structure into a uniform format based on the structure of well-formed rows. Of 200 351 manually annotated rows from four different sources, Tasheeh was able to correctly detect 95.53% of data rows and accurately generate transformations for 87.83% of them.show moreshow less

Download full text files

Export metadata

Statistics

Number of document requests

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Mazhar Hameed, Gerardo Vitagliano, Fabian PanseGND, Felix Naumann
URN:urn:nbn:de:bvb:384-opus4-1172476
Frontdoor URLhttps://opus.bibliothek.uni-augsburg.de/opus4/117247
ISBN:978-3-89318-095-0OPAC
ISSN:2367-2005OPAC
Parent Title (English):Proceedings 27th International Conference on Extending Database Technology (EDBT 2024), March 25-28, 2024, Paestum, Italy
Publisher:OpenProceedings
Place of publication:Konstanz
Type:Conference Proceeding
Language:English
Year of first Publication:2024
Publishing Institution:Universität Augsburg
Release Date:2024/12/03
First Page:426
Last Page:439
Series:Advances in Database Technology ; 27-3
DOI:https://doi.org/10.48786/edbt.2024.37
Institutes:Fakultät für Angewandte Informatik
Fakultät für Angewandte Informatik / Institut für Informatik
Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Data Engineering
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):CC-BY-NC-ND 4.0: Creative Commons: Namensnennung - Nicht kommerziell - Keine Bearbeitung (mit Print on Demand)