TASHEEH: repairing row-structure in raw CSV files
- Comma-separated value (CSV) files follow a useful and widespread format for data exchange due to their flexible standard.
However, due to this flexibility and plain text format, such files
often have structural issues, such as unescaped quote characters within quoted fields, columns containing different value formats, rows with different numbers of cells, etc. We refer to rows
that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database
or an analytics platform, often requires prior data preparation
steps.
Traditionally, data scientists write custom code to clean illformed rows, even before they can use data cleaning tools and
libraries, which assume all data to be properly loaded. These
tasks are tedious and time-consuming, requiring expertise and
frequent human intervention. To automate this process, we propose Tasheeh, a system that automatically detects ill-formed
rows containing data and then standardizesComma-separated value (CSV) files follow a useful and widespread format for data exchange due to their flexible standard.
However, due to this flexibility and plain text format, such files
often have structural issues, such as unescaped quote characters within quoted fields, columns containing different value formats, rows with different numbers of cells, etc. We refer to rows
that contain such structural inconsistencies as ill-formed. Consequently, ingesting them into a host system, such as a database
or an analytics platform, often requires prior data preparation
steps.
Traditionally, data scientists write custom code to clean illformed rows, even before they can use data cleaning tools and
libraries, which assume all data to be properly loaded. These
tasks are tedious and time-consuming, requiring expertise and
frequent human intervention. To automate this process, we propose Tasheeh, a system that automatically detects ill-formed
rows containing data and then standardizes their structure into
a uniform format based on the structure of well-formed rows.
Of 200 351 manually annotated rows from four different sources,
Tasheeh was able to correctly detect 95.53% of data rows and
accurately generate transformations for 87.83% of them.…