Read dirty tab-separated CSV files

readr
read_fwf
tab
CSV
Author
Affiliations

Layal Christine Lettry

cynkra GmbH

University of Fribourg, Dept. of Informatics, ASAM Group

Published

January 15, 2024

What function could be used to read tab-separated CSV files with values containing a tab?

Template CSV

Let’s create a new tab-separated CSV. The second observation contains a tab instead of a space in the name variable. This will lead to a problem when reading the CSV file.

library(tidyverse)

data <- tibble(
  id = c(1, 2, 3),
  name = c("John Doe", "Jane\tSmith", "Bob Johnson"),
  age =  c(25, 30, 35)
)

file <- tempfile(fileext = ".csv")

write.table(data, file, sep = "\t", quote = FALSE, row.names = FALSE)

file_encoding <-
  guess_encoding(file) |>
  pull(encoding)

Reading problem

When you read your data, you will receive a warning message telling you to run vroom::problems() on your output. This shows how many columns were expected and actually read. You can see that only three columns were expected but four columns were actually detected.

Moreover, the family name of the second observation was merged to the age value, which leads to a NA when converting it to an integer.

This is due to the tab contained in the value of the name variable for the second observation.

You could also get the following error message (sorry, I could not reproduce this error that led me to write this post):

`Stopped early on line 2. Expected 3 fields but found 4 Consider fill=TRUE and comment.char=.`

read_data <- read_delim(file,
  delim = "\t",
  locale = readr::locale(encoding = file_encoding),
  col_types = cols(age = col_integer(), .default = col_character())
)

read_data
# A tibble: 3 × 3
  id    name          age
  <chr> <chr>       <int>
1 1     John Doe       25
2 2     Jane           NA
3 3     Bob Johnson    35
vroom::problems(read_data)
# A tibble: 2 × 5
    row   col expected   actual      file                                       
  <int> <int> <chr>      <chr>       <chr>                                      
1     3     3 an integer "Smith\t30" /private/var/folders/nt/m0krkrqx6jgcl_9km5…
2     3     4 3 columns  "4 columns" /private/var/folders/nt/m0krkrqx6jgcl_9km5…

Reading solution

One solution I found to override the error quoted above was to read the entire CSV file except for the first row with variable names, thanks to the function read_fwf(). All the information is then saved in a single column called X1.

raw_data <-
  read_fwf(
    file,
    col_types = cols(.default = col_character()),
    locale = locale(encoding = file_encoding),
    skip = 1L
  )

raw_data
# A tibble: 3 × 1
  X1                  
  <chr>               
1 "1\tJohn Doe\t25"   
2 "2\tJane\tSmith\t30"
3 "3\tBob Johnson\t35"

The next thing to do is to read the headers separately and to separate the columns thanks to separate_wider_delim(). The problematic value containing a tab will be merged with the following column value, leading to an NA when setting the integer type to this column.

headers <- strsplit(readLines(file, n = 1L), "\t")[[1]]
col_names <- gsub("\"", "", headers)

out <- raw_data |>
  separate_wider_delim(
    cols = X1,
    names = col_names,
    delim = "\t",
    too_many = "merge" 
  ) |>
  mutate(
    across(everything(), \(x) gsub('"', "", x))
  )

raw_dirty <- out |>
  mutate(
    id = as.integer(id),
    age = as.integer(age)
  )

raw_clean <- janitor::clean_names(raw_dirty)

raw_clean
# A tibble: 3 × 3
     id name          age
  <int> <chr>       <int>
1     1 John Doe       25
2     2 Jane           NA
3     3 Bob Johnson    35

Conclusion

I agree with you that the solution I found with read_fwf() and separate_wider_delim() is more complicated than the one with read_delim().

However, this solved my problem of partially reading the CSV file, although I could not reproduce it in this example, namely:

`Stopped early on line 2. Expected 3 fields but found 4 Consider fill=TRUE and comment.char=.`

Thanks to read_fwf(), I can read all the rows. Then, I separate the columns with separate_wider_delim() and merge the wrong additional columns values containing an extra tab together thanks to the option too_many = "merge".

Eventually, by converting my age variable to an integer type, the merged values are transformed into an NA. This does not bother me, since this row will certainly get dropped in my following calculations, as the original typed data were wrong for this observation.

Citation

BibTeX citation:
@online{lettry2024,
  author = {Lettry, Layal Christine},
  title = {Read Dirty Tab-Separated {CSV} Files},
  date = {2024-01-15},
  url = {https://rdiscovery.netlify.app/posts/2024-01-15_read-fwf/},
  langid = {en}
}
For attribution, please cite this work as:
Lettry, Layal Christine. 2024. “Read Dirty Tab-Separated CSV Files.” January 15, 2024. https://rdiscovery.netlify.app/posts/2024-01-15_read-fwf/.