library(tidyverse)
<- tibble(
data id = c(1, 2, 3),
name = c("John Doe", "Jane\tSmith", "Bob Johnson"),
age = c(25, 30, 35)
)
<- tempfile(fileext = ".csv")
file
write.table(data, file, sep = "\t", quote = FALSE, row.names = FALSE)
<-
file_encoding guess_encoding(file) |>
pull(encoding)
What function could be used to read tab-separated CSV files with values containing a tab?
Template CSV
Let’s create a new tab-separated CSV. The second observation contains a tab instead of a space in the name
variable. This will lead to a problem when reading the CSV file.
Reading problem
When you read your data, you will receive a warning message telling you to run vroom::problems()
on your output. This shows how many columns were expected and actually read. You can see that only three columns were expected but four columns were actually detected.
Moreover, the family name of the second observation was merged to the age
value, which leads to a NA
when converting it to an integer.
This is due to the tab contained in the value of the name
variable for the second observation.
You could also get the following error message (sorry, I could not reproduce this error that led me to write this post):
`Stopped early on line 2. Expected 3 fields but found 4 Consider fill=TRUE and comment.char=.`
<- read_delim(file,
read_data delim = "\t",
locale = readr::locale(encoding = file_encoding),
col_types = cols(age = col_integer(), .default = col_character())
)
read_data
# A tibble: 3 × 3
id name age
<chr> <chr> <int>
1 1 John Doe 25
2 2 Jane NA
3 3 Bob Johnson 35
::problems(read_data) vroom
# A tibble: 2 × 5
row col expected actual file
<int> <int> <chr> <chr> <chr>
1 3 3 an integer "Smith\t30" /private/var/folders/nt/m0krkrqx6jgcl_9km5…
2 3 4 3 columns "4 columns" /private/var/folders/nt/m0krkrqx6jgcl_9km5…
Reading solution
One solution I found to override the error quoted above was to read the entire CSV file except for the first row with variable names, thanks to the function read_fwf()
. All the information is then saved in a single column called X1
.
<-
raw_data read_fwf(
file,col_types = cols(.default = col_character()),
locale = locale(encoding = file_encoding),
skip = 1L
)
raw_data
# A tibble: 3 × 1
X1
<chr>
1 "1\tJohn Doe\t25"
2 "2\tJane\tSmith\t30"
3 "3\tBob Johnson\t35"
The next thing to do is to read the headers separately and to separate the columns thanks to separate_wider_delim()
. The problematic value containing a tab will be merged with the following column value, leading to an NA
when setting the integer type to this column.
<- strsplit(readLines(file, n = 1L), "\t")[[1]]
headers <- gsub("\"", "", headers)
col_names
<- raw_data |>
out separate_wider_delim(
cols = X1,
names = col_names,
delim = "\t",
too_many = "merge"
|>
) mutate(
across(everything(), \(x) gsub('"', "", x))
)
<- out |>
raw_dirty mutate(
id = as.integer(id),
age = as.integer(age)
)
<- janitor::clean_names(raw_dirty)
raw_clean
raw_clean
# A tibble: 3 × 3
id name age
<int> <chr> <int>
1 1 John Doe 25
2 2 Jane NA
3 3 Bob Johnson 35
Conclusion
I agree with you that the solution I found with read_fwf()
and separate_wider_delim()
is more complicated than the one with read_delim()
.
However, this solved my problem of partially reading the CSV file, although I could not reproduce it in this example, namely:
`Stopped early on line 2. Expected 3 fields but found 4 Consider fill=TRUE and comment.char=.`
Thanks to read_fwf()
, I can read all the rows. Then, I separate the columns with separate_wider_delim()
and merge the wrong additional columns values containing an extra tab together thanks to the option too_many = "merge"
.
Eventually, by converting my age
variable to an integer type, the merged values are transformed into an NA
. This does not bother me, since this row will certainly get dropped in my following calculations, as the original typed data were wrong for this observation.
Citation
@online{lettry2024,
author = {Lettry, Layal Christine},
title = {Read Dirty Tab-Separated {CSV} Files},
date = {2024-01-15},
url = {https://rdiscovery.netlify.app/posts/2024-01-15_read-fwf/},
langid = {en}
}