What are the use cases for using nested data over packed data?

nest
unnest
pack
unpack
tidyr
jsonlite
Author
Affiliations

Layal Christine Lettry

cynkra GmbH

University of Fribourg, Dept. of Informatics, ASAM Group

Published

June 3, 2024

When should you use packed data instead of nested data, and vice versa?

Introduction

In my last blog post, we saw how nested data was different from packed data.

In this article, I would like to explain when we can use nested and packed data.

Nested data

In the following example, we will nest data from the tibble palmerpenguins::penguins to the raw data palmerpenguins::penguins_raw.

my_nested_tib <-
  palmerpenguins::penguins |>
  dplyr::distinct(island) |>
  dplyr::rename(my_island = island) |>
  dplyr::mutate(
    penguins_data = purrr::map(
      my_island, \(x) dplyr::filter(palmerpenguins::penguins, island == x)
    ),
    penguins_raw_data = purrr::map(
      my_island, \(x) dplyr::filter(palmerpenguins::penguins_raw, Island == x)
    )
  ) |>
  dplyr::select(-my_island)

dplyr::glimpse(my_nested_tib)
Rows: 3
Columns: 2
$ penguins_data     <list> [<tbl_df[52 x 8]>], [<tbl_df[168 x 8]>], [<tbl_df[1…
$ penguins_raw_data <list> [<tbl_df[52 x 17]>], [<tbl_df[168 x 17]>], [<tbl_df…

The tibble my_nested_tib is a nested tibble containing the variables penguins_data and penguins_raw_data, which are both lists with 3 tibble elements of different dimensions. We can handle these variables like any normal list.

To unlist our data, we will unnest the columns penguins_data and penguins_raw_data.

my_unnested_tib <-
  my_nested_tib |>
  tidyr::unnest(
    cols = c(penguins_data, penguins_raw_data)
  )
dplyr::glimpse(my_unnested_tib)
Rows: 344
Columns: 25
$ species               <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, …
$ island                <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torg…
$ bill_length_mm        <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ bill_depth_mm         <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ flipper_length_mm     <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ body_mass_g           <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ sex                   <fct> male, female, female, NA, female, male, female, …
$ year                  <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, …
$ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
$ `Sample Number`       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
$ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
$ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
$ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
$ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
$ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
$ `Date Egg`            <date> 2007-11-11, 2007-11-11, 2007-11-16, 2007-11-16,…
$ `Culmen Length (mm)`  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34…
$ `Culmen Depth (mm)`   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18…
$ `Flipper Length (mm)` <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190,…
$ `Body Mass (g)`       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 34…
$ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
$ `Delta 15 N (o/oo)`   <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.18…
$ `Delta 13 C (o/oo)`   <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.298…
$ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

Packed data

You would need to pack data into groups so that you can easily compare them. Opposite to my_unnested_tib, our tibble my_packed_data will be narrow because similar columns will be consolidated into a single variable. You will find additional information about the packing process in this article.

my_packed_data <-
  my_unnested_tib |>
  tidyr::pack(
    all_species = c(species, Species),
    all_island = c(island, Island),
    all_bill_length_mm = c(bill_length_mm, `Culmen Length (mm)`),
    all_bill_depth_mm = c(bill_depth_mm, `Culmen Depth (mm)`),
    all_flipper_length_mm = c(flipper_length_mm, `Flipper Length (mm)`),
    all_body_mass_g = c(body_mass_g, `Body Mass (g)`),
    all_sex = c(sex, Sex),
    all_date = c(year, `Date Egg`),
    all_remaining_raw_data = c(
      studyName, `Sample Number`, Region, Stage, `Individual ID`,
      `Clutch Completion`, `Delta 15 N (o/oo)`, `Delta 13 C (o/oo)`,
      Comments
    )
  )
dplyr::glimpse(my_packed_data)
Rows: 344
Columns: 9
$ all_species            <tibble[,2]> <tbl_df[26 x 2]>
$ all_island             <tibble[,2]> <tbl_df[26 x 2]>
$ all_bill_length_mm     <tibble[,2]> <tbl_df[26 x 2]>
$ all_bill_depth_mm      <tibble[,2]> <tbl_df[26 x 2]>
$ all_flipper_length_mm  <tibble[,2]> <tbl_df[26 x 2]>
$ all_body_mass_g        <tibble[,2]> <tbl_df[26 x 2]>
$ all_sex                <tibble[,2]> <tbl_df[26 x 2]>
$ all_date               <tibble[,2]> <tbl_df[26 x 2]>
$ all_remaining_raw_data <tibble[,9]> <tbl_df[26 x 9]>

Now, my_packed_data is a tibble with 9 columns instead of 25 for my_unnested_tib.

You can either subset it using the dollar sign $ or square brackets [[]]. For example, you could write my_packed_data$all_body_mass_g$body_mass_g, my_packed_data$all_body_mass_g[["body_mass_g"]] or my_packed_data[["all_body_mass_g"]][["body_mass_g"]]. This will give you the same result.

waldo::compare(
  my_packed_data$all_body_mass_g$body_mass_g,
  my_packed_data$all_body_mass_g[["body_mass_g"]]
)
✔ No differences
waldo::compare(
  my_packed_data$all_body_mass_g$body_mass_g,
  my_packed_data[["all_body_mass_g"]][["body_mass_g"]]
)
✔ No differences

Furthermore, it is so organised that you can easily analyse the differences between palmerpenguins::penguins and palmerpenguins::penguins_raw.

# Species from penguins_raw and species from penguins (old vs new)
waldo::compare(
  my_packed_data$all_species$Species,
  my_packed_data$all_species$species
)
`old` is a character vector ('Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', 'Adelie Penguin (Pygoscelis adeliae)', ...)
`new` is an S3 object of class <factor>, an integer vector
# Island from penguins_raw and island from penguins (old vs new)
waldo::compare(
  my_packed_data$all_island$Island,
  my_packed_data$all_island$island
)
`old` is a character vector ('Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', 'Torgersen', ...)
`new` is an S3 object of class <factor>, an integer vector
# `Culmen Length (mm)` from penguins_raw and bill_length_mm from penguins (old vs new)
waldo::compare(
  my_packed_data$all_bill_length_mm$`Culmen Length (mm)`,
  my_packed_data$all_bill_length_mm$bill_length_mm
)
✔ No differences

You can also easily obtain the columns palmerpenguins::penguins_raw that were not included in palmerpenguins::penguins and unpack them.

my_raw_additional_columns <-
  my_packed_data |>
  dplyr::select(all_remaining_raw_data) |>
  tidyr::unpack(all_remaining_raw_data)

dplyr::glimpse(my_raw_additional_columns)
Rows: 344
Columns: 9
$ studyName           <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL07…
$ `Sample Number`     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Region              <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", …
$ Stage               <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adult…
$ `Individual ID`     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N…
$ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "N…
$ `Delta 15 N (o/oo)` <dbl> NA, 8.94956, 8.36821, NA, 8.76651, 8.66496, 9.1871…
$ `Delta 13 C (o/oo)` <dbl> NA, -24.69454, -25.33302, NA, -25.32426, -25.29805…
$ Comments            <chr> "Not enough blood for isotopes.", NA, NA, "Adult n…

When should you use nested data or packed data when using a JSON format?

In my experience, I used nested and packed data in an API so that you could read it in JSON format at the end.

In my opinion, it is preferable to use nested data when you have a tibble of several rows. On the contrary, when a single row is involved, it is recommended to use packed data to avoid square brackets.

multiple_rows_tib <-
  tibble::tribble(
    ~category, ~value,
    "category1", seq(from = 1.092, to = 1.098, by = 0.001),
    "category2", seq(from = 352.15, to = 352.2, by = 0.01),
    "category3", seq(from = 25.63, to = 25.7, by = 0.01)
  )

my_date <- "2022-09-12"
single_row_tib <-
  tibble::tribble(
    ~year, ~month, ~day,
    lubridate::year(my_date), lubridate::month(my_date, label = TRUE), lubridate::day(my_date)
  )

my_tib <-
  tibble::tibble(
    all_values = vctrs::list_of(multiple_rows_tib),
    date = tidyr::pack(single_row_tib)
  )

jsonlite::toJSON(my_tib, pretty = TRUE)
[
  {
    "all_values": [
      {
        "category": "category1",
        "value": [1.092, 1.093, 1.094, 1.095, 1.096, 1.097, 1.098]
      },
      {
        "category": "category2",
        "value": [352.15, 352.16, 352.17, 352.18, 352.19, 352.2]
      },
      {
        "category": "category3",
        "value": [25.63, 25.64, 25.65, 25.66, 25.67, 25.68, 25.69, 25.7]
      }
    ],
    "date": {
      "year": 2022,
      "month": "Sep",
      "day": 12
    }
  }
] 

Citation

BibTeX citation:
@online{lettry2024,
  author = {Lettry, Layal Christine},
  title = {What Are the Use Cases for Using Nested Data over Packed
    Data?},
  date = {2024-06-03},
  url = {https://rdiscovery.netlify.app/posts/2024-06-03_use-case-pack-nest/},
  langid = {en}
}
For attribution, please cite this work as:
Lettry, Layal Christine. 2024. “What Are the Use Cases for Using Nested Data over Packed Data?” June 3, 2024. https://rdiscovery.netlify.app/posts/2024-06-03_use-case-pack-nest/.