Perform transformations on several variables with .env and .data pronouns

rlang
pronouns
magrittr
data-masking
Author
Affiliations

Layal Christine Lettry

cynkra GmbH

University of Fribourg, Dept. of Informatics, ASAM Group

Published

October 15, 2023

Looping through all the variables in a data frame with the .env and .data pronouns

Problem

Let’s assume that you would like to know the number of occurrences for each instance of the following data.

mydata <- tibble::tribble(
  ~year,      ~country, ~age, ~is_married, ~has_child, ~is_woman,
   1990,      "France",   68,        TRUE,      FALSE,      TRUE,
   1990,      "France",   22,       FALSE,      FALSE,      TRUE,
   1990,       "Italy",   28,       FALSE,       TRUE,     FALSE,
   1990,       "Italy",   56,        TRUE,       TRUE,     FALSE,
   1990,       "Italy",   36,        TRUE,       TRUE,     FALSE,
   1990, "Switzerland",   23,       FALSE,       TRUE,      TRUE,
   1990, "Switzerland",   23,       FALSE,      FALSE,     FALSE,
   2000,      "France",   13,       FALSE,      FALSE,      TRUE,
   2000,      "France",   63,        TRUE,       TRUE,     FALSE,
   2000,       "Italy",   43,        TRUE,      FALSE,     FALSE,
   2000, "Switzerland",   42,        TRUE,       TRUE,      TRUE,
   2000, "Switzerland",   32,        TRUE,      FALSE,      TRUE
  )

mydata
# A tibble: 12 × 6
    year country       age is_married has_child is_woman
   <dbl> <chr>       <dbl> <lgl>      <lgl>     <lgl>   
 1  1990 France         68 TRUE       FALSE     TRUE    
 2  1990 France         22 FALSE      FALSE     TRUE    
 3  1990 Italy          28 FALSE      TRUE      FALSE   
 4  1990 Italy          56 TRUE       TRUE      FALSE   
 5  1990 Italy          36 TRUE       TRUE      FALSE   
 6  1990 Switzerland    23 FALSE      TRUE      TRUE    
 7  1990 Switzerland    23 FALSE      FALSE     FALSE   
 8  2000 France         13 FALSE      FALSE     TRUE    
 9  2000 France         63 TRUE       TRUE      FALSE   
10  2000 Italy          43 TRUE       FALSE     FALSE   
11  2000 Switzerland    42 TRUE       TRUE      TRUE    
12  2000 Switzerland    32 TRUE       FALSE     TRUE    

The .env pronoun

Let’s say we also want to have a variable called is_parent which can be deduced from the variable has_child. We can define the default value TRUE for is_parent and make it dependent of the has_child. This default variable can be retrieved with the .env pronoun.

is_parent <- TRUE

parent_data <- 
  mydata |>
    dplyr::mutate(is_parent = as.logical(has_child * .env$is_parent)) 

parent_data
# A tibble: 12 × 7
    year country       age is_married has_child is_woman is_parent
   <dbl> <chr>       <dbl> <lgl>      <lgl>     <lgl>    <lgl>    
 1  1990 France         68 TRUE       FALSE     TRUE     FALSE    
 2  1990 France         22 FALSE      FALSE     TRUE     FALSE    
 3  1990 Italy          28 FALSE      TRUE      FALSE    TRUE     
 4  1990 Italy          56 TRUE       TRUE      FALSE    TRUE     
 5  1990 Italy          36 TRUE       TRUE      FALSE    TRUE     
 6  1990 Switzerland    23 FALSE      TRUE      TRUE     TRUE     
 7  1990 Switzerland    23 FALSE      FALSE     FALSE    FALSE    
 8  2000 France         13 FALSE      FALSE     TRUE     FALSE    
 9  2000 France         63 TRUE       TRUE      FALSE    TRUE     
10  2000 Italy          43 TRUE       FALSE     FALSE    FALSE    
11  2000 Switzerland    42 TRUE       TRUE      TRUE     TRUE     
12  2000 Switzerland    32 TRUE       FALSE     TRUE     FALSE    

Solution

With a for loop

for (var in names(parent_data)) {
  parent_data |>
    dplyr::count(.data[[var]]) |>
    print()
}
# A tibble: 2 × 2
   year     n
  <dbl> <int>
1  1990     7
2  2000     5
# A tibble: 3 × 2
  country         n
  <chr>       <int>
1 France          4
2 Italy           4
3 Switzerland     4
# A tibble: 11 × 2
     age     n
   <dbl> <int>
 1    13     1
 2    22     1
 3    23     2
 4    28     1
 5    32     1
 6    36     1
 7    42     1
 8    43     1
 9    56     1
10    63     1
11    68     1
# A tibble: 2 × 2
  is_married     n
  <lgl>      <int>
1 FALSE          5
2 TRUE           7
# A tibble: 2 × 2
  has_child     n
  <lgl>     <int>
1 FALSE         6
2 TRUE          6
# A tibble: 2 × 2
  is_woman     n
  <lgl>    <int>
1 FALSE        6
2 TRUE         6
# A tibble: 2 × 2
  is_parent     n
  <lgl>     <int>
1 FALSE         6
2 TRUE          6

With the function purrr::map()

parent_data |>
  names() |>
  purrr::map(\(.x) dplyr::count(parent_data, .data[[.x]]))
[[1]]
# A tibble: 2 × 2
   year     n
  <dbl> <int>
1  1990     7
2  2000     5

[[2]]
# A tibble: 3 × 2
  country         n
  <chr>       <int>
1 France          4
2 Italy           4
3 Switzerland     4

[[3]]
# A tibble: 11 × 2
     age     n
   <dbl> <int>
 1    13     1
 2    22     1
 3    23     2
 4    28     1
 5    32     1
 6    36     1
 7    42     1
 8    43     1
 9    56     1
10    63     1
11    68     1

[[4]]
# A tibble: 2 × 2
  is_married     n
  <lgl>      <int>
1 FALSE          5
2 TRUE           7

[[5]]
# A tibble: 2 × 2
  has_child     n
  <lgl>     <int>
1 FALSE         6
2 TRUE          6

[[6]]
# A tibble: 2 × 2
  is_woman     n
  <lgl>    <int>
1 FALSE        6
2 TRUE         6

[[7]]
# A tibble: 2 × 2
  is_parent     n
  <lgl>     <int>
1 FALSE         6
2 TRUE          6

Theory

What is the difference between the .data and the .env pronouns?

The .env pronoun allows to use variables which were previously defined in the environment, whereas the .data pronoun takes the variables which are in the data frame.

body_mass_g <- palmerpenguins::penguins$body_mass_g
body_mass_g[is.na(body_mass_g)] <- 4209

palmerpenguins::penguins |>
  dplyr::select(body_mass_g) |>
  dplyr::mutate(
    body_mass_kg_env = .env$body_mass_g / 1e3,
    body_mass_kg_data = .data$body_mass_g / 1e3
  )
# A tibble: 344 × 3
   body_mass_g body_mass_kg_env body_mass_kg_data
         <int>            <dbl>             <dbl>
 1        3750             3.75              3.75
 2        3800             3.8               3.8 
 3        3250             3.25              3.25
 4          NA             4.21             NA   
 5        3450             3.45              3.45
 6        3650             3.65              3.65
 7        3625             3.62              3.62
 8        4675             4.68              4.68
 9        3475             3.48              3.48
10        4250             4.25              4.25
# ℹ 334 more rows

What is the difference between the .data and the magrittr . pronouns?

I learnt in this article’s section that it is safer to use the rlang .data pronoun than the magrittr . one in a data-masked context. With grouped data, . relates to the whole data whereas .data represents the current sliced data.

The .data pronoun is automatically generated when you use data-masking functions.

References

The previous code and explanations are inspired from:

Citation

BibTeX citation:
@online{lettry2023,
  author = {Lettry, Layal Christine},
  title = {Perform Transformations on Several Variables with `.env` and
    `.data` Pronouns},
  date = {2023-10-15},
  url = {https://rdiscovery.netlify.app/posts/2023-10-14_data-pronoun/},
  langid = {en}
}
For attribution, please cite this work as:
Lettry, Layal Christine. 2023. “Perform Transformations on Several Variables with `.env` and `.data` Pronouns.” October 15, 2023. https://rdiscovery.netlify.app/posts/2023-10-14_data-pronoun/.