Teaching modeling in introductory statistics

A comparison of formula and tidyverse syntaxes

Amelia McNamara @AmeliaMN

Pre-print

This talk is based on a paper I wrote, which is available as a pre-print via arXiv, https://arxiv.org/abs/2201.12960

Screenshot of first page of paper entitled Teaching modeling in introductory statistics: A comparison of formula and tidyverse syntaxes

Palmer Penguins

library(palmerpenguins)
Illustration of Chinstrap, Gentoo, and Adelie penguins with colorful backgrounds.

Penguin data and penguin art by Allison Horst.

Summary statistics three ways

base

mean(penguins$body_mass_g[penguins$species == "Adelie"], na.rm = TRUE)
mean(penguins$body_mass_g[penguins$species == "Chinstrap"], na.rm = TRUE)
mean(penguins$body_mass_g[penguins$species == "Gentoo"], na.rm = TRUE)

tapply(penguins$body_mass_g, penguins$species, mean, na.rm = TRUE)

formula

mean(body_mass_g ~ species, data = penguins, na.rm = TRUE)

tidyverse

penguins %>%
  drop_na(body_mass_g) %>%
  group_by(species) %>%
  summarize(mean(body_mass_g))

Scatterplot three ways

base

plot(penguins$flipper_length_mm, penguins$bill_length_mm)

formula

gf_point(bill_length_mm ~ flipper_length_mm, data = penguins)

tidyverse

ggplot(penguins) +
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm))

Sets of scatterplots three ways

base

par(mfrow = c(1, 3))
plot(penguins$flipper_length_mm[penguins$species == "Adelie"],
     penguins$bill_length_mm[penguins$species == "Adelie"])
plot(penguins$flipper_length_mm[penguins$species == "Chinstrap"],
     penguins$bill_length_mm[penguins$species == "Chinstrap"])
plot(penguins$flipper_length_mm[penguins$species == "Gentoo"],
     penguins$bill_length_mm[penguins$species == "Gentoo"])
# apply solution?

formula

gf_point(bill_length_mm ~ flipper_length_mm | species, data = penguins)

tidyverse

ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point() +
  facet_grid(~species)

Sets of scatterplots three ways

base

formula/tidyverse

Choosing syntax

There is debate in the statistics education community about which syntax is best to teach, particularly for the “first course.” All three syntaxes have their proponents, but the folks most interested in pedagogy tend to argue about whether formula or tidyverse is best.

Head-to-head comparison

  • Students enrolled in the same lecture class (60-90 students)
  • Lecture was broken into three smaller sections for lab material
  • I taught two of the sections (n=21 in each), and both were designated as using R
  • Using random assignment (coin flip) I chose one to use tidyverse syntax and one to use formula syntax
  • Similar students in both groups– mostly business majors, similar prior programming experience
Prior programming experience?.
formula tidyverse
No 10 9
Yes, but not with R 2 4

Why do this?

To get some data

Constraints breed creativity

Data

I was able to gather lots of data:

  • RMarkdown documents and associated code
  • Pre- and post-survey
  • YouTube analytics
  • RStudio Cloud analytics

getParseData()

The getParseData() function from utils allows you to parse R code and find all sorts of things about the parse tree. I just filtered for functions.

purled_labs <- list.files(path = here("data/prelabs"), full.names = TRUE)
names(purled_labs) <- list.files(path = here("data/prelabs"))
parsed_all <- map_dfr(purled_labs, ~getParseData(parse(file = .)), 
                      .id = "filename")
allfunctions <- parsed_all %>%
  filter(token == "SYMBOL_FUNCTION_CALL") 

Minimal reproducible… stats course

Like “Enough R for Intro Stats” by Randy Pruim.

The formula section saw a total of 37 functions and the tidyverse section saw 50, with an overlap of 18 functions between the two sections.

Neither of these numbers are very large!

The functions both sections of students saw included helper functions like library(), set.seed(), and set() (a function in the knitr options included in the top of each RMark- down document), statistics like mean(), sd(), and cor(), and modeling-related functions like aov(), lm(), summary() and predict().

What was hard? (formula edition)

Dealing with missing data.

options(na.rm = TRUE)
mean(body_mass_g ~ species, data = penguins, na.rm = TRUE)
cor(body_mass_g ~ species, data = penguins, use = "complete.obs")

Solutions:

  • More consistency in prep.
  • Teach some wrangling
  • ??

What was hard? (tidyverse edition)

Explaining things with two categorical variables (two-way tables, inference)

penguins %>%
  group_by(sex, island) %>%
  summarize(n = n()) %>%
  mutate(prop = n / sum(n))
# A tibble: 4 × 4
# Groups:   sex [2]
  sex    island     n  prop
  <fct>  <fct>  <int> <dbl>
1 female Biscoe    80 0.567
2 female Dream     61 0.433
3 male   Biscoe    83 0.572
4 male   Dream     62 0.428
penguins %>%
  infer::prop_test(
    response = island,
    explanatory = sex,
    alternative = "two-sided",
    order = c("female", "male")
  )
# A tibble: 1 × 6
  statistic chisq_df p_value alternative lower_ci upper_ci
      <dbl>    <dbl>   <dbl> <chr>          <dbl>    <dbl>
1  1.78e-30        1    1.00 two.sided     -0.127    0.117

For comparison

mosaic::tally(island ~ sex, 
              data = penguins, 
              format = "proportion")
        sex
island      female      male
  Biscoe 0.5673759 0.5724138
  Dream  0.4326241 0.4275862
mosaic::prop.test(island ~ sex, 
                  data = penguins, 
                  success = "Biscoe")

    2-sample test for equality of proportions with continuity correction

data:  tally(island ~ sex)
X-squared = 2.8876e-30, df = 1, p-value = 1
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1248439  0.1147680
sample estimates:
   prop 1    prop 2 
0.5673759 0.5724138 

Pre- and post-survey

Not much difference between sections

Boxplots showing change in responses to survey questions between pre- and post-surveys. Image is faceted by section. There are few visible differences between sections, and most boxes are centered at 0 (no change).

Pre- and post-survey

Likert scale plot showing a comparison of pre- and post-survey responses to survey questions. Most boxes are centered such that the responds are positive, and few show any difference between pre- and post-surveys. The notable exception is the Programming Confident question, which shows a marked improvement between pre- and post-survey.

Pre-lab length

RMarkdown documents, in lines

Side-by-side barcharts showing length of RMarkdown documents in lines, over the course of the semester. Sections are compared side-by-side. Most weeks, the tidyverse bar is slightly taller.

pre-lab videos, in minutes

Side-by-side barcharts showing length of pre-lab YouTube videos, in minutes, over the course of the semester. Sections are compared side-by-side. Most weeks the videos are about the same length, with two notable exceptions, week 3 and week 10. In both weeks, the tidyverse bar is taller.

Compute time

Faceted density plots, one for each month of the semester. Each month has two lines, one for each section, with the exception of November, where tidyverse data is missing. In October and December, the tidyverse density plot is noticebly pulled to the right.

Model

Fit a linear mixed-effects, using month as a categorical variable. lme4 package doesn’t provide p-values, but we can look at confidence intervals.

Confidence intervals for coefficient estimates.
2.5% 97.5%
.sig01 3.3 6.1
.sigma 4.5 5.9
(Intercept) 8.4 14.4
sectiontidyverse -6.2 2.2
monthOctober 1.2 7.5
monthNovember -4.9 1.5
monthDecember -5.5 0.9
sectiontidyverse:monthOctober 0.4 9.4
sectiontidyverse:monthDecember 0.7 9.7

So, which should I use?

It depends!

I used formula this past semester when I taught the R labs again. Most students won’t go on to another stat class, and are mostly “users” of R rather than “doers.”

If most of my students were stat majors, I would definitely teach tidyverse. If it was intro data science, I would teach tidyverse.

My big suggestion is be consistent.

Constraints breed creativity!

Use getParseData() to learn how many functions you are using, and streamline.

Learn more

Thank you