Day 2: Inference and Hypothesis testing

…. more details

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 25, 2024

introduction

collaborators

Johanna Hardin, Pomona College
Benjamin S. Baumer, Smith College
Amelia McNamara, University of St Thomas
Nicholas J. Horton, Amherst College
Colin W. Rundel, Duke University

setting the scene

Assumption 1:

Teach authentic tools

Assumption 2:

Teach R as the authentic tool

takeaway

The tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle.

principles of the tidyverse

tidyverse

meta R package that loads eight core packages when invoked and also bundles numerous other packages upon installation
tidyverse packages share a design philosophy, common grammar, and data structures

tidyverse flow

setup

Data: Thousands of loans made through the Lending Club, a peer-to-peer lending platform available in the openintro package, with a few modifications.

library(tidyverse)
library(openintro)

loans <- loans_full_schema %>%
  mutate(
    homeownership = str_to_title(homeownership), 
    bankruptcy = if_else(public_record_bankrupt >= 1, "Yes", "No")
  ) %>%
  filter(annual_income >= 10) %>%
  select(
    loan_amount, homeownership, bankruptcy,
    application_type, annual_income, interest_rate
  )

start with a data frame

loans

# A tibble: 9,976 × 6
  loan_amount homeownership bankruptcy application_type annual_income
        <int> <chr>         <chr>      <fct>                    <dbl>
1       28000 Mortgage      No         individual               90000
2        5000 Rent          Yes        individual               40000
3        2000 Rent          No         individual               40000
4       21600 Rent          No         individual               30000
5       23000 Rent          No         joint                    35000
6        5000 Own           No         individual               34000
# ℹ 9,970 more rows
# ℹ 1 more variable: interest_rate <dbl>

tidy data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

task: calculate a summary statistic

Calculate the mean loan amount.

loans

# A tibble: 9,976 × 6
  loan_amount homeownership bankruptcy application_type annual_income
        <int> <chr>         <chr>      <fct>                    <dbl>
1       28000 Mortgage      No         individual               90000
2        5000 Rent          Yes        individual               40000
3        2000 Rent          No         individual               40000
4       21600 Rent          No         individual               30000
5       23000 Rent          No         joint                    35000
6        5000 Own           No         individual               34000
# ℹ 9,970 more rows
# ℹ 1 more variable: interest_rate <dbl>

mean(loan_amount)

Error: object 'loan_amount' not found

accessing a variable

Approach 1: With attach():

attach(loans)
mean(loan_amount)

[1] 16357.53

Not recommended. What if you had another data frame you’re working with concurrently called car_loans that also had a variable called loan_amount in it?

accessing a variable

Approach 2: Using $:

mean(loans$loan_amount)

[1] 16357.53

Approach 3: Using with():

with(loans, mean(loan_amount))

[1] 16357.53

accessing a variable

Approach 4: The tidyverse approach:

loans %>%
  summarise(mean_loan_amount = mean(loan_amount))

# A tibble: 1 × 1
  mean_loan_amount
             <dbl>
1           16358.

More verbose
But also more expressive and extensible