Lab 2: Statistical inference & hypothesis testing

Practice session covering topics discussed in Lecture 2

M. Chiara Mimmi, Ph.D. | Università degli Studi di Pavia

July 25, 2024

GOAL OF TODAY’S PRACTICE SESSION

Consolidate understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.



Lecture 2: topics

  • Purpose and foundations of inferential statistics
  • Getting to know the “language” of hypothesis testing
  • Hypothesis testing
    • review examples
  • A closer look at testing assumptions
    • more examples dealing with assumptions’ violation

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We will also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session

# General 
library(fs)           # file/directory interactions
library(here)         # tools find your project's files, based on working directory
library(janitor)      # tools for examining and cleaning data
library(dplyr)        # {tidyverse} tools for manipulating and summarising tidy data 
library(forcats)      # {tidyverse} tool for handling factors
library(tidyr)        # Tidy Messy Data       

# Statistics
library(BSDA)         # Basic Statistics and Data Analysis   
library(rstatix)      # Pipe-Friendly Framework for Basic Statistical Tests
library(car)          # Companion to Applied Regression
library(multcomp)     # Simultaneous Inference in General Parametric Models 

# Plotting
library(ggplot2)      # {tidyverse} tools for plotting
library(ggstatsplot) # 'ggplot2' Based Plots with Statistical Details  
library(ggpubr)       # 'ggplot2' Based Publication Ready Plots 
library(patchwork)    # Functions for ""Grid" Graphics"composing" plots 
library(viridis)      # Colorblind-Friendly Color Maps for R 
library(ggthemes)     # Extra Themes, Scales and Geoms for 'ggplot2'

DATASETS FOR TODAY

For the most part, we will refer to a real clinical dataset (for which a Creative Commons license was granted) discussed in two articles (also open access) :

  • Ahmad, T., Munir, A., Bhatti, S. H., Aftab, M., & Raza, M. A. (2017). Survival analysis of heart failure patients: A case study. PLOS ONE, 12(7), e0181001. https://doi.org/10.1371/journal.pone.0181001
  • Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20(1), 16. https://doi.org/10.1186/s12911-020-1023-5

Importing from project folder (previously downloaded file)

You can access the dataset either:

  • From the UC Irvine Machine Learning Repository Heart Failure Clinical Records
  • From the workshop website: use function here to specify the complete path of the input data folder
# Check my working directory location
# here::here()

# Use `here` in specifying all the subfolders AFTER the working directory 
heart_failure <- read.csv(file = here::here("practice", "data_input", "02_datasets",
                                      "heart_failure_clinical_records_dataset.csv"), 
                          header = TRUE, # 1st line is the name of the variables
                          sep = ",", # which is the field separator character.
                          na.strings = c("?","NA" ), # specific MISSING values  
                          row.names = NULL) 

Tip

Make sure to match your own folder structure!

INSPECTING THE “HEART FAILURE” DATASET

What are the variables and their levels of measurement?

The data, with medical records of 299 heart failure patient, were collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab, Pakistan), during April–December 2015. See Table 1.

Table 1

Look into the dataset just loaded in the R environment

Recall some base R functions from Lab 1

# What variables are included in this dataset?
colnames(heart_failure)
 [1] "age"                      "anaemia"                 
 [3] "creatinine_phosphokinase" "diabetes"                
 [5] "ejection_fraction"        "high_blood_pressure"     
 [7] "platelets"                "serum_creatinine"        
 [9] "serum_sodium"             "sex"                     
[11] "smoking"                  "time"                    
[13] "DEATH_EVENT"             
# How many observations & variables?
nrow(heart_failure)
[1] 299
# How many rows & columns?
dim(heart_failure)
[1] 299  13

Inspect the dataframe structure (base R)

# What does the dataframe look like?
str(heart_failure)
'data.frame':   299 obs. of  13 variables:
 $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
 $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
 $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
 $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
 $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
 $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
 $ platelets               : num  265000 263358 162000 210000 327000 ...
 $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
 $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
 $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
 $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
 $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
 $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

Inspect the dataframe structure (skimr)

Remember the skimr function skim?

# some variables 
heart_failure %>% skimr::skim( age, DEATH_EVENT ) 

# the whole dataframe
heart_failure %>% skimr::skim() 



You try…

Run skimr::skim() on your own either on the whole dataset or on any specific variable

  • notice there are no (missing values) NAs in any of the variables

Recode some variables for later ease of analysis

I may need some variables coded as factor (e.g. categorical variables for plotting), and, while I am at it, I can add clearer labels for the variables’ levels. Here, we are:

  • using tidyverse packages dplyr and forcats
  • adding new (recoded) variables called oldname_f
heart_failure <-heart_failure %>% 
  dplyr::mutate(DEATH_EVENT_f = as.factor(DEATH_EVENT) %>%
                  forcats::fct_recode("died" = "1", "survived" = "0")) %>% 
  dplyr::mutate(sex_f = as.factor(sex) %>%
                  forcats::fct_recode("male" = "1", "female" = "0"))

# check 
table(heart_failure$DEATH_EVENT_f)

survived     died 
     203       96 
table(heart_failure$sex_f)

female   male 
   105    194 

Some more dummy variables recoded as factor

[Mostly for illustration: it’s totally fine (if not preferable) to keep these as binary [0,1] variables]

  • It’s worth learning the useful function dplyr::across1, which allows to iteratively transform several columns at once!
# Recode as factor with levels "yes" (= 1), "no" (= 0)
fct_cols = c("anaemia", "diabetes", "high_blood_pressure", "smoking" )

heart_failure <- heart_failure  %>% 
  ## ---- 1st create new cols as "factor versions" of old cols
  dplyr::mutate(
    # let's introduce `across` function 
    dplyr::across(
      # Columns to transform
      .cols = all_of(fct_cols), 
      # Functions to apply to each col  
      .fns =  ~as.factor (.x),
      # new name to apply where "{.col}" stands for the selected column
      .names = "{.col}_f")) %>% 
  ## ---- 2nd create new cols as "factor versions" of old cols
  dplyr::mutate(
    dplyr::across(
      # Columns to transform 2 conditions 
      .cols = ends_with("_f") & !matches(c( "DEATH_EVENT_f", "sex_f" )) , 
      # Functions to apply to each col(different syntax)
      .fns = ~forcats::fct_recode(.x,  yes = "1", no = "0" )))

(Small digression on dplyr::across)

Notice how dplyr::across(.cols = ..., .fns = ..., .names = ...) has these arguments:

  1. .cols = to select the columns which we want to transform (i.e. fct_cols)
    • with help from tidyselect functions: all_of, ends_with, and matches
  2. .fns = ~function(.x) to specify the function
    • where ~function(.x) uses the “anonymous function” syntax of the tidyverse
    • and .x inside the function is a “stand in” for each of the columns selected
  3. [optional] .names = to name the new cols created using {.col} in place of each of the transformed columns
## ---- 1st create new cols as "factor versions" of old cols
heart_failure <- heart_failure  %>% 
  dplyr::mutate(
    dplyr::across(
      .cols = all_of(fct_cols), 
      .fns = ~as.factor (.x), 
      # (optional)
      .names = "{.col}_f")) %>% 
  ## ---- 2nd create new cols as "factor versions" of old cols
  dplyr::mutate(
    dplyr::across(
      .cols = ends_with("_f") & !matches(c( "DEATH_EVENT_f", "sex_f" )) , 
      .fns =  ~forcats::fct_recode(.x,  yes = "1", no = "0" )))

VISUAL DATA EXPLORATION FOR THE “HEART FAILURE”

CONTINUOUS VARIABLES

Why is visual exploration important?

  • Gaining insight on the variables (range, outliers, missing data)
  • Preliminary check of assumptions for parametric hypothesis testing:
    • normally distributed outcome variables?
    • homogeneity of variance across groups?

Let’s explore the Heart failure dataset with some data visualization…

  • Following the referenced articles (which were mostly interested in predict mortality based on patients’ characteristics), we will take the categorical, binary variable DEATH_EVENT_f as our main criterion to split the sample (into survived and dead patients) to explore any significant difference between groups in terms of means of known quantitative features.
  • We will look at both:
    • continuous variables in the dataset (with the Probability Density Function (PDF))
    • discrete variables in the dataset (with the Probability Mass Function (PMF))

Age

Introducing the handy R package patchwork which lets us compose different plots in a very simple and intuitive way

  • (check it out with ??patchwork)
age <-ggplot(heart_failure,aes(x = age ))+
  geom_histogram(binwidth = 5, color = "white", fill = "grey",alpha = 0.5)+
  geom_vline(aes(xintercept = mean(age)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Age Distribution" )+
  scale_x_continuous(breaks = seq(40,100,5))  

age2 <-ggplot(heart_failure, aes(x = age, fill = DEATH_EVENT_f))+
  geom_histogram(binwidth = 5, position = "identity",alpha = 0.5,color = "white")+
  geom_vline(aes(xintercept = mean(age[DEATH_EVENT == 0])), color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(age[DEATH_EVENT==1])), color = "#d8717b")+
  theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  labs(title =  "Age Distribution by group (Death Event)")+
  scale_x_continuous(breaks = seq(40,100,5))

# patchwork
library(patchwork) # The Composer of Plots
age + age2 + plot_layout(ncol = 1)

Age

As the age increases, the incidence of death event seems to increase

Creatinine Phosphokinase (CPK)

cpk <- ggplot(heart_failure,aes(x = creatinine_phosphokinase))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,8000, 500))+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  theme(axis.text.x = element_text(angle=50, vjust=0.75))+
  labs(title = "Creatinine phosphokinase (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

cpk2 <- ggplot(heart_failure,aes(x = creatinine_phosphokinase,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  scale_x_continuous(breaks = seq(0,8000, 500))+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(creatinine_phosphokinase[DEATH_EVENT==1])), 
             color = "#d8717b")+
  theme_fivethirtyeight()+
  theme(axis.text.x = element_text(angle=50, vjust=0.75))+
  labs(title =  "Creatinine phosphokinase (density distribution) by group (Death Event)")

cpk + cpk2 + plot_layout(ncol = 1)

Creatinine Phosphokinase (CPK)

This definitely doesn’t look like a normal distribution!

Ejection Fraction

ejf <- ggplot(heart_failure,aes(x = ejection_fraction))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,100, 5))+
  geom_vline(aes(xintercept = mean(ejection_fraction)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Ejection Fraction (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

ejf2 <- ggplot(heart_failure,aes(x = ejection_fraction,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_x_continuous(breaks = seq(0,100, 5))+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(ejection_fraction[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(ejection_fraction[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Ejection Fraction (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ejf + ejf2 + plot_layout(ncol = 1)

Ejection Fraction

This also doesn’t look like a normal distribution… and there is a remarkable change in the probability density function (PDF) shape when we introduce the grouping variable

Platelets

# normalize the var for readability 
heart_failure  <-  heart_failure %>%  dplyr::mutate(plat_norm = platelets/1000) 

plat <- ggplot(heart_failure,aes(x = plat_norm))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,800, 100))+
  geom_vline(aes(xintercept = mean(plat_norm)), color = "#4c4c4c")+
  theme_fivethirtyeight()   + 
  labs(title =  "Platelets (density distribution)",
       y = "Density", x = "Sample platelet count (in 10^3 µL)") 

plat2 <- ggplot(heart_failure,aes(x = plat_norm,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_x_continuous(breaks = seq(0,800, 100))+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(plat_norm[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(plat_norm[DEATH_EVENT==1])), 
             color = "#d8717b")+
  theme_fivethirtyeight()   + 
  labs(title =  "Platelets (density distribution) by group (Death Event)",
       caption = "(Sample platelet count in 10^3 µL)") 
 
plat + plat2 + plot_layout(ncol = 1)

Platelets

Here the probability distributions resemble a Normal one and we observe more uniformity in the mean/variance across the 2 groups

Serum Creatinine

ser_cr <- ggplot(heart_failure,aes(x = serum_creatinine))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,10, 1))+
  geom_vline(aes(xintercept = mean(serum_creatinine)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Serum Creatinine (density distribution)" )+
  theme(plot.caption = element_text(hjust = 0.5, face = "italic"))

ser_cr2 <- ggplot(heart_failure,aes(x = serum_creatinine,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+theme_fivethirtyeight()+
  scale_x_continuous(breaks = seq(0,10, 1))+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(serum_creatinine[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(serum_creatinine[DEATH_EVENT==1])), 
             color = "#d8717b")+
  labs(title =  "Serum Creatinine (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ser_cr + ser_cr2 + plot_layout(ncol = 1)

Serum Creatinine

Another continuous random variable with a non-normal distribution (long right tails) and a seemingly important difference in variance between the groups.

Serum Sodium

ser_sod <- ggplot(heart_failure,aes(x = serum_sodium))+
  geom_density(fill = "gray", alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,150, 5))+
  geom_vline(aes(xintercept = mean(serum_sodium)), color = "#4c4c4c")+
  theme_fivethirtyeight()+
  labs(title = "Serum Sodium (density distribution)" )

ser_sod2 <- ggplot(heart_failure,aes(x = serum_sodium,fill = DEATH_EVENT_f))+
  geom_density(alpha = 0.5)+
  scale_x_continuous(breaks = seq(0,150, 5))+
  scale_fill_manual(values = c("#999999", "#d8717b"))+
  geom_vline(aes(xintercept = mean(serum_sodium[DEATH_EVENT == 0])),
             color = "#4c4c4c")+
  geom_vline(aes(xintercept = mean(serum_sodium[DEATH_EVENT==1])), 
             color = "#d8717b")+
  theme_fivethirtyeight()+
  labs(title =  "Serum Sodium (density distribution) by group (Death Event)")+
  theme_fivethirtyeight()

ser_sod + ser_sod2 + plot_layout(ncol = 1)

Serum Sodium

Same as above, except for the long left tails…