Testing and summarizing relationship between 2 variables (correlation)
Pearson’s 𝒓 analysis (param)
Spearman test (no param)
Measures of association
Chi-Square test of independence
Fisher’s Exact Test
alternative to the Chi-Square Test of Independence
From correlation/association to prediction/causation
The purpose of observational and experimental studies
Widely used analytical tools
Simple linear regression models
Multiple Linear Regression models
Shifting the emphasis on empirical prediction
Introduction to Machine Learning (ML)
Distinction between Supervised & Unsupervised algorithms
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We will also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session# -- General library(fs) # file/directory interactionslibrary(here) # tools find your project's files, based on working directorylibrary(paint) # paint data.frames summaries in colourlibrary(janitor) # tools for examining and cleaning datalibrary(dplyr) # {tidyverse} tools for manipulating and summarizing tidy data library(forcats) # {tidyverse} tool for handling factorslibrary(openxlsx) # Read, Write and Edit xlsx Fileslibrary(flextable) # Functions for Tabular Reporting# -- Statisticslibrary(rstatix) # Pipe-Friendly Framework for Basic Statistical Testslibrary(lmtest) # Testing Linear Regression Models library(broom) # Convert Statistical Objects into Tidy Tibbles#library(tidymodels) # not installed on this machinelibrary(performance) # Assessment of Regression Models Performance # -- Plottinglibrary(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
DATASETS FOR TODAY
We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:
Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1 Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)
# Check my working directory location# here::here()# Use `here` in specifying all the subfolders AFTER the working directory nhanes_samp <-read.csv(file = here::here("practice", "data_input", "03_datasets","nhanes.samp.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?","NA" ), # specific MISSING values row.names =NULL)
Adapting the function here to match your own folder structure
NHANES Variables and their description
[EXCERPT: see complete file in Input Data Folder]
Variable
Type
Description
X
int
xxxx
ID
int
xxxxx
SurveyYr
chr
yyyy_mm. Ex. 2011_12
Gender
chr
Gender (sex) of study participant coded as male or female
Age
int
##
AgeDecade
chr
yy-yy es 20-29
Education
chr
[>= 20 yro]. Ex. 8thGrade, 9-11thGrade, HighSchool, SomeCollege, or CollegeGrad.
Weight
dbl
Weight in kg
Height
dbl
Standing height in cm. Reported for participants aged 2 years or older.
BMI
dbl
Body mass index (weight/height2 in kg/m2). Reported for participants aged 2 years or older
Pulse
int
60 second pulse rate
DirectChol
dbl
Direct HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
TotChol
dbl
Total HDL cholesterol in mmol/L. Reported for participants aged 6 years or older
Diabetes
chr
Study participant told by a doctor or health professional that they have diabetes
DiabetesAge
int
Age of study participant when first told they had diabetes
HealthGen
chr
Self-reported rating of health: Excellent, Vgood, Good, Fair, or Poor Fair
Alcohol12PlusYr
chr
Participant has consumed at least 12 drinks of any type of alcoholic beverage in any one year
...
...
...
Importing Dataset 2 (PREVEND)
Name: PREVEND (Prevention of REnal and Vascular END-stage Disease) is a study which took place in the Netherlands starting in the 1990s, with subsequent follow-ups throughout the 2000s. This dataset is from the third survey, which participants completed in 2003-2006; data is provided for 4,095 individuals who completed cognitive testing. Documentation: dataset2 and sample dataset variables’ codebook Sampling details: Here we use a sample of 500 adults taken from 4,095 individuals who completed cognitive testing (i.e. the prevend.samp dataset in the R oibiostat package)
# Check my working directory location# here::here()# Use `here` in specifying all the subfolders AFTER the working directory prevend_samp <-read.csv(file = here::here("practice", "data_input", "03_datasets","prevend.samp.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?","NA" ), # specific MISSING values row.names =NULL)
PREVEND Variables and their description
[EXCERPT: see complete file in Input Data Folder]
Variable
Type
Description
X
int
Patient ID
Age
int
Age in years
Gender
int
Expressed as: 0 = males; 1 = females
RFFT
int
Performance on the Ruff Figural Fluency Test. Scores range from 0 (worst) to 175 (best)
VAT
int
Visual Association Test score. Scores may range from 0 (worst) to 12 (best)
Chol
dbl
Total cholesterol, in mmol/L.
HDL
dbl
HDL cholesterol, in mmol/L.
Statin
int
Statin use at enrollment. Numeric vector: 0 = No; 1 = Yes.
CVD
int
History of cardiovascular event. Numeric vector: 0 = No; 1 = Yes
DM
int
Diabetes mellitus status at enrollment. Numeric vector: 0 = No; 1 = Yes
Education
int
Highest level of education. Numeric: 0 primary school; 1 = lower secondary education; 3 = university
Status of hypertension at enrollment. Numeric vector: 0 = No; 1 = Yes
Ethnicity
int
Expressed as: 0 = Western European; 1 = African; 2 = Asian; 3 = Other
...
...
...
Importing Dataset 3 (FAMuSS)
Name: FAMuSS (Functional SNPs Associated with Muscle Size and Strength) examine the association of demographic, physiological and genetic characteristics with muscle strength – including data on race and genotype at a specific locus on the ACTN3 gene (the “sports gene”). Documentation: dataset3 Sampling details: the DATASET includes 595 observations on 9 variables (famuss in the R oibiostat package)
# Check my working directory location# here::here()# Use `here` in specifying all the subfolders AFTER the working directory famuss <-read.csv(file = here::here("practice", "data_input", "03_datasets","famuss.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?","NA" ), # specific MISSING values row.names =NULL)
FAMuSS Variables and their description
[See complete file in Input Data Folder]
Variable
Description
X
id
ndrm.ch
Percent change in strength in the non-dominant arm
drm.ch
Percent change in strength in the dominant arm
sex
Sex of the participant
age
Age in years
race
Recorded as African Am (African American), Caucasian, Asian, Hispanic, Other
height
Height in inches
weight
Weight in pounds
actn3.r577x
Genotype at the location r577x in the ACTN3 gene.
bmi
Body Mass Index
CORRELATION
[Using NHANES and FAMuSS datasets]
Explore relationships between two variables
Approaches for summarizing relationships between two variables vary depending on variable types…
Two numerical variables
Two categorical variables
One numerical variable and one categorical variable
Two variables \(x\) and \(y\) are
positively associated if \(y\) increases as \(x\) increases.
negatively associated if \(y\) decreases as \(x\) increases.
TWO NUMERICAL VARIABLES (NHANES)
Two numerical variables (plot)
Height and weight (taken from the nhanes_samp dataset) are positively associated.
notice we can also use the generic base R function plot for a quick scatter plot
Two numerical variables: correlation (with stats::cor)
Correlation is a numerical summary that measures the strength of a linear relationship between two variables.
The correlation coefficient \(r\) takes on values between \(-1\) and \(1\).
The closer \(r\) is to \(\pm 1\), the stronger the linear association.
Here we compute the Pearson rho (parametric), with base R function stats::cor
the use argument let us choose how to deal with missing values (in this case only using all complete pairs)
is.numeric(nhanes$height)
[1] TRUE
is.numeric(nhanes$weight)
[1] TRUE
# using `stats` packagestats::cor(x = nhanes$height, y = nhanes$weight, # argument for dealing with missing valuesuse ="pairwise.complete.obs",method ="pearson")
[1] 0.4102269
Two numerical variables: correlation (with stats::cor.test)
Here we compute the Pearson rho (parametric), with the function cor.test (the same we used for testing paired samples)
implicitely takes care on NAs
# using `stats` package cor_test_result <-cor.test(x = nhanes$height, y = nhanes$weight, method ="pearson")# looking at the cor estimatecor_test_result[["estimate"]][["cor"]]
[1] 0.4102269
The function ggpubr::ggscatter gives us all in one (scatter plot + \(r\) (“R”))! 🤯
library("ggpubr") # 'ggplot2' Based Publication Ready Plotsggpubr::ggscatter(nhanes, x ="height", y ="weight", cor.coef =TRUE, cor.method ="pearson", #cor.coef.coord = 2,xlab ="Height (in)", ylab ="Weight (lb)")
Two numerical variables: correlation (with stats::cor.test)
Spearman rank-order correlation
The Spearman’s rank-order correlation is the nonparametric version of the Pearson correlation.
Spearman’s correlation coefficient, (\(ρ\), also signified by \(rs\)) measures the strength and direction of association between two ranked variables.
used when 2 variables have a non-linear relationship
excellent for ordinal data (when Pearson’s is not appropriate), i.e. Likert scale items
To compute it, we simply calculate Pearson’s correlation of the rankings of the raw data (instead of the data).
Spearman rank-order correlation (example)
Let’s say we want to get Spearman’s correlation with ordinal factors Education and HealthGen in the NHANES sample.
We have to convert them to their underlying numeric code, to compare rankings.
tabyl(nhanes$education)
nhanes$education n percent valid_percent
8th Grade 32 0.064 0.06412826
9 - 11th Grade 68 0.136 0.13627255
College Grad 157 0.314 0.31462926
High School 94 0.188 0.18837675
Some College 148 0.296 0.29659319
<NA> 1 0.002 NA
tabyl(nhanes$health_gen)
nhanes$health_gen n percent valid_percent
Excellent 47 0.094 0.10444444
Fair 53 0.106 0.11777778
Good 177 0.354 0.39333333
Poor 11 0.022 0.02444444
Vgood 162 0.324 0.36000000
<NA> 50 0.100 NA
After setting up the variables in the correct (numerical rank) format, now we can actually compute it: + same function call stats::cor.test + but specifying argument method = "spearman"
# -- using `stats` package cor_test_result_sp <-cor.test(x = nhanes$edu_rank,y = nhanes$health_rank, method ="spearman", exact =FALSE) # removes the Ties message warning # looking at the cor estimatecor_test_result_sp
Spearman's rank correlation rho
data: nhanes$edu_rank and nhanes$health_rank
S = 10641203, p-value = 1.915e-10
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.2946493
# -- only print Spearman rho #cor_test_result_sp[["estimate"]][["rho"]]
TWO CATEGORICAL VARIABLES (FAMuSS)
Two categorical variables (plot)
In the famuss dataset, the variables race, and actn3.r577x are categorical variables.
Specifically, the variable actn3.r577x takes on three possible levels (CC, CT, or TT) which indicate the distribution of genotype at location r577x on the ACTN3 gene for the FAMuSS study participants.
A contingency table summarizes data for two categorical variables.
the function stats::addmargins puts arbitrary Margins on multidimensional tables
The extra column & row "Sum" provide the marginal totals across each row and each column, respectively
# levels of actn3.r577xtable(famuss$actn3.r577x)
CC CT TT
173 261 161
# contingency table to summarize race and actn3.r577xaddmargins(table(famuss$race, famuss$actn3.r577x))
CC CT TT Sum
African Am 16 6 5 27
Asian 21 18 16 55
Caucasian 125 216 126 467
Hispanic 4 10 9 23
Other 7 11 5 23
Sum 173 261 161 595
Two categorical variables (contingency table prop)
Contingency tables can also be converted to show proportions. Since there are 2 variables, it is necessary to specify whether the proportions are calculated according to the row variable or the column variable.
using the margin = argument in the base::prop.table function (1 indicates rows, 2 indicates columns)
CC CT TT Sum
African Am 0.09248555 0.02298851 0.03105590 0.14652996
Asian 0.12138728 0.06896552 0.09937888 0.28973168
Caucasian 0.72254335 0.82758621 0.78260870 2.33273826
Hispanic 0.02312139 0.03831418 0.05590062 0.11733618
Other 0.04046243 0.04214559 0.03105590 0.11366392
Sum 1.00000000 1.00000000 1.00000000 3.00000000
Chi Squared test of independence
The Chi-squared test is a hypothesis test used to determine whether there is a relationship between two categorical variables.
categorical vars. can have nominal or ordinal measurement scale
the observed frequencies are compared with the expected frequencies and their deviations are examined.
# Chi-squared test# (Test of association to see if # H0: the 2 cat var (race & actn3.r577x ) are independent# H1: the 2 cat var are correlated in __some way__tab <-table(famuss$race, famuss$actn3.r577x)test_chi <-chisq.test(tab)
the obtained result (test_chi) is a list of objects…
You try…
…run View(test_chi) to check
Chi Squared test of independence (cont)
Within test_chi results there are:
Observed frequencies =
how often a combination occurs in our sample
# Observed frequenciestest_chi$observed
CC CT TT
African Am 16 6 5
Asian 21 18 16
Caucasian 125 216 126
Hispanic 4 10 9
Other 7 11 5
Expected frequencies = what would it be if the 2 vars were PERFECTLY INDEPENDENT
Recall that Crammer’s V allows to measure the effect size of the test of independence (i.e. the strength of association between two nominal variables)
\(V\) ranges from [0 1] (the smaller \(V\), the lower the correlation)
\[V=\sqrt{\frac{\chi^2}{n(k-1)}} \]
where:
\(V\) denotes Cramér’s V
\(\chi^2\) is the Pearson chi-square statistic from the prior test
\(n\) is the sample size involved in the test
\(k\) is the lesser number of categories of either variable
Computing Cramer’s V after test of independence (2 ways)
✍🏻 “By hand” first to see the steps
# Compute Creamer's V by hand# inputs chi_calc <- test_chi$statisticn <-nrow(famuss) # N of obd n_r <-nrow(test_chi$observed) # number of rows in the contingency tablen_c <-ncol(test_chi$observed) # number of columns in the contingency table# Cramer’s Vsqrt(chi_calc / (n*min(n_r -1, n_c -1)) )
# Cramer’s V with rstatixrstatix::cramer_v(test_chi$observed)
[1] 0.1276816
Cramer’s V = 0.12, which indicates a relatively weak association between the two categorical variables. It suggests that while there may be some relationship between the variables, it is not particularly strong.
Chi Squared test of goodness of fit
In some cases the Chi-square test examines whether or not an observed frequency distribution matches an expected theoretical distribution.
Here, we are conducting a type of Chi-square Goodness of Fit Test which:
serves to test whether the observed distribution of a categorical variable differs from your expectations
interprets the statistic based on the discrepancies between observed and expected counts
Chi Squared test of goodness of fit (example)
Since the participants of the FAMuSS study where volunteers at a university, they did not come from a “representative” sample of the US population, we can use the \(\chi^{2}\) goodness of fit test to test against:
\(H_{0}\): the study participants (1st row below) are racially representative of the general population (2nd row below)
Race
African.American
Asian
Caucasian
Other
Total
FAMuSS (Observed)
27
55
467
46
595
US Census (Expected)
76.16
5.95
478.38
34.51
595
We use the formula \[\chi^{2} = \sum_{k}\frac{(Observed - Expected)^{2}}{Expected}\]
Under \(H_{0}\), the sample proportions should equal the population proportions.
Chi Squared test of goodness of fit (example)
# Subset the vectors of frequencies from the 2 rows observed <-c(27, 55, 467, 46)expected <-c(76.2, 5.95, 478.38, 34.51)# Calculate Chi-Square statistic manually chi_sq_statistic <-sum((observed - expected)^2/ expected) df <-length(observed) -1p_value <-1-pchisq(chi_sq_statistic, df) # Print results chi_sq_statistic
[1] 440.2166
df
[1] 3
p_value
[1] 0
The calculated \(\chi^{2}\) statistic is very large, and the p_value is close to 0. Hence, there is more than sufficient evidence to reject the null hypothesis that the sample is representative of the general population.
Comparing the observed and expected values (or the residuals), we find the largest discrepancy with the over-representation of Asian study participants.
SIMPLE LINEAR REGRESSION
[Using NHANES dataset]
Visualize the data: BMI and age
We are mainly looking for a “vaguely” linear shape here
ggplot2 gives us a visual confirmation with geom_point()
Essentially, geom_smooth() adds a trend line over an existing plot
inside the function, we have different options with the method argument (default is LOESS (locally estimated scatterplot smoothing))
with method = lm we get the linear best fit (the least squares regression line) & its 95% CI
The lm() function is used to fit linear models has the following generic structure:
lm(y ~ x, data)
where:
the 1st argument y ~ x specifies the variables used in the model (here the model regresses a response variable\(y\) against an explanatory variable\(x\).
The 2nd argument data is used only when the dataframe name is not already specified in the first argument.
Linear regression models syntax
The following example shows fitting a linear model that predicts BMI from age (in years) using data from nhanes adult sample (individuals 21 years of age or older from the NHANES data).
# fitting linear modellm(nhanes$bmi ~ nhanes$age)
# or equivalently...lm(bmi ~ age, data = nhanes)
Call:
lm(formula = bmi ~ age, data = nhanes)
Coefficients:
(Intercept) age
28.40113 0.01982
Running the function creates an object (of class lm) that contains several components (model coefficients, etc), either directly displayed or accessible with summary() notation or specific functions.
Linear regression models syntax
We can save the model and then extract individual output elements from it using the $ syntax
# name the model objectlr_model <-lm(bmi ~ age, data = nhanes)# extract model output elementslr_model$coefficientslr_model$residualslr_model$fitted.values
The command summary returns these elements
Call: reminds the equation used for this regression model
Residuals: a 5 number summary of the distribution of residuals from the regression model
Coefficients:displays the estimated coefficients of the regression model and relative hypothesis testing, given for:
intercept
explanatory variable(s) slope
Linear regression models interpretation: coefficients
The model tests the null hypothesis \(H_{0}\) that a coefficient is 0
coefficients outputs are: estimate, std. error, t-statistic, and p-value correspondent to the t-statistic for:
intercept
explanatory variable(s) slope
In regression, the population parameter of interest is typically the slope parameter
in this model, age doesn’t appear significantly ≠ 0
summary(lr_model)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.40112932 0.96172389 29.531480 2.851707e-111
age 0.01981675 0.01824641 1.086063 2.779797e-01
Linear regression models interpretation: Coefficients 2
For the the estimated coefficients of the regression model, we get:
Estimate = the average increase in the response variable associated with a one unit increase in the predictor variable, (assuming all other predictor variables are held constant).
Std. Error = a measure of the uncertainty in our estimate of the coefficient.
t value = the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).
Pr(>|t|) = the p-value that corresponds to the t-statistic. If less than some alpha level (e.g. 0.05). the predictor variable is said to be statistically significant.
Linear regression models outputs: fitted values
Here we see \(\hat{y}_i\), i.e. the fitted \(y\) value for the \(i\)-th individual
fit_val <- lr_model$fitted.values# print the first 6 elementshead(fit_val)
Linear regression model’s fit: Residual standard error
The Residual standard error (an estimate of the parameter \(\sigma\)) tells the average distance that the observed values fall from the regression line (we are assuming constant variance).
The smaller it is, the better the model fits the dataset!
# Residual Standard error (Like Standard Deviation)# --- inputs # sample sizen =length(lr_model$residuals)# n of parameters in the modelk =length(lr_model$coefficients)-1#Subtract one to ignore intercept# degrees of freedom of the the residuals df_resid = n-k-1# Squared Sum of ErrorsSSE =sum(lr_model$residuals^2) # 22991.19# --- Residual Standard ErrorResStdErr <-sqrt(SSE/df_resid) # 6.815192ResStdErr
[1] 6.815192
Linear regression model’s fit: : \(R^2\) and \(Adj. R^2\)
The \(R^2\) tells us the proportion of the variance in the response variable that can be explained by the predictor variable(s).
if \(R^2\) close to 0 -> data more spread
if \(R^2\) close to 1 -> data more tight around the regression line
# --- R^2summary(lr_model)$r.squared
[1] 0.00237723
The \(Adj. R^2\) is a modified version of \(R^2\) that has been adjusted for the number of predictors in the model.
It is always lower than the R-squared
It can be useful for comparing the fit of different regression models that use different numbers of predictor variables.
# --- Adj. R^2summary(lr_model)$adj.r.squared
[1] 0.0003618303
Linear regression model’s fit: : F statistic
The F-statistic indicates whether the regression model provides a better fit to the data than a model that contains no independent variables. In essence, it tests if the regression model as a whole is useful.
# extract only F statistic summary(lr_model)$fstatistic
value numdf dendf
1.179533 1.000000 495.000000
# define function to extract overall p-value of modeloverall_p <-function(my_model) { f <-summary(my_model)$fstatistic p <-pf(f[1],f[2],f[3],lower.tail=F)attributes(p) <-NULLreturn(p)}# extract overall p-value of modeloverall_p(lr_model)
[1] 0.2779797
Given the p-value is > 0.05, this indicate that the predictor variable is not useful for predicting the value of the response variable.
DIAGNOSTIC PLOTS
The following plots help us checking if (most of) the assumptions of linear regression are met!
(the independence assumption is more linked to the study design than to the data used in modeling)
Linear regression diagnostic plots: residuals 1/4
ASSUMPTION 1: there exists a linear relationship between the independent variable, x, and the dependent variable, y
For an observation \((x_i, y_i)\), where \(\hat{y}_i\) is the predicted value according to the line \(\hat{y} = b_0 + b_1x\), the residual is the value \(e_i = y_i - \hat{y}_i\)
A linear (e.g. lr_model) is a particularly good fit for the data when the residual plot shows random scatter above and below the horizontal line.
(In this R plot, we look for a red line that is fairly straight)
# residual plotplot(lr_model, which =1 )
We use the argument which in the function plot so we see the plots one at a time.
Linear regression diagnostic plots: residuals 1/4
Linear regression diagnostic plots: normality of residuals 2/4
ASSUMPTION 2: The residuals of the model are normally distributed
With the quantile-quantile plot (Q-Q) we can checking normality of the residuals.
# quantile-quantile plotplot(lr_model, which =2 )
Linear regression diagnostic plots: normality of residuals 2/4
The data appear roughly normal, but there are deviations from normality in the tails, particularly the upper tail.