Course Description & Objectives

Course Description

This is an intensive course originally designed for PhD students and researchers dealing with biological datasets to help them consolidate their practical understanding of frequently implemented data analysis and statistical modeling for academic research purposes. This course focuses on empowering participants to conduct end-to-end data analyses, guiding them through reproducible data storage, cleaning, exploration, analysis, and interpretation cycle.

Beyond the conceptual foundations, participants engage in hands-on coding sessions in the R software with real-world datasets and solved questions, validating in practice the skills needed to confidently acquire and communicate insights throughout the analytical process.

Particular emphasis is placed on developing an understanding of the statistical methods used, when to apply them, and how to interpret them, in close connection to “real life” situations for a research scientists.

Target Audience

The target audience for this course is graduate students, post-doctoral fellows, and researchers in industry or academia already familiar with basic statistics (ideally in R) and looking to learn more about conducting an analysis from start to finish with more intermediate or advanced statistical techniques.

Prerequisites

While there are no specific prerequisites, because of the intensive nature of this course, participants will make the most of it if they have prior exposure to quantitative work in the biology/life science field. Similarly, some prior exposure to R programming will make the lab sessions more engaging.

Detailed instructions are provided to complete the required installation of R and RStudio ahead of the workshop, so that participants can follow along the lab sessions.

Course Outline

Module 1: Introduction to R and data analysis

  • Introduction to reproducible end-to-end analysis using R
    • Why use R?
    • Principles of reproducible analysis with R + RStudio
    • R objects, functions, packages
  • Discussion of different variable types (qualitative, quantitative) and levels of measurement (nominal, ordinal, interval, ratio)
    • Principles of “tidy data”
    • Introduction to data cleaning and manipulation methods
  • Descriptive statistics
    • Univariate analysis
    • Measures of central tendency, measures of variability (or spread), and frequency distributions
  • Visual data exploration
    • Introduction to ggplot2 package for making graphs in R

Module 2: Statistical inference and classical hypothesis testing

  • Purpose and foundations of inferential statistics
    • Probability and random variables
    • Meaningful probability distributions
    • Sampling distributions and Central Limit Theorem
  • Getting to know the “language” of hypothesis testing
    • The null and alternative hypothesis
    • The probability of error? (α or “significance level”)
    • The p-value probability and tests interpretation
    • Confidence Intervals
    • Types of errors (Type 1 and Type 2)
    • Effective vs statistical significance
  • Hypothesis tests examples
    • Comparing sample mean to a hypothesized population mean (Z test & t test)
    • Comparing two independent sample means (t test)
    • Comparing sample means from 3 or more groups (ANOVA)
  • A closer look at testing assumptions (with examples)
    • Testing two groups that are not independent
    • Testing if the data are not normally distributed: non-parametric tests
    • Testing samples without homogeneous variance of observations

Module 3: Modeling correlation and regression

  • Testing and summarizing relationship between 2 variables (correlation)
    • Pearson \(r\) analysis (parametric)
      • (numerical variables)
    • Spearman’s test (not parametric)
  • Measures of association
    • Chi-Square Test of Independence
      • (categorical variables)
    • Fisher’s Exact Test
  • From correlation/association to prediction/causation
    • The purpose of observational and experimental studies
  • Introduction of regression based statistical methods
    • Simple linear regression models
    • Multiple Linear Regression models
  • Shifting the emphasis on empirical prediction
    • Introduction to Machine Learning (ML)
    • Distinction between Supervised and Unsupervised algorithms

Module 4: Introduction to machine learning

  • Examples of Machine Learning algorithms
    • PCA – “unsupervised” ML algorithm for dimensionality reduction
    • PLS-Discriminant Analysis – “supervised” alternative to PCA performing simultaneous dimensionality reduction and classification
  • Introduction to MetaboAnalyst software
    • Overview of a useful, R-based resources for metabolomics
    • Illustrative workflow with MetaboAnalyst
  • Elements of statistical Power Analysis
    • Brief review of hypothesis testing framework (from Module 2)
    • Review of type I and type II decision errors, contextualizing them in experimental settings
    • Understanding the test’s statistical power in connection to the effect size of an experiment

programming labs

Each of the above module is accompanied by a matching practical session intended to consolidate the theoretical concepts via hands-on R coding sessions.

  • The illustrative examples used are based on real biology/clinical research data.
  • In each example, the student is guided through the entire process: acquiring and reading data into R, identifying the appropriate analytical method, running the analysis and, finally, interpreting the obtained outcomes.
  • For every exercise, the participant is provided with input datasets (complete with documentation) and R code source files (*.R) with solved examples for future reference.
  • The instructor will also use the lab sessions as an opportunity to discuss common questions and challenges that are normally encountered in the day-to-day life of research scientists.