0.0 Introduction

In this tutorial, we’re going to be running through a hands-on tutorial on exploring a dummy dataset and conducting some statistical analyses.

We are going to explore the different types of data variable types, run tests of normality and association, and plot our results.


1.0 Loading Libraries and Data

In R, a package comprises a specific set of tools and functions that you use to view, manipulate (or wrangle), and analyze your data.

A function carries out a specific task on your data.

There are several R packages available performing a range of different functions. The same output can be generated from similar functions for different packages; which one you use is up to you.

Before we start, we need to ‘load’ each computational package that we will be using in our R script. We do this by first installing the packages we want by using install.packages function and then loading them into our environment by using the library function.

You write all of your code in special code chunks (Ctrl + Option + I), and you can name each chunk after the r present at the top. You can also include several options for each chunk - here I will set the chunk to not display any warnings, messages or errors.

You can leave comments for yourself/others and for unexcuted code throughout your code chunks using the # option.

#install.packages("tidyverse") #you can use single quotation ' ' , or double quotation marks " " . 

#data handling packages
library(tidyverse) #contains the dplyr, stringr, readr, tidyr, tibble, forcats, purrr, ggplot2
library(here) 
library(openxlsx)
library(data.table)
library(rlist)
library(readxl)
library(DT)
library(rmarkdown)
library(knitr)
library(factoextra)
library(reshape2)

#analysis packages
library(pheatmap)
library(limma)

#extra packages
library(kableExtra)
library(formatR)
library(janitor)
library(scales)
library(ggpubr)
library(RColorBrewer)

Let’s now load in our data. We will be using the tidyverse package to handle our data. This specific super package contains eight different packages, and allows you write and execute your code in a cleaner format - in comparison to the functions you might use from the default R language (i.e., base R functions).

Let’s first check our working directory using the here package. This will usually be where you have saved your R .rmd and .md files.

Our data consists of a few recorded clinical variables obtained from a patient group.

here::here() 
## [1] "/Users/nikitatelkar/Google Drive/UBC Docs/Volunteer/BCCHR Stats/Summer_Stats_Tutorial"
data <- read.csv(file = "BCCHR_Dataset.csv")

info <- read.xlsx("BCCHR_Dataset_info.xlsx")

dim (data)
## [1] 100  11
str(data)
## 'data.frame':    100 obs. of  11 variables:
##  $ group1_id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ group1_gender          : int  1 2 2 1 1 2 1 2 2 1 ...
##  $ group1_age             : int  20 19 21 24 19 21 22 24 23 22 ...
##  $ group1_weekly_sleep_hrs: num  6 6.5 5 7 7.5 7 9 5 6 7 ...
##  $ group1_sleep_cat       : chr  "high" "high" "low" "high" ...
##  $ group1_steps           : int  1 1 0 0 1 1 0 0 1 0 ...
##  $ group1_study_cat       : int  1 2 3 1 2 3 1 2 3 1 ...
##  $ group1_pre_gpa         : num  3 2.8 3.3 1.8 2 4 3.7 2.2 3.6 2.4 ...
##  $ group1_post_gpa        : num  3.2 2.7 3.2 2 2.3 3.8 3.5 2.4 3.8 2.3 ...
##  $ group2_post_gpa        : num  3.3 3.8 3.2 2.3 2.5 3 2.8 3.2 3.4 2.5 ...
##  $ group1_health          : int  1 1 1 1 0 1 1 1 0 1 ...

The dim function gives us the dimensions of our dataset. It looks like our data has a 100 rows, and 11 columns.

The str function shows us the different data types of our data: chr = character type, int = integer, num = numeric

There’s also an information file to go along with our data file which explains a few of the variables present in our data. Let’s now look at the first 10 columns of our data, and the column names, i.e., the different metrics that we have.

head(data) #displays the first 10 rows of a dataframe
##   group1_id group1_gender group1_age group1_weekly_sleep_hrs group1_sleep_cat
## 1         1             1         20                     6.0             high
## 2         2             2         19                     6.5             high
## 3         3             2         21                     5.0              low
## 4         4             1         24                     7.0             high
## 5         5             1         19                     7.5             high
## 6         6             2         21                     7.0             high
##   group1_steps group1_study_cat group1_pre_gpa group1_post_gpa group2_post_gpa
## 1            1                1            3.0             3.2             3.3
## 2            1                2            2.8             2.7             3.8
## 3            0                3            3.3             3.2             3.2
## 4            0                1            1.8             2.0             2.3
## 5            1                2            2.0             2.3             2.5
## 6            1                3            4.0             3.8             3.0
##   group1_health
## 1             1
## 2             1
## 3             1
## 4             1
## 5             0
## 6             1
names(data) #displays column names of a dataframe
##  [1] "group1_id"               "group1_gender"          
##  [3] "group1_age"              "group1_weekly_sleep_hrs"
##  [5] "group1_sleep_cat"        "group1_steps"           
##  [7] "group1_study_cat"        "group1_pre_gpa"         
##  [9] "group1_post_gpa"         "group2_post_gpa"        
## [11] "group1_health"
info
##                  Variables
## 1                group1_id
## 2            group1_gender
## 3               group1_age
## 4  group1_weekly_sleep_hrs
## 5         group1_sleep_cat
## 6             group1_steps
## 7         group1_study_cat
## 8           group1_pre_gpa
## 9          group1_post_gpa
## 10         group2_post_gpa
## 11           group1_health
##                                                          Option(s)
## 1                                                             <NA>
## 2                                  1 = male, 2 = female, 3 = other
## 3                                                             <NA>
## 4                                                             <NA>
## 5                                    high > 6 hours, low < 6 hours
## 6                                 1 > 10000 steps, 0 < 10000 steps
## 7  1  = no change, 2 = study group, 3 = study group + office hours
## 8                                                             <NA>
## 9                                                             <NA>
## 10                                                            <NA>
## 11                                              1 = good, 2 = poor
##                                       Description
## 1                                            <NA>
## 2                                            <NA>
## 3          Average sleep hours in the last 7 days
## 4                                            <NA>
## 5                        Average sleep (high/low)
## 6          Average steps in the last 7 days (1/0)
## 7                                            <NA>
## 8          GPA mid-way through semester (group 1)
## 9                   GPA end of semester (group 1)
## 10 GPA end of semester (group 2) - no gpa_t1 data
## 11              Overall health in the last 7 days

Some functions use the quotations, some don’t. Quotations are usually used when assigning a name to a variable.

Aah, great. So, the clinical variables that we have are:

We have a mix of binary, nominal, discrete, and continuous variables. A variable (x) is something that can take on multiple values. For example, the variable group1_id takes all of the 100 observations, one at a time.

Try to guess which variable is which type! The statistical tests we use depend on that.


2.0 Gender, Age and Steps

Let’s check the breakdown of the genders of our patients.

data$group1_gender #you can select the column you want to select to perform a function on by using the $ symbol.  
##   [1] 1 2 2 1 1 2 1 2 2 1 2 1 2 2 1 3 1 2 1 2 2 1 3 1 3 2 3 1 2 3 1 2 2 1 2 1 1
##  [38] 2 3 2 1 2 3 3 1 1 2 2 1 2 1 2 2 2 1 3 1 2 2 3 2 1 3 1 2 3 2 2 1 1 1 2 1 2
##  [75] 1 2 3 2 3 1 2 1 2 3 1 2 2 1 2 1 1 1 2 1 2 1 2 3 1 2
data %>% 
  #this %>% symbol is called a `pipe`. It allows us to perform functions on our entire dataframe 
  dplyr::count(group1_gender) #here I am specifying to use the `count` function specifically from the `dplyr` package using the :: command. The count function allows us to get the frequencies.  
##   group1_gender  n
## 1             1 40
## 2             2 44
## 3             3 16
#Let's look at the percentage breakdown.  
data %>% 
  dplyr::count(group1_gender) %>% 
  mutate(perc = (n*100)/100) #mutate allows us to make a new variable from using existing data. The name you specify on the right side of the = sign is the name of your new column, and on the left are the already existing columns
##   group1_gender  n perc
## 1             1 40   40
## 2             2 44   44
## 3             3 16   16
#the results are the same since we already have a dataset of 100 individuals

Some questions we’re going to ask now:

What’s the average age of our patients?

mean(data$group1_age) #average age is 21.75
## [1] 21.75
min(data$group1_age) #minimum age is 17
## [1] 17
max(data$group1_age) #maximum age is 25
## [1] 25

Does age, as a variable, display a normal distribution of data?
We’ll use the hist function and ggplot2 package to visualize our data.

hist(data$group1_age) #histogram function

Visually, we can’t make sure if the age is normally distributed. Let’s perform a hypothesis test, the Shapiro-Wilk Test of Normality to check if is:

shapiro.test(data$group1_age)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$group1_age
## W = 0.95069, p-value = 0.0009214

The p-value tells us the probability of whether our null hypothesis, that age is normally distributed, is true/accepted (p > 0.05), or the alternative hypothesis, that age is not normally distributed (p < 0.05).

Null Hypothesis: Age is normally distributed
Alternate Hypotheseis: Age is not normally distributed

Here, the p-value is < 0.05, so we do not accept the null hypothesis, meaning that age is NOT normally distributed in our dataset.

What’s the spread of age by gender?

data$group1_gender #nominal variable
##   [1] 1 2 2 1 1 2 1 2 2 1 2 1 2 2 1 3 1 2 1 2 2 1 3 1 3 2 3 1 2 3 1 2 2 1 2 1 1
##  [38] 2 3 2 1 2 3 3 1 1 2 2 1 2 1 2 2 2 1 3 1 2 2 3 2 1 3 1 2 3 2 2 1 1 1 2 1 2
##  [75] 1 2 3 2 3 1 2 1 2 3 1 2 2 1 2 1 1 1 2 1 2 1 2 3 1 2
#editing the original dataframe data (data <- data) to add a new column `gender` which states our gender in a character variable rather than numeric, taken from the info dataframe, using the `recode`.
data <- data %>% 
  mutate(gender = recode(group1_gender, `1` = "Male", `2` = "Female", `3` = "Other"))

#here we will make a histogram using the ggplot2 package, instead of hist function
data %>% 
  ggplot(aes(x = as.factor(group1_age))) + #using the ggplot2 package for graphs, specifying the x axis as age, and fill i.e. distinction variable which is gender.
  geom_histogram(stat = "count") + #specifying the transparency of the density plot
  labs (x = "age", y = "density", title = "Spread of Age by Gender", subtitle = "No.of total patients = 100") #axis and graph labels
## Warning: Ignoring unknown parameters: binwidth, bins, pad

#a density plot is a smoothed version of a histogram
data %>% 
  ggplot(aes(x = group1_age, fill = gender)) + #using the ggplot2 package for graphs, specifying the x axis as age, and fill i.e. distinction variable which is gender.
  geom_density(alpha = 0.3) + #specifying the transparency of the density plot
  labs (x = "age", y = "density", title = "Spread of Age by Gender", subtitle = "No.of total patients = 100") #axis and graph labels

Looks like our patients are in the 20-24 range:

Do number of steps everyday show a difference with age?

data %>% 
  mutate(steps = recode(group1_steps, `1` = "High", `0` = "Low")) %>% 
  ggplot(aes(x = as.factor(group1_age), fill = steps)) + #advanced function: using as.factor to convert age into a ordered variable
  geom_bar(stat = "count", position = position_dodge()) + #using a barplot to visualise the frequency, and specifying the position of plots side by side by the variable 'steps'
  theme_minimal() + #this option allows you to change the background and formatting of the graph
  labs(x = "Age", y = "Frequency", title = "Sleep Category by Age")

The frequency of patients with high and low step count looks pretty much similar at each age group, meaning that age is not a factor by which number of steps differ.


3.0 GPA

Is there a significant change in GPA mid-semester and the end of semester?

We’ll use a paired-sample t-test, to test this, given that we sampled the same population (i.e., patients) twice. However, let’s first check visually if there is indeed any noticeable difference in GPA.

data$group1_pre_gpa #continuous variable
##   [1] 3.0 2.8 3.3 1.8 2.0 4.0 3.7 2.2 3.6 2.4 2.0 3.9 3.5 2.5 2.9 2.4 2.2 2.7
##  [19] 1.5 3.0 3.1 3.5 2.8 3.2 3.3 3.7 3.9 3.2 2.9 2.8 3.4 3.2 3.3 2.8 2.9 3.0
##  [37] 1.7 2.7 3.5 2.3 3.2 3.4 3.0 3.9 3.8 3.7 3.3 3.2 2.6 2.4 3.0 4.0 2.5 2.0
##  [55] 2.5 3.8 3.7 2.4 3.5 2.8 3.8 3.4 3.2 2.6 2.7 2.5 2.9 2.6 2.0 3.1 2.9 3.6
##  [73] 2.7 3.0 3.2 3.0 3.1 2.9 3.1 3.0 3.5 3.2 3.4 3.0 3.1 2.7 2.0 2.8 3.2 3.0
##  [91] 3.2 2.8 2.9 3.7 3.8 3.6 3.1 3.2 2.7 3.0
data$group1_post_gpa #continuous variable
##   [1] 3.2 2.7 3.2 2.0 2.3 3.8 3.5 2.4 3.8 2.3 2.3 4.0 3.2 2.5 3.2 2.4 2.3 3.2
##  [19] 2.0 3.2 3.4 3.4 2.9 3.8 3.4 3.4 3.6 3.3 3.1 3.1 3.2 3.0 3.5 2.3 2.8 3.4
##  [37] 2.0 2.9 3.4 2.5 3.3 3.5 2.8 3.7 3.6 3.5 3.0 3.4 2.8 2.7 3.2 3.8 3.0 2.3
##  [55] 2.4 3.7 2.3 2.6 3.3 3.0 4.0 3.5 3.3 2.8 2.9 2.7 3.2 2.8 2.4 3.0 2.8 3.7
##  [73] 3.0 3.3 3.5 2.6 3.0 3.2 3.1 3.1 3.6 3.0 3.2 3.5 3.2 2.4 2.5 3.0 3.3 3.4
##  [91] 3.4 3.0 3.1 3.6 3.7 3.4 3.2 3.3 3.0 3.4
data %>% 
  dplyr::select(group1_pre_gpa, group1_post_gpa) %>% #the select function allows me to select specific columns
  pivot_longer(cols = 1:2, names_to = "Time", values_to = "GPA") %>% #advanced function: converts a wider dataframe into a longer one
  ggplot(aes(x = Time, y = GPA, fill = Time)) +
  geom_boxplot() + #using the boxplot to show the distribution of the variables
  theme_minimal() +
  labs(x = "GPA Timepoint", y = "GPA", title = "GPA distribution by mid-semester and end of semester", subtitle = "GROUP 1")

Hmm, it doesn’t look like there is a drastic difference in mid-term and end of semester GPA for group1. You don’t have to run a test if you can assume visually that there won’t be a significant difference.

Is there a significant change in GPA between group 1 and group 2?

Let’s check between the end semester GPAs of group1 and group2.

data %>% 
  dplyr::select(group1_post_gpa, group2_post_gpa) %>% 
  pivot_longer(cols =1:2, names_to = "Time", values_to = "GPA") %>% 
  ggplot(aes(x = Time, y = GPA, fill = Time)) +
  geom_boxplot() +
  geom_jitter(shape = 16, colour = "grey", alpha = 0.5, width = 0.2) + #the jitter function displays all datapoints for the variables in the graph
  theme_minimal() +
  labs(x = "GPA Timepoint", y = "GPA", title = "GPA distribution for end of semester", subtitle = "Group 1 and Group 2")

It looks like a slight difference. Let’s use the unpaired two-sample t-test to find out, since we are comparing two different groups. However, this test is a parametric test that assumes that our data is normally distributed.

shapiro.test(data$group1_post_gpa)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$group1_post_gpa
## W = 0.96499, p-value = 0.009264
shapiro.test(data$group2_post_gpa)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$group2_post_gpa
## W = 0.97523, p-value = 0.05619

It looks like neither of them show a normal distribution (group2 showing borderline significance). We’ll go with the non-parametric version of the unpaired two-sample t-test, the Wilcoxon Rank Sum test / Mann-Whitney U test.

wilcox.test(data$group1_post_gpa, data$group2_post_gpa)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$group1_post_gpa and data$group2_post_gpa
## W = 5534.5, p-value = 0.1906
## alternative hypothesis: true location shift is not equal to 0

The p-value is > 0.05, signifying that there is no significant difference between the GPAs of group1 and group2.


4.0 Health and Sleep

Is health related to average hours of sleep, age, or gender?

We’ll use a chi-squared test to find this out.

data$group1_weekly_sleep_hrs #discrete variable
##   [1] 6.0 6.5 5.0 7.0 7.5 7.0 9.0 5.0 6.0 7.0 6.5 8.0 7.0 6.5 6.0 8.5 5.0 6.0
##  [19] 4.0 7.0 7.5 6.5 6.0 7.0 7.5 8.5 9.0 6.0 5.5 7.5 7.0 6.0 7.0 6.5 6.0 6.0
##  [37] 4.5 6.0 7.0 7.5 6.0 6.5 6.0 7.5 7.0 7.0 8.0 7.5 6.0 6.5 6.0 7.0 6.5 8.0
##  [55] 7.5 8.0 6.5 5.5 4.0 6.0 7.5 6.5 8.0 7.0 6.5 7.5 8.0 8.5 6.0 7.0 5.5 6.0
##  [73] 6.5 7.5 7.0 7.0 7.5 8.0 6.5 5.0 4.5 6.0 6.5 7.0 6.5 7.5 7.0 7.0 8.0 6.5
##  [91] 7.5 6.0 7.0 6.5 7.0 6.5 8.0 4.5 5.5 6.0
hist(data$group1_weekly_sleep_hrs)

data %>% 
  dplyr::count(group1_health)
##   group1_health  n
## 1             0 22
## 2             1 78
#making a dataframe containing only the two columns required
sleep_health <- data %>% 
  dplyr::select(group1_weekly_sleep_hrs, group1_health)

chisq.test(sleep_health)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  sleep_health
## X-squared = 21.335, df = 99, p-value = 1

Looks like there is no significant association between sleep and health. Let’s check with adding age and gender to the mix.

health <- data %>% 
  dplyr::select(group1_weekly_sleep_hrs, group1_health, group1_age, group1_gender)

chisq.test(health)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  health
## X-squared = 65.226, df = 297, p-value = 1

No, there is no association between any 4 of these variables.


5.0 Bonus: Age and GPA

Is there a correlation with age and GPA?

Since these are both continuous variables as we previously saw, we will use a correlation test. Let’s take an average of the mid-semester and end-semester GPAs.

data <- data %>% 
  rowwise() %>% #to perform functions across rows, rather than across columns   
  mutate(avg_gpa = mean(c(group1_pre_gpa, group1_post_gpa)))

Checking their distribution:

shapiro.test(data$group1_age)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$group1_age
## W = 0.95069, p-value = 0.0009214
shapiro.test(data$avg_gpa)
## 
##  Shapiro-Wilk normality test
## 
## data:  data$avg_gpa
## W = 0.97754, p-value = 0.08538

GPA is normally distributed, age is not. Hence, we cannot use a parametric test. We will go with the Spearman’s rank order correlation test to check for any correlation between these two.

cor.test(data$group1_age, data$avg_gpa, method ="spearman")
## Warning in cor.test.default(data$group1_age, data$avg_gpa, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  data$group1_age and data$avg_gpa
## S = 169390, p-value = 0.871
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.01644122

The p-value shows that there is no significant correlation between these two variables, and the rho parameter indicates a very slight inverse/negative correlation.


6.0 Knit your file!

While we save our code in .rmd or .md files, we can use the knit option (Ctrl + Shift + K) to render our .rmd (rmarkdown) output in HTML, PDF or Word files. Try it!