In this tutorial, we’re going to be running through a hands-on tutorial on exploring a dummy dataset and conducting some statistical analyses.
We are going to explore the different types of data variable types, run tests of normality and association, and plot our results.
In R, a package comprises a specific set of tools and functions that you use to view, manipulate (or wrangle), and analyze your data.
A function carries out a specific task on your data.
There are several R packages available performing a range of different functions. The same output can be generated from similar functions for different packages; which one you use is up to you.
Before we start, we need to ‘load’ each computational package that we will be using in our R script. We do this by first installing the packages we want by using install.packages
function and then loading them into our environment by using the library
function.
You write all of your code in special code chunks (Ctrl + Option + I), and you can name each chunk after the r
present at the top. You can also include several options for each chunk - here I will set the chunk to not display any warnings, messages or errors.
You can leave comments for yourself/others and for unexcuted code throughout your code chunks using the # option.
#install.packages("tidyverse") #you can use single quotation ' ' , or double quotation marks " " .
#data handling packages
library(tidyverse) #contains the dplyr, stringr, readr, tidyr, tibble, forcats, purrr, ggplot2
library(here)
library(openxlsx)
library(data.table)
library(rlist)
library(readxl)
library(DT)
library(rmarkdown)
library(knitr)
library(factoextra)
library(reshape2)
#analysis packages
library(pheatmap)
library(limma)
#extra packages
library(kableExtra)
library(formatR)
library(janitor)
library(scales)
library(ggpubr)
library(RColorBrewer)
Let’s now load in our data. We will be using the tidyverse
package to handle our data. This specific super package contains eight different packages, and allows you write and execute your code in a cleaner format - in comparison to the functions you might use from the default R language (i.e., base R functions).
Let’s first check our working directory using the here
package. This will usually be where you have saved your R .rmd and .md files.
Our data consists of a few recorded clinical variables obtained from a patient group.
::here() here
## [1] "/Users/nikitatelkar/Google Drive/UBC Docs/Volunteer/BCCHR Stats/Summer_Stats_Tutorial"
<- read.csv(file = "BCCHR_Dataset.csv")
data
<- read.xlsx("BCCHR_Dataset_info.xlsx")
info
dim (data)
## [1] 100 11
str(data)
## 'data.frame': 100 obs. of 11 variables:
## $ group1_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ group1_gender : int 1 2 2 1 1 2 1 2 2 1 ...
## $ group1_age : int 20 19 21 24 19 21 22 24 23 22 ...
## $ group1_weekly_sleep_hrs: num 6 6.5 5 7 7.5 7 9 5 6 7 ...
## $ group1_sleep_cat : chr "high" "high" "low" "high" ...
## $ group1_steps : int 1 1 0 0 1 1 0 0 1 0 ...
## $ group1_study_cat : int 1 2 3 1 2 3 1 2 3 1 ...
## $ group1_pre_gpa : num 3 2.8 3.3 1.8 2 4 3.7 2.2 3.6 2.4 ...
## $ group1_post_gpa : num 3.2 2.7 3.2 2 2.3 3.8 3.5 2.4 3.8 2.3 ...
## $ group2_post_gpa : num 3.3 3.8 3.2 2.3 2.5 3 2.8 3.2 3.4 2.5 ...
## $ group1_health : int 1 1 1 1 0 1 1 1 0 1 ...
The dim
function gives us the dimensions of our dataset. It looks like our data has a 100 rows, and 11 columns.
The str
function shows us the different data types of our data: chr = character type, int = integer, num = numeric
There’s also an information
file to go along with our data
file which explains a few of the variables present in our data. Let’s now look at the first 10 columns of our data, and the column names, i.e., the different metrics that we have.
head(data) #displays the first 10 rows of a dataframe
## group1_id group1_gender group1_age group1_weekly_sleep_hrs group1_sleep_cat
## 1 1 1 20 6.0 high
## 2 2 2 19 6.5 high
## 3 3 2 21 5.0 low
## 4 4 1 24 7.0 high
## 5 5 1 19 7.5 high
## 6 6 2 21 7.0 high
## group1_steps group1_study_cat group1_pre_gpa group1_post_gpa group2_post_gpa
## 1 1 1 3.0 3.2 3.3
## 2 1 2 2.8 2.7 3.8
## 3 0 3 3.3 3.2 3.2
## 4 0 1 1.8 2.0 2.3
## 5 1 2 2.0 2.3 2.5
## 6 1 3 4.0 3.8 3.0
## group1_health
## 1 1
## 2 1
## 3 1
## 4 1
## 5 0
## 6 1
names(data) #displays column names of a dataframe
## [1] "group1_id" "group1_gender"
## [3] "group1_age" "group1_weekly_sleep_hrs"
## [5] "group1_sleep_cat" "group1_steps"
## [7] "group1_study_cat" "group1_pre_gpa"
## [9] "group1_post_gpa" "group2_post_gpa"
## [11] "group1_health"
info
## Variables
## 1 group1_id
## 2 group1_gender
## 3 group1_age
## 4 group1_weekly_sleep_hrs
## 5 group1_sleep_cat
## 6 group1_steps
## 7 group1_study_cat
## 8 group1_pre_gpa
## 9 group1_post_gpa
## 10 group2_post_gpa
## 11 group1_health
## Option(s)
## 1 <NA>
## 2 1 = male, 2 = female, 3 = other
## 3 <NA>
## 4 <NA>
## 5 high > 6 hours, low < 6 hours
## 6 1 > 10000 steps, 0 < 10000 steps
## 7 1 = no change, 2 = study group, 3 = study group + office hours
## 8 <NA>
## 9 <NA>
## 10 <NA>
## 11 1 = good, 2 = poor
## Description
## 1 <NA>
## 2 <NA>
## 3 Average sleep hours in the last 7 days
## 4 <NA>
## 5 Average sleep (high/low)
## 6 Average steps in the last 7 days (1/0)
## 7 <NA>
## 8 GPA mid-way through semester (group 1)
## 9 GPA end of semester (group 1)
## 10 GPA end of semester (group 2) - no gpa_t1 data
## 11 Overall health in the last 7 days
Some functions use the quotations, some don’t. Quotations are usually used when assigning a name to a variable.
Aah, great. So, the clinical variables that we have are:
info
file)We have a mix of binary, nominal, discrete, and continuous variables. A variable (x) is something that can take on multiple values. For example, the variable group1_id
takes all of the 100 observations, one at a time.
Try to guess which variable is which type! The statistical tests we use depend on that.
Let’s check the breakdown of the genders of our patients.
$group1_gender #you can select the column you want to select to perform a function on by using the $ symbol. data
## [1] 1 2 2 1 1 2 1 2 2 1 2 1 2 2 1 3 1 2 1 2 2 1 3 1 3 2 3 1 2 3 1 2 2 1 2 1 1
## [38] 2 3 2 1 2 3 3 1 1 2 2 1 2 1 2 2 2 1 3 1 2 2 3 2 1 3 1 2 3 2 2 1 1 1 2 1 2
## [75] 1 2 3 2 3 1 2 1 2 3 1 2 2 1 2 1 1 1 2 1 2 1 2 3 1 2
%>%
data #this %>% symbol is called a `pipe`. It allows us to perform functions on our entire dataframe
::count(group1_gender) #here I am specifying to use the `count` function specifically from the `dplyr` package using the :: command. The count function allows us to get the frequencies. dplyr
## group1_gender n
## 1 1 40
## 2 2 44
## 3 3 16
#Let's look at the percentage breakdown.
%>%
data ::count(group1_gender) %>%
dplyrmutate(perc = (n*100)/100) #mutate allows us to make a new variable from using existing data. The name you specify on the right side of the = sign is the name of your new column, and on the left are the already existing columns
## group1_gender n perc
## 1 1 40 40
## 2 2 44 44
## 3 3 16 16
#the results are the same since we already have a dataset of 100 individuals
Some questions we’re going to ask now:
What’s the average age of our patients?
mean(data$group1_age) #average age is 21.75
## [1] 21.75
min(data$group1_age) #minimum age is 17
## [1] 17
max(data$group1_age) #maximum age is 25
## [1] 25
Does age, as a variable, display a normal distribution of data?
We’ll use the hist
function and ggplot2
package to visualize our data.
hist(data$group1_age) #histogram function
Visually, we can’t make sure if the age is normally distributed. Let’s perform a hypothesis test, the Shapiro-Wilk Test of Normality to check if is:
shapiro.test(data$group1_age)
##
## Shapiro-Wilk normality test
##
## data: data$group1_age
## W = 0.95069, p-value = 0.0009214
The p-value tells us the probability of whether our null hypothesis, that age is normally distributed, is true/accepted (p > 0.05), or the alternative hypothesis, that age is not normally distributed (p < 0.05).
Null Hypothesis: Age is normally distributed
Alternate Hypotheseis: Age is not normally distributed
Here, the p-value is < 0.05, so we do not accept the null hypothesis, meaning that age is NOT normally distributed in our dataset.
What’s the spread of age by gender?
$group1_gender #nominal variable data
## [1] 1 2 2 1 1 2 1 2 2 1 2 1 2 2 1 3 1 2 1 2 2 1 3 1 3 2 3 1 2 3 1 2 2 1 2 1 1
## [38] 2 3 2 1 2 3 3 1 1 2 2 1 2 1 2 2 2 1 3 1 2 2 3 2 1 3 1 2 3 2 2 1 1 1 2 1 2
## [75] 1 2 3 2 3 1 2 1 2 3 1 2 2 1 2 1 1 1 2 1 2 1 2 3 1 2
#editing the original dataframe data (data <- data) to add a new column `gender` which states our gender in a character variable rather than numeric, taken from the info dataframe, using the `recode`.
<- data %>%
data mutate(gender = recode(group1_gender, `1` = "Male", `2` = "Female", `3` = "Other"))
#here we will make a histogram using the ggplot2 package, instead of hist function
%>%
data ggplot(aes(x = as.factor(group1_age))) + #using the ggplot2 package for graphs, specifying the x axis as age, and fill i.e. distinction variable which is gender.
geom_histogram(stat = "count") + #specifying the transparency of the density plot
labs (x = "age", y = "density", title = "Spread of Age by Gender", subtitle = "No.of total patients = 100") #axis and graph labels
## Warning: Ignoring unknown parameters: binwidth, bins, pad
#a density plot is a smoothed version of a histogram
%>%
data ggplot(aes(x = group1_age, fill = gender)) + #using the ggplot2 package for graphs, specifying the x axis as age, and fill i.e. distinction variable which is gender.
geom_density(alpha = 0.3) + #specifying the transparency of the density plot
labs (x = "age", y = "density", title = "Spread of Age by Gender", subtitle = "No.of total patients = 100") #axis and graph labels
Looks like our patients are in the 20-24 range:
Do number of steps everyday show a difference with age?
%>%
data mutate(steps = recode(group1_steps, `1` = "High", `0` = "Low")) %>%
ggplot(aes(x = as.factor(group1_age), fill = steps)) + #advanced function: using as.factor to convert age into a ordered variable
geom_bar(stat = "count", position = position_dodge()) + #using a barplot to visualise the frequency, and specifying the position of plots side by side by the variable 'steps'
theme_minimal() + #this option allows you to change the background and formatting of the graph
labs(x = "Age", y = "Frequency", title = "Sleep Category by Age")
The frequency of patients with high and low step count looks pretty much similar at each age group, meaning that age is not a factor by which number of steps differ.
Is there a significant change in GPA mid-semester and the end of semester?
We’ll use a paired-sample t-test, to test this, given that we sampled the same population (i.e., patients) twice. However, let’s first check visually if there is indeed any noticeable difference in GPA.
$group1_pre_gpa #continuous variable data
## [1] 3.0 2.8 3.3 1.8 2.0 4.0 3.7 2.2 3.6 2.4 2.0 3.9 3.5 2.5 2.9 2.4 2.2 2.7
## [19] 1.5 3.0 3.1 3.5 2.8 3.2 3.3 3.7 3.9 3.2 2.9 2.8 3.4 3.2 3.3 2.8 2.9 3.0
## [37] 1.7 2.7 3.5 2.3 3.2 3.4 3.0 3.9 3.8 3.7 3.3 3.2 2.6 2.4 3.0 4.0 2.5 2.0
## [55] 2.5 3.8 3.7 2.4 3.5 2.8 3.8 3.4 3.2 2.6 2.7 2.5 2.9 2.6 2.0 3.1 2.9 3.6
## [73] 2.7 3.0 3.2 3.0 3.1 2.9 3.1 3.0 3.5 3.2 3.4 3.0 3.1 2.7 2.0 2.8 3.2 3.0
## [91] 3.2 2.8 2.9 3.7 3.8 3.6 3.1 3.2 2.7 3.0
$group1_post_gpa #continuous variable data
## [1] 3.2 2.7 3.2 2.0 2.3 3.8 3.5 2.4 3.8 2.3 2.3 4.0 3.2 2.5 3.2 2.4 2.3 3.2
## [19] 2.0 3.2 3.4 3.4 2.9 3.8 3.4 3.4 3.6 3.3 3.1 3.1 3.2 3.0 3.5 2.3 2.8 3.4
## [37] 2.0 2.9 3.4 2.5 3.3 3.5 2.8 3.7 3.6 3.5 3.0 3.4 2.8 2.7 3.2 3.8 3.0 2.3
## [55] 2.4 3.7 2.3 2.6 3.3 3.0 4.0 3.5 3.3 2.8 2.9 2.7 3.2 2.8 2.4 3.0 2.8 3.7
## [73] 3.0 3.3 3.5 2.6 3.0 3.2 3.1 3.1 3.6 3.0 3.2 3.5 3.2 2.4 2.5 3.0 3.3 3.4
## [91] 3.4 3.0 3.1 3.6 3.7 3.4 3.2 3.3 3.0 3.4
%>%
data ::select(group1_pre_gpa, group1_post_gpa) %>% #the select function allows me to select specific columns
dplyrpivot_longer(cols = 1:2, names_to = "Time", values_to = "GPA") %>% #advanced function: converts a wider dataframe into a longer one
ggplot(aes(x = Time, y = GPA, fill = Time)) +
geom_boxplot() + #using the boxplot to show the distribution of the variables
theme_minimal() +
labs(x = "GPA Timepoint", y = "GPA", title = "GPA distribution by mid-semester and end of semester", subtitle = "GROUP 1")
Hmm, it doesn’t look like there is a drastic difference in mid-term and end of semester GPA for group1. You don’t have to run a test if you can assume visually that there won’t be a significant difference.
Is there a significant change in GPA between group 1 and group 2?
Let’s check between the end semester GPAs of group1 and group2.
%>%
data ::select(group1_post_gpa, group2_post_gpa) %>%
dplyrpivot_longer(cols =1:2, names_to = "Time", values_to = "GPA") %>%
ggplot(aes(x = Time, y = GPA, fill = Time)) +
geom_boxplot() +
geom_jitter(shape = 16, colour = "grey", alpha = 0.5, width = 0.2) + #the jitter function displays all datapoints for the variables in the graph
theme_minimal() +
labs(x = "GPA Timepoint", y = "GPA", title = "GPA distribution for end of semester", subtitle = "Group 1 and Group 2")
It looks like a slight difference. Let’s use the unpaired two-sample t-test to find out, since we are comparing two different groups. However, this test is a parametric test that assumes that our data is normally distributed.
shapiro.test(data$group1_post_gpa)
##
## Shapiro-Wilk normality test
##
## data: data$group1_post_gpa
## W = 0.96499, p-value = 0.009264
shapiro.test(data$group2_post_gpa)
##
## Shapiro-Wilk normality test
##
## data: data$group2_post_gpa
## W = 0.97523, p-value = 0.05619
It looks like neither of them show a normal distribution (group2 showing borderline significance). We’ll go with the non-parametric version of the unpaired two-sample t-test, the Wilcoxon Rank Sum test / Mann-Whitney U test.
wilcox.test(data$group1_post_gpa, data$group2_post_gpa)
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$group1_post_gpa and data$group2_post_gpa
## W = 5534.5, p-value = 0.1906
## alternative hypothesis: true location shift is not equal to 0
The p-value is > 0.05, signifying that there is no significant difference between the GPAs of group1 and group2.
Is health related to average hours of sleep, age, or gender?
We’ll use a chi-squared test to find this out.
$group1_weekly_sleep_hrs #discrete variable data
## [1] 6.0 6.5 5.0 7.0 7.5 7.0 9.0 5.0 6.0 7.0 6.5 8.0 7.0 6.5 6.0 8.5 5.0 6.0
## [19] 4.0 7.0 7.5 6.5 6.0 7.0 7.5 8.5 9.0 6.0 5.5 7.5 7.0 6.0 7.0 6.5 6.0 6.0
## [37] 4.5 6.0 7.0 7.5 6.0 6.5 6.0 7.5 7.0 7.0 8.0 7.5 6.0 6.5 6.0 7.0 6.5 8.0
## [55] 7.5 8.0 6.5 5.5 4.0 6.0 7.5 6.5 8.0 7.0 6.5 7.5 8.0 8.5 6.0 7.0 5.5 6.0
## [73] 6.5 7.5 7.0 7.0 7.5 8.0 6.5 5.0 4.5 6.0 6.5 7.0 6.5 7.5 7.0 7.0 8.0 6.5
## [91] 7.5 6.0 7.0 6.5 7.0 6.5 8.0 4.5 5.5 6.0
hist(data$group1_weekly_sleep_hrs)
%>%
data ::count(group1_health) dplyr
## group1_health n
## 1 0 22
## 2 1 78
#making a dataframe containing only the two columns required
<- data %>%
sleep_health ::select(group1_weekly_sleep_hrs, group1_health)
dplyr
chisq.test(sleep_health)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: sleep_health
## X-squared = 21.335, df = 99, p-value = 1
Looks like there is no significant association between sleep and health. Let’s check with adding age and gender to the mix.
<- data %>%
health ::select(group1_weekly_sleep_hrs, group1_health, group1_age, group1_gender)
dplyr
chisq.test(health)
## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: health
## X-squared = 65.226, df = 297, p-value = 1
No, there is no association between any 4 of these variables.
Is there a correlation with age and GPA?
Since these are both continuous variables as we previously saw, we will use a correlation test. Let’s take an average of the mid-semester and end-semester GPAs.
<- data %>%
data rowwise() %>% #to perform functions across rows, rather than across columns
mutate(avg_gpa = mean(c(group1_pre_gpa, group1_post_gpa)))
Checking their distribution:
shapiro.test(data$group1_age)
##
## Shapiro-Wilk normality test
##
## data: data$group1_age
## W = 0.95069, p-value = 0.0009214
shapiro.test(data$avg_gpa)
##
## Shapiro-Wilk normality test
##
## data: data$avg_gpa
## W = 0.97754, p-value = 0.08538
GPA is normally distributed, age is not. Hence, we cannot use a parametric test. We will go with the Spearman’s rank order correlation test to check for any correlation between these two.
cor.test(data$group1_age, data$avg_gpa, method ="spearman")
## Warning in cor.test.default(data$group1_age, data$avg_gpa, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: data$group1_age and data$avg_gpa
## S = 169390, p-value = 0.871
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.01644122
The p-value shows that there is no significant correlation between these two variables, and the rho parameter indicates a very slight inverse/negative correlation.
While we save our code in .rmd or .md files, we can use the knit option (Ctrl + Shift + K) to render our .rmd (rmarkdown) output in HTML, PDF or Word files. Try it!