For this session we will use Gene counts data from Lokesh given in the previous session. You can download the data here genes and metadata.
Please load the data into R.# Please type the correct path to where you save your data files.
gen_counts<-read.table("~/Documents/GitHub/Bioinformatics_Workshop_ImmTech/Data/2019-11-18/data/gene_counts.tsv")
gen_counts <-as.matrix(gen_counts)
pheno <- read.table("~/Documents/GitHub/Bioinformatics_Workshop_ImmTech/Data/2019-11-18/data/metadata.tsv", header = TRUE)
Have a look at the data by using head()
function.
First have look at the the gene expression data.
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## TMEM240 84 85 0 0 0 0 0
## MT-CO1 0 67 0 0 80 165 0
## ADGRG6 0 0 0 0 0 0 24
## Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13
## TMEM240 0 46 0 38 0 0
## MT-CO1 70 209 130 0 113 0
## ADGRG6 0 0 0 0 0 0
## Sample_14 Sample_15 Sample_16
## TMEM240 0 72 0
## MT-CO1 276 0 0
## ADGRG6 9 0 0
And now the Phenotype data.
## Sample Healthy Age Sex
## 1 Sample_1 yes 23 F
## 2 Sample_2 yes 47 M
## 3 Sample_3 yes 35 F
## 4 Sample_4 yes 19 M
## 5 Sample_5 yes 36 F
## 6 Sample_6 yes 29 M
It is good practice to keep the annotation is separate data structures. It is often the the one needs to make use of different sample annotations and row annotations for statistics and plotting purpose. Usually the numeric data such as gene expression values is kept in a matrix and all the annotations such as gene annotations and sample annotations are kept in data frames.We might want to recall that Matrix stores only one type of data such as numeric, strings etc. while data frames can hold different types of values.
On your own
Try the str()
, summary()
, dim()
and class
functions and see if you can find out how many rows and columns each data has. Also find what other attributes you can understand using these functions
R is a flexible language and it gives many different options to do the same thing. Something simple like accession certain rows of the data or columns of the can be done in many different ways. Lets look do it by using positional indices mapping or by using row or column names.
Using positional indices
Accessing parts of the data in R can be done in different ways. Lets start by using data[ , ]
to extract parts of the data. Use left side of the comma for rows and right side of the comma for columns.
Note that if you leave either side of the comma as blank, r assumes that you meant all data in that dimension should be selected.
Using names
Lets say you are only interested in certain samples and you want to extract those by names.
Lets create a vector of samples you are interested in. To extract the samples from your data simply write the name of the vector on the right side of the comma in the square brackets [ , ]
.
## Create a vector of samples you are interested in
samples <- c("Sample_13" , "Sample_9" , "Sample_4" , "Sample_14" , "Sample_12")
## To extract the samples, here I am using indices for rows while names for columns.
gen_counts[ 1:3, samples]
## Sample_13 Sample_9 Sample_4 Sample_14 Sample_12
## TMEM240 0 46 0 0 0
## MT-CO1 0 209 0 276 113
## ADGRG6 0 0 0 9 0
And similarly, to extract you favorite genes by names use following examples.
genes <- c("BMS1" , "SYNC" , "NFIB" , "RAB11FIP3" , "COG1" , "C5orf38" , "FAM126B" , "SUCLA2" , "SCN4A" , "CISD1" )
## To extract the rows, here I am using indices for rows while names for columns.
head(gen_counts[genes,], 3)
## Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## BMS1 146 0 253 0 0 152 119
## SYNC 63 77 120 47 43 25 11
## NFIB 34 0 0 56 0 0 0
## Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13 Sample_14
## BMS1 0 49 198 0 73 0 134
## SYNC 35 38 24 130 26 0 0
## NFIB 0 0 0 0 0 0 0
## Sample_15 Sample_16
## BMS1 342 0
## SYNC 37 0
## NFIB 0 0
**Extract names samples coming from healthy and ill individuals.
So far we have only printer the extracted data in the console, this is not very useful if we want to use the extracted data for further analyses. Let us store this data in the “environment” by using <-
or =
signs and write it to a file with write.table()
function.
So far we have consolidated our knowledge on how to load, access, subset and save data. This is going to be useful in later session.
R is a Functional Language and there is a lot of help available. For almost all necessary data analyses tasks, there are functions available. Most of these functions are well written and have good documentation on how to use them. Often R functions are bundled in packages. An example of a package is ggplot2
that is developed for plotting purpose. There are plenty more such as limma
for analyses of micro array data. There are handy functions to find out what these packages and functions do.
We would take a look at how to get help on packages, functions and data types.
And to get help on ggplot()
function lets use ?
. Or t.test()
or lm()
.
To know more about packages, lets look at the Vignettes of ggplot2 package in R by using function call browseVignettes("ggplot2")
.
If you want to know what methods are available for your type of data, you can use methods
function as following.
Essentially each function as a set of parameters you can give. Often the errors R produces is due to the fact that the user is not providing the correct data (arguments) for those parameters. Help pages will give an overview of the parameters available. Typically there are default values (arguments) for each parameter should be but the function wont work with some minimal arguments.
R is a statistical and programming language.There are thousands of packages and functions available for basic to very advanced statistics. In not, once can always write there own function to help themselves in their cause.
There are several convenience functions and packages available for descriptive statistics and analyses in R. We will use some of these and try to apply them on our genes data.
To get values such as mean, standard deviation, variance, min, max, median etc. R has implemented functions in the base package. This package is automatically loaded in R when you start the program R or Studio.
Lets say we want obtain the mean of the fist sample we can use rowMeans()
, colMeans()
, aggregate()
as following.
# means on selection of samples
colMeans(gen_counts[ , samples])
# means of selection of rows and columns.
rowMeans(gen_counts[genes , samples])
## Sample_13 Sample_9 Sample_4 Sample_14 Sample_12
## 43.28804 56.36232 9.75000 45.61957 23.19203
## BMS1 SYNC NFIB RAB11FIP3 COG1 C5orf38 FAM126B
## 51.2 22.2 11.2 14.0 3.0 41.6 75.0
## SUCLA2 SCN4A CISD1
## 53.0 53.4 0.0
Similarly, we can apply the same other descriptive statistical functions such as rowVars()
or colVars()
for variances of rows and columns respectively.
Try to use rowsd, rowMax, rowMin
etc. to get an idea of the functions available for descriptive statistics.
If there is a particular function you are interested in, ask one of the TAs or brown R related webpages.
Mean, Median, Variance, Standard Deviation, Quantiles, counts, log2, log10, Exponential
## Sex Age
## 1 F 52.625
## 2 M 43.500
## Sex Age
## 1 F 24.43614
## 2 M 22.27106
apply()
*** family functionsWriting function is relatively easy. Lets write a function to calculate ratio of means between healthy vs. ill individuals.
We define ratioOfMeans
by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}).
When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.
Now we have written a function, we can use it to calculate ratios within our own data sets whenever we want.
*** It is OK if you don’t fully understand this, for typical data analyses one can usually rely on already available functions in R***. There is a lot of testing and annotation required before your function can be published for others to use. This was just to illustrate the value of investing time in learning R.
apply()
family functionsR has a special class of helper functions that can be used to apply functions to all elements of any data type. These include lapply()
, sapply()
, maapply()
, tapply()
, rapply()
. These are alternatives to loops and have faster implementation. It is recommended to use these instead of loops as they are much faster and rather easy to implement.
Lets take a look at an example.
The above function should give you the exact same output as rowMeans()
and colMeans()
respectively.
The apply()
functions are particularly handy when you want to write small through away functions.
We will use these in the upcoming session when we want to learn about t.test and linear models.
We will cover more advanced descriptive statistics in the future sessions.
We will look at the two different statistical tests namely t-test and linear regression to test our hypotheses.
To apply t.test()
in R we can first look at the help by typing ?t.test
.
We want to find out whether certain genes are deferentially expressed in healthy vs. ill patients using students t-test.
## Extract the names of the ill samples
healthy <- pheno[pheno$Healthy == "yes" , ]$Sample
ill <- pheno[pheno$Healthy == "no", ]$Sample
# Aplly t.test on gene at row 1 in the data.
t_res<- t.test(gen_counts[1 , healthy], gen_counts[1 , ill])
# Take a look at the resuls of the t test.
t_res
##
## Welch Two Sample t-test
##
## data: gen_counts[1, healthy] and gen_counts[1, ill]
## t = 1.1869, df = 13.681, p-value = 0.2555
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -15.71249 54.46249
## sample estimates:
## mean of x mean of y
## 30.000 10.625
Lets have a look at the output of the t_res object. First have a look at the object by using str()
function.
## List of 10
## $ statistic : Named num 1.19
## ..- attr(*, "names")= chr "t"
## $ parameter : Named num 13.7
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.255
## $ conf.int : num [1:2] -15.7 54.5
## ..- attr(*, "conf.level")= num 0.95
## $ estimate : Named num [1:2] 30 10.6
## ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "difference in means"
## $ stderr : num 16.3
## $ alternative: chr "two.sided"
## $ method : chr "Welch Two Sample t-test"
## $ data.name : chr "gen_counts[1, healthy] and gen_counts[1, ill]"
## - attr(*, "class")= chr "htest"
We can extract desired output such as p value, means of groups etc. from this object by using $
operator.
Try to extract confidence intervals, means of groups and from the same object.
## mean of x
## 30
## mean of y
## 10.625
## [1] -15.71249 54.46249
## attr(,"conf.level")
## [1] 0.95
We can make use of R’s apply function to perform t test on all genes in the data.
For nicer output lets try to access p-values.
And means by using following code.
Can you extract confidence intervals?
Similar to t.test, one can also apply linear regression on a single gene. Take a look at some examples on how to extract output from the models. use ?lm
and methods(class="lm")
to know more about what you can extract from linear model fit.
One should be aware the formula
argument of the function and it works such that the outcome variable is on the left hand side of the “~
” sign and the predictor variables are on the right hand side.
To look at the results from the linear model summary()
is a very useful function and it gives most of the output needed for interpretation of results.
To find out what other functions you can apply to linear model output use two of our favorite functions class()
and methods()
.
## [1] add1 alias anova case.names
## [5] coerce confint cooks.distance deviance
## [9] dfbeta dfbetas drop1 dummy.coef
## [13] effects extractAIC family formula
## [17] hatvalues influence initialize kappa
## [21] labels logLik model.frame model.matrix
## [25] nobs plot predict print
## [29] proj qr residuals rstandard
## [33] rstudent show simulate slotsFromS3
## [37] summary variable.names vcov
## see '?methods' for accessing help and source code
And lets use some of functions available for class “lm” to extract useful output.
## (Intercept) gen_counts[1, ]
## 46.0159833 0.1007516
##
## Call:
## lm(formula = y ~ gen_counts[1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.479 -17.016 -7.212 22.575 34.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.0160 6.9952 6.578 1.23e-05 ***
## gen_counts[1, ] 0.1008 0.1844 0.546 0.593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.63 on 14 degrees of freedom
## Multiple R-squared: 0.02088, Adjusted R-squared: -0.04906
## F-statistic: 0.2985 on 1 and 14 DF, p-value: 0.5934
##
## (Intercept) gen_counts[1, ]
## 1.232512e-05 5.934069e-01
Just like we did for t-test, we would like to run linear regression for all our genes and extract output in a nice format.
Store this output in a list and have a look at the summary of results fro the first gene.
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.479 -17.016 -7.212 22.575 34.984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.0160 6.9952 6.578 1.23e-05 ***
## x 0.1008 0.1844 0.546 0.593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.63 on 14 degrees of freedom
## Multiple R-squared: 0.02088, Adjusted R-squared: -0.04906
## F-statistic: 0.2985 on 1 and 14 DF, p-value: 0.5934
Once you know the output well enough you can try to run lm
on all genes and extract output.
If the outcome variable is categorical, one can use glm()
function. We can learn more about this in the future sessions. Just as a teaser, try following.
Be aware of multiple testing. We will cover that in the future sessions.