1 A quick catch up and practice for previous sessions

1.1 Data: Import, Export, Generation, vectors, matrices and data frames

For this session we will use Gene counts data from Lokesh given in the previous session. You can download the data here genes and metadata.

Please load the data into R.

Have a look at the data by using head() function.

First have look at the the gene expression data.

R
##         Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## TMEM240       84       85        0        0        0        0        0
## MT-CO1         0       67        0        0       80      165        0
## ADGRG6         0        0        0        0        0        0       24
##         Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13
## TMEM240        0       46         0        38         0         0
## MT-CO1        70      209       130         0       113         0
## ADGRG6         0        0         0         0         0         0
##         Sample_14 Sample_15 Sample_16
## TMEM240         0        72         0
## MT-CO1        276         0         0
## ADGRG6          9         0         0

And now the Phenotype data.

R
##     Sample Healthy Age Sex
## 1 Sample_1     yes  23   F
## 2 Sample_2     yes  47   M
## 3 Sample_3     yes  35   F
## 4 Sample_4     yes  19   M
## 5 Sample_5     yes  36   F
## 6 Sample_6     yes  29   M

It is good practice to keep the annotation is separate data structures. It is often the the one needs to make use of different sample annotations and row annotations for statistics and plotting purpose. Usually the numeric data such as gene expression values is kept in a matrix and all the annotations such as gene annotations and sample annotations are kept in data frames.We might want to recall that Matrix stores only one type of data such as numeric, strings etc. while data frames can hold different types of values.

On your own

Try the str(), summary(), dim() and class functions and see if you can find out how many rows and columns each data has. Also find what other attributes you can understand using these functions

1.1.1 Accessing parts of the data

R is a flexible language and it gives many different options to do the same thing. Something simple like accession certain rows of the data or columns of the can be done in many different ways. Lets look do it by using positional indices mapping or by using row or column names.

Using positional indices

Accessing parts of the data in R can be done in different ways. Lets start by using data[ , ] to extract parts of the data. Use left side of the comma for rows and right side of the comma for columns.

Note that if you leave either side of the comma as blank, r assumes that you meant all data in that dimension should be selected.

Using names

Lets say you are only interested in certain samples and you want to extract those by names. Lets create a vector of samples you are interested in. To extract the samples from your data simply write the name of the vector on the right side of the comma in the square brackets [ , ].

And similarly, to extract you favorite genes by names use following examples.

R
##      Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Sample_6 Sample_7
## BMS1      146        0      253        0        0      152      119
## SYNC       63       77      120       47       43       25       11
## NFIB       34        0        0       56        0        0        0
##      Sample_8 Sample_9 Sample_10 Sample_11 Sample_12 Sample_13 Sample_14
## BMS1        0       49       198         0        73         0       134
## SYNC       35       38        24       130        26         0         0
## NFIB        0        0         0         0         0         0         0
##      Sample_15 Sample_16
## BMS1       342         0
## SYNC        37         0
## NFIB         0         0

**Extract names samples coming from healthy and ill individuals.

1.1.2 Save the extracted data

So far we have only printer the extracted data in the console, this is not very useful if we want to use the extracted data for further analyses. Let us store this data in the “environment” by using <- or = signs and write it to a file with write.table() function.

So far we have consolidated our knowledge on how to load, access, subset and save data. This is going to be useful in later session.

2 Functional programming

R is a Functional Language and there is a lot of help available. For almost all necessary data analyses tasks, there are functions available. Most of these functions are well written and have good documentation on how to use them. Often R functions are bundled in packages. An example of a package is ggplot2 that is developed for plotting purpose. There are plenty more such as limma for analyses of micro array data. There are handy functions to find out what these packages and functions do. We would take a look at how to get help on packages, functions and data types.

2.0.1 Help on functions

And to get help on ggplot() function lets use ?. Or t.test() or lm().

2.0.2 Help on R packages

To know more about packages, lets look at the Vignettes of ggplot2 package in R by using function call browseVignettes("ggplot2").

2.0.3 Help on data closeness.

If you want to know what methods are available for your type of data, you can use methods function as following.

Essentially each function as a set of parameters you can give. Often the errors R produces is due to the fact that the user is not providing the correct data (arguments) for those parameters. Help pages will give an overview of the parameters available. Typically there are default values (arguments) for each parameter should be but the function wont work with some minimal arguments.

3 Descriptive statistics

R is a statistical and programming language.There are thousands of packages and functions available for basic to very advanced statistics. In not, once can always write there own function to help themselves in their cause.

3.1 Some useful descriptive statistics

There are several convenience functions and packages available for descriptive statistics and analyses in R. We will use some of these and try to apply them on our genes data.

To get values such as mean, standard deviation, variance, min, max, median etc. R has implemented functions in the base package. This package is automatically loaded in R when you start the program R or Studio.

Lets say we want obtain the mean of the fist sample we can use rowMeans() , colMeans(), aggregate() as following.

R
## Sample_13  Sample_9  Sample_4 Sample_14 Sample_12 
##  43.28804  56.36232   9.75000  45.61957  23.19203 
##      BMS1      SYNC      NFIB RAB11FIP3      COG1   C5orf38   FAM126B 
##      51.2      22.2      11.2      14.0       3.0      41.6      75.0 
##    SUCLA2     SCN4A     CISD1 
##      53.0      53.4       0.0

Similarly, we can apply the same other descriptive statistical functions such as rowVars() or colVars() for variances of rows and columns respectively.

Try to use rowsd, rowMax, rowMin etc. to get an idea of the functions available for descriptive statistics.

If there is a particular function you are interested in, ask one of the TAs or brown R related webpages.

Mean, Median, Variance, Standard Deviation, Quantiles, counts, log2, log10, Exponential

3.2 Applying functions on subset of the data

R
##   Sex    Age
## 1   F 52.625
## 2   M 43.500
##   Sex      Age
## 1   F 24.43614
## 2   M 22.27106

4 Writing your own functions apply()*** family functions

Writing function is relatively easy. Lets write a function to calculate ratio of means between healthy vs. ill individuals.

4.1 Writing your own functions

We define ratioOfMeans by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function–the statements that are executed when it runs–is contained within curly braces ({}).

When we call the function, the values we pass to it are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.

Now we have written a function, we can use it to calculate ratios within our own data sets whenever we want.

*** It is OK if you don’t fully understand this, for typical data analyses one can usually rely on already available functions in R***. There is a lot of testing and annotation required before your function can be published for others to use. This was just to illustrate the value of investing time in learning R.

4.2 The apply() family functions

R has a special class of helper functions that can be used to apply functions to all elements of any data type. These include lapply(), sapply(), maapply(), tapply(), rapply(). These are alternatives to loops and have faster implementation. It is recommended to use these instead of loops as they are much faster and rather easy to implement.

Lets take a look at an example.

The above function should give you the exact same output as rowMeans() and colMeans() respectively.

The apply() functions are particularly handy when you want to write small through away functions.

We will use these in the upcoming session when we want to learn about t.test and linear models.

We will cover more advanced descriptive statistics in the future sessions.

5 Basic Statistics

We will look at the two different statistical tests namely t-test and linear regression to test our hypotheses.

5.1 Student t-test

To apply t.test() in R we can first look at the help by typing ?t.test. We want to find out whether certain genes are deferentially expressed in healthy vs. ill patients using students t-test.

R
## 
##  Welch Two Sample t-test
## 
## data:  gen_counts[1, healthy] and gen_counts[1, ill]
## t = 1.1869, df = 13.681, p-value = 0.2555
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -15.71249  54.46249
## sample estimates:
## mean of x mean of y 
##    30.000    10.625

Lets have a look at the output of the t_res object. First have a look at the object by using str() function.

R
## List of 10
##  $ statistic  : Named num 1.19
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 13.7
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 0.255
##  $ conf.int   : num [1:2] -15.7 54.5
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 30 10.6
##   ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ stderr     : num 16.3
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "gen_counts[1, healthy] and gen_counts[1, ill]"
##  - attr(*, "class")= chr "htest"

We can extract desired output such as p value, means of groups etc. from this object by using $ operator.

R
## [1] 0.255458

Try to extract confidence intervals, means of groups and from the same object.

R
## mean of x 
##        30 
## mean of y 
##    10.625 
## [1] -15.71249  54.46249
## attr(,"conf.level")
## [1] 0.95

5.2 Many t tests in one go.

We can make use of R’s apply function to perform t test on all genes in the data.

For nicer output lets try to access p-values.

And means by using following code.

Can you extract confidence intervals?

5.3 Linear regression

Similar to t.test, one can also apply linear regression on a single gene. Take a look at some examples on how to extract output from the models. use ?lm and methods(class="lm") to know more about what you can extract from linear model fit.

One should be aware the formula argument of the function and it works such that the outcome variable is on the left hand side of the “~” sign and the predictor variables are on the right hand side.

To look at the results from the linear model summary() is a very useful function and it gives most of the output needed for interpretation of results.

To find out what other functions you can apply to linear model output use two of our favorite functions class() and methods().

R
##  [1] add1           alias          anova          case.names    
##  [5] coerce         confint        cooks.distance deviance      
##  [9] dfbeta         dfbetas        drop1          dummy.coef    
## [13] effects        extractAIC     family         formula       
## [17] hatvalues      influence      initialize     kappa         
## [21] labels         logLik         model.frame    model.matrix  
## [25] nobs           plot           predict        print         
## [29] proj           qr             residuals      rstandard     
## [33] rstudent       show           simulate       slotsFromS3   
## [37] summary        variable.names vcov          
## see '?methods' for accessing help and source code

And lets use some of functions available for class “lm” to extract useful output.

R
##     (Intercept) gen_counts[1, ] 
##      46.0159833       0.1007516 
## 
## Call:
## lm(formula = y ~ gen_counts[1, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.479 -17.016  -7.212  22.575  34.984 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      46.0160     6.9952   6.578 1.23e-05 ***
## gen_counts[1, ]   0.1008     0.1844   0.546    0.593    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.63 on 14 degrees of freedom
## Multiple R-squared:  0.02088,    Adjusted R-squared:  -0.04906 
## F-statistic: 0.2985 on 1 and 14 DF,  p-value: 0.5934
## 
##     (Intercept) gen_counts[1, ] 
##    1.232512e-05    5.934069e-01

5.4 Many Linear regression

Just like we did for t-test, we would like to run linear regression for all our genes and extract output in a nice format.

Store this output in a list and have a look at the summary of results fro the first gene.

R
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.479 -17.016  -7.212  22.575  34.984 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.0160     6.9952   6.578 1.23e-05 ***
## x             0.1008     0.1844   0.546    0.593    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.63 on 14 degrees of freedom
## Multiple R-squared:  0.02088,    Adjusted R-squared:  -0.04906 
## F-statistic: 0.2985 on 1 and 14 DF,  p-value: 0.5934

Once you know the output well enough you can try to run lm on all genes and extract output.

5.5 Logistic regression

If the outcome variable is categorical, one can use glm() function. We can learn more about this in the future sessions. Just as a teaser, try following.

5.6 Multiple testing

Be aware of multiple testing. We will cover that in the future sessions.