1 Introduction

This exercise assumes that you are working with two matrices:

A data matrix containing expression data from multiple samples, with optional annotation columns (can be microarray, RNA-seq, quantitative proteomics)
A design matrix with the sample column names in one column and sample annotations in the remaining

Example data matrix:

pep  RT   s1_1 s1_2 s1_3 s2_1 s2_2 s2_3
ATCA 40.3 20.1 20.2 19.5 22.3 22.5 21.9
GCC  43.0 20.3 19.8 19.5 21.7 21.9 21.2

Example design matrix:

sample  group  other
s1_1    1      batch1
s1_2    1      batch1
s1_3    1      batch2
s2_1    2      batch2
s2_2    2      batch2
s2_3    2      batch1

Note how the entries in the sample column all are present in the data matrix.

You can either use your own data, or download a template dataset:

Download data matrix
Download design matrix

1.1 Required packages

In order to fully run this exercise, please make sure that you have the tidyverse package installed. This package includes ggplot2 and dplyr which both are used in this exercise.

# Load the tidyverse package
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

1.2 Preprocessing the data

If using your own data you need to calculate basic statistics before proceeding with the exercises. For this you could use NormalyzerDE. For small datasets you can run it from: quantitativeproteomics.org/NormalyzerDE. If your dataset is larger than 5 MB you could either subset it using the head command, run NormalyzerDE locally or ask contact Jakob for help processing it locally.

2 Loading data

The file paths here assume that you have a folder data within your working directory with a data matrix named data_matrix.tsv and a design matrix named design_matrix.tsv. Please adjust the path and filename to the data you are using.

data_fp <- "data/data_matrix.tsv"
design_fp <- "data/design_matrix.tsv"

# stringAsFactors - If not used, many columns are read as factors, which can be misinterpreted as integers
data_df <- read.table(data_fp, sep="\t", header = TRUE, stringsAsFactors = FALSE)
design_df <- read.table(design_fp, sep="\t", header = TRUE, stringsAsFactors = FALSE)

# For reference: The above code can be replaced with the following from the readr package
# full_data_df <- read_tsv(data_fp)
# design_df <- read_tsv(design_fp)

2.1 Investigate the data

str: Produces a brief summary of the object that is given to it
head: Retrieve the first n rows from a data.frame or matrix, where n here is specified to 10
colnames: Show the column names of the data frame (or matrix)

Exercise 1: Inspect your data and make sure you understand what is present in the design- and the data matrix. Verify that the columns are in the format you expect them to be.

str(design_df)

## 'data.frame':    18 obs. of  4 variables:
##  $ sample  : chr  "dil9_5" "dil9_4" "dil8_1" "dil8_2" ...
##  $ group   : chr  "g9" "g9" "g8" "g8" ...
##  $ time    : chr  "late" "late" "late" "late" ...
##  $ dilution: chr  "high" "high" "high" "high" ...

head(design_df, 10)

##    sample group time dilution
## 1  dil9_5    g9 late     high
## 2  dil9_4    g9 late     high
## 3  dil8_1    g8 late     high
## 4  dil8_2    g8 late     high
## 5  dil8_3    g8 late     high
## 6  dil9_3    g9 late     high
## 7  dil8_5    g8 late      low
## 8  dil8_4    g8 late      low
## 9  dil9_2    g9 late      low
## 10 dil9_1    g9 late      low

colnames(design_df)

## [1] "sample"   "group"    "time"     "dilution"

dim(design_df)

## [1] 18  4

str(data_df)

## 'data.frame':    4008 obs. of  28 variables:
##  $ Average.RT      : num  61.3 99 99 42.7 52.4 ...
##  $ Cluster.ID      : chr  "1418128207109" "1418128190676" "1418128188736" "1418128214098" ...
##  $ Peptide.Sequence: chr  "AAAAAAGDVFKDVQEAK" "AAADYLEVPLYTYLGGFNTK" "AAADYLEVPLYTYLGGFNTK" "AAAEALAQEQAAK" ...
##  $ External.IDs    : chr  "Q99Z57" "P69949" "P69949" "Q9A1Z8" ...
##  $ Charge          : int  3 2 3 2 2 2 3 2 2 2 ...
##  $ Average.m.z     : num  555 1104 736 636 603 ...
##  $ P.Value         : num  0.677 0.377 0.687 0.884 0.983 ...
##  $ adj.P.Val       : num  1 1 1 1 1 ...
##  $ log2Fold        : num  0.52036 -0.46058 0.28254 -0.08969 -0.00894 ...
##  $ AvgExpr         : num  17.6 17.4 14.3 17.9 18.9 ...
##  $ dil9_5          : num  NA NA NA 18.7 19.1 ...
##  $ dil9_4          : num  18 NA NA 18.2 18.7 ...
##  $ dil8_1          : num  NA NA NA 17.9 19 ...
##  $ dil8_2          : num  NA NA NA 18.5 19 ...
##  $ dil8_3          : num  18.7 NA NA 18.8 19.4 ...
##  $ dil9_3          : num  NA NA NA 18.9 19.4 ...
##  $ dil8_5          : num  16.6 NA NA 16 17.3 ...
##  $ dil8_4          : num  17 NA NA 15.7 17.6 ...
##  $ dil9_2          : num  NA NA NA 15 17.6 ...
##  $ dil9_1          : num  NA NA NA 15.6 17.5 ...
##  $ dil8_6          : num  NA 18.1 NA 19 20.1 ...
##  $ dil9_6          : num  NA 17.7 14.8 18.7 19.4 ...
##  $ dil8_7          : num  NA 17.7 14.4 18.7 19.6 ...
##  $ dil9_7          : num  NA 16.6 13.7 18.1 19.1 ...
##  $ dil8_8          : num  NA 17.1 13.3 18.2 19.2 ...
##  $ dil9_8          : num  NA 17.7 14.8 18.8 19.8 ...
##  $ dil8_9          : num  NA 17.6 14.8 18.4 19.4 ...
##  $ dil9_9          : num  NA 16.6 NA 18.5 19.7 ...

head(data_df, 10)

##    Average.RT                  Cluster.ID     Peptide.Sequence
## 1    61.32530               1418128207109    AAAAAAGDVFKDVQEAK
## 2    99.03974               1418128190676 AAADYLEVPLYTYLGGFNTK
## 3    98.97744               1418128188736 AAADYLEVPLYTYLGGFNTK
## 4    42.66328               1418128214098        AAAEALAQEQAAK
## 5    52.43872               1418128213643         AAAEGNFAGIER
## 6    69.56549 1418128215927,1418128232269    AAAELALISGQKPLITK
## 7    69.72379               1418128213201    AAAELALISGQKPLITK
## 8    79.61463               1418128179641         AAAGIISPWFSK
## 9    55.89205               1418128212810          AAASIVIVDSR
## 10   89.38617               1418128190315  AAATPEAVIAVFDEASTAK
##     External.IDs Charge Average.m.z   P.Value adj.P.Val     log2Fold
## 1         Q99Z57      3    554.6214 0.6773926 0.9998144  0.520363420
## 2         P69949      2   1103.5547 0.3770416 0.9998144 -0.460577131
## 3         P69949      3    736.0416 0.6867821 0.9998144  0.282537091
## 4         Q9A1Z8      2    636.3341 0.8840833 0.9998144 -0.089692270
## 5         Q99YE5      2    603.2997 0.9833695 0.9998144 -0.008937177
## 6         Q9A1W2      2    862.5214 0.8341496 0.9998144  0.122805456
## 7         Q9A1W2      3    575.3502 0.9145702 0.9998144  0.077139498
## 8  Q9A048,Q48Z67      2    624.3418 0.3760239 0.9998144 -0.645683321
## 9         P68900      2    551.3176 0.9438940 0.9998144 -0.040146075
## 10        Q9A0B0      2    931.4782 0.8271795 0.9998144 -0.137790319
##     AvgExpr   dil9_5   dil9_4   dil8_1   dil8_2   dil8_3   dil9_3   dil8_5
## 1  17.58087       NA 17.97114       NA       NA 18.73059       NA 16.63856
## 2  17.40734       NA       NA       NA       NA       NA       NA       NA
## 3  14.30422       NA       NA       NA       NA       NA       NA       NA
## 4  17.87283 18.70998 18.17519 17.94717 18.48465 18.77886 18.86186 16.04935
## 5  18.92062 19.12293 18.67264 18.99337 18.97384 19.36756 19.41670 17.25360
## 6  21.16187 20.41252 19.89131 20.08161 19.82381 21.27203 21.00412 19.43886
## 7  22.46114 22.26701 21.49473 21.64091 21.80175 21.95921 21.97757 20.19211
## 8  16.00691 16.30645       NA 15.07881 15.58317       NA       NA       NA
## 9  21.54734 21.61346 21.20912 21.60655 21.49148 21.20536 21.45241 19.61898
## 10 15.85908       NA       NA       NA       NA       NA       NA       NA
##      dil8_4   dil9_2   dil9_1   dil8_6   dil9_6   dil8_7   dil9_7   dil8_8
## 1  16.98317       NA       NA       NA       NA       NA       NA       NA
## 2        NA       NA       NA 18.09947 17.74470 17.73883 16.64132 17.10829
## 3        NA       NA       NA       NA 14.84994 14.36933 13.65260 13.32145
## 4  15.71238 15.00452 15.58053 19.03427 18.69501 18.67079 18.12471 18.18855
## 5  17.59831 17.55021 17.46875 20.06490 19.43200 19.56391 19.11942 19.16025
## 6  19.46593 20.69190 19.90472 22.96216 21.87029 21.91014 21.93753 22.37153
## 7  20.35781 20.38597 20.42075 24.16540 23.69706 23.78264 23.41764 23.56234
## 8        NA 14.73884       NA 16.87624       NA 16.42151       NA 16.16485
## 9  19.63612 19.59374 19.68795 23.09042 22.62403 22.62896 22.04219 22.24242
## 10       NA       NA       NA 15.48076 14.87317 15.69007 14.88943 15.88645
##      dil9_8   dil8_9   dil9_9
## 1        NA       NA       NA
## 2  17.74684 17.60395 16.57536
## 3  14.83394 14.79809       NA
## 4  18.76573 18.39302 18.53430
## 5  19.76935 19.35004 19.69335
## 6  23.05443 22.57813 22.24263
## 7  24.67505 24.34093 24.16159
## 8        NA 16.88539       NA
## 9  22.83190 22.58638 22.69056
## 10 16.80297 16.65464 16.59519

colnames(data_df)

##  [1] "Average.RT"       "Cluster.ID"       "Peptide.Sequence"
##  [4] "External.IDs"     "Charge"           "Average.m.z"     
##  [7] "P.Value"          "adj.P.Val"        "log2Fold"        
## [10] "AvgExpr"          "dil9_5"           "dil9_4"          
## [13] "dil8_1"           "dil8_2"           "dil8_3"          
## [16] "dil9_3"           "dil8_5"           "dil8_4"          
## [19] "dil9_2"           "dil9_1"           "dil8_6"          
## [22] "dil9_6"           "dil8_7"           "dil9_7"          
## [25] "dil8_8"           "dil9_8"           "dil8_9"          
## [28] "dil9_9"

dim(data_df)

## [1] 4008   28

2.2 Adding a ‘significance’ column

This code adds new columns to the data matrix. These will contain TRUE or FALSE for each row depending on the p- or FDR-value was lower than a given threshold.

Exercise: Add another column where you specified FDR values (adj.P.Val) lower than 0.05 and one with lower than 0.1. How many significant entries do you get in each case?

data_df$IsSig <- data_df$P.Value < 0.05
data_df$IsFDRSig <- data_df$adj.P.Val < 0.2

# Inspect the results
head(data_df)

##   Average.RT                  Cluster.ID     Peptide.Sequence External.IDs
## 1   61.32530               1418128207109    AAAAAAGDVFKDVQEAK       Q99Z57
## 2   99.03974               1418128190676 AAADYLEVPLYTYLGGFNTK       P69949
## 3   98.97744               1418128188736 AAADYLEVPLYTYLGGFNTK       P69949
## 4   42.66328               1418128214098        AAAEALAQEQAAK       Q9A1Z8
## 5   52.43872               1418128213643         AAAEGNFAGIER       Q99YE5
## 6   69.56549 1418128215927,1418128232269    AAAELALISGQKPLITK       Q9A1W2
##   Charge Average.m.z   P.Value adj.P.Val     log2Fold  AvgExpr   dil9_5
## 1      3    554.6214 0.6773926 0.9998144  0.520363420 17.58087       NA
## 2      2   1103.5547 0.3770416 0.9998144 -0.460577131 17.40734       NA
## 3      3    736.0416 0.6867821 0.9998144  0.282537091 14.30422       NA
## 4      2    636.3341 0.8840833 0.9998144 -0.089692270 17.87283 18.70998
## 5      2    603.2997 0.9833695 0.9998144 -0.008937177 18.92062 19.12293
## 6      2    862.5214 0.8341496 0.9998144  0.122805456 21.16187 20.41252
##     dil9_4   dil8_1   dil8_2   dil8_3   dil9_3   dil8_5   dil8_4   dil9_2
## 1 17.97114       NA       NA 18.73059       NA 16.63856 16.98317       NA
## 2       NA       NA       NA       NA       NA       NA       NA       NA
## 3       NA       NA       NA       NA       NA       NA       NA       NA
## 4 18.17519 17.94717 18.48465 18.77886 18.86186 16.04935 15.71238 15.00452
## 5 18.67264 18.99337 18.97384 19.36756 19.41670 17.25360 17.59831 17.55021
## 6 19.89131 20.08161 19.82381 21.27203 21.00412 19.43886 19.46593 20.69190
##     dil9_1   dil8_6   dil9_6   dil8_7   dil9_7   dil8_8   dil9_8   dil8_9
## 1       NA       NA       NA       NA       NA       NA       NA       NA
## 2       NA 18.09947 17.74470 17.73883 16.64132 17.10829 17.74684 17.60395
## 3       NA       NA 14.84994 14.36933 13.65260 13.32145 14.83394 14.79809
## 4 15.58053 19.03427 18.69501 18.67079 18.12471 18.18855 18.76573 18.39302
## 5 17.46875 20.06490 19.43200 19.56391 19.11942 19.16025 19.76935 19.35004
## 6 19.90472 22.96216 21.87029 21.91014 21.93753 22.37153 23.05443 22.57813
##     dil9_9 IsSig IsFDRSig
## 1       NA FALSE    FALSE
## 2 16.57536 FALSE    FALSE
## 3       NA FALSE    FALSE
## 4 18.53430 FALSE    FALSE
## 5 19.69335 FALSE    FALSE
## 6 22.24263 FALSE    FALSE

# The table command summarizes what type of entries is present in a vector, and in which number
table(data_df$IsSig)

## 
## FALSE  TRUE 
##  3879    86

table(data_df$IsFDRSig)

## 
## FALSE  TRUE 
##  3948    17

3 Visualize

I personally dislike the gray background you get in the default ggplot-plots. There are a number of themes that can be assigned for different styles. Here, I assign the style classic as the global style.

Check here for a selection of themes: https://ggplot2.tidyverse.org/reference/ggtheme.html

theme_set(theme_classic())

3.1 P-value histogram

For further discussion on its interpretation, see the following link: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/

Special settings:

geom_histogram - Geometry specifying to do histograms, where we are cutting up a range of data in bins
bins - Specify how many segments the data should be separated in

When running this you get a warning “Removed 43 rows containing non-finite values (stat_bin)”. This is due to missing values in the dataset, and is not a problem.

ggplot(data_df, aes(x=P.Value)) + geom_histogram(bins=100)

## Warning: Removed 43 rows containing non-finite values (stat_bin).

This can easily be adjusted to other characteristics, such as expression or fold change:

ggplot(data_df, aes(x=AvgExpr)) + geom_histogram(bins=100)

## Warning: Removed 43 rows containing non-finite values (stat_bin).

Exercise: Further explore the data_df matrix using the histogram. Use colnames to inspect what columns there are in the matrix. Try making histograms for different columns, and try to adjust the number of bins in the histogram using the bins option.

3.2 Using multiple aesthetics - coloring the histogram

If you want to color those entries that pass your adjusted p-value threshold in a histogram, you can use the fill aesthetics.

Here, we want to highlight entries passing our assigned FDR threshold, which is specified in the column IsFDRSig that we created above.

The #AAAAAA is hexcode and a way to specify colors. The pairs of letters specify the amount of red, green and blue respectively.

gray_blue_colors <- c("#AAAAAA", "#0C81A2")

ggplot(data_df, aes(x=P.Value, fill=IsFDRSig)) + geom_histogram(bins=100) + scale_fill_manual(values=gray_blue_colors)

## Warning: Removed 43 rows containing non-finite values (stat_bin).

Exercise: Try changing the selected colors (you can find choices at the following link: https://www.w3schools.com/colors/colors_picker.asp).

Exercise: Try coloring based on other criteria than IsFDRSig (for instance IsSig)

Challenge exercise: Color based on if a feature is upregulated or not (i.e. if its fold-change is higher or lower than 0).

3.3 Volcano plot

The volcano plot gives a quick overview of the fold-changes and p-values within your dataset. The x-axis shows the log2-fold and the y-axis is rescaled p-values where the top ones are those with the lowest p-values.

ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value))) + geom_point()

## Warning: Removed 43 rows containing missing values (geom_point).

You can color the significant hits here too, similarly to in the p-value histogram.

ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value), color=IsFDRSig)) + geom_point()

## Warning: Removed 43 rows containing missing values (geom_point).

We can improve the visuals further by using the options:

alpha - Make the dots partially transparent
na.rm=TRUE - Omit the annoying warning message about NA values
ggtitle - Add a title to the plot
xlab / ylab - Custom axis-labels

ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value), color=IsFDRSig)) + 
    geom_point(alpha=0.6, na.rm=TRUE) +
    ggtitle("Fancy volcano plot") +
    xlab("Fold regulation (log2)") +
    ylab("Significance (-log10)") + 
    scale_color_manual(values=gray_blue_colors)

Exercise: Try changing the x/y labels and the title

Exercise: Try adjusting the coloring aesthetics. Use a different column for coloring. You could also switch the coloring scheme.

Thinking point: When can we have a large fold change but still have a large p-value?

Challenge exercise: Add labels to the significant features. Instructions on how to achieve this can be found here: https://www.gettinggeneticsdone.com/2016/01/repel-overlapping-text-labels-in-ggplot2.html

3.4 MA plot

The MA plot is very similar to the vulcano plot, with the only difference what we are specifying as x-axis and y-axis.

The command arrange(data_df, desc(P.Value)) is a Tidyverse command for sorting a data frame. Here, they are sorting with the features with the lowest p-values at the bottom to make sure they are drawn on top.

ggplot(arrange(data_df, desc(P.Value)), aes(x=AvgExpr, y=log2Fold, color=IsFDRSig)) + 
    geom_point(alpha=0.6, na.rm=TRUE) +
    ggtitle("MA") +
    xlab("Expression level") +
    ylab("Fold change (log2)") + 
    scale_color_manual(values=gray_blue_colors)

Exercise: Take a look at the output from arrange(data_df, desc(P.Value)). Do you see what is going on if you compare it to the output from data_df?

4 Single feature illustrations

We often want to study intensity levels of selected features. We first need to prepare the data in the right format and then we can perform various visualizations.

ggplot expects the data to be in the so called ‘tidy’ format where one column contains the values, and the other columns contain information linked to that value. This is further discussed below in the “Preparing data in long format” section.

4.1 Preparing illustration for single features

First, we sort the data frame on the P.Value column and take a look at those with the highest and lowest p-value.

head(arrange(data_df, P.Value))

##   Average.RT                  Cluster.ID Peptide.Sequence External.IDs
## 1   60.26076 1418128185272,1418128223206         VCFQYLNR       hum272
## 2   60.60463               1418128210770          IALDFQR       hum210
## 3   39.67873               1418128213014      ENAGEDPGLAR       hum240
## 4   48.64133               1418128214834     FLGEVHCGTNVR       hum269
## 5   62.81780               1418128211443        AGLQFPVGR       hum284
## 6   49.44278               1418128215034     HSDAVFTDNYTR       hum163
##   Charge Average.m.z      P.Value adj.P.Val log2Fold  AvgExpr   dil9_5
## 1      2    550.2725 0.0001043932 0.1214715 2.977166 19.64220       NA
## 2      2    431.7433 0.0001483525 0.1214715 1.961896 18.80728 20.18851
## 3      2    564.7685 0.0001660904 0.1214715 2.026260 18.51726 18.94022
## 4      2    694.8409 0.0002228151 0.1214715 1.632291 16.72149 17.47698
## 5      2    472.7698 0.0002243532 0.1214715 1.951682 19.69827 20.88876
## 6      2    713.3251 0.0002383996 0.1214715 2.395875 16.68474 17.36265
##     dil9_4   dil8_1   dil8_2   dil8_3   dil9_3   dil8_5   dil8_4   dil9_2
## 1       NA       NA       NA 18.41610 21.88865 16.26718       NA       NA
## 2 19.66857 17.99485 18.37609 18.50994 20.37002 16.22937 16.69950 18.29133
## 3 18.73956 17.02382 17.06114       NA       NA       NA       NA       NA
## 4 17.06073 15.00175 15.69515 16.46954 18.32329       NA       NA 16.50913
## 5 20.39044 18.65152 18.90877 19.39108 21.16600 17.27939 17.45454 19.17637
## 6 16.68158 15.11930 15.14777 16.25041 17.65572 14.30295 14.90223 16.31986
##     dil9_1   dil8_6   dil9_6   dil8_7   dil9_7   dil8_8   dil9_8   dil8_9
## 1 20.46198       NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445
## 2 18.43800 18.65345 20.14669 18.16291 19.68011 17.84140 21.10707 17.96952
## 3       NA       NA 19.83402 17.97490 19.57531 17.16462 19.91811 17.83567
## 4       NA       NA 17.55349 16.10982 17.09761 15.68256 17.77533 15.77371
## 5 19.23063 19.69910 21.03337 19.15981 20.56503 18.79018 22.10187 19.16744
## 6 16.27362       NA 18.25063       NA 18.03754       NA 19.03499       NA
##     dil9_9 IsSig IsFDRSig
## 1 20.95204  TRUE     TRUE
## 2 20.20381  TRUE     TRUE
## 3 19.62251  TRUE     TRUE
## 4 17.57183  TRUE     TRUE
## 5 21.51451  TRUE     TRUE
## 6 18.24705  TRUE     TRUE

tail(arrange(data_df, P.Value))

##      Average.RT    Cluster.ID Peptide.Sequence  External.IDs Charge
## 4003   39.03130 1418128228099         VENSWGEK        Q99YL0      2
## 4004   42.78815 1418128228616          VIDTSIR        Q9A0J0      2
## 4005   47.62865 1418128210565          VINDFTK        P68900      2
## 4006   71.77410 1418128227443    VVDAFSTNFHLPK        Q99Z49      3
## 4007   39.75320 1418128216450         YADATSLK Q9A1H0,Q490W3      2
## 4008   63.88005 1418128223337       YLYTISPNIK        Q99ZR2      2
##      Average.m.z P.Value adj.P.Val log2Fold AvgExpr   dil9_5   dil9_4
## 4003    474.7267      NA        NA       NA      NA       NA       NA
## 4004    402.2355      NA        NA       NA      NA       NA       NA
## 4005    418.7291      NA        NA       NA      NA 24.06851 23.57863
## 4006    492.2607      NA        NA       NA      NA       NA       NA
## 4007    434.7244      NA        NA       NA      NA       NA       NA
## 4008    606.3358      NA        NA       NA      NA       NA       NA
##      dil8_1 dil8_2   dil8_3 dil9_3   dil8_5 dil8_4 dil9_2 dil9_1 dil8_6
## 4003     NA     NA       NA     NA       NA     NA     NA     NA     NA
## 4004     NA     NA       NA     NA       NA     NA     NA     NA     NA
## 4005     NA     NA       NA     NA       NA     NA     NA     NA     NA
## 4006     NA     NA       NA     NA       NA     NA     NA     NA     NA
## 4007     NA     NA       NA     NA       NA     NA     NA     NA     NA
## 4008     NA     NA 18.62476     NA 15.37982     NA     NA     NA     NA
##        dil9_6   dil8_7   dil9_7   dil8_8   dil9_8   dil8_9   dil9_9 IsSig
## 4003 18.70282       NA       NA       NA 17.95279       NA 18.63583    NA
## 4004       NA       NA       NA 15.42199       NA 16.23517       NA    NA
## 4005       NA       NA       NA       NA       NA       NA       NA    NA
## 4006 18.43230       NA 18.34095       NA       NA       NA       NA    NA
## 4007       NA 17.47744       NA 16.71426       NA       NA       NA    NA
## 4008       NA       NA       NA       NA       NA       NA       NA    NA
##      IsFDRSig
## 4003       NA
## 4004       NA
## 4005       NA
## 4006       NA
## 4007       NA
## 4008       NA

Next, we extract the top feature and put it in its own data frame. There are a few things going on here:

data_df$External.IDs gives us the column only with the external IDs
data_df$External.IDs == "hum272" will give us a long vector with TRUE/FALSE values (here only true for the “hum272” entry)
data_df[data_df$External.IDs == "hum272", ] here we select all rows that are true in the previous vector (in this case - only one)
best_hit[, design_df$sample] here best_hit is a single-row data frame. We subset it with the sample names from the design matrix so that we only have the values left.

best_hit <- data_df[data_df$External.IDs == "hum272", ]
best_hit_vals <- best_hit[, design_df$sample]
print(best_hit)

##      Average.RT                  Cluster.ID Peptide.Sequence External.IDs
## 3495   60.26076 1418128185272,1418128223206         VCFQYLNR       hum272
##      Charge Average.m.z      P.Value adj.P.Val log2Fold AvgExpr dil9_5
## 3495      2    550.2725 0.0001043932 0.1214715 2.977166 19.6422     NA
##      dil9_4 dil8_1 dil8_2  dil8_3   dil9_3   dil8_5 dil8_4 dil9_2   dil9_1
## 3495     NA     NA     NA 18.4161 21.88865 16.26718     NA     NA 20.46198
##      dil8_6   dil9_6   dil8_7   dil9_7   dil8_8   dil9_8   dil8_9   dil9_9
## 3495     NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445 20.95204
##      IsSig IsFDRSig
## 3495  TRUE     TRUE

print(best_hit_vals)

##      dil9_5 dil9_4 dil8_1 dil8_2  dil8_3   dil9_3   dil8_5 dil8_4 dil9_2
## 3495     NA     NA     NA     NA 18.4161 21.88865 16.26718     NA     NA
##        dil9_1 dil8_6   dil9_6   dil8_7   dil9_7   dil8_8   dil9_8   dil8_9
## 3495 20.46198     NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445
##        dil9_9
## 3495 20.95204

Exercise: Take a closer look at the code. Make sure you understand how the data is extracted from the dataframe.

Exercise: Do the same again but for the feature with the lowest p-value.

The final step is to include these values in a new data frame. We use the cbind command to bind this value together with the design matrix. Now we have a data frame containing both the sample annotation information from the data frame and expression values for our feature of interest.

unlist - best_hit_vals is a single-row data frame. To bind it we need it as a vector, which unlist does for us.
cbind - “column bind”, let’s us bind one or many columns together (must have the same number of rows). The corresponding command for row-wise binding is rbind.

best_hit_df <- cbind(value=unlist(best_hit_vals), design_df)
head(best_hit_df)

##           value sample group time dilution
## dil9_5       NA dil9_5    g9 late     high
## dil9_4       NA dil9_4    g9 late     high
## dil8_1       NA dil8_1    g8 late     high
## dil8_2       NA dil8_2    g8 late     high
## dil8_3 18.41610 dil8_3    g8 late     high
## dil9_3 21.88865 dil9_3    g9 late     high

Exercise: Bind the high-pvalue feature that you prepared before to the same data frame.

4.2 Illustrating single features

Now we are ready to study this feature. At the same time, it is a good time to try out some different geometics, and try using multiple geometics at once.

ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_point()

## Warning: Removed 7 rows containing missing values (geom_point).

ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_boxplot() + geom_point()

## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

## Warning: Removed 7 rows containing missing values (geom_point).

ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_boxplot() + geom_point(aes(color=time), position=position_jitter(0.1), size=3)

## Warning: Removed 7 rows containing non-finite values (stat_boxplot).

## Warning: Removed 7 rows containing missing values (geom_point).

Exercise: Do the same illustrations using the high p-value feature. Can you relate the figure to the p-values you have obtained?

Exercise: Try changing the coloring from “time” to “group” in the last plot. Does the results make sense?

Exercise: Update the xlabel, ylabel and title to be more illustrative

Exercise: Experiment with position_jitter setting. What happens if you change the value from 0.1 to 0.3? To 0? What happens if you change position_jitter to position_jitter_dodge? If you remove the entire argument?

Exercise: Try assigning aes(color=time) inside the geom_boxplot call (in the last plot).

5 Preparing data in long format

5.1 Wide vs. long formats

ggplot generally expects to get the data in the so called “long format”. This is different from the “wide format” we commonly used for expression matrices. Example of wide format, with one gene per row, and one sample per column:

gene   s1  s2  s3
gene_A  4   5   3
gene_B  3   2   3

The same matrix in long format:

gene   sample value
gene_A     s1     4
gene_A     s2     5
gene_A     s3     3
gene_B     s1     3
gene_B     s2     2
gene_B     s3     3

Here, all values are in one column, and we have a new column specifying to which sample they belong. Also notice that we have multiple rows for each gene (seen in the gene column). The long format is less pleasing for the human eye to look at, but makes it easier for ggplot as you only need to keep track of two columns to keep track of each sample and value.

This conversion can be made using the tidyverse-command pivot_longer. Here, you specify:

The data frame to convert to long format
What columns to transform (in our case - the sample columns)
The names of the sample- and value-columns after conversion

Non-sample columns will follow along similarly to the gene column in the small example above.

long_data <- pivot_longer(data_df, as.character(design_df$sample), names_to="sample", values_to="value")
head(long_data)

## # A tibble: 6 x 14
##   Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
##        <dbl> <chr>      <chr>            <chr>         <int>       <dbl>
## 1       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 2       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 3       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 4       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 5       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 6       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## # … with 8 more variables: P.Value <dbl>, adj.P.Val <dbl>, log2Fold <dbl>,
## #   AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>, sample <chr>, value <dbl>

Exercise: Inspect the long_data table using the commands dim, str and head. Compare it to the wide format data_df data frame.

5.2 Visualizing sample density distributions

Having the data in long format we can now easily do sample-wide visualizations.

For instance, we can do density plots to see how the values in each sample is distributed.

ggplot(long_data, aes(x=value, color=sample)) + geom_density()

## Warning: Removed 20006 rows containing non-finite values (stat_density).

It looks like we have a four outliers here! (seen shifted to the left). These will likely cause problems in our downstream analysis if not handled with care.

5.3 Adding sample-annotation to the long dataframe

Often we want to color on different conditions to look for patterns in the data. Then we need to merge in out design matrix into this long data.

This can be achieved using a left_join from the dplyr package. Here, we specify the two matrices to merge, and which column we use for the merge.

annot_long_data <- left_join(long_data, design_df, by="sample")
head(annot_long_data)

## # A tibble: 6 x 17
##   Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
##        <dbl> <chr>      <chr>            <chr>         <int>       <dbl>
## 1       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 2       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 3       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 4       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 5       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## 6       61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57            3        555.
## # … with 11 more variables: P.Value <dbl>, adj.P.Val <dbl>,
## #   log2Fold <dbl>, AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>,
## #   sample <chr>, value <dbl>, group <chr>, time <chr>, dilution <chr>

Exercise:: Inspect the annot_long_data data frame and compare it to the long_data matrix. Do you see how the design matrix got added? How many columns were added? How many columns are there in the design dataframe?

5.4 Density plots

Now we can use different characteristics found in the design matrix to study the sample densities. Note that we need to use the group aesthetics to separate the samples when using other colorings.

ggplot(annot_long_data, aes(x=value, color=time)) + geom_density()

## Warning: Removed 20006 rows containing non-finite values (stat_density).

ggplot(annot_long_data, aes(x=value, color=time, group=sample)) + geom_density()

## Warning: Removed 20006 rows containing non-finite values (stat_density).

Exercise: There are a lot of technical influence going on in this dataset. Can you spot the two batch effects by visually inspecting the figure?

Exercise: Try coloring the figure on other sample characteristics found in the data frame. Can you spot which matches the pattern? Do you see why these samples are shifted this way?

5.5 Boxplot or violins

Now we can easily do boxplots and other illustrations of our samples.

Here, we use the theme option to adjust the x-axis labels (axis.text.x) and to rotate them 90 degrees. We also adjust them to be nicely aligned with the ticks. The theme option is a powerful aspect of ggplot which lets you target in principle any aspects for the figures and change fonts, colorings, text orientations and much more.

ggplot(annot_long_data, aes(x=sample, y=value, color=time)) + 
    geom_boxplot() + 
    theme(axis.text.x = element_text(angle=90, vjust=0.5)) + 
    ggtitle("Sample intensity levels")

## Warning: Removed 20006 rows containing non-finite values (stat_boxplot).

Exercise: See if you can make a violin plot for the densities (tips: the geom for the violin plot is geom_violin).

Exercise: Try adjusting the vjust value. Do you see what is happening?

Challenge exercise: See if you can rotate the y-axis labels.

6 Special topics

6.1 Making a multi-pane plot

You often want to combine multiple plots

library(ggpubr)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

dens1 <- ggplot(annot_long_data, aes(x=value, color=time)) + geom_density()
dens2 <- ggplot(annot_long_data, aes(x=value, color=time, group=sample)) + geom_density()
ggarrange(dens1, dens2)

## Warning: Removed 20006 rows containing non-finite values (stat_density).

## Warning: Removed 20006 rows containing non-finite values (stat_density).

grid_plt <- ggarrange(dens1, dens2, common.legend = TRUE, legend = "right")

## Warning: Removed 20006 rows containing non-finite values (stat_density).

## Warning: Removed 20006 rows containing non-finite values (stat_density).

## Warning: Removed 20006 rows containing non-finite values (stat_density).

annotated_plt <- annotate_figure(grid_plt, top="Grid title")
annotated_plt

6.2 Saving plots to file

ggsave(annotated_plt, filename = "output_path.png")

## Saving 7 x 5 in image

You can specify the measures, and the resolution.

ggsave(annotated_plt, filename = "output_path_highres.png", width=10, height=10, dpi = 300)

7 Special cases

7.1 Heatmaps

Source: https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/ Source for base R: https://www.r-graph-gallery.com/heatmap

Heatmaps can be done in ggplot using geom_tile(), but in this particular case I would recommend using the base R version due to its clustering of the entries.

library(tidyverse)

heatmap(data_df %>% head(100) %>% dplyr::select(design_df$sample) %>% as.matrix() %>% na.omit())

colnames(data_df) <- make.names(colnames(data_df))
out <- data_df %>%
    arrange(P.Value) %>% 
    head(10) %>% 
    pivot_longer(as.character(design_df$sample), names_to="sample", values_to="value") %>% 
    filter(!is.na(value))
head(out)

## # A tibble: 6 x 14
##   Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
##        <dbl> <chr>      <chr>            <chr>         <int>       <dbl>
## 1       60.3 141812818… VCFQYLNR         hum272            2        550.
## 2       60.3 141812818… VCFQYLNR         hum272            2        550.
## 3       60.3 141812818… VCFQYLNR         hum272            2        550.
## 4       60.3 141812818… VCFQYLNR         hum272            2        550.
## 5       60.3 141812818… VCFQYLNR         hum272            2        550.
## 6       60.3 141812818… VCFQYLNR         hum272            2        550.
## # … with 8 more variables: P.Value <dbl>, adj.P.Val <dbl>, log2Fold <dbl>,
## #   AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>, sample <chr>, value <dbl>

ggplot(out, aes(sample, External.IDs)) + 
    geom_tile(aes(fill=value)) + 
    theme(axis.text.x = element_text(angle=90, vjust=0.5))

7.2 Principal component analysis

There are many ways to do principal component analysis visualizations in R. The one I find easiest to use is the PCAtools package.

Here is a nice vignette for more extensive examples: https://bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html

removeVar - Removing highly variant features to get a better view of trends
scale - Whether all dimensions should be scaled to same size range before calculating importance

library(PCAtools)

## Loading required package: ggrepel

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

## Loading required package: lattice

## Loading required package: cowplot

## 
## ********************************************************

## Note: As of version 1.0.0, cowplot does not change the

##   default ggplot2 theme anymore. To recover the previous

##   behavior, execute:
##   theme_set(theme_cowplot())

## ********************************************************

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggpubr':
## 
##     get_legend

## 
## Attaching package: 'PCAtools'

## The following objects are masked from 'package:stats':
## 
##     biplot, screeplot

# Biplots for different components and colorings
# Pairplot

data_only <- data_df[, design_df$sample]

data_only_no_missing <- data_only[complete.cases(data_only), ]

p <- pca(data_only_no_missing, removeVar=0.1, scale=TRUE, metadata=as.matrix(design_df))

## -- removing the lower 10% of variables based on variance

biplot(p, colby = "group")

biplot(p, colby = "time")

biplot(p, colby = "time", shape="group", legendPosition = "right")

screeplot(p)

pairsplot(p, colby = "group")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

pairsplot(p, colby = "time")

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Visualize expression data in R

Jakob Willforss

October 21, 2019