This exercise assumes that you are working with two matrices:
Example data matrix:
pep RT s1_1 s1_2 s1_3 s2_1 s2_2 s2_3
ATCA 40.3 20.1 20.2 19.5 22.3 22.5 21.9
GCC 43.0 20.3 19.8 19.5 21.7 21.9 21.2
Example design matrix:
sample group other
s1_1 1 batch1
s1_2 1 batch1
s1_3 1 batch2
s2_1 2 batch2
s2_2 2 batch2
s2_3 2 batch1
Note how the entries in the sample column all are present in the data matrix.
You can either use your own data, or download a template dataset:
In order to fully run this exercise, please make sure that you have the tidyverse package installed. This package includes ggplot2
and dplyr
which both are used in this exercise.
# Load the tidyverse package
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
If using your own data you need to calculate basic statistics before proceeding with the exercises. For this you could use NormalyzerDE. For small datasets you can run it from: quantitativeproteomics.org/NormalyzerDE
. If your dataset is larger than 5 MB you could either subset it using the head
command, run NormalyzerDE locally or ask contact Jakob for help processing it locally.
The file paths here assume that you have a folder data
within your working directory with a data matrix named data_matrix.tsv
and a design matrix named design_matrix.tsv
. Please adjust the path and filename to the data you are using.
data_fp <- "data/data_matrix.tsv"
design_fp <- "data/design_matrix.tsv"
# stringAsFactors - If not used, many columns are read as factors, which can be misinterpreted as integers
data_df <- read.table(data_fp, sep="\t", header = TRUE, stringsAsFactors = FALSE)
design_df <- read.table(design_fp, sep="\t", header = TRUE, stringsAsFactors = FALSE)
# For reference: The above code can be replaced with the following from the readr package
# full_data_df <- read_tsv(data_fp)
# design_df <- read_tsv(design_fp)
str
: Produces a brief summary of the object that is given to ithead
: Retrieve the first n
rows from a data.frame or matrix, where n
here is specified to 10colnames
: Show the column names of the data frame (or matrix)Exercise 1: Inspect your data and make sure you understand what is present in the design- and the data matrix. Verify that the columns are in the format you expect them to be.
str(design_df)
## 'data.frame': 18 obs. of 4 variables:
## $ sample : chr "dil9_5" "dil9_4" "dil8_1" "dil8_2" ...
## $ group : chr "g9" "g9" "g8" "g8" ...
## $ time : chr "late" "late" "late" "late" ...
## $ dilution: chr "high" "high" "high" "high" ...
head(design_df, 10)
## sample group time dilution
## 1 dil9_5 g9 late high
## 2 dil9_4 g9 late high
## 3 dil8_1 g8 late high
## 4 dil8_2 g8 late high
## 5 dil8_3 g8 late high
## 6 dil9_3 g9 late high
## 7 dil8_5 g8 late low
## 8 dil8_4 g8 late low
## 9 dil9_2 g9 late low
## 10 dil9_1 g9 late low
colnames(design_df)
## [1] "sample" "group" "time" "dilution"
dim(design_df)
## [1] 18 4
str(data_df)
## 'data.frame': 4008 obs. of 28 variables:
## $ Average.RT : num 61.3 99 99 42.7 52.4 ...
## $ Cluster.ID : chr "1418128207109" "1418128190676" "1418128188736" "1418128214098" ...
## $ Peptide.Sequence: chr "AAAAAAGDVFKDVQEAK" "AAADYLEVPLYTYLGGFNTK" "AAADYLEVPLYTYLGGFNTK" "AAAEALAQEQAAK" ...
## $ External.IDs : chr "Q99Z57" "P69949" "P69949" "Q9A1Z8" ...
## $ Charge : int 3 2 3 2 2 2 3 2 2 2 ...
## $ Average.m.z : num 555 1104 736 636 603 ...
## $ P.Value : num 0.677 0.377 0.687 0.884 0.983 ...
## $ adj.P.Val : num 1 1 1 1 1 ...
## $ log2Fold : num 0.52036 -0.46058 0.28254 -0.08969 -0.00894 ...
## $ AvgExpr : num 17.6 17.4 14.3 17.9 18.9 ...
## $ dil9_5 : num NA NA NA 18.7 19.1 ...
## $ dil9_4 : num 18 NA NA 18.2 18.7 ...
## $ dil8_1 : num NA NA NA 17.9 19 ...
## $ dil8_2 : num NA NA NA 18.5 19 ...
## $ dil8_3 : num 18.7 NA NA 18.8 19.4 ...
## $ dil9_3 : num NA NA NA 18.9 19.4 ...
## $ dil8_5 : num 16.6 NA NA 16 17.3 ...
## $ dil8_4 : num 17 NA NA 15.7 17.6 ...
## $ dil9_2 : num NA NA NA 15 17.6 ...
## $ dil9_1 : num NA NA NA 15.6 17.5 ...
## $ dil8_6 : num NA 18.1 NA 19 20.1 ...
## $ dil9_6 : num NA 17.7 14.8 18.7 19.4 ...
## $ dil8_7 : num NA 17.7 14.4 18.7 19.6 ...
## $ dil9_7 : num NA 16.6 13.7 18.1 19.1 ...
## $ dil8_8 : num NA 17.1 13.3 18.2 19.2 ...
## $ dil9_8 : num NA 17.7 14.8 18.8 19.8 ...
## $ dil8_9 : num NA 17.6 14.8 18.4 19.4 ...
## $ dil9_9 : num NA 16.6 NA 18.5 19.7 ...
head(data_df, 10)
## Average.RT Cluster.ID Peptide.Sequence
## 1 61.32530 1418128207109 AAAAAAGDVFKDVQEAK
## 2 99.03974 1418128190676 AAADYLEVPLYTYLGGFNTK
## 3 98.97744 1418128188736 AAADYLEVPLYTYLGGFNTK
## 4 42.66328 1418128214098 AAAEALAQEQAAK
## 5 52.43872 1418128213643 AAAEGNFAGIER
## 6 69.56549 1418128215927,1418128232269 AAAELALISGQKPLITK
## 7 69.72379 1418128213201 AAAELALISGQKPLITK
## 8 79.61463 1418128179641 AAAGIISPWFSK
## 9 55.89205 1418128212810 AAASIVIVDSR
## 10 89.38617 1418128190315 AAATPEAVIAVFDEASTAK
## External.IDs Charge Average.m.z P.Value adj.P.Val log2Fold
## 1 Q99Z57 3 554.6214 0.6773926 0.9998144 0.520363420
## 2 P69949 2 1103.5547 0.3770416 0.9998144 -0.460577131
## 3 P69949 3 736.0416 0.6867821 0.9998144 0.282537091
## 4 Q9A1Z8 2 636.3341 0.8840833 0.9998144 -0.089692270
## 5 Q99YE5 2 603.2997 0.9833695 0.9998144 -0.008937177
## 6 Q9A1W2 2 862.5214 0.8341496 0.9998144 0.122805456
## 7 Q9A1W2 3 575.3502 0.9145702 0.9998144 0.077139498
## 8 Q9A048,Q48Z67 2 624.3418 0.3760239 0.9998144 -0.645683321
## 9 P68900 2 551.3176 0.9438940 0.9998144 -0.040146075
## 10 Q9A0B0 2 931.4782 0.8271795 0.9998144 -0.137790319
## AvgExpr dil9_5 dil9_4 dil8_1 dil8_2 dil8_3 dil9_3 dil8_5
## 1 17.58087 NA 17.97114 NA NA 18.73059 NA 16.63856
## 2 17.40734 NA NA NA NA NA NA NA
## 3 14.30422 NA NA NA NA NA NA NA
## 4 17.87283 18.70998 18.17519 17.94717 18.48465 18.77886 18.86186 16.04935
## 5 18.92062 19.12293 18.67264 18.99337 18.97384 19.36756 19.41670 17.25360
## 6 21.16187 20.41252 19.89131 20.08161 19.82381 21.27203 21.00412 19.43886
## 7 22.46114 22.26701 21.49473 21.64091 21.80175 21.95921 21.97757 20.19211
## 8 16.00691 16.30645 NA 15.07881 15.58317 NA NA NA
## 9 21.54734 21.61346 21.20912 21.60655 21.49148 21.20536 21.45241 19.61898
## 10 15.85908 NA NA NA NA NA NA NA
## dil8_4 dil9_2 dil9_1 dil8_6 dil9_6 dil8_7 dil9_7 dil8_8
## 1 16.98317 NA NA NA NA NA NA NA
## 2 NA NA NA 18.09947 17.74470 17.73883 16.64132 17.10829
## 3 NA NA NA NA 14.84994 14.36933 13.65260 13.32145
## 4 15.71238 15.00452 15.58053 19.03427 18.69501 18.67079 18.12471 18.18855
## 5 17.59831 17.55021 17.46875 20.06490 19.43200 19.56391 19.11942 19.16025
## 6 19.46593 20.69190 19.90472 22.96216 21.87029 21.91014 21.93753 22.37153
## 7 20.35781 20.38597 20.42075 24.16540 23.69706 23.78264 23.41764 23.56234
## 8 NA 14.73884 NA 16.87624 NA 16.42151 NA 16.16485
## 9 19.63612 19.59374 19.68795 23.09042 22.62403 22.62896 22.04219 22.24242
## 10 NA NA NA 15.48076 14.87317 15.69007 14.88943 15.88645
## dil9_8 dil8_9 dil9_9
## 1 NA NA NA
## 2 17.74684 17.60395 16.57536
## 3 14.83394 14.79809 NA
## 4 18.76573 18.39302 18.53430
## 5 19.76935 19.35004 19.69335
## 6 23.05443 22.57813 22.24263
## 7 24.67505 24.34093 24.16159
## 8 NA 16.88539 NA
## 9 22.83190 22.58638 22.69056
## 10 16.80297 16.65464 16.59519
colnames(data_df)
## [1] "Average.RT" "Cluster.ID" "Peptide.Sequence"
## [4] "External.IDs" "Charge" "Average.m.z"
## [7] "P.Value" "adj.P.Val" "log2Fold"
## [10] "AvgExpr" "dil9_5" "dil9_4"
## [13] "dil8_1" "dil8_2" "dil8_3"
## [16] "dil9_3" "dil8_5" "dil8_4"
## [19] "dil9_2" "dil9_1" "dil8_6"
## [22] "dil9_6" "dil8_7" "dil9_7"
## [25] "dil8_8" "dil9_8" "dil8_9"
## [28] "dil9_9"
dim(data_df)
## [1] 4008 28
This code adds new columns to the data matrix. These will contain TRUE or FALSE for each row depending on the p- or FDR-value was lower than a given threshold.
Exercise: Add another column where you specified FDR values (adj.P.Val) lower than 0.05 and one with lower than 0.1. How many significant entries do you get in each case?
data_df$IsSig <- data_df$P.Value < 0.05
data_df$IsFDRSig <- data_df$adj.P.Val < 0.2
# Inspect the results
head(data_df)
## Average.RT Cluster.ID Peptide.Sequence External.IDs
## 1 61.32530 1418128207109 AAAAAAGDVFKDVQEAK Q99Z57
## 2 99.03974 1418128190676 AAADYLEVPLYTYLGGFNTK P69949
## 3 98.97744 1418128188736 AAADYLEVPLYTYLGGFNTK P69949
## 4 42.66328 1418128214098 AAAEALAQEQAAK Q9A1Z8
## 5 52.43872 1418128213643 AAAEGNFAGIER Q99YE5
## 6 69.56549 1418128215927,1418128232269 AAAELALISGQKPLITK Q9A1W2
## Charge Average.m.z P.Value adj.P.Val log2Fold AvgExpr dil9_5
## 1 3 554.6214 0.6773926 0.9998144 0.520363420 17.58087 NA
## 2 2 1103.5547 0.3770416 0.9998144 -0.460577131 17.40734 NA
## 3 3 736.0416 0.6867821 0.9998144 0.282537091 14.30422 NA
## 4 2 636.3341 0.8840833 0.9998144 -0.089692270 17.87283 18.70998
## 5 2 603.2997 0.9833695 0.9998144 -0.008937177 18.92062 19.12293
## 6 2 862.5214 0.8341496 0.9998144 0.122805456 21.16187 20.41252
## dil9_4 dil8_1 dil8_2 dil8_3 dil9_3 dil8_5 dil8_4 dil9_2
## 1 17.97114 NA NA 18.73059 NA 16.63856 16.98317 NA
## 2 NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA
## 4 18.17519 17.94717 18.48465 18.77886 18.86186 16.04935 15.71238 15.00452
## 5 18.67264 18.99337 18.97384 19.36756 19.41670 17.25360 17.59831 17.55021
## 6 19.89131 20.08161 19.82381 21.27203 21.00412 19.43886 19.46593 20.69190
## dil9_1 dil8_6 dil9_6 dil8_7 dil9_7 dil8_8 dil9_8 dil8_9
## 1 NA NA NA NA NA NA NA NA
## 2 NA 18.09947 17.74470 17.73883 16.64132 17.10829 17.74684 17.60395
## 3 NA NA 14.84994 14.36933 13.65260 13.32145 14.83394 14.79809
## 4 15.58053 19.03427 18.69501 18.67079 18.12471 18.18855 18.76573 18.39302
## 5 17.46875 20.06490 19.43200 19.56391 19.11942 19.16025 19.76935 19.35004
## 6 19.90472 22.96216 21.87029 21.91014 21.93753 22.37153 23.05443 22.57813
## dil9_9 IsSig IsFDRSig
## 1 NA FALSE FALSE
## 2 16.57536 FALSE FALSE
## 3 NA FALSE FALSE
## 4 18.53430 FALSE FALSE
## 5 19.69335 FALSE FALSE
## 6 22.24263 FALSE FALSE
# The table command summarizes what type of entries is present in a vector, and in which number
table(data_df$IsSig)
##
## FALSE TRUE
## 3879 86
table(data_df$IsFDRSig)
##
## FALSE TRUE
## 3948 17
I personally dislike the gray background you get in the default ggplot-plots. There are a number of themes that can be assigned for different styles. Here, I assign the style classic
as the global style.
Check here for a selection of themes: https://ggplot2.tidyverse.org/reference/ggtheme.html
theme_set(theme_classic())
For further discussion on its interpretation, see the following link: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/
Special settings:
When running this you get a warning “Removed 43 rows containing non-finite values (stat_bin)”. This is due to missing values in the dataset, and is not a problem.
ggplot(data_df, aes(x=P.Value)) + geom_histogram(bins=100)
## Warning: Removed 43 rows containing non-finite values (stat_bin).
This can easily be adjusted to other characteristics, such as expression or fold change:
ggplot(data_df, aes(x=AvgExpr)) + geom_histogram(bins=100)
## Warning: Removed 43 rows containing non-finite values (stat_bin).
Exercise: Further explore the data_df
matrix using the histogram. Use colnames
to inspect what columns there are in the matrix. Try making histograms for different columns, and try to adjust the number of bins in the histogram using the bins
option.
If you want to color those entries that pass your adjusted p-value threshold in a histogram, you can use the fill
aesthetics.
Here, we want to highlight entries passing our assigned FDR threshold, which is specified in the column IsFDRSig
that we created above.
The #AAAAAA
is hexcode and a way to specify colors. The pairs of letters specify the amount of red, green and blue respectively.
gray_blue_colors <- c("#AAAAAA", "#0C81A2")
ggplot(data_df, aes(x=P.Value, fill=IsFDRSig)) + geom_histogram(bins=100) + scale_fill_manual(values=gray_blue_colors)
## Warning: Removed 43 rows containing non-finite values (stat_bin).
Exercise: Try changing the selected colors (you can find choices at the following link: https://www.w3schools.com/colors/colors_picker.asp).
Exercise: Try coloring based on other criteria than IsFDRSig (for instance IsSig)
Challenge exercise: Color based on if a feature is upregulated or not (i.e. if its fold-change is higher or lower than 0).
The volcano plot gives a quick overview of the fold-changes and p-values within your dataset. The x-axis shows the log2-fold and the y-axis is rescaled p-values where the top ones are those with the lowest p-values.
ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value))) + geom_point()
## Warning: Removed 43 rows containing missing values (geom_point).
You can color the significant hits here too, similarly to in the p-value histogram.
ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value), color=IsFDRSig)) + geom_point()
## Warning: Removed 43 rows containing missing values (geom_point).
We can improve the visuals further by using the options:
ggplot(data_df, aes(x=log2Fold, y=-log10(P.Value), color=IsFDRSig)) +
geom_point(alpha=0.6, na.rm=TRUE) +
ggtitle("Fancy volcano plot") +
xlab("Fold regulation (log2)") +
ylab("Significance (-log10)") +
scale_color_manual(values=gray_blue_colors)
Exercise: Try changing the x/y labels and the title
Exercise: Try adjusting the coloring aesthetics. Use a different column for coloring. You could also switch the coloring scheme.
Thinking point: When can we have a large fold change but still have a large p-value?
Challenge exercise: Add labels to the significant features. Instructions on how to achieve this can be found here: https://www.gettinggeneticsdone.com/2016/01/repel-overlapping-text-labels-in-ggplot2.html
The MA plot is very similar to the vulcano plot, with the only difference what we are specifying as x-axis and y-axis.
The command arrange(data_df, desc(P.Value))
is a Tidyverse command for sorting a data frame. Here, they are sorting with the features with the lowest p-values at the bottom to make sure they are drawn on top.
ggplot(arrange(data_df, desc(P.Value)), aes(x=AvgExpr, y=log2Fold, color=IsFDRSig)) +
geom_point(alpha=0.6, na.rm=TRUE) +
ggtitle("MA") +
xlab("Expression level") +
ylab("Fold change (log2)") +
scale_color_manual(values=gray_blue_colors)
Exercise: Take a look at the output from arrange(data_df, desc(P.Value))
. Do you see what is going on if you compare it to the output from data_df
?
We often want to study intensity levels of selected features. We first need to prepare the data in the right format and then we can perform various visualizations.
ggplot
expects the data to be in the so called ‘tidy’ format where one column contains the values, and the other columns contain information linked to that value. This is further discussed below in the “Preparing data in long format” section.
First, we sort the data frame on the P.Value
column and take a look at those with the highest and lowest p-value.
head(arrange(data_df, P.Value))
## Average.RT Cluster.ID Peptide.Sequence External.IDs
## 1 60.26076 1418128185272,1418128223206 VCFQYLNR hum272
## 2 60.60463 1418128210770 IALDFQR hum210
## 3 39.67873 1418128213014 ENAGEDPGLAR hum240
## 4 48.64133 1418128214834 FLGEVHCGTNVR hum269
## 5 62.81780 1418128211443 AGLQFPVGR hum284
## 6 49.44278 1418128215034 HSDAVFTDNYTR hum163
## Charge Average.m.z P.Value adj.P.Val log2Fold AvgExpr dil9_5
## 1 2 550.2725 0.0001043932 0.1214715 2.977166 19.64220 NA
## 2 2 431.7433 0.0001483525 0.1214715 1.961896 18.80728 20.18851
## 3 2 564.7685 0.0001660904 0.1214715 2.026260 18.51726 18.94022
## 4 2 694.8409 0.0002228151 0.1214715 1.632291 16.72149 17.47698
## 5 2 472.7698 0.0002243532 0.1214715 1.951682 19.69827 20.88876
## 6 2 713.3251 0.0002383996 0.1214715 2.395875 16.68474 17.36265
## dil9_4 dil8_1 dil8_2 dil8_3 dil9_3 dil8_5 dil8_4 dil9_2
## 1 NA NA NA 18.41610 21.88865 16.26718 NA NA
## 2 19.66857 17.99485 18.37609 18.50994 20.37002 16.22937 16.69950 18.29133
## 3 18.73956 17.02382 17.06114 NA NA NA NA NA
## 4 17.06073 15.00175 15.69515 16.46954 18.32329 NA NA 16.50913
## 5 20.39044 18.65152 18.90877 19.39108 21.16600 17.27939 17.45454 19.17637
## 6 16.68158 15.11930 15.14777 16.25041 17.65572 14.30295 14.90223 16.31986
## dil9_1 dil8_6 dil9_6 dil8_7 dil9_7 dil8_8 dil9_8 dil8_9
## 1 20.46198 NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445
## 2 18.43800 18.65345 20.14669 18.16291 19.68011 17.84140 21.10707 17.96952
## 3 NA NA 19.83402 17.97490 19.57531 17.16462 19.91811 17.83567
## 4 NA NA 17.55349 16.10982 17.09761 15.68256 17.77533 15.77371
## 5 19.23063 19.69910 21.03337 19.15981 20.56503 18.79018 22.10187 19.16744
## 6 16.27362 NA 18.25063 NA 18.03754 NA 19.03499 NA
## dil9_9 IsSig IsFDRSig
## 1 20.95204 TRUE TRUE
## 2 20.20381 TRUE TRUE
## 3 19.62251 TRUE TRUE
## 4 17.57183 TRUE TRUE
## 5 21.51451 TRUE TRUE
## 6 18.24705 TRUE TRUE
tail(arrange(data_df, P.Value))
## Average.RT Cluster.ID Peptide.Sequence External.IDs Charge
## 4003 39.03130 1418128228099 VENSWGEK Q99YL0 2
## 4004 42.78815 1418128228616 VIDTSIR Q9A0J0 2
## 4005 47.62865 1418128210565 VINDFTK P68900 2
## 4006 71.77410 1418128227443 VVDAFSTNFHLPK Q99Z49 3
## 4007 39.75320 1418128216450 YADATSLK Q9A1H0,Q490W3 2
## 4008 63.88005 1418128223337 YLYTISPNIK Q99ZR2 2
## Average.m.z P.Value adj.P.Val log2Fold AvgExpr dil9_5 dil9_4
## 4003 474.7267 NA NA NA NA NA NA
## 4004 402.2355 NA NA NA NA NA NA
## 4005 418.7291 NA NA NA NA 24.06851 23.57863
## 4006 492.2607 NA NA NA NA NA NA
## 4007 434.7244 NA NA NA NA NA NA
## 4008 606.3358 NA NA NA NA NA NA
## dil8_1 dil8_2 dil8_3 dil9_3 dil8_5 dil8_4 dil9_2 dil9_1 dil8_6
## 4003 NA NA NA NA NA NA NA NA NA
## 4004 NA NA NA NA NA NA NA NA NA
## 4005 NA NA NA NA NA NA NA NA NA
## 4006 NA NA NA NA NA NA NA NA NA
## 4007 NA NA NA NA NA NA NA NA NA
## 4008 NA NA 18.62476 NA 15.37982 NA NA NA NA
## dil9_6 dil8_7 dil9_7 dil8_8 dil9_8 dil8_9 dil9_9 IsSig
## 4003 18.70282 NA NA NA 17.95279 NA 18.63583 NA
## 4004 NA NA NA 15.42199 NA 16.23517 NA NA
## 4005 NA NA NA NA NA NA NA NA
## 4006 18.43230 NA 18.34095 NA NA NA NA NA
## 4007 NA 17.47744 NA 16.71426 NA NA NA NA
## 4008 NA NA NA NA NA NA NA NA
## IsFDRSig
## 4003 NA
## 4004 NA
## 4005 NA
## 4006 NA
## 4007 NA
## 4008 NA
Next, we extract the top feature and put it in its own data frame. There are a few things going on here:
data_df$External.IDs
gives us the column only with the external IDsdata_df$External.IDs == "hum272"
will give us a long vector with TRUE/FALSE values (here only true for the “hum272” entry)data_df[data_df$External.IDs == "hum272", ]
here we select all rows that are true in the previous vector (in this case - only one)best_hit[, design_df$sample]
here best_hit
is a single-row data frame. We subset it with the sample names from the design matrix so that we only have the values left.best_hit <- data_df[data_df$External.IDs == "hum272", ]
best_hit_vals <- best_hit[, design_df$sample]
print(best_hit)
## Average.RT Cluster.ID Peptide.Sequence External.IDs
## 3495 60.26076 1418128185272,1418128223206 VCFQYLNR hum272
## Charge Average.m.z P.Value adj.P.Val log2Fold AvgExpr dil9_5
## 3495 2 550.2725 0.0001043932 0.1214715 2.977166 19.6422 NA
## dil9_4 dil8_1 dil8_2 dil8_3 dil9_3 dil8_5 dil8_4 dil9_2 dil9_1
## 3495 NA NA NA 18.4161 21.88865 16.26718 NA NA 20.46198
## dil8_6 dil9_6 dil8_7 dil9_7 dil8_8 dil9_8 dil8_9 dil9_9
## 3495 NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445 20.95204
## IsSig IsFDRSig
## 3495 TRUE TRUE
print(best_hit_vals)
## dil9_5 dil9_4 dil8_1 dil8_2 dil8_3 dil9_3 dil8_5 dil8_4 dil9_2
## 3495 NA NA NA NA 18.4161 21.88865 16.26718 NA NA
## dil9_1 dil8_6 dil9_6 dil8_7 dil9_7 dil8_8 dil9_8 dil8_9
## 3495 20.46198 NA 20.66011 18.61546 20.27065 18.10826 21.73932 18.68445
## dil9_9
## 3495 20.95204
Exercise: Take a closer look at the code. Make sure you understand how the data is extracted from the dataframe.
Exercise: Do the same again but for the feature with the lowest p-value.
The final step is to include these values in a new data frame. We use the cbind
command to bind this value together with the design matrix. Now we have a data frame containing both the sample annotation information from the data frame and expression values for our feature of interest.
unlist
- best_hit_vals
is a single-row data frame. To bind it we need it as a vector, which unlist
does for us.cbind
- “column bind”, let’s us bind one or many columns together (must have the same number of rows). The corresponding command for row-wise binding is rbind
.best_hit_df <- cbind(value=unlist(best_hit_vals), design_df)
head(best_hit_df)
## value sample group time dilution
## dil9_5 NA dil9_5 g9 late high
## dil9_4 NA dil9_4 g9 late high
## dil8_1 NA dil8_1 g8 late high
## dil8_2 NA dil8_2 g8 late high
## dil8_3 18.41610 dil8_3 g8 late high
## dil9_3 21.88865 dil9_3 g9 late high
Exercise: Bind the high-pvalue feature that you prepared before to the same data frame.
Now we are ready to study this feature. At the same time, it is a good time to try out some different geometics, and try using multiple geometics at once.
ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_point()
## Warning: Removed 7 rows containing missing values (geom_point).
ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_boxplot() + geom_point()
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing missing values (geom_point).
ggplot(best_hit_df, aes(x=as.factor(group), y=value)) + geom_boxplot() + geom_point(aes(color=time), position=position_jitter(0.1), size=3)
## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing missing values (geom_point).
Exercise: Do the same illustrations using the high p-value feature. Can you relate the figure to the p-values you have obtained?
Exercise: Try changing the coloring from “time” to “group” in the last plot. Does the results make sense?
Exercise: Update the xlabel, ylabel and title to be more illustrative
Exercise: Experiment with position_jitter
setting. What happens if you change the value from 0.1
to 0.3
? To 0
? What happens if you change position_jitter
to position_jitter_dodge
? If you remove the entire argument?
Exercise: Try assigning aes(color=time)
inside the geom_boxplot
call (in the last plot).
ggplot
generally expects to get the data in the so called “long format”. This is different from the “wide format” we commonly used for expression matrices. Example of wide format, with one gene per row, and one sample per column:
gene s1 s2 s3
gene_A 4 5 3
gene_B 3 2 3
The same matrix in long format:
gene sample value
gene_A s1 4
gene_A s2 5
gene_A s3 3
gene_B s1 3
gene_B s2 2
gene_B s3 3
Here, all values are in one column, and we have a new column specifying to which sample they belong. Also notice that we have multiple rows for each gene (seen in the gene
column). The long format is less pleasing for the human eye to look at, but makes it easier for ggplot
as you only need to keep track of two columns to keep track of each sample and value.
This conversion can be made using the tidyverse-command pivot_longer
. Here, you specify:
Non-sample columns will follow along similarly to the gene
column in the small example above.
long_data <- pivot_longer(data_df, as.character(design_df$sample), names_to="sample", values_to="value")
head(long_data)
## # A tibble: 6 x 14
## Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
## <dbl> <chr> <chr> <chr> <int> <dbl>
## 1 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 2 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 3 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 4 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 5 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 6 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## # … with 8 more variables: P.Value <dbl>, adj.P.Val <dbl>, log2Fold <dbl>,
## # AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>, sample <chr>, value <dbl>
Exercise: Inspect the long_data
table using the commands dim
, str
and head
. Compare it to the wide format data_df
data frame.
Having the data in long format we can now easily do sample-wide visualizations.
For instance, we can do density plots to see how the values in each sample is distributed.
ggplot(long_data, aes(x=value, color=sample)) + geom_density()
## Warning: Removed 20006 rows containing non-finite values (stat_density).
It looks like we have a four outliers here! (seen shifted to the left). These will likely cause problems in our downstream analysis if not handled with care.
Often we want to color on different conditions to look for patterns in the data. Then we need to merge in out design matrix into this long data.
This can be achieved using a left_join
from the dplyr
package. Here, we specify the two matrices to merge, and which column we use for the merge.
annot_long_data <- left_join(long_data, design_df, by="sample")
head(annot_long_data)
## # A tibble: 6 x 17
## Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
## <dbl> <chr> <chr> <chr> <int> <dbl>
## 1 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 2 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 3 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 4 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 5 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## 6 61.3 141812820… AAAAAAGDVFKDVQE… Q99Z57 3 555.
## # … with 11 more variables: P.Value <dbl>, adj.P.Val <dbl>,
## # log2Fold <dbl>, AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>,
## # sample <chr>, value <dbl>, group <chr>, time <chr>, dilution <chr>
Exercise:: Inspect the annot_long_data
data frame and compare it to the long_data
matrix. Do you see how the design matrix got added? How many columns were added? How many columns are there in the design dataframe?
Now we can use different characteristics found in the design matrix to study the sample densities. Note that we need to use the group
aesthetics to separate the samples when using other colorings.
ggplot(annot_long_data, aes(x=value, color=time)) + geom_density()
## Warning: Removed 20006 rows containing non-finite values (stat_density).
ggplot(annot_long_data, aes(x=value, color=time, group=sample)) + geom_density()
## Warning: Removed 20006 rows containing non-finite values (stat_density).
Exercise: There are a lot of technical influence going on in this dataset. Can you spot the two batch effects by visually inspecting the figure?
Exercise: Try coloring the figure on other sample characteristics found in the data frame. Can you spot which matches the pattern? Do you see why these samples are shifted this way?
Now we can easily do boxplots and other illustrations of our samples.
Here, we use the theme
option to adjust the x-axis labels (axis.text.x
) and to rotate them 90 degrees. We also adjust them to be nicely aligned with the ticks. The theme
option is a powerful aspect of ggplot which lets you target in principle any aspects for the figures and change fonts, colorings, text orientations and much more.
ggplot(annot_long_data, aes(x=sample, y=value, color=time)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90, vjust=0.5)) +
ggtitle("Sample intensity levels")
## Warning: Removed 20006 rows containing non-finite values (stat_boxplot).
Exercise: See if you can make a violin plot for the densities (tips: the geom for the violin plot is geom_violin
).
Exercise: Try adjusting the vjust
value. Do you see what is happening?
Challenge exercise: See if you can rotate the y-axis labels.
You often want to combine multiple plots
library(ggpubr)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
dens1 <- ggplot(annot_long_data, aes(x=value, color=time)) + geom_density()
dens2 <- ggplot(annot_long_data, aes(x=value, color=time, group=sample)) + geom_density()
ggarrange(dens1, dens2)
## Warning: Removed 20006 rows containing non-finite values (stat_density).
## Warning: Removed 20006 rows containing non-finite values (stat_density).
grid_plt <- ggarrange(dens1, dens2, common.legend = TRUE, legend = "right")
## Warning: Removed 20006 rows containing non-finite values (stat_density).
## Warning: Removed 20006 rows containing non-finite values (stat_density).
## Warning: Removed 20006 rows containing non-finite values (stat_density).
annotated_plt <- annotate_figure(grid_plt, top="Grid title")
annotated_plt
ggsave(annotated_plt, filename = "output_path.png")
## Saving 7 x 5 in image
You can specify the measures, and the resolution.
ggsave(annotated_plt, filename = "output_path_highres.png", width=10, height=10, dpi = 300)
Source: https://learnr.wordpress.com/2010/01/26/ggplot2-quick-heatmap-plotting/ Source for base R: https://www.r-graph-gallery.com/heatmap
Heatmaps can be done in ggplot using geom_tile()
, but in this particular case I would recommend using the base R version due to its clustering of the entries.
library(tidyverse)
heatmap(data_df %>% head(100) %>% dplyr::select(design_df$sample) %>% as.matrix() %>% na.omit())
colnames(data_df) <- make.names(colnames(data_df))
out <- data_df %>%
arrange(P.Value) %>%
head(10) %>%
pivot_longer(as.character(design_df$sample), names_to="sample", values_to="value") %>%
filter(!is.na(value))
head(out)
## # A tibble: 6 x 14
## Average.RT Cluster.ID Peptide.Sequence External.IDs Charge Average.m.z
## <dbl> <chr> <chr> <chr> <int> <dbl>
## 1 60.3 141812818… VCFQYLNR hum272 2 550.
## 2 60.3 141812818… VCFQYLNR hum272 2 550.
## 3 60.3 141812818… VCFQYLNR hum272 2 550.
## 4 60.3 141812818… VCFQYLNR hum272 2 550.
## 5 60.3 141812818… VCFQYLNR hum272 2 550.
## 6 60.3 141812818… VCFQYLNR hum272 2 550.
## # … with 8 more variables: P.Value <dbl>, adj.P.Val <dbl>, log2Fold <dbl>,
## # AvgExpr <dbl>, IsSig <lgl>, IsFDRSig <lgl>, sample <chr>, value <dbl>
ggplot(out, aes(sample, External.IDs)) +
geom_tile(aes(fill=value)) +
theme(axis.text.x = element_text(angle=90, vjust=0.5))
There are many ways to do principal component analysis visualizations in R. The one I find easiest to use is the PCAtools
package.
Here is a nice vignette for more extensive examples: https://bioconductor.org/packages/release/bioc/vignettes/PCAtools/inst/doc/PCAtools.html
library(PCAtools)
## Loading required package: ggrepel
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: lattice
## Loading required package: cowplot
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggpubr':
##
## get_legend
##
## Attaching package: 'PCAtools'
## The following objects are masked from 'package:stats':
##
## biplot, screeplot
# Biplots for different components and colorings
# Pairplot
data_only <- data_df[, design_df$sample]
data_only_no_missing <- data_only[complete.cases(data_only), ]
p <- pca(data_only_no_missing, removeVar=0.1, scale=TRUE, metadata=as.matrix(design_df))
## -- removing the lower 10% of variables based on variance
biplot(p, colby = "group")
biplot(p, colby = "time")
biplot(p, colby = "time", shape="group", legendPosition = "right")
screeplot(p)
pairsplot(p, colby = "group")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
pairsplot(p, colby = "time")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.