As explained in the slides, you can assign a value and initiate variable by using <-
or =
symbols. You can see the examples below:
# Numeric vaiables
x <- 8
y <- 2
# String varibales
str_var1 <- "I"
str_var2 <- "love"
str_var3 <- "Batman"
Similar to what we covered in the last session as well, you can do multiple arithmetic orations with the numeric variables.
You can initiate a vector direct by using c()
function. Here c
technically stands for concatenation
With the vector n
, You can try do some of the basic statistics functions like log()
, mean()
, max()
and so on as explained in the slides to get a hang of it!
Here I will introduce some in-built functions to manipulate strings! Some of these functions can be very handy when you work with tables with different strings and you want to search/edit them!
Remember to use ?
on the different functions to understand them better
# To make sub-strings out of a larger string!
substr(str_vec, start=1, stop=2)
substr(str_vec, start=1, stop=4)
# To search in a string
grep("a", str_vec)
grep("i", str_vec)
grep("i", str_vec, ignore.case = TRUE)
#just for the fun of it
str_var4 <- "but not Robin"
str_vec2 <- c(str_vec, str_var4) #note that we add an extra element to the already existing vector.
grep("o", str_vec2)
#substitution
gsub("Robin", "Joker", str_vec2, ignore.case=FALSE)
#splitting strings
strsplit(str_var3, "t")
#concatenating
paste(str_vec2, collapse=" ")
#change case
toupper(str_var3)
tolower(str_var3)
Note: Almost all of the functions above are much more powerful than what you see in the results. When you look into the help page of any of these functions you would see pattern
as an argument for these functions. This stands for REGEX patterns that you can learn to use capture string patterns in large texts or tables! I will not touch more upon this as REGEX
alone can be a course on its own!
You can generally import a csv
or tsv
file into the R environment using read.table
command. This would automatically create a datatype called data.frame
that stores information from a matrix (like an excel sheet) in a structured way!
You can download the files here: genes and metadata.
gene_counts <- read.table("gene_counts.tsv", sep = "\t",row.names = 1, header = T)
metadata <- read.table("metadata.tsv", sep = "\t", header = T)
Note: It is important that by using ?read.table
command, you can see that there are many other arguments to this function that one can use to read in their file, the way they want. Like skip
can for example be used to skip a certain number of lines from your file! This would be helpful, if you have a file with some comments in the beginning of a counts table for example!
You can use functions like names()
that will give you all the different column names of a data frame and summary()
that will summarize your data.frame based on the data that is in the data.frame.
You can access the different columns in the data.frame using $
sign. Then the result of this action becomes a vector
of values in that column. You can think of this as accessing a particular column in a an excel sheet!
The functions row.names()
and colnames()
can be used to set the row-headers and column-headers as the functions suggest! For example if you did not have headers and row names in the file you imported! In the metadata
dataframe, we did not include renames! So, we add a row name based on one of the columns!
You find out what kind of datatype your object is either by using class()
or str()
standing for structure
of your R object!
It is good to know the difference between a matrix and a dataframe! A matrix is basically a dataframe that is not structured! You can between these formats by using the as.matrix()
or as.data.frame()
functions!
You can also access and modify the contents of the dataframe or a matrix by using the syntax [row,column]
. Here you can either specify the number or the names of the rows and columns.
Similarly, you can also access the entire rows by [row,]
syntax or entire columns by [,column]
syntax.
You can also subset your dataframes with the help of c()
function.
With this syntax, you can particularly change the contents by using <-
or =
as shown below:
md.mt <- as.matrix(metadata)
# Notice the difference
length(md.mt)
length(metadata)
#Accessing the contents and see the difference between the two datatypes
metadata[2,1]
md.mt[2,1]
metadata["Sample_5","Age"]
# Accessing entire rows and columns
metadata["Sample_8",]
metadata[,"Age"]
# Subsetting dataframes
metadata[c(3,5),c(2,4)]
# Changing values in a dataframe
metadata["Sample_9","Age"] <- 77
You can add and remove row(s) or column(s) to a dataframe manually with <-
or with cbind()
or rbind()
standing for row-bind and column-bind. In this case you make a new dataframe/vector and you bind it to the data.
You can remove row or column from a dataframe by using -
sign for the particular number!
#for example we add a bloodpressure column
metadata$BP <- c(92,128,111,88,125,127,118,104,87,130,107,137,139,109,136,108)
metadata
# using cbind
Glu <-c(103,180,157,147,179,80,82,116,123,150,160,117,135,141,149,124)
metadata <- cbind(metadata,Glu)
metadata
#using rbind
Sample_17 <- c("Sample_17", "no", 67, "M", 103, 141)
metadata <- rbind(metadata, Sample_17)
metadata
# Note: You will get an error message here and we come to that later.
#removing from dataframe
metadata <- metadata[-17,]
There are ways to check for NA values in your dataset! You can use is.na()
function for example and with the combination of which()
that would basically tell you exactly where it is the case!
you can also remove rows or columns with NA by using na.omit()
%in%
function to specifically look for things in your dataframe.factors are very important for statistical calculations. These basically the different levels
that are there in your data. For example the Health
in our metadata
is a factor with two levels of yes
and no
. It is basically a binary factor. But you can have many levels as well!
If there is a column in your dataframe and you want to make it a factor you can do this by as.factor()
function
The function table()
tabulates observations and can be used to create bar plots quickly.
summary(metadata)
#Using factors to get observations
table(metadata$Age)
table(metadata$Sex)
#Using factors to plot different interesting counts
barplot(table(metadata$Healthy))
barplot(table(metadata$Sex))
#even more interesting plots
plot(x = metadata$Sex, y = metadata$BP)
plot(x = metadata$Sex, y = metadata$Glu)
plot(x = metadata$Healthy, y = metadata$BP)
plot(x = metadata$Healthy, y = metadata$Glu)
plot()
, boxplot()
and hist()
#histogram
hist(gene_counts$Sample_5)
#Scatter plot
plot(gene_counts$Sample_5,gene_counts$Sample_16)
#Boxplot
boxplot(gene_counts)
boxplot(log2(gene_counts)) # why do you get warning?
boxplot(log2(gene_counts + 1))
Hope you guys learnt some new ways handling data in R and most importantly had fun. You need to remember: