User defined functions (UDF)

R language allows users to create their own function objects. The basic syntax for defining new functions is:
name <- function(arg_1, arg_2, …)

Function Name - This is the actual name of the function. It is stored in R environment as an object with this name.

Arguments - An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.

Function Body - The function body contains a collection of statements that defines what the function does.

Return Value - The return value of a function is the last expression in the function body to be evaluated.

myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}

Set Working Directory:

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB W8 - R Functions and Control Structures")

1. Simple functions

sqSum<-function(x,y){
result=x^2+y^2
return(result)
}
sqSum(2,3)

[1] 13

Create a function to print squares of numbers in sequence

new.function <- function(a) {
   for(i in 1:a) {
      b <- i^2
      print(b)
   }
}

new.function(6)

[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Create a function that prints a message to the terminal

sqSumPrint<-function(x,y){
result=x^2+y^2
#The cat() is a built-in R function that concatenates and prints the objects.
cat("Here is the result:",result,"\n")
}
sqSumPrint(2,3)

Here is the result: 13

2. If/Else Conditions

Execute a certain part of the code only if certain condition is satisfied

Lets look at this example with the cpgi.hg19.chr21.bed file :

cpgi.df <- read.table("cpgi.hg19.chr21.bed", header = FALSE)

The function takes input one row of CpGi data frame:

largeCpGi<-function(bedRow){
 cpglen=bedRow[3]-bedRow[2]+1
 if(cpglen>1500){
  cat("this is large\n")
  }
 else if(cpglen<=1500 & cpglen>700){
  cat("this is normal\n")
  }
 else{
  cat("this is short\n")
 }
 }

largeCpGi(cpgi.df[10,])

this is normal

3. Loops and looping structures in R

To repeat a certain task or execute a function multiple times, you can a loop wich will execute the task until a certain condition is reached.

Create a “for-loop” to execute a task sequentially 10 times

Example 1:

for (i in 1:10){
cat("This loop will print this strign 10 times")
}

This loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 timesThis loop will print this strign 10 times

Example 2:

for(i in 1:10){ # number of repetitions
  cat("This is iteration") # the task to be repeated
  print(i)}

This is iteration[1] 1
This is iteration[1] 2
This is iteration[1] 3
This is iteration[1] 4
This is iteration[1] 5
This is iteration[1] 6
This is iteration[1] 7
This is iteration[1] 8
This is iteration[1] 9
This is iteration[1] 10

Calculate the length of the CpG islands for the first 100 islands in the data frame

Workflow: 1. Create an empty vector to keep the lengths 2. Start the loop 3. Calculate the length 4. Append the length to the result 5. Check the results

result=c()
for(i in 1:100){
len=cpgi.df[i,3]-cpgi.df[i,2]+1
result=c(result,len)
}
head(result)

[1]  855  208  357 1501 1090  751

The code executed hundred times, and it calculate the length of the CpG islands for the first 100 islands in the data frame (by subtracting the end coordinate from the start coordinate).

4.Use “apply” functions instead of loops

“apply” family of functions – include apply, lapply, mapply and tapply. These tend to be more efficient than loops and work by applying a given function to a set of instances and return the result of those functions for each instance. The subtle difference between these functions is that they take different type of inputs.

. apply works on data frames or matrices and applies the function on each row or column of the data structure.

. lapply works on lists or vectors and applies a function which takes the list element as an argument.

. mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each … argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.

. tapply applies a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.

Write a matrix,then use “apply() to use the sum function to calculate of numbers on each row of the matrix.

Use cbind in R to write the matrix:

#cbind - combines by collumns or rows
mat=cbind(c(3,0,3,3),c(3,0,0,0),c(3,0,0,3),c(1,1,0,0),c(1,1,1,0),c(1,1,1,0))
mat

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    3    3    3    1    1    1
[2,]    0    0    0    1    1    1
[3,]    3    0    0    0    1    1
[4,]    3    0    3    0    0    0

Use apply to sum the row value

#Use apply to sum the rows(specified by the number 1 in the midle argument) values
result<-apply(mat,1, sum)
result

[1] 12  3  5  6

Define the function as an argument to apply()

#Use apply to sum the rows(specified by the number 1 in the midle argument) values
result<-apply(mat,1,function(x) sum(x))
result

[1] 12  3  5  6

Integers 1 and 2 are used to differentiate between row and column respectively in the apply() function. To find the sum in the columns use 2 in the function:

result<-apply(mat,2,sum)
result

[1] 9 3 6 2 3 3

Use Lapply() to apply a function on a list or a vector.

In this example, let’s add another column that contains the squares of given numbers.

input=c(1,2,3)
lapply(input,function(x) x^2)

[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

Use mapply() to apply a function on an unlimited set of vectors/lists.

It is like a version of lapply that can handle multiple vectors as arguments. The argument to the mapply() is the function to be applied and the sets of parameters to be supplied as arguments of the function.

The arguments to be summed up are in the format of vectors, Xs and Ys. mapply() applies the summation function to each pair in Xs and Ys vector.

Xs=0:5
Ys=c(2,2,2,3,3,3)
result<-mapply(function(x,y) sum(x,y),Xs,Ys)
result

[1] 2 3 4 6 7 8

5. Application to genomic data

Generate a data frame with ID numbers and DNA sequences

fx <- function(test) {
x <- as.integer(runif(20, min=1, max=5))
x[x==1] <- "A"; x[x==2] <- "T"; x[x==3] <- "G"; x[x==4] <- "C"
paste(x, sep = "", collapse ="")
}
z1 <- c()
for(i in 1:50) {
z1 <- c(fx(i), z1)
}
z1 <- data.frame(ID=seq(along=z1), Seq=z1)
z1

   ID                  Seq
1   1 GGGAATTATGTCGGCTTATA
2   2 ACGACCTACGGGAGCGCGTC
3   3 CTTGGCGCGGTTTATGTGAA
4   4 ACATGCGCTTCGGATATATT
5   5 AGTCAACCTGGAGCCTAAGT
6   6 AAGGTGTACACCACGGAGTT
7   7 CGGACTATCGGGTCTATACG
8   8 TTCCCGATAGCACTCAATTC
9   9 CAACTCCGGCTCAGTCAGGT
10 10 AAATCAATCTCCCATACGTT
11 11 ACCGCGCGACCTTACAGTCT
12 12 AGGTCATTTTAGCGGGCCTT
13 13 TGGAAAACTTCGGTCGGAGA
14 14 GGAGTCATCGGTCTGAGACT
15 15 ATATGGTTAGTACCTGGGAG
16 16 CCAGCCAGCTGTACGGTAGG
17 17 GTCCGACAACATCCCTGTAG
18 18 TTAGATCAATGTTACGATAA
19 19 GCGGGTTTGAGACCGGCATG
20 20 ATCATCGGTGCGTACCTCGA
21 21 TGGATGCAATACCTACACGA
22 22 CTAATACTGCGATGTTTATA
23 23 ACAACGGGGGGCCGCGTAGT
24 24 CGTGGACAATCCGAGAGCAC
25 25 GCACAGTATGGCGGCTGCTA
26 26 GAAATGGGCTTAAAGCTCCG
27 27 CTGAATGCCAAAGCCGGTAG
28 28 GAGTCGCTACGATAGCGTAC
29 29 CCGGTGCCGAGAATGGGCAC
30 30 GTCGCGAACTACAGGTCGAG
31 31 TTTGAAGCCTCCTTAAATGT
32 32 GGCATGGAAGACAGATTGAG
33 33 GGTTCTAGAGGCCGCGCCCG
34 34 GACTGAAAAGTGGCGACGCC
35 35 GGAGTAAAATCGTTGCGGGG
36 36 AAGGGTGCTAAGACGTCAAA
37 37 GCTTTGCCGGATGCTCCAGA
38 38 TCTTTCCCAATAACCTCTTC
39 39 GGGAAATCTGAAACGCGGTA
40 40 TGGCTCCTGGTGTCTAGGCT
41 41 TATGGAACGCCGTTTTCATG
42 42 ACAGCACACAAACGGTATTG
43 43 TCCCCCTACAGTCGCTGGGC
44 44 TATTTATGTCTCTATGGCTG
45 45 AAAAGTTTTCGAGGAGCGAC
46 46 ATACGGCGCACGTAAGTCGA
47 47 AAACTAACTTTCGCTCAAAA
48 48 CTCAGACAGGGGCATCATGG
49 49 GCTGCCCTAAACCCAAGAGA
50 50 CGAGCACATAGGCTAATTGG

Write each character of sequence into separate vector field and reverse its order

my_split <- strsplit(as.character(z1[1,2]),"")
my_rev <- rev(my_split[[1]])
paste(my_rev, collapse="")

[1] "ATATTCGGCTGTATTAAGGG"

Generate the sequence complement by replacing G|C|A|T by C|G|T|A

Use ‘apply’ or ‘for loop’ to apply the above operations to all sequences in sample data frame ‘z1’

Calculate in the same loop the GC content for each sequence using the following command:

table(my_split[[1]])/length(my_split[[1]])


   A    C    G    T 
0.25 0.10 0.30 0.35

6. Translate DNA into Protein Sequences

Translates one or many DNA sequences in all six reading frames into proteins and export of results in FASTA format

6.1. Sequence Import Function

seq_imp_fct <- function(fileloc) {
    my_fasta <- readLines(fileloc) # reads file line-wise into vector 
y <- regexpr("^[^>]", my_fasta, perl=T) # identifies all fields that do not  start with a '>' sign 
    y <- as.vector(y);  y[y==-1] <- 0
    index <- which(y==0)
distance <- data.frame(start=index[1:(length(index)-1)], end=index[2:length(index)])
distance <- rbind(distance, c(distance[length(distance[,1]),2],  length(y)+1)) # gets data for last entry 
    distance <- data.frame(distance, dist=distance[,2]-distance[,1])
    seq_no <- 1:length(y[y==0])
    index <- rep(seq_no, as.vector(distance[,3]))
    my_fasta <- data.frame(index, y, my_fasta)
    my_fasta[my_fasta[,2]==0,1] <- 0
seq <- tapply(as.vector(my_fasta[,3]), factor(my_fasta[,1]), paste, collapse="", simplify=F)
    seq <- as.vector(seq[2:length(seq)])
Desc <- as.vector(my_fasta[c(grep("^>", as.character(my_fasta[,3]), perl = TRUE)),3])
    ID <- gsub("^>| .*", "", as.character(Desc), perl=T)
    Desc <- gsub("^.*? ", "", as.character(Desc), perl=T)
    my_fasta <- data.frame(ID, Desc, Length=nchar(seq), seq)
    my_fasta
}   
cat("\n# (1.1) Import 2058 ORFs from Halobacterium into R like this: \n\t myseq < seq_imp_fct(\"ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn\")\n")


# (1.1) Import 2058 ORFs from Halobacterium into R like this: 
     myseq < seq_imp_fct("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.ffn")

cat("\n# (1.2) Import codon table:\n\t AAdf <- read.table(file=\"http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt\", header=T, sep=\"\\t\")\n")


# (1.2) Import codon table:
     AAdf <- read.table(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt", header=T, sep="\t")

6.2. Define translateDNA()

# (a) Translate a single DNA sequence
translate <- function(mystr="TGGCTATCCCATGATTTTCTCCAC", 
                      frame, 
                      pepCode=pepCode) 
  { 
# argument "pepCode" can have the values 3 or 4; the argument "frame" can have one integer between 1 and 6.
mystr <- gsub("(...)", "\\1_", mystr)
# inserts '_' between all characters in sequence string (required for step).
    mystr <- unlist(strsplit(mystr , "_")) # generates vectorized triplet sequence
    mystr <- mystr[grep("^...$", mystr)]
    mystr <- data.frame(Order=1:length(mystr), Seq=mystr)
    if(frame==1|frame==2|frame==3) {
    mystr <- merge(mystr, AAdf, by.x="Seq", by.y="Codon", all.x=T) # Merges codon table with input sequence. The order column is used in the next step to sort back to the initial order in the sequence string.
    mystr[,pepCode] <- as.character(mystr[,pepCode])
    mystr[,pepCode][is.na(mystr[,pepCode])] <- "X" # assigns X's to triplets with unusual characters (e.g. N's)
    mystr <- paste(as.vector(mystr[order(mystr[,2]), pepCode]), collapse="")
    }
    if(frame==4|frame==5|frame==6) {
    mystr <- merge(mystr, AAdf, by.x="Seq", by.y="AntiCodon", all.x=T) # Merges codon table with input sequence. The order column is used in the next step to sort back to the initial order in the sequence string.
    mystr <- mystr[,c(1,2,4,5,6,3)] # obtains same column order as for frames 1-3
    mystr[,pepCode] <- as.character(mystr[,pepCode])
    mystr[,pepCode][is.na(mystr[,pepCode])] <- "X" # assigns X's to triplets with unusual characters (e.g. N's)
    mystr <- paste(as.vector(mystr[rev(order(mystr[,2])), pepCode]), collapse="") #     generates reverse protein sequence!!!
    }
    mystr
}

# (b) Transform input DNA into 6 reading frames (list object) and translate them with the translate() function.

translateDNA <- function(myseq=myseq, frame=c(1,2,3,4,5,6), pepCode="single") { 
        # argument frame allows to select any combination of the 6 reading frames 
    if(pepCode=="single") { pepCode <- 3 } 
    if(pepCode=="triple") { pepCode <- 4 }
    myseq <- gsub("([A-Z])", "\\U\\1", as.character(myseq), perl=T, ignore.case=T) # takes care of case inconsistencies
    frame1 <- myseq
    frame2 <- gsub("^.", "", myseq)
    frame3 <- gsub("^..", "", myseq)
myl <- list(frame1=frame1, frame2=frame2, frame3=frame3, frame4=frame1, frame5=frame2, frame6=frame3)
# The next line runs translate() only on the reading frames that are specified by the provided 'frame' argument
myl <- lapply(names(myl)[frame], function(x) translate(mystr=myl[x], frame=which(names(myl) %in% x), pepCode=pepCode))
    names(myl) <- paste(rep("frame", length(frame)), frame, sep="")
    myl
}
cat("\n# (2.1) Translate sample DNAs in all 6 reading frames:\n\t pep_list <- apply(data.frame(myseq[1:10,4]), 1, function(x) translateDNA(myseq=x, frame=c(1,2,3,4,5,6), pepCode=\"single\"))\n")


# (2.1) Translate sample DNAs in all 6 reading frames:
     pep_list <- apply(data.frame(myseq[1:10,4]), 1, function(x) translateDNA(myseq=x, frame=c(1,2,3,4,5,6), pepCode="single"))

6.3. Export protein list to text file in fasta format

seql2fasta <- function(seql=pep_list, myfile="myfile") {
            names(seql) <- paste(">", names(seql), sep="")
            export <- data.frame(IDs=names(seql), seq=unlist(seql))
            export <- as.vector(unlist(t(export)))
    write.table(export, file=paste(myfile, ".txt", sep=""), quote=F, row.names=F, col.names=F)
}
cat("\n# (3.1) Export protein sequences in fasta format:\n\t seql2fasta(seql=pep_list, myfile=\"myfile\")\n")


# (3.1) Export protein sequences in fasta format:
     seql2fasta(seql=pep_list, myfile="myfile")