Chapter 7 Programing basics
In this section we see three main applications: conditional flows, loops, and functions, that are main pillars of any type of programming.
7.1 Conditional flows
7.1.1 if/else
The main syntax is as follows
if (condition) {
some R code
} else {
more R code
}
Here is a simple example:
<- c("what","is","truth")
x
if("Truth" %in% x) {
print("Truth is found")
else {
} print("Truth is not found")
}
## [1] "Truth is not found"
How about this:
<- c(1, 4, 4)
x <- 3
a
#Here is a nice if/Else
if(length(x[x == a]) > 0) {
print(paste("x has", length(x[x==a]), a))
else {
} print(paste("x doesn't have any", a))
}
## [1] "x doesn't have any 3"
#Another one with pipping
<- 4
a if(a %in% x) {
print(paste("x has", length(x[x==a]), a))
else {
} print(paste("x doesn't have any", a))
}
## [1] "x has 2 4"
7.1.2 Nested conditions
#Change the numbers to see all conditions
<- 0
x <- 4
y if (x == 0 & y!= 0) {
print("a number cannot be divided by zero")
else if (x == 0 & y == 0) {
} print("a zero cannot be divided by zero")
else {
} <- y/x
a print(paste("y/x = ", a))
}
## [1] "a number cannot be divided by zero"
Building multiple conditions without else (it’s a silly example!):
<- 0
z <- 4
w <- 5
x <- 3
y if(z > w) print("z is bigger than w")
if(w > z) print("w is bigger than z")
## [1] "w is bigger than z"
if(x > y) print("x is bigger than y")
## [1] "x is bigger than y"
if(y > x) print("y is bigger than x")
if(z > x) print("z is bigger than x")
if(x > z) print("x is bigger than z")
## [1] "x is bigger than z"
if(w > y) print("w is bigger than y")
## [1] "w is bigger than y"
if(y > w) print("y is bigger than w")
7.1.3 Simpler ifelse
A simpler, one-line ifelse
:
#Change the numbers
<- 0
x <- 4
y ifelse (x > y, "x is bigger than y", "y is bigger than x")
## [1] "y is bigger than x"
#Better (ifelse will fail if x = y. Try it!)
ifelse (x == y, "x is the same as y",
ifelse(x > y, "x is bigger than y", "y is bigger than x"))
## [1] "y is bigger than x"
A simpler, without else!
<- 0
z <- 4
w if(z > w) print("w is bigger than z")
#Change the numbers
<- 5
x <- 3
y if(x > y) print("x is bigger than y")
## [1] "x is bigger than y"
#See that both of them moves to the next line.
The ifelse()
function only allows for one “if” statement, two cases. You could add nested “if” statements, but that’s just a pain, especially if the 3+ conditions you want to use are all on the same level, conceptually. Is there a way to specify multiple conditions at the same time?
#Let's create a data frame:
<- data.frame("name"=c("Kaija", "Ella", "Andis"), "test1" = c(FALSE, TRUE, TRUE),
df "test2" = c(FALSE, FALSE, TRUE))
df
## name test1 test2
## 1 Kaija FALSE FALSE
## 2 Ella TRUE FALSE
## 3 Andis TRUE TRUE
Suppose we want separate the people into three groups:
- People who passed both tests: Group A
- People who passed one test: Group B
- People who passed neither test: Group C
dplyr
has a function for exactly this purpose: case_when()
.
library(dplyr)
<- df %>%
df mutate(group = case_when(test1 & test2 ~ "A", # both tests: group A
xor(test1, test2) ~ "B", # one test: group B
!test1 & !test2 ~ "C" # neither test: group C
)) df
## name test1 test2 group
## 1 Kaija FALSE FALSE C
## 2 Ella TRUE FALSE B
## 3 Andis TRUE TRUE A
7.2 Loops
What would you do if you needed to execute a block of code multiple times? In general, statements are executed sequentially. A loop statement allows us to execute a statement or group of statements multiple times and the following is the general form of a loop statement in most programming languages. There are 3 main loop types: while()
, for()
, repeat()
.
Here are some examples for for()
loop:
<- c(3, -1, 4, 2, 10, 5)
x
for (i in 1:length(x)) {
<- x[i] * 2
x[i]
}
x
## [1] 6 -2 8 4 20 10
Note that this just for an example. If we want to multiply each element of a vector by 2, a loop isn’t the best way. Although it is very normal in many programming languages, we would simply use a vectorized operation in R.
<- c(3, -1, 4, 2, 10, 5)
x <- x * 2
x x
## [1] 6 -2 8 4 20 10
7.2.1 Conditional loops
But some times it would be very handy. If the element in \(x > 3\), multiply it with the subsequent element:
<- c(3, -1, 0, 2, 10, 5)
x
<- c() #empty container
x_new for (i in 1:(length(x)-1)) {
ifelse(x[i] > 3, x_new[i] <- x[i] * x[i + 1], x_new[i] <- 0)
}
x
## [1] 3 -1 0 2 10 5
x_new
## [1] 0 0 0 0 50
Inside the if
and else
clause, you can use next
and break
to further control the flow. The next
function goes directly to the next loop cycle, while break
jumped out of the current loop.
<- c(9, -1, 0, 5, -7, 16, 22)
x <- c()
zn
for(i in 1:length(x)){
if(x[i] < 0){
next
} <- c(zn, sqrt(x[i]))
zn
}
zn
## [1] 3.000000 0.000000 2.236068 4.000000 4.690416
Inside the if
and else
clause, you can use next
and break
to further control the flow. The next
function goes directly to the next loop cycle, while break
jumped out of the current loop.
<- c(9, 1, 0, 5, 7, 16, 22)
x <- c()
bn
for(i in 1:length(x)){
if(x[i] > 10){
break
} <- c(bn, sqrt(x[i]))
bn
}
bn
## [1] 3.000000 1.000000 0.000000 2.236068 2.645751
7.2.2 while()
and repeat()
Here are some examples for while()
loop:
# Let's use our first example
<- 3
x <- 1
cnt
while (cnt < 11) {
= x * 2
x = cnt + 1
cnt
} x
## [1] 3072
Here are some examples for repeat()
loop:
# Let's use our first example
<- 3
x <- 1
cnt
repeat {
= x * 2
x = cnt + 1
cnt
if(cnt > 10) break
} x
## [1] 3072
7.2.3 Nested loops
It is also common to put one loop inside another one. Let’s say we want to create a 5x5 matrix where each element \(A_{i j}=i+j\)
<- matrix(0, 5, 5) #initialize the matrix A
A
for (i in 1:5){ #loop over index i
for (j in 1:5){ #loop over index j
<- i + j #set the (i, j)-th element of A
A[i, j]
}
} A
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 3 4 5 6
## [2,] 3 4 5 6 7
## [3,] 4 5 6 7 8
## [4,] 5 6 7 8 9
## [5,] 6 7 8 9 10
7.2.4 outer()
outer()
takes two vectors and a function (that itself takes two arguments) and builds a matrix by calling the given function for each combination of the elements in the two vectors.
<- c(0, 1, 2)
x <- c(0, 1, 2, 3, 4)
y
<- outer (
m # First dimension: the columns (y)
y, # Second dimension: the rows (x)
x, function (x, y) x+2*y
)
m
## [,1] [,2] [,3]
## [1,] 0 2 4
## [2,] 1 3 5
## [3,] 2 4 6
## [4,] 3 5 7
## [5,] 4 6 8
In place of the function, an operator can be given, which makes it easy to create a matrix with simple calculations (such as multiplying):
<- outer(c(10, 20, 30, 40), c(2, 4, 6), "*")
m m
## [,1] [,2] [,3]
## [1,] 20 40 60
## [2,] 40 80 120
## [3,] 60 120 180
## [4,] 80 160 240
It becomes very handy when we build a polynomial model:
<- sample(0:20, 10, replace = TRUE)
x x
## [1] 9 15 14 0 0 4 9 1 15 10
<- outer(x, 1:4, "^")
m m
## [,1] [,2] [,3] [,4]
## [1,] 9 81 729 6561
## [2,] 15 225 3375 50625
## [3,] 14 196 2744 38416
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## [6,] 4 16 64 256
## [7,] 9 81 729 6561
## [8,] 1 1 1 1
## [9,] 15 225 3375 50625
## [10,] 10 100 1000 10000
We can also use outer()
for this example
outer(1:5, 1:5, "+")
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 3 4 5 6
## [2,] 3 4 5 6 7
## [3,] 4 5 6 7 8
## [4,] 5 6 7 8 9
## [5,] 6 7 8 9 10
# Or
outer(1:4, 1:4, function(i, j){0.5^{abs(i-j)}})
## [,1] [,2] [,3] [,4]
## [1,] 1.000 0.50 0.25 0.125
## [2,] 0.500 1.00 0.50 0.250
## [3,] 0.250 0.50 1.00 0.500
## [4,] 0.125 0.25 0.50 1.000
7.3 The apply()
family
The apply()
family is one of the R base packages and is populated with functions to manipulate slices of data from matrices, arrays, lists and data frames in a repetitive way. These functions allow crossing the data in a number of ways and avoid explicit use of loop constructs. They act on an input list, matrix or array and apply a named function with one or several optional arguments. The family is made up of the apply()
, lapply()
, sapply()
, vapply()
, mapply()
, rapply()
, and tapply()
functions.
7.3.1 apply()
The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUN, ...)
, where, X is an array or a matrix if the dimension of the array is 2; MARGIN is a variable defining how the function is applied: when MARGIN=1
, it applies over rows, whereas with MARGIN=2
, it works over columns. Note that when you use the construct MARGIN=c(1,2)
, it applies to both rows and columns; and FUN, which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
# Construct a 5x6 matrix
<- matrix(rnorm(30), nrow=5, ncol=6)
X
# Sum the values of each column with `apply()`
apply(X, 2, sum)
## [1] 2.4420174 -2.3425854 -0.7139472 1.1880886 1.6326744 3.7587897
apply(X, 2, length)
## [1] 5 5 5 5 5 5
apply(X, 1, length)
## [1] 6 6 6 6 6
apply(X, 2, function (x) length(x)-1)
## [1] 4 4 4 4 4 4
#If you don’t want to write a function inside of the arguments
<- function(x){
len length(x)-1
}apply(X,2, len)
## [1] 4 4 4 4 4 4
#It can also be used to repeat a function on cells within a matrix
<- apply(X[1:2,], 1, function(x) x+1)
X_new X_new
## [,1] [,2]
## [1,] 1.3853829 2.79835285
## [2,] 1.3818356 -0.16297086
## [3,] 1.0637923 0.05658597
## [4,] 1.8645813 0.20281732
## [5,] 0.4845239 -0.45212888
## [6,] 0.6685199 1.86793058
Since apply()
is used only for matrices, if you apply apply()
to a data frame, it first coerces your data.frame to an array which means all the columns must have the same type. Depending on your context, this could have unintended consequences. For a safer practice in data frames, we can use lappy()
and sapply()
:
7.3.2 lapply()
You want to apply a given function to every element of a list and obtain a list as a result. When you execute ?lapply
, you see that the syntax looks like the apply()
function. The difference is that it can be used for other objects like data frames, lists or vectors. And the output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it. lapply()
function does not need MARGIN.
<-c(1:9)
A<-c(1:12)
B<-c(1:15)
C<-list(A,B,C)
my.lstlapply(my.lst, sum)
## [[1]]
## [1] 45
##
## [[2]]
## [1] 78
##
## [[3]]
## [1] 120
7.3.3 sapply()
sapply
works just like lapply
, but will simplify the output if possible. This means that instead of returning a list like lapply
, it will return a vector instead if the data is simplifiable.
<-c(1:9)
A<-c(1:12)
B<-c(1:15)
C<-list(A,B,C)
my.lstsapply(my.lst, sum)
## [1] 45 78 120
7.3.4 tapply()
Sometimes you may want to perform the apply function on some data, but have it separated by factor. In that case, you should use tapply
. Let’s take a look at the information for tapply
.
<- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)
X <- as.data.frame(cbind(c(1,1,1,1,1,2,2,2,2,2), X))
tdata tdata
## V1 V2 V3 V4
## 1 1 1 11 21
## 2 1 2 12 22
## 3 1 3 13 23
## 4 1 4 14 24
## 5 1 5 15 25
## 6 2 6 16 26
## 7 2 7 17 27
## 8 2 8 18 28
## 9 2 9 19 29
## 10 2 10 20 30
tapply(tdata$V2, tdata$V1, mean)
## 1 2
## 3 8
What we have here is an important tool: We have a conditional mean of column 2 (V2) with respect to column 1 (V1). You can use tapply
to do some quick summary statistics on a variable split by condition.
<- tapply(tdata$V2, tdata$V1, function(x) c(mean(x), sd(x)))
summary summary
## $`1`
## [1] 3.000000 1.581139
##
## $`2`
## [1] 8.000000 1.581139
7.3.5 mapply()
mapply()
would be used to create a new variable. For example, using dataset tdata
, we could divide one column by another column to create a new value. This would be useful for creating a ratio of two variables as shown in the example below.
$V5 <- mapply(function(x, y) x/y, tdata$V2, tdata$V4)
tdata$V5 tdata
## [1] 0.04761905 0.09090909 0.13043478 0.16666667 0.20000000 0.23076923
## [7] 0.25925926 0.28571429 0.31034483 0.33333333
7.4 Functions
An R function is created by using the keyword function. Let’s write our first function:
<- function(a){
first <- a ^ 2
b return(b)
}
first(1675)
## [1] 2805625
Let’s have a function that find the z-score (standardization). That’s subtracting the sample mean, and dividing by the sample standard deviation.
\[ \frac{x-\overline{x}}{s} \]
<- function(x){
z_score return((x - mean(x))/sd(x))
}
set.seed(1)
<- rnorm(10, 3, 30)
x <- z_score(x)
z z
## [1] -0.97190653 0.06589991 -1.23987805 1.87433300 0.25276523 -1.22045645
## [7] 0.45507643 0.77649606 0.56826358 -0.56059319
Lets create a function that prints the factorials:
<- function(a){
fact <- 1
b for (i in 1:(a-1)) {
<- b*(i+1)
b
}
b
}
fact(5)
## [1] 120
Creating loops is an act of art and requires very careful thinking. The same loop can be done by many different structures. And it always takes more time to understand somebody else’s loop than your own!
7.5 source()
You can use the source()
function in R to reuse functions that you create in another R script. The function uses the following basic syntax: source("path/to/some/file.R")
Suppose we have the following R script called some_functions.R
that contains two simple user-defined functions:
divide_by_two <- function(x) {
return(x/2)
}
multiply_by_three <- function(x) {
return(x*3)
}
Now suppose we’re currently working with some R script called main_script.R
. Assuming some_functions.R
and main_script.R are located within the same folder, we can use source at the top of our main_script.R
to allow us to use the functions we defined in the some_functions.R
script:
source("some_functions.R")
<- data.frame(team=c('A', 'B', 'C', 'D', 'E', 'F'),
df points=c(14, 19, 22, 15, 30, 40))
$half_points <- divide_by_two(df$points)
df
$triple_points <- multiply_by_three(df$points)
df df
## team points half_points triple_points
## 1 A 14 7.0 42
## 2 B 19 9.5 57
## 3 C 22 11.0 66
## 4 D 15 7.5 45
## 5 E 30 15.0 90
## 6 F 40 20.0 120
We can use as many source functions as we’d like if we want to reuse functions defined in several different scripts.