Chapter 2 Vectors
Many operations in R make heavy use of vectors. Possibly the most common way to create a vector in R is using the c() function, which is short for “combine.” As the name suggests, it combines a list of elements separated by commas.
c(1, 5, 0, -1)## [1] 1 5 0 -1
If we would like to store this vector in a variable we can do so with the assignment operator <- or =. But the convention is <-
x <- c(1, 5, 0, -1)
z = c(1, 5, 0, -1)
x## [1] 1 5 0 -1
z## [1] 1 5 0 -1
Note that scalars do not exists in R. They are simply vectors of length 1.
y <- 24 #this a vector with 1 element, 242.1 One type, same type
Because vectors must contain elements that are all the same type, R will automatically coerce to a single type when attempting to create a vector that combines multiple types.
c(10, "Machine Learning", FALSE)## [1] "10" "Machine Learning" "FALSE"
c(10, FALSE)## [1] 10 0
c(10, TRUE)## [1] 10 1
x <- c(10, "Machine Learning", FALSE)
str(x) #this tells us the structure of the object## chr [1:3] "10" "Machine Learning" "FALSE"
class(x)## [1] "character"
y <- c(10, FALSE)
str(y)## num [1:2] 10 0
class(y)## [1] "numeric"
We know that vectors are objects that have values of the same type. If you combine them into a vector, R will unify all values into the most complex one, which is usually called the coercion rule.
m <- c(TRUE, 5, -2, FALSE)
m## [1] 1 5 -2 0
class(m)## [1] "numeric"
And,
m_2 <- c(8, "Joe", 21, "Mustang")
m_2 ## [1] "8" "Joe" "21" "Mustang"
class(m_2)## [1] "character"
You can also manually convert the vectors
n <- c(8, 3, 21, 2)
n## [1] 8 3 21 2
nc <- as.character(n)
nc## [1] "8" "3" "21" "2"
n <- as.numeric(nc)
n## [1] 8 3 21 2
And be careful:
m_2 <- c(8, "Joe", 21, "Mustang")
as.numeric(m_2)## Warning: NAs introduced by coercion
## [1] 8 NA 21 NA
2.2 Patterns
If you want to create a vector based on a sequence of numbers, you can do it easily with an operator, which creates a sequence of integers between two specified integers.
y <- c(1:15)
y## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#or
y <- 1:15
y## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
If you want to create a vector based on a specific sequence of numbers increasing or decreasing, you can use seq()
y <- seq(from = 1.5, to = 13, by = 0.9) #increasing
y## [1] 1.5 2.4 3.3 4.2 5.1 6.0 6.9 7.8 8.7 9.6 10.5 11.4 12.3
y <- seq(1.5, -13, -0.9) #decreasing. Note that you can ignore the argument labels
y## [1] 1.5 0.6 -0.3 -1.2 -2.1 -3.0 -3.9 -4.8 -5.7 -6.6 -7.5 -8.4
## [13] -9.3 -10.2 -11.1 -12.0 -12.9
The other useful tool is rep()
rep("ML", times = 10)## [1] "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML"
#or
x <- c(1, 5, 0, -1)
rep(x, times = 2)## [1] 1 5 0 -1 1 5 0 -1
And we can use them as follows.
wow <- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
wow## [1] 1 5 0 -1 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42 2 3
## [26] 4
Another one, which can be used to create equal intervals.
g <- seq(6, 60, length = 4)
g## [1] 6 24 42 60
See this
unique(wow)## [1] 1 5 0 -1 3 7 9 2 42 4
2.3 Attributes
We can calculate the number of elements in a vector:
length(wow)## [1] 26
There is set of functions starting with is.***(). For example: is.numeric(), which checks whether a vector is of numeric type,
is.numeric(g)## [1] TRUE
is.character(g)## [1] FALSE
In addition to storing the values of a vector, you can also create named vectors.
x <- c(165, 60, 22)
x## [1] 165 60 22
x_n <- c(height = 125, weight = 56, BMI = 21)
x_n## height weight BMI
## 125 56 21
And,
attributes(x_n)## $names
## [1] "height" "weight" "BMI"
2.4 Character operators
animals <- c("dog", "cat", "donkey")
nchar(animals)## [1] 3 3 6
We can concatenate several strings into a single string.
wrong <- c("we have", "dogs", "cats", "and, donkey")
wrong## [1] "we have" "dogs" "cats" "and, donkey"
right <-paste("we have ", "dogs, ", "cats, ", "and, donkey")
right## [1] "we have dogs, cats, and, donkey"
You can check paste0()
hah <- toupper(right)
hah## [1] "WE HAVE DOGS, CATS, AND, DONKEY"
haha <-tolower(hah)
haha## [1] "we have dogs, cats, and, donkey"
2.5 Sort, rank, and order
x <- c(2, 3, 2, 0, 4, 7)
x ## [1] 2 3 2 0 4 7
By default, the sort() function sorts elements in vector in the increasing order.
sort(x)## [1] 0 2 2 3 4 7
sort(x, decreasing = TRUE)## [1] 7 4 3 2 2 0
The rank() function gives the corresponding positions in the ascending order.
rank(x)## [1] 2.5 4.0 2.5 1.0 5.0 6.0
You can see that the smallest value of x is 0, which corresponds to the fourth element. Thus, the fourth element has rank 1.
As for the order() function, it is confusing and a very different function from sort().
order(x)## [1] 4 1 3 2 5 6
We can see that the order() function returns indices for the elements in the ascending order.
2.6 Simple descriptive measures
Let’s have a numeric vector:
h <- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
h## [1] 2 3 2 0 4 7 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42
## [26] 2 3 4
summary(h)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 5.357 7.000 42.000
The set of statistical measures:
min(h)## [1] 0
max(h)## [1] 42
mean(h)## [1] 5.357143
median(h)## [1] 3
sd(h)## [1] 7.650777
range(h)## [1] 0 42
sum(h)## [1] 150
cumsum(h)## [1] 2 5 7 7 11 18 19 22 27 34 43 44 47 52 59 68 69 72 77
## [20] 84 93 94 96 99 141 143 146 150
prod(h)## [1] 0
quantile(h)## 0% 25% 50% 75% 100%
## 0 2 3 7 42
IQR(h) #interquartile range## [1] 5
We can also find the index number of maximum and minimum numbers
which.max(h)## [1] 25
which.min(h)## [1] 4
2.7 Subsetting Vectors
One of the most confusing subjects in R is subsetting the data containers. It’s an important part in data management and if it is done in 2 steps, the whole operation becomes quite easy:
- Identifying the index of the element that satisfies the required condition,
- Calling the index to subset the vector.
But before we start, lets see a simple subsetting. (Note the square brackets)
#Suppose we have the following vector
myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
#I can call each element with its index number:
myvector[c(1,6)]## [1] 1 8
myvector[4:7]## [1] 4 5 8 4
myvector[-6]## [1] 1 2 3 4 5 4 10 12
Okay, let’s see commonly used operators for doing comparisons:
x <- 3x < 2 #less## [1] FALSE
x <= 2 #less or equal to## [1] FALSE
x > 1 #bigger## [1] TRUE
x >= 1 #bigger or equal to## [1] TRUE
x == 3 #equal to## [1] TRUE
#x = 3 #Note that this an assignment operator
x != 3 #not equal to## [1] FALSE
#Let's look at this vector
myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
#We want to subset only those less than 5
#Step 1: use a logical operator to identify the elements
#meeting the condition.
logi <- myvector < 5
logi## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
#logi is a logical vector
class(logi)## [1] "logical"
#Step 2: use it for subsetting
newvector <- myvector[logi==TRUE]
newvector## [1] 1 2 3 4 4
or better:
newvector <- myvector[logi]
newvector## [1] 1 2 3 4 4
This is good as it shows those 2 steps. Perhaps, we can combine these 2 steps as follows:
newvector <- myvector[myvector < 5]
newvector## [1] 1 2 3 4 4
Another way to do this is to use of which(), which gives us the index of each element that satisfies the condition.
ind <- which(myvector < 5) # Step 1
ind## [1] 1 2 3 4 7
newvector <- myvector[ind] # Step 2
newvector## [1] 1 2 3 4 4
Or we can combine these 2 steps:
newvector <- myvector[which(myvector < 5)]
newvector## [1] 1 2 3 4 4
Last one: find the 4’s in myvector make them 8 (I know hard, but after a couple of tries it will seem easier):
myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
#I'll show you 3 ways to do that.
#1st way to show the steps
ind <- which(myvector==4) #identifying the index with 4
newvector <- myvector[ind] + 4 # adding them 4
myvector[ind] <- newvector #replacing those with the new values
myvector## [1] 1 2 3 8 5 8 8 10 12
#2nd and easier way
myvector[which(myvector==4)] <- myvector[which(myvector==4)] + 4
myvector## [1] 1 2 3 8 5 8 8 10 12
#3nd and easiest way
myvector[myvector==4] <- myvector[myvector==4] + 4
myvector## [1] 1 2 3 8 5 8 8 10 12
What happens if the vector is a character vector? How can we subset it? We can use grep() as shown below:
m <- c("about", "aboard", "board", "bus", "cat", "abandon")
#Now suppose that we need to pick the elements that contain "ab"
#Same steps again
a <- grep("ab", m) #similar to which() that gives us index numbers
a## [1] 1 2 6
newvector <- m[a]
newvector## [1] "about" "aboard" "abandon"
2.8 Vectorization or vector operations
One of the biggest strengths of R is its use of vectorized operations. Lets see it in action!
x <- 1:10
x## [1] 1 2 3 4 5 6 7 8 9 10
x+1## [1] 2 3 4 5 6 7 8 9 10 11
2 * x## [1] 2 4 6 8 10 12 14 16 18 20
2 ^ x## [1] 2 4 8 16 32 64 128 256 512 1024
x ^ 2## [1] 1 4 9 16 25 36 49 64 81 100
sqrt(x)## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
log(x)## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
Its like a calculator!
y <- 1:10
y## [1] 1 2 3 4 5 6 7 8 9 10
x + y## [1] 2 4 6 8 10 12 14 16 18 20
How about this:
y <- 1:11
x + y## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 8 10 12 14 16 18 20 12
OK, the warning is self-explanatory. But what’s “12” at the end?
It’s the sum of the first element of x, which is 1 and the last element of y, which is 11.
2.9 Set operations
x <- c(1, 2, 1, 3, 1)
y <- c(1, 1, 3, 6, 6, 5)You can use the intersect() function to get values in both x and y:
intersect(x, y)## [1] 1 3
To get values in either x or y
union(x, y)## [1] 1 2 3 6 5
To get values in x but not in y
setdiff(x, y)## [1] 2
setdiff(y, x)## [1] 6 5
To check whether each element of x is inside y
is.element(x, y)## [1] TRUE FALSE TRUE TRUE TRUE
# or
x %in% y## [1] TRUE FALSE TRUE TRUE TRUE
2.10 Missing values
R uses NA to represent missing values indicating they are not available. In a data file, NA’s are very common and have to be dealt with properly. Why?
x <- c(1, NA, 2, NA, 3)
mean(x)## [1] NA
sum(x)## [1] NA
And it’s contagious
y <- 1:5
x+y## [1] 2 NA 5 NA 8
To deal with NA’s, we need to know how to find indices with missing values
# Do we have and NA?
anyNA(x)## [1] TRUE
# Which ones?
is.na(x)## [1] FALSE TRUE FALSE TRUE FALSE
# Or
which(is.na(x))## [1] 2 4
How to remove? But before removing them:
mean(x, na.rm = TRUE)## [1] 2
So we may skip removing them from the data as many functions have built-in arguments to deal with NA’s.
x2 <- x[!is.na(x)]
x2## [1] 1 2 3
# Or
x[complete.cases(x)]## [1] 1 2 3
2.11 Factors
Factor type is known as an “indicator” variable. Factors represent a very efficient way to store character values, because each unique character value is stored only once, and the data itself is stored as a vector of integers.
set.seed((123))
anim <- sample(animals, 100, replace = TRUE)
anim## [1] "donkey" "donkey" "donkey" "cat" "donkey" "cat" "cat" "cat"
## [9] "donkey" "dog" "cat" "cat" "dog" "cat" "donkey" "dog"
## [17] "donkey" "donkey" "dog" "dog" "dog" "dog" "donkey" "cat"
## [25] "donkey" "cat" "dog" "cat" "donkey" "cat" "dog" "donkey"
## [33] "donkey" "dog" "donkey" "cat" "dog" "donkey" "dog" "dog"
## [41] "cat" "donkey" "donkey" "dog" "donkey" "dog" "donkey" "cat"
## [49] "dog" "cat" "dog" "dog" "donkey" "dog" "cat" "dog"
## [57] "dog" "donkey" "dog" "cat" "dog" "donkey" "dog" "donkey"
## [65] "cat" "donkey" "cat" "cat" "donkey" "cat" "cat" "donkey"
## [73] "donkey" "dog" "cat" "cat" "dog" "cat" "dog" "dog"
## [81] "cat" "donkey" "donkey" "dog" "cat" "dog" "cat" "dog"
## [89] "donkey" "donkey" "cat" "donkey" "dog" "cat" "cat" "donkey"
## [97] "cat" "dog" "donkey" "donkey"
table(anim)## anim
## cat dog donkey
## 32 33 35
Let’s define anim vector as factor variable:
animf <-as.factor(anim)
levels(animf)## [1] "cat" "dog" "donkey"
We can change the levels:
data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata## [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3
rdata = factor(data,labels=c("I","II","III"))
rdata## [1] I II II III I II III III I II III III I
## Levels: I II III
#Or
levels(fdata) = c('I','II','III')
fdata## [1] I II II III I II III III I II III III I
## Levels: I II III
The cut function is used to convert a numeric variable into a factor. The breaks= argument is used to define how ranges of numbers will be converted to factor values. Consider Lot_Area in the ames data.
library(RBootcamp)
Area = cut(ames$Lot_Area,3)
table(Area)## Area
## (1.09e+03,7.26e+04] (7.26e+04,1.44e+05] (1.44e+05,2.15e+05]
## 2926 1 3
# More
Area = cut(ames$Lot_Area,3, labels=c('Small','Medium','Large'))
table(Area)## Area
## Small Medium Large
## 2926 1 3
# Quantiles
Area = cut(ames$Lot_Area, quantile(ames$Lot_Area,prob = seq(0, 1, 0.25)))
table(Area)## Area
## (1.3e+03,7.44e+03] (7.44e+03,9.44e+03] (9.44e+03,1.16e+04] (1.16e+04,2.15e+05]
## 732 732 732 733