Chapter 2 Vectors
Many operations in R make heavy use of vectors. Possibly the most common way to create a vector in R is using the c()
function, which is short for “combine.” As the name suggests, it combines a list of elements separated by commas.
c(1, 5, 0, -1)
## [1] 1 5 0 -1
If we would like to store this vector in a variable we can do so with the assignment operator <-
or =
. But the convention is <-
<- c(1, 5, 0, -1)
x = c(1, 5, 0, -1)
z x
## [1] 1 5 0 -1
z
## [1] 1 5 0 -1
Note that scalars do not exists in R. They are simply vectors of length 1.
<- 24 #this a vector with 1 element, 24 y
2.1 One type, same type
Because vectors must contain elements that are all the same type, R will automatically coerce to a single type when attempting to create a vector that combines multiple types.
c(10, "Machine Learning", FALSE)
## [1] "10" "Machine Learning" "FALSE"
c(10, FALSE)
## [1] 10 0
c(10, TRUE)
## [1] 10 1
<- c(10, "Machine Learning", FALSE)
x str(x) #this tells us the structure of the object
## chr [1:3] "10" "Machine Learning" "FALSE"
class(x)
## [1] "character"
<- c(10, FALSE)
y str(y)
## num [1:2] 10 0
class(y)
## [1] "numeric"
We know that vectors are objects that have values of the same type. If you combine them into a vector, R will unify all values into the most complex one, which is usually called the coercion rule.
<- c(TRUE, 5, -2, FALSE)
m m
## [1] 1 5 -2 0
class(m)
## [1] "numeric"
And,
<- c(8, "Joe", 21, "Mustang")
m_2 m_2
## [1] "8" "Joe" "21" "Mustang"
class(m_2)
## [1] "character"
You can also manually convert the vectors
<- c(8, 3, 21, 2)
n n
## [1] 8 3 21 2
<- as.character(n)
nc nc
## [1] "8" "3" "21" "2"
<- as.numeric(nc)
n n
## [1] 8 3 21 2
And be careful:
<- c(8, "Joe", 21, "Mustang")
m_2 as.numeric(m_2)
## Warning: NAs introduced by coercion
## [1] 8 NA 21 NA
2.2 Patterns
If you want to create a vector based on a sequence of numbers, you can do it easily with an operator, which creates a sequence of integers between two specified integers.
<- c(1:15)
y y
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#or
<- 1:15
y y
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
If you want to create a vector based on a specific sequence of numbers increasing or decreasing, you can use seq()
<- seq(from = 1.5, to = 13, by = 0.9) #increasing
y y
## [1] 1.5 2.4 3.3 4.2 5.1 6.0 6.9 7.8 8.7 9.6 10.5 11.4 12.3
<- seq(1.5, -13, -0.9) #decreasing. Note that you can ignore the argument labels
y y
## [1] 1.5 0.6 -0.3 -1.2 -2.1 -3.0 -3.9 -4.8 -5.7 -6.6 -7.5 -8.4
## [13] -9.3 -10.2 -11.1 -12.0 -12.9
The other useful tool is rep()
rep("ML", times = 10)
## [1] "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML"
#or
<- c(1, 5, 0, -1)
x rep(x, times = 2)
## [1] 1 5 0 -1 1 5 0 -1
And we can use them as follows.
<- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
wow wow
## [1] 1 5 0 -1 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42 2 3
## [26] 4
Another one, which can be used to create equal intervals.
<- seq(6, 60, length = 4)
g g
## [1] 6 24 42 60
See this
unique(wow)
## [1] 1 5 0 -1 3 7 9 2 42 4
2.3 Attributes
We can calculate the number of elements in a vector:
length(wow)
## [1] 26
There is set of functions starting with is.***()
. For example: is.numeric()
, which checks whether a vector is of numeric type,
is.numeric(g)
## [1] TRUE
is.character(g)
## [1] FALSE
In addition to storing the values of a vector, you can also create named vectors.
<- c(165, 60, 22)
x x
## [1] 165 60 22
<- c(height = 125, weight = 56, BMI = 21)
x_n x_n
## height weight BMI
## 125 56 21
And,
attributes(x_n)
## $names
## [1] "height" "weight" "BMI"
2.4 Character operators
<- c("dog", "cat", "donkey")
animals nchar(animals)
## [1] 3 3 6
We can concatenate several strings into a single string.
<- c("we have", "dogs", "cats", "and, donkey")
wrong wrong
## [1] "we have" "dogs" "cats" "and, donkey"
<-paste("we have ", "dogs, ", "cats, ", "and, donkey")
right right
## [1] "we have dogs, cats, and, donkey"
You can check paste0()
<- toupper(right)
hah hah
## [1] "WE HAVE DOGS, CATS, AND, DONKEY"
<-tolower(hah)
haha haha
## [1] "we have dogs, cats, and, donkey"
2.5 Sort, rank, and order
<- c(2, 3, 2, 0, 4, 7)
x x
## [1] 2 3 2 0 4 7
By default, the sort()
function sorts elements in vector in the increasing order.
sort(x)
## [1] 0 2 2 3 4 7
sort(x, decreasing = TRUE)
## [1] 7 4 3 2 2 0
The rank()
function gives the corresponding positions in the ascending order.
rank(x)
## [1] 2.5 4.0 2.5 1.0 5.0 6.0
You can see that the smallest value of x
is 0, which corresponds to the fourth element. Thus, the fourth element has rank 1.
As for the order()
function, it is confusing and a very different function from sort()
.
order(x)
## [1] 4 1 3 2 5 6
We can see that the order()
function returns indices for the elements in the ascending order.
2.6 Simple descriptive measures
Let’s have a numeric vector:
<- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
h h
## [1] 2 3 2 0 4 7 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 2 3 42
## [26] 2 3 4
summary(h)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 5.357 7.000 42.000
The set of statistical measures:
min(h)
## [1] 0
max(h)
## [1] 42
mean(h)
## [1] 5.357143
median(h)
## [1] 3
sd(h)
## [1] 7.650777
range(h)
## [1] 0 42
sum(h)
## [1] 150
cumsum(h)
## [1] 2 5 7 7 11 18 19 22 27 34 43 44 47 52 59 68 69 72 77
## [20] 84 93 94 96 99 141 143 146 150
prod(h)
## [1] 0
quantile(h)
## 0% 25% 50% 75% 100%
## 0 2 3 7 42
IQR(h) #interquartile range
## [1] 5
We can also find the index number of maximum and minimum numbers
which.max(h)
## [1] 25
which.min(h)
## [1] 4
2.7 Subsetting Vectors
One of the most confusing subjects in R is subsetting the data containers. It’s an important part in data management and if it is done in 2 steps, the whole operation becomes quite easy:
- Identifying the index of the element that satisfies the required condition,
- Calling the index to subset the vector.
But before we start, lets see a simple subsetting. (Note the square brackets)
#Suppose we have the following vector
<- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
myvector
#I can call each element with its index number:
c(1,6)] myvector[
## [1] 1 8
4:7] myvector[
## [1] 4 5 8 4
-6] myvector[
## [1] 1 2 3 4 5 4 10 12
Okay, let’s see commonly used operators for doing comparisons:
<- 3 x
< 2 #less x
## [1] FALSE
<= 2 #less or equal to x
## [1] FALSE
> 1 #bigger x
## [1] TRUE
>= 1 #bigger or equal to x
## [1] TRUE
== 3 #equal to x
## [1] TRUE
#x = 3 #Note that this an assignment operator
!= 3 #not equal to x
## [1] FALSE
#Let's look at this vector
<- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
myvector
#We want to subset only those less than 5
#Step 1: use a logical operator to identify the elements
#meeting the condition.
<- myvector < 5
logi logi
## [1] TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
#logi is a logical vector
class(logi)
## [1] "logical"
#Step 2: use it for subsetting
<- myvector[logi==TRUE]
newvector newvector
## [1] 1 2 3 4 4
or better:
<- myvector[logi]
newvector newvector
## [1] 1 2 3 4 4
This is good as it shows those 2 steps. Perhaps, we can combine these 2 steps as follows:
<- myvector[myvector < 5]
newvector newvector
## [1] 1 2 3 4 4
Another way to do this is to use of which()
, which gives us the index of each element that satisfies the condition.
<- which(myvector < 5) # Step 1
ind ind
## [1] 1 2 3 4 7
<- myvector[ind] # Step 2
newvector newvector
## [1] 1 2 3 4 4
Or we can combine these 2 steps:
<- myvector[which(myvector < 5)]
newvector newvector
## [1] 1 2 3 4 4
Last one: find the 4’s in myvector
make them 8 (I know hard, but after a couple of tries it will seem easier):
<- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
myvector #I'll show you 3 ways to do that.
#1st way to show the steps
<- which(myvector==4) #identifying the index with 4
ind <- myvector[ind] + 4 # adding them 4
newvector <- newvector #replacing those with the new values
myvector[ind] myvector
## [1] 1 2 3 8 5 8 8 10 12
#2nd and easier way
which(myvector==4)] <- myvector[which(myvector==4)] + 4
myvector[ myvector
## [1] 1 2 3 8 5 8 8 10 12
#3nd and easiest way
==4] <- myvector[myvector==4] + 4
myvector[myvector myvector
## [1] 1 2 3 8 5 8 8 10 12
What happens if the vector is a character vector? How can we subset it? We can use grep()
as shown below:
<- c("about", "aboard", "board", "bus", "cat", "abandon")
m
#Now suppose that we need to pick the elements that contain "ab"
#Same steps again
<- grep("ab", m) #similar to which() that gives us index numbers
a a
## [1] 1 2 6
<- m[a]
newvector newvector
## [1] "about" "aboard" "abandon"
2.8 Vectorization or vector operations
One of the biggest strengths of R is its use of vectorized operations. Lets see it in action!
<- 1:10
x x
## [1] 1 2 3 4 5 6 7 8 9 10
+1 x
## [1] 2 3 4 5 6 7 8 9 10 11
2 * x
## [1] 2 4 6 8 10 12 14 16 18 20
2 ^ x
## [1] 2 4 8 16 32 64 128 256 512 1024
^ 2 x
## [1] 1 4 9 16 25 36 49 64 81 100
sqrt(x)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
log(x)
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
Its like a calculator!
<- 1:10
y y
## [1] 1 2 3 4 5 6 7 8 9 10
+ y x
## [1] 2 4 6 8 10 12 14 16 18 20
How about this:
<- 1:11
y + y x
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 8 10 12 14 16 18 20 12
OK, the warning is self-explanatory. But what’s “12” at the end?
It’s the sum of the first element of x
, which is 1 and the last element of y
, which is 11.
2.9 Set operations
<- c(1, 2, 1, 3, 1)
x <- c(1, 1, 3, 6, 6, 5) y
You can use the intersect()
function to get values in both x
and y
:
intersect(x, y)
## [1] 1 3
To get values in either x
or y
union(x, y)
## [1] 1 2 3 6 5
To get values in x
but not in y
setdiff(x, y)
## [1] 2
setdiff(y, x)
## [1] 6 5
To check whether each element of x
is inside y
is.element(x, y)
## [1] TRUE FALSE TRUE TRUE TRUE
# or
%in% y x
## [1] TRUE FALSE TRUE TRUE TRUE
2.10 Missing values
R uses NA
to represent missing values indicating they are not available. In a data file, NA’s are very common and have to be dealt with properly. Why?
<- c(1, NA, 2, NA, 3)
x mean(x)
## [1] NA
sum(x)
## [1] NA
And it’s contagious
<- 1:5
y +y x
## [1] 2 NA 5 NA 8
To deal with NA’s, we need to know how to find indices with missing values
# Do we have and NA?
anyNA(x)
## [1] TRUE
# Which ones?
is.na(x)
## [1] FALSE TRUE FALSE TRUE FALSE
# Or
which(is.na(x))
## [1] 2 4
How to remove? But before removing them:
mean(x, na.rm = TRUE)
## [1] 2
So we may skip removing them from the data as many functions have built-in arguments to deal with NA’s.
<- x[!is.na(x)]
x2 x2
## [1] 1 2 3
# Or
complete.cases(x)] x[
## [1] 1 2 3
2.11 Factors
Factor type is known as an “indicator” variable. Factors represent a very efficient way to store character values, because each unique character value is stored only once, and the data itself is stored as a vector of integers.
set.seed((123))
<- sample(animals, 100, replace = TRUE)
anim anim
## [1] "donkey" "donkey" "donkey" "cat" "donkey" "cat" "cat" "cat"
## [9] "donkey" "dog" "cat" "cat" "dog" "cat" "donkey" "dog"
## [17] "donkey" "donkey" "dog" "dog" "dog" "dog" "donkey" "cat"
## [25] "donkey" "cat" "dog" "cat" "donkey" "cat" "dog" "donkey"
## [33] "donkey" "dog" "donkey" "cat" "dog" "donkey" "dog" "dog"
## [41] "cat" "donkey" "donkey" "dog" "donkey" "dog" "donkey" "cat"
## [49] "dog" "cat" "dog" "dog" "donkey" "dog" "cat" "dog"
## [57] "dog" "donkey" "dog" "cat" "dog" "donkey" "dog" "donkey"
## [65] "cat" "donkey" "cat" "cat" "donkey" "cat" "cat" "donkey"
## [73] "donkey" "dog" "cat" "cat" "dog" "cat" "dog" "dog"
## [81] "cat" "donkey" "donkey" "dog" "cat" "dog" "cat" "dog"
## [89] "donkey" "donkey" "cat" "donkey" "dog" "cat" "cat" "donkey"
## [97] "cat" "dog" "donkey" "donkey"
table(anim)
## anim
## cat dog donkey
## 32 33 35
Let’s define anim
vector as factor variable:
<-as.factor(anim)
animf levels(animf)
## [1] "cat" "dog" "donkey"
We can change the levels:
= c(1,2,2,3,1,2,3,3,1,2,3,3,1)
data = factor(data)
fdata fdata
## [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3
= factor(data,labels=c("I","II","III"))
rdata rdata
## [1] I II II III I II III III I II III III I
## Levels: I II III
#Or
levels(fdata) = c('I','II','III')
fdata
## [1] I II II III I II III III I II III III I
## Levels: I II III
The cut
function is used to convert a numeric variable into a factor. The breaks=
argument is used to define how ranges of numbers will be converted to factor values. Consider Lot_Area
in the ames
data.
library(RBootcamp)
= cut(ames$Lot_Area,3)
Area table(Area)
## Area
## (1.09e+03,7.26e+04] (7.26e+04,1.44e+05] (1.44e+05,2.15e+05]
## 2926 1 3
# More
= cut(ames$Lot_Area,3, labels=c('Small','Medium','Large'))
Area table(Area)
## Area
## Small Medium Large
## 2926 1 3
# Quantiles
= cut(ames$Lot_Area, quantile(ames$Lot_Area,prob = seq(0, 1, 0.25)))
Area table(Area)
## Area
## (1.3e+03,7.44e+03] (7.44e+03,9.44e+03] (9.44e+03,1.16e+04] (1.16e+04,2.15e+05]
## 732 732 732 733