Chapter 2 Vectors

Many operations in R make heavy use of vectors. Possibly the most common way to create a vector in R is using the c() function, which is short for “combine.” As the name suggests, it combines a list of elements separated by commas.

c(1, 5, 0, -1)

## [1]  1  5  0 -1

If we would like to store this vector in a variable we can do so with the assignment operator <- or =. But the convention is <-

x <- c(1, 5, 0, -1)
z = c(1, 5, 0, -1)
x

## [1]  1  5  0 -1

## [1]  1  5  0 -1

Note that scalars do not exists in R. They are simply vectors of length 1.

y <- 24  #this a vector with 1 element, 24

2.1 One type, same type

Because vectors must contain elements that are all the same type, R will automatically coerce to a single type when attempting to create a vector that combines multiple types.

c(10, "Machine Learning", FALSE)

## [1] "10"               "Machine Learning" "FALSE"

c(10, FALSE)

## [1] 10  0

c(10, TRUE)

## [1] 10  1

x <- c(10, "Machine Learning", FALSE) 
str(x) #this tells us the structure of the object

##  chr [1:3] "10" "Machine Learning" "FALSE"

class(x)

## [1] "character"

y <- c(10, FALSE)
str(y)

##  num [1:2] 10 0

class(y)

## [1] "numeric"

We know that vectors are objects that have values of the same type. If you combine them into a vector, R will unify all values into the most complex one, which is usually called the coercion rule.

m <- c(TRUE, 5, -2, FALSE)
m

## [1]  1  5 -2  0

class(m)

## [1] "numeric"

And,

m_2 <- c(8, "Joe", 21, "Mustang")
m_2

## [1] "8"       "Joe"     "21"      "Mustang"

class(m_2)

## [1] "character"

You can also manually convert the vectors

n <- c(8, 3, 21, 2)
n

## [1]  8  3 21  2

nc <- as.character(n)
nc

## [1] "8"  "3"  "21" "2"

n <- as.numeric(nc)
n

## [1]  8  3 21  2

And be careful:

m_2 <- c(8, "Joe", 21, "Mustang")
as.numeric(m_2)

## Warning: NAs introduced by coercion

## [1]  8 NA 21 NA

2.2 Patterns

If you want to create a vector based on a sequence of numbers, you can do it easily with an operator, which creates a sequence of integers between two specified integers.

y <- c(1:15)
y

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

#or
y <- 1:15
y

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

If you want to create a vector based on a specific sequence of numbers increasing or decreasing, you can use seq()

y <- seq(from = 1.5, to = 13, by = 0.9) #increasing
y

##  [1]  1.5  2.4  3.3  4.2  5.1  6.0  6.9  7.8  8.7  9.6 10.5 11.4 12.3

y <- seq(1.5, -13, -0.9) #decreasing.  Note that you can ignore the argument labels
y

##  [1]   1.5   0.6  -0.3  -1.2  -2.1  -3.0  -3.9  -4.8  -5.7  -6.6  -7.5  -8.4
## [13]  -9.3 -10.2 -11.1 -12.0 -12.9

The other useful tool is rep()

rep("ML", times = 10)

##  [1] "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML" "ML"

#or

x <- c(1, 5, 0, -1)
rep(x, times = 2)

## [1]  1  5  0 -1  1  5  0 -1

And we can use them as follows.

wow <- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
wow

##  [1]  1  5  0 -1  1  3  5  7  9  1  3  5  7  9  1  3  5  7  9  1  2  3 42  2  3
## [26]  4

Another one, which can be used to create equal intervals.

g <- seq(6, 60, length = 4)
g

## [1]  6 24 42 60

See this

unique(wow)

##  [1]  1  5  0 -1  3  7  9  2 42  4

2.3 Attributes

We can calculate the number of elements in a vector:

length(wow)

## [1] 26

There is set of functions starting with is.***(). For example: is.numeric(), which checks whether a vector is of numeric type,

is.numeric(g)

## [1] TRUE

is.character(g)

## [1] FALSE

In addition to storing the values of a vector, you can also create named vectors.

x <- c(165, 60, 22)
x

## [1] 165  60  22

x_n <- c(height = 125, weight = 56, BMI = 21)
x_n

## height weight    BMI 
##    125     56     21

And,

attributes(x_n)

## $names
## [1] "height" "weight" "BMI"

2.4 Character operators

animals <- c("dog", "cat", "donkey")
nchar(animals)

## [1] 3 3 6

We can concatenate several strings into a single string.

wrong <- c("we have", "dogs", "cats", "and, donkey")
wrong

## [1] "we have"     "dogs"        "cats"        "and, donkey"

right <-paste("we have ", "dogs, ", "cats, ", "and, donkey")
right

## [1] "we have  dogs,  cats,  and, donkey"

You can check paste0()

hah <- toupper(right)
hah

## [1] "WE HAVE  DOGS,  CATS,  AND, DONKEY"

haha <-tolower(hah) 
haha

## [1] "we have  dogs,  cats,  and, donkey"

2.5 Sort, rank, and order

x <- c(2, 3, 2, 0, 4, 7) 
x

## [1] 2 3 2 0 4 7

By default, the sort() function sorts elements in vector in the increasing order.

sort(x)

## [1] 0 2 2 3 4 7

sort(x, decreasing = TRUE)

## [1] 7 4 3 2 2 0

The rank() function gives the corresponding positions in the ascending order.

rank(x)

## [1] 2.5 4.0 2.5 1.0 5.0 6.0

You can see that the smallest value of x is 0, which corresponds to the fourth element. Thus, the fourth element has rank 1.

As for the order() function, it is confusing and a very different function from sort().

order(x)

## [1] 4 1 3 2 5 6

We can see that the order() function returns indices for the elements in the ascending order.

2.6 Simple descriptive measures

Let’s have a numeric vector:

h <- c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)
h

##  [1]  2  3  2  0  4  7  1  3  5  7  9  1  3  5  7  9  1  3  5  7  9  1  2  3 42
## [26]  2  3  4

summary(h)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   5.357   7.000  42.000

The set of statistical measures:

min(h)

## [1] 0

max(h)

## [1] 42

mean(h)

## [1] 5.357143

median(h)

## [1] 3

sd(h)

## [1] 7.650777

range(h)

## [1]  0 42

sum(h)

## [1] 150

cumsum(h)

##  [1]   2   5   7   7  11  18  19  22  27  34  43  44  47  52  59  68  69  72  77
## [20]  84  93  94  96  99 141 143 146 150

prod(h)

## [1] 0

quantile(h)

##   0%  25%  50%  75% 100% 
##    0    2    3    7   42

IQR(h) #interquartile range

## [1] 5

We can also find the index number of maximum and minimum numbers

which.max(h)

## [1] 25

which.min(h)

## [1] 4

2.7 Subsetting Vectors

One of the most confusing subjects in R is subsetting the data containers. It’s an important part in data management and if it is done in 2 steps, the whole operation becomes quite easy:

Identifying the index of the element that satisfies the required condition,
Calling the index to subset the vector.

But before we start, lets see a simple subsetting. (Note the square brackets)

#Suppose we have the following vector
myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)

#I can call each element with its index number:
myvector[c(1,6)]

## [1] 1 8

myvector[4:7]

## [1] 4 5 8 4

myvector[-6]

## [1]  1  2  3  4  5  4 10 12

Okay, let’s see commonly used operators for doing comparisons:

x <- 3

x < 2      #less

## [1] FALSE

x <= 2     #less or equal to

## [1] FALSE

x > 1      #bigger

## [1] TRUE

x >= 1     #bigger or equal to

## [1] TRUE

x == 3     #equal to

## [1] TRUE

#x = 3     #Note that this an assignment operator
x != 3     #not equal to

## [1] FALSE

#Let's look at this vector
myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)

#We want to subset only those less than 5

#Step 1: use a logical operator to identify the elements
#meeting the condition.
logi <- myvector < 5 
logi

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE

#logi is a logical vector
class(logi)

## [1] "logical"

#Step 2: use it for subsetting
newvector <- myvector[logi==TRUE]
newvector

## [1] 1 2 3 4 4

or better:

newvector <- myvector[logi]
newvector

## [1] 1 2 3 4 4

This is good as it shows those 2 steps. Perhaps, we can combine these 2 steps as follows:

newvector <- myvector[myvector < 5]
newvector

## [1] 1 2 3 4 4

Another way to do this is to use of which(), which gives us the index of each element that satisfies the condition.

ind <- which(myvector < 5)  # Step 1
ind

## [1] 1 2 3 4 7

newvector <- myvector[ind]  # Step 2
newvector

## [1] 1 2 3 4 4

Or we can combine these 2 steps:

newvector <- myvector[which(myvector < 5)]
newvector

## [1] 1 2 3 4 4

Last one: find the 4’s in myvector make them 8 (I know hard, but after a couple of tries it will seem easier):

myvector <- c(1, 2, 3, 4, 5, 8, 4, 10, 12)
#I'll show you 3 ways to do that.

#1st way to show the steps
ind <- which(myvector==4) #identifying the index with 4
newvector <- myvector[ind] + 4 # adding them 4
myvector[ind] <- newvector #replacing those with the new values
myvector

## [1]  1  2  3  8  5  8  8 10 12

#2nd and easier way
myvector[which(myvector==4)] <- myvector[which(myvector==4)] + 4
myvector

## [1]  1  2  3  8  5  8  8 10 12

#3nd and easiest way
myvector[myvector==4] <- myvector[myvector==4] + 4
myvector

## [1]  1  2  3  8  5  8  8 10 12

What happens if the vector is a character vector? How can we subset it? We can use grep() as shown below:

m <- c("about", "aboard", "board", "bus", "cat", "abandon")

#Now suppose that we need to pick the elements that contain "ab"

#Same steps again
a <- grep("ab", m) #similar to which() that gives us index numbers
a

## [1] 1 2 6

newvector <- m[a]
newvector

## [1] "about"   "aboard"  "abandon"

2.8 Vectorization or vector operations

One of the biggest strengths of R is its use of vectorized operations. Lets see it in action!

x <- 1:10
x

##  [1]  1  2  3  4  5  6  7  8  9 10

x+1

##  [1]  2  3  4  5  6  7  8  9 10 11

2 * x

##  [1]  2  4  6  8 10 12 14 16 18 20

2 ^ x

##  [1]    2    4    8   16   32   64  128  256  512 1024

x ^ 2

##  [1]   1   4   9  16  25  36  49  64  81 100

sqrt(x)

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278

log(x)

##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851

Its like a calculator!

y <- 1:10
y

##  [1]  1  2  3  4  5  6  7  8  9 10

x + y

##  [1]  2  4  6  8 10 12 14 16 18 20

How about this:

y <- 1:11
x + y

## Warning in x + y: longer object length is not a multiple of shorter object
## length

##  [1]  2  4  6  8 10 12 14 16 18 20 12

OK, the warning is self-explanatory. But what’s “12” at the end?
It’s the sum of the first element of x, which is 1 and the last element of y, which is 11.

2.9 Set operations

x <- c(1, 2, 1, 3, 1)
y <- c(1, 1, 3, 6, 6, 5)

You can use the intersect() function to get values in both x and y:

intersect(x, y)

## [1] 1 3

To get values in either x or y

union(x, y)

## [1] 1 2 3 6 5

To get values in x but not in y

setdiff(x, y)

## [1] 2

setdiff(y, x)

## [1] 6 5

To check whether each element of x is inside y

is.element(x, y)

## [1]  TRUE FALSE  TRUE  TRUE  TRUE

# or

x %in% y

## [1]  TRUE FALSE  TRUE  TRUE  TRUE

2.10 Missing values

R uses NA to represent missing values indicating they are not available. In a data file, NA’s are very common and have to be dealt with properly. Why?

x <- c(1, NA, 2, NA, 3)
mean(x)

## [1] NA

sum(x)

## [1] NA

And it’s contagious

y <- 1:5
x+y

## [1]  2 NA  5 NA  8

To deal with NA’s, we need to know how to find indices with missing values

# Do we have and NA?
anyNA(x)

## [1] TRUE

# Which ones?
is.na(x)

## [1] FALSE  TRUE FALSE  TRUE FALSE

# Or
which(is.na(x))

## [1] 2 4

How to remove? But before removing them:

mean(x, na.rm = TRUE)

## [1] 2

So we may skip removing them from the data as many functions have built-in arguments to deal with NA’s.

x2 <- x[!is.na(x)]
x2

## [1] 1 2 3

# Or
x[complete.cases(x)]

## [1] 1 2 3

2.11 Factors

Factor type is known as an “indicator” variable. Factors represent a very efficient way to store character values, because each unique character value is stored only once, and the data itself is stored as a vector of integers.

set.seed((123))
anim <- sample(animals, 100, replace = TRUE)
anim

##   [1] "donkey" "donkey" "donkey" "cat"    "donkey" "cat"    "cat"    "cat"   
##   [9] "donkey" "dog"    "cat"    "cat"    "dog"    "cat"    "donkey" "dog"   
##  [17] "donkey" "donkey" "dog"    "dog"    "dog"    "dog"    "donkey" "cat"   
##  [25] "donkey" "cat"    "dog"    "cat"    "donkey" "cat"    "dog"    "donkey"
##  [33] "donkey" "dog"    "donkey" "cat"    "dog"    "donkey" "dog"    "dog"   
##  [41] "cat"    "donkey" "donkey" "dog"    "donkey" "dog"    "donkey" "cat"   
##  [49] "dog"    "cat"    "dog"    "dog"    "donkey" "dog"    "cat"    "dog"   
##  [57] "dog"    "donkey" "dog"    "cat"    "dog"    "donkey" "dog"    "donkey"
##  [65] "cat"    "donkey" "cat"    "cat"    "donkey" "cat"    "cat"    "donkey"
##  [73] "donkey" "dog"    "cat"    "cat"    "dog"    "cat"    "dog"    "dog"   
##  [81] "cat"    "donkey" "donkey" "dog"    "cat"    "dog"    "cat"    "dog"   
##  [89] "donkey" "donkey" "cat"    "donkey" "dog"    "cat"    "cat"    "donkey"
##  [97] "cat"    "dog"    "donkey" "donkey"

table(anim)

## anim
##    cat    dog donkey 
##     32     33     35

Let’s define anim vector as factor variable:

animf <-as.factor(anim)
levels(animf)

## [1] "cat"    "dog"    "donkey"

We can change the levels:

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata

##  [1] 1 2 2 3 1 2 3 3 1 2 3 3 1
## Levels: 1 2 3

rdata = factor(data,labels=c("I","II","III"))
rdata

##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III

#Or
levels(fdata) = c('I','II','III')
fdata

##  [1] I   II  II  III I   II  III III I   II  III III I  
## Levels: I II III

The cut function is used to convert a numeric variable into a factor. The breaks= argument is used to define how ranges of numbers will be converted to factor values. Consider Lot_Area in the ames data.

library(RBootcamp)
Area = cut(ames$Lot_Area,3)
table(Area)

## Area
## (1.09e+03,7.26e+04] (7.26e+04,1.44e+05] (1.44e+05,2.15e+05] 
##                2926                   1                   3

# More
Area = cut(ames$Lot_Area,3, labels=c('Small','Medium','Large'))
table(Area)

## Area
##  Small Medium  Large 
##   2926      1      3

# Quantiles
Area = cut(ames$Lot_Area, quantile(ames$Lot_Area,prob = seq(0, 1, 0.25)))
table(Area)

## Area
##  (1.3e+03,7.44e+03] (7.44e+03,9.44e+03] (9.44e+03,1.16e+04] (1.16e+04,2.15e+05] 
##                 732                 732                 732                 733