-
Notifications
You must be signed in to change notification settings - Fork 32
/
02-data_types.Rmd
850 lines (590 loc) · 20.3 KB
/
02-data_types.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
# Objects, their classes and types, and useful R functions to get you started
All objects in R have a given *type*. You already know most of them, as these types are also used
in mathematics. Integers, floating point numbers (floats), matrices, etc, are all objects you
are already familiar with. But R has other, maybe lesser known data types (that you can find in a
lot of other programming languages) that you need to become familiar with. But first, we need to
learn how to assign a value to a variable. This can be done in two ways:
```{r}
a <- 3
```
or
```{r}
a = 3
```
in very practical terms, there is no difference between the two. I prefer using `<-` for assigning
values to variables and reserve `=` for passing arguments to functions, for example:
```{r}
spam <- mean(x = c(1,2,3))
```
I think this is less confusing than:
```{r}
spam = mean(x = c(1,2,3))
```
but as I explained above you can use whatever you feel most comfortable with.
## The `numeric` class
To define single numbers, you can do the following:
```{r}
a <- 3
```
The `class()` function allows you to check the class of an object:
```{r}
class(a)
```
Decimals are defined with the character `.`:
```{r}
a <- 3.14
```
R also supports integers. If you find yourself in a situation where you explicitly need an integer
and not a floating point number, you can use the following:
```{r}
a <- as.integer(3)
class(a)
```
The `as.integer()` function is very useful, because it converts its argument into an integer. There
is a whole family of `as.*()` functions. To convert `a` into a floating point number again:
```{r}
class(as.numeric(a))
```
There is also `is.numeric()` which tests whether a number is of the `numeric` class:
```{r}
is.numeric(a)
```
It is also possible to create an integer using `L`:
```{r}
a <- 5L
class(a)
```
Another way to convert this integer back to a floating point number is to use `as.double()` instead of
as numeric:
```{r}
class(as.double(a))
```
The functions prefixed with `is.*` and `as.*` are quite useful, there is one for any of the supported types in R, such
as `as/is.character()`, `as/is.factor()`, etc...
## The `character` class
Use `" "` to define characters (called strings in other programming languages):
```{r}
a <- "this is a string"
```
```{r}
class(a)
```
To convert something to a character you can use the `as.character()` function:
```{r}
a <- 4.392
class(a)
```
Now let's convert it:
```{r}
class(as.character(a))
```
It is also possible to convert a character to a numeric:
```{r}
a <- "4.392"
class(a)
class(as.numeric(a))
```
But this only works if it makes sense:
```{r}
a <- "this won't work, chief"
class(a)
as.numeric(a)
```
A very nice package to work with characters is `{stringr}`, which is also part of the `{tidyverse}`.
## The `factor` class
Factors look like characters, but are very different. They are the representation of categorical
variables. A `{tidyverse}` package to work with factors is `{forcats}`. You would rarely use
factor variables outside of datasets, so for now, it is enough to know that this class exists.
We are going to learn more about factor variables in Chapter 4, by using the `{forcats}` package.
## The `Date` class
Dates also look like characters, but are very different too:
```{r}
as.Date("2019/03/19")
class(as.Date("2019/03/19"))
```
Manipulating dates and time can be tricky, but thankfully there's a `{tidyverse}` package for that,
called `{lubridate}`. We are going to go over this package in Chapter 4.
## The `logical` class
This is the class of predicates, expressions that evaluate to *true* or *false*. For example, if you type:
```{r}
4 > 3
```
R returns `TRUE`, which is an object of class `logical`:
```{r}
k <- 4 > 3
class(k)
```
In other programming languages, `logical`s are often called `bool`s. A `logical` variable can only have
two values, either `TRUE` or `FALSE`. You can test the truthiness of a variable with `isTRUE()`:
```{r}
k <- 4 > 3
isTRUE(k)
```
How can you test if a variable is false? There is not a `isFALSE()` function (at least not without having
to load a package containing this function), but there is way to do it:
```{r}
k <- 4 > 3
!isTRUE(k)
```
The `!` operator indicates negation, so the above expression could be translated as *is k not TRUE?*.
There are other operators for boolean algebra, namely `&, &&, |, ||`. `&` means *and* and `|` stands for *or*.
You might be wondering what the difference between `&` and `&&` is? Or between `|` and `||`? `&` and
`|` work on vectors, doing pairwise comparisons:
```{r}
one <- c(TRUE, FALSE, TRUE, FALSE)
two <- c(FALSE, TRUE, TRUE, TRUE)
one & two
```
Compare this to the `&&` operator:
```{r}
one <- c(TRUE, FALSE, TRUE, FALSE)
two <- c(FALSE, TRUE, TRUE, TRUE)
one && two
```
The `&&` and `||` operators only compare the first element of the vectors and stop as soon as a the return
value can be safely determined. This is called short-circuiting. Consider the following:
```{r}
one <- c(TRUE, FALSE, TRUE, FALSE)
two <- c(FALSE, TRUE, TRUE, TRUE)
three <- c(TRUE, TRUE, FALSE, FALSE)
one && two && three
one || two || three
```
The `||` operator stops as soon it evaluates to `TRUE` whereas the `&&` stops as soon as it evaluates to `FALSE`.
Personally, I rarely use `||` or `&&` because I get confused. I find using `|` or `&` in combination with the
`all()` or `any()` functions much more useful:
```{r}
one <- c(TRUE, FALSE, TRUE, FALSE)
two <- c(FALSE, TRUE, TRUE, TRUE)
any(one & two)
all(one & two)
```
`any()` checks whether any of the vector's elements are `TRUE` and `all()` checks if all elements of the vector are
`TRUE`.
As a final note, you should know that is possible to use `T` for `TRUE` and `F` for `FALSE` but I
would advise against doing this, because it is not very explicit.
## Vectors and matrices
You can create a vector in different ways. But first of all, it is important to understand that a
vector in most programming languages is nothing more than a list of things. These things can be
numbers (either integers or floats), strings, or even other vectors. A vector in R can only contain elements of one
single type. This is not the case for a list, which is much more flexible. We will talk about lists shortly, but
let's first focus on vectors and matrices.
### The `c()` function
A very important function that allows you to build a vector is `c()`:
```{r}
a <- c(1,2,3,4,5)
```
This creates a vector with elements 1, 2, 3, 4, 5. If you check its class:
```{r}
class(a)
```
This can be confusing: you where probably expecting a to be of class *vector* or
something similar. This is not the case if you use `c()` to create the vector, because `c()`
doesn't build a vector in the mathematical sense, but a so-called atomic vector.
Checking its dimension:
```{r}
dim(a)
```
returns `NULL` because an atomic vector doesn't have a dimension.
If you want to create a true vector, you need to use `cbind()` or `rbind()`.
But before continuing, be aware that atomic vectors can only contain elements of the same type:
```{r}
c(1, 2, "3")
```
because "3" is a character, all the other values get implicitly converted to characters. You have
to be very careful about this, and if you use atomic vectors in your programming, you have to make
absolutely sure that no characters or logicals or whatever else are going to convert your atomic
vector to something you were not expecting.
### `cbind()` and `rbind()`
You can create a *true* vector with `cbind()`:
```{r}
a <- cbind(1, 2, 3, 4, 5)
```
Check its class now:
```{r}
class(a)
```
This is exactly what we expected. Let's check its dimension:
```{r}
dim(a)
```
This returns the dimension of `a` using the LICO notation (number of LInes first, the number of COlumns).
It is also possible to bind vectors together to create a matrix.
```{r}
b <- cbind(6,7,8,9,10)
```
Now let's put vector `a` and `b` into a matrix called `matrix_c` using `rbind()`.
`rbind()` functions the same way as `cbind()` but glues the vectors together by rows and not by columns.
```{r}
matrix_c <- rbind(a,b)
print(matrix_c)
```
### The `matrix` class
R also has support for matrices. For example, you can create a matrix of dimension (5,5) filled
with 0's with the `matrix()` function:
```{r}
matrix_a <- matrix(0, nrow = 5, ncol = 5)
```
If you want to create the following matrix:
\[
B = \left(
\begin{array}{ccc}
2 & 4 & 3 \\
1 & 5 & 7
\end{array} \right)
\]
you would do it like this:
```{r}
B <- matrix(c(2, 4, 3, 1, 5, 7), nrow = 2, byrow = TRUE)
```
The option `byrow = TRUE` means that the rows of the matrix will be filled first.
You can access individual elements of `matrix_a` like so:
```{r}
matrix_a[2, 3]
```
and R returns its value, 0. We can assign a new value to this element if we want. Try:
```{r}
matrix_a[2, 3] <- 7
```
and now take a look at `matrix_a` again.
```{r}
print(matrix_a)
```
Recall our vector `b`:
```{r}
b <- cbind(6,7,8,9,10)
```
To access its third element, you can simply write:
```{r}
b[3]
```
I have heard many people praising R for being a matrix based language. Matrices are indeed useful,
and statisticians are used to working with them. However, I very rarely use matrices in my
day to day work, and prefer an approach based on data frames (which will be discussed below). This
is because working with data frames makes it easier to use R's advanced functional programming
language capabilities, and this is where R really shines in my opinion. Working with matrices
almost automatically implies using loops and all the iterative programming techniques, *à la Fortran*,
which I personally believe are ill-suited for interactive statistical programming (as discussed in
the introduction).
## The `list` class
The `list` class is a very flexible class, and thus, very useful. You can put anything inside a list,
such as numbers:
```{r}
list1 <- list(3, 2)
```
or other lists constructed with `c()`:
```{r}
list2 <- list(c(1, 2), c(3, 4))
```
you can also put objects of different classes in the same list:
```{r}
list3 <- list(3, c(1, 2), "lists are amazing!")
```
and of course create list of lists:
```{r}
my_lists <- list(list1, list2, list3)
```
To check the contents of a list, you can use the structure function `str()`:
```{r}
str(my_lists)
```
or you can use RStudio's *Environment* pane:
```{r, echo=FALSE}
knitr::include_graphics("pics/rstudio_environment_list.gif")
```
You can also create named lists:
```{r}
list4 <- list("name_1" = 2, "name_2" = 8, "name_3" = "this is a named list")
```
and you can access the elements in two ways:
```{r}
list4[[1]]
```
or, for named lists:
```{r}
list4$name_3
```
Take note of the `$` operator, because it is going to be quite useful for `data.frame`s as well,
which we are going to get to know in the next section.
Lists are used extensively because they are so flexible. You can build lists of datasets and apply
functions to all the datasets at once, build lists of models, lists of plots, etc... In the later
chapters we are going to learn all about them. Lists are central objects in a functional programming
workflow for interactive statistical analysis.
## The `data.frame` and `tibble` classes
In the next chapter we are going to learn how to import datasets into R. Once you import data, the
resulting object is either a `data.frame` or a `tibble` depending on which package you used to
import the data. `tibble`s extend `data.frame`s so if you know about `data.frame` objects already,
working with `tibble`s will be very easy. `tibble`s have a better `print()` method, and some other
niceties.
However, I want to stress that these objects are central to R and are thus very important; they are
actually special cases of lists, discussed above. There are different ways to print a `data.frame` or
a `tibble` if you wish to inspect it. You can use `View(my_data)` to show the `my_data` `data.frame`
in the *View* pane of RStudio:
```{r, echo=FALSE}
knitr::include_graphics("pics/rstudio_view_data.gif")
```
You can also use the `str()` function:
```{r, eval=FALSE}
str(my_data)
```
And if you need to access an individual column, you can use the `$` sign, same as for a list:
```{r, eval=FALSE}
my_data$col1
```
## Formulas
We will learn more about formulas later, but because it is an important object, it is useful if you
already know about them early on. A formula is defined in the following way:
```{r}
my_formula <- ~x
class(my_formula)
```
Formula objects are defined using the `~` symbol. Formulas are useful to define statistical models,
for example for a linear regression:
```{r, eval=FALSE}
lm(y ~ x)
```
or also to define anonymous functions, but more on this later.
## Models
A statistical model is an object like any other in R:
```{r, include = FALSE}
data(mtcars)
my_model <- lm(mpg ~ hp, mtcars)
```
Here, I have already a model that I ran on some test data:
```{r}
class(my_model)
```
`my_model` is an object of class `lm`, for *linear model*. You can apply different functions to a model object:
```{r}
summary(my_model)
```
This class will be explored in later chapters.
## NULL, NA and NaN
The `NULL`, `NA` and `NaN` classes are pretty special. `NULL` is returned when the result of function is undetermined.
For example, consider `list4`:
```{r}
list4
```
if you try to access an element that does not exist, such as `d`, you will get `NULL` back:
```{r}
list4$d
```
`NaN` means "Not a Number" and is returned when a function return something that is not a number:
```{r}
sqrt(-1)
```
or:
```{r}
0/0
```
Basically, numbers that cannot be represented as floating point numbers are `NaN`.
Finally, there's `NA` which is closely related to `NaN` but is used for missing values. `NA` stands for `Not Available`. There are
several types of `NA`s:
* `NA_integer_`
* `NA_real_`
* `NA_complex_`
* `NA_character_`
but these are in principle only used when you need to program your own functions and need
to explicitly test for the missingness of, say, a character value.
To test whether a value is `NA`, use the `is.na()` function.
## Useful functions to get you started
This section will list several basic R functions that are very useful and should be part of your toolbox.
### Sequences
There are several functions that create sequences, `seq()`, `seq_along()` and `rep()`. `rep()` is easy enough:
```{r}
rep(1, 10)
```
This simply repeats `1` 10 times. You can repeat other objects too:
```{r}
rep("HAHA", 10)
```
To create a sequence, things are not as straightforward. There is `seq()`:
```{r}
seq(1, 10)
seq(70, 80)
```
It is also possible to provide a `by` argument:
```{r}
seq(1, 10, by = 2)
```
`seq_along()` behaves similarly, but returns the length of the object passed to it. So if you pass `list4` to
`seq_along()`, it will return a sequence from 1 to 3:
```{r}
seq_along(list4)
```
which is also true for `seq()` actually:
```{r}
seq(list4)
```
but these two functions behave differently for arguments of length equal to 1:
```{r}
seq(10)
seq_along(10)
```
So be quite careful about that. I would advise you do not use `seq()`, but only `seq_along()` and `seq_len()`. `seq_len()`
only takes arguments of length 1:
```{r}
seq_len(10)
seq_along(10)
```
The problem with `seq()` is that it is unpredictable; depending on its input, the output will either be an integer or a sequence.
When programming, it is better to have function that are stricter and fail when confronted to special cases, instead of returning
some result. This is a bit of a recurrent issue with R, and the functions from the `{tidyverse}` mitigate this issue by being
stricter than their base R counterparts. For example, consider the `ifelse()` function from base R:
```{r}
ifelse(3 > 5, 1, "this is false")
```
and compare it to `{dplyr}`'s implementation, `if_else()`:
```{r, eval=FALSE}
if_else(3 > 5, 1, "this is false")
Error: `false` must be type double, not character
Call `rlang::last_error()` to see a backtrace
```
`if_else()` fails because the return value when `FALSE` is not a double (a real number) but a character. This might seem unnecessarily
strict, but at least it is predictable. This makes debugging easier when used inside functions. In Chapter 8 we are going to learn how
to write our own functions, and being strict makes programming easier.
### Basic string manipulation
For now, we have not closely studied `character` objects, we only learned how to define them. Later, in Chapter 5 we will learn about the
`{stringr}` package which provides useful function to work with strings. However, there are several base R functions that are very
useful that you might want to know nonetheless, such as `paste()` and `paste0()`:
```{r}
paste("Hello", "amigo")
```
but you can also change the separator if needed:
```{r}
paste("Hello", "amigo", sep = "--")
```
`paste0()` is the same as `paste()` but does not have any `sep` argument:
```{r}
paste0("Hello", "amigo")
```
If you provide a vector of characters, you can also use the `collapse` argument,
which places whatever you provide for `collapse` between the
characters of the vector:
```{r}
paste0(c("Joseph", "Mary", "Jesus"), collapse = ", and ")
```
To change the case of characters, you can use `toupper()` and `tolower()`:
```{r}
tolower("HAHAHAHAH")
```
```{r}
toupper("hueuehuehuheuhe")
```
Finally, there are the classical mathematical functions that you know and love:
* `sqrt()`
* `exp()`
* `log()`
* `abs()`
* `sin()`, `cos()`, `tan()`, and others
* `sum()`, `cumsum()`, `prod()`, `cumprod()`
* `max()`, `min()`
and many others...
## Exercises
### Exercise 1 {-}
Try to create the following vector:
\[a = (6,3,8,9)\]
and add it this other vector:
\[b = (9,1,3,5)\]
and save the result to a new variable called `result`.
### Exercise 2 {-}
Using `a` and `b` from before, try to get their dot product.
Try with `a * b` in the R console. What happened?
Try to find the right function to get the dot product. Don't hesitate to google the answer!
### Exercise 3 {-}
How can you create a matrix of dimension (30,30) filled with 2's by only using the function `matrix()`?
### Exercise 4 {-}
Save your first name in a variable `a` and your surname in a variable `b`. What does the function:
```{r, eval=FALSE}
paste(a, b)
```
do? Look at the help for `paste()` with `?paste` or using the *Help* pane in RStudio. What does the
optional argument `sep` do?
### Exercise 5 {-}
Define the following variables: `a <- 8`, `b <- 3`, `c <- 19`. What do the following lines check?
What do they return?
```{r, eval=FALSE}
a > b
a == b
a != b
a < b
(a > b) && (a < c)
(a > b) && (a > c)
(a > b) || (a < b)
```
### Exercise 6 {-}
Define the following matrix:
\[
\text{matrix_a} = \left(
\begin{array}{ccc}
9 & 4 & 12 \\
5 & 0 & 7 \\
2 & 6 & 8 \\
9 & 2 & 9
\end{array} \right)
\]
```{r, include=FALSE}
matrix_a <- matrix(c(9, 4, 12, 5, 0, 7, 2, 6, 8, 9, 2, 9), nrow = 4, byrow = TRUE)
```
* What does `matrix_a >= 5` do?
* What does `matrix_a[ , 2]` do?
* Can you find which function gives you the transpose of this matrix?
### Exercise 7 {-}
Solve the following system of equations using the `solve()` function:
\[
\left(
\begin{array}{cccc}
9 & 4 & 12 & 2 \\
5 & 0 & 7 & 9\\
2 & 6 & 8 & 0\\
9 & 2 & 9 & 11
\end{array} \right) \times \left(
\begin{array}{ccc}
x \\
y \\
z \\
t \\
\end{array}\right) =
\left(
\begin{array}{ccc}
7\\
18\\
1\\
0
\end{array}
\right)
\]
```{r, include=FALSE}
matrix_a <- matrix(c(9, 4, 12, 2, 5, 0, 7, 9, 2, 6, 8, 0, 9, 2, 9, 11), nrow = 4, byrow = TRUE)
result <- c(7, 18, 1, 0)
solution <- solve(matrix_a, result)
```
### Exercise 8 {-}
Load the `mtcars` data (`mtcars` is include in R, so you only need to use the `data()` function to
load the data):
```{r}
data(mtcars)
```
if you run `class(mtcars)`, you get "data.frame". Try now with `typeof(mtcars)`. The answer is now
"list"! This is because the class of an object is an attribute of that object, which can even
be assigned by the user:
```{r}
class(mtcars) <- "don't do this"
class(mtcars)
```
The type of an object is R's internal type of that object, which cannot be manipulated by the user.
It is always useful to know the type of an object (not just its class). For example, in the particular
case of data frames, because the type of a data frame is a list, you can use all that you learned
about lists to manipulate data frames! Recall that `$` allowed you to select the element of a list
for instance:
```{r}
my_list <- list("one" = 1, "two" = 2, "three" = 3)
my_list$one
```
Because data frames are nothing but fancy lists, this is why you can access columns the same way:
```{r}
mtcars$mpg
```
```{r, include=FALSE}
rm(mtcars)
```