-
Notifications
You must be signed in to change notification settings - Fork 3
/
setup.tex
98 lines (84 loc) · 5.33 KB
/
setup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
\section{The R, RHIPE, Hadoop Setting}
\subsection{Overview}
The setting has three components: remote computer, one or more Unix
R-session servers, and a Unix Hadoop cluster. The second two components are
running R and RHIPE. You work on the remote
computer, say your laptop, and login to an R-session server.
This is home base, where you do all of your programming
of R and RHIPE R commands. The R commands you write for division, application
of analytic methods, and recombination that are destined for Hadoop on the
cluster are passed along by RHIPE R commands.
The remote computer is typically for you to maintain. The R-session
servers require IT staff to help install software, configure, and maintain.
However you install packages too on the R-session servers, just you do when you
want to use an R CRAN package in R. There is an extra task though; you want
packages you install to be pushed up the Hadoop cluster so they can be used
there too. Except for this push by you, the Hadoop cluster is the
domain of the systems administrators who must, among other tasks, install
Hadoop.
\subsection{The R-Session Server and RStudio}
Now the R-session server can be separate from the Hadoop cluster, handling
only R sessions, or it can be one of the servers on the Hadoop cluster. If it
is on the Hadoop cluster, there must be some precautions taken in the Hadoop
configuration to protect the programming of the R session. This is needed
because the RHIPE Hadoop jobs compete with the R sessions. There are never full
guarantees though, so "safe mode" is separate R session servers. The last thing
you want is for R sessions to get bogged down. If the cluster option is chosen,
then you want to mount a file server on the cluster that contains the files
associated with the R session such as .RData and files read into to R or
written by R.
There is a vast segment of the R community that uses RStudio, for good reason.
RStudio can join the setting. You have RStudio server installed on the
R-session servers by system administrators. A web browser on the R server runs
the RStudio interface which is accessed by you on your remote device via the
remote login.
\subsection{The Remote Computer}
The remote computer is just a communication device, and does not carry out data
analysis, so it can run any operating system, such as Windows. This is
especially important for teaching, since Windows labs are typically very
plentiful at academic institutions, but Unix labs much less so.
Whatever the operating system, a common communication protocol that is used
is the SSH protocol. SSH is typically used to log into a remote machine and
execute commands or to transfer files. But a critical capability of it for our
purposes here is that it supports both your R session command-line window,
showing both input and output, and a separate window to show graphics.
\subsection{Where Are the Data Analyzed}
Obviously, much data analysis is carried out by Hadoop on the Hadoop cluster.
Your R commands are given to RHIPE, passed along to Hadoop, and the outputs
are written by Hadoop to the HDFS.
But in many analyses of larger and more complex data, it is common to have
(1) outputs of a recombination method that constitute a relatively small
dataset, and (2) the outputs are further analyzed as part of the overall
analysis. If they are small enough to be readily analyzed in your R session,
then for sure that is where you want to be.
RHIPE commands allow you to write the recombination outputs from the HDFS to
the R global environment of your R session. They become a dataset in .RData.
While programming R and RHIPE is easy, it is not as easy as plain old serial R.
The point is that a lot of data analysis can be carried out in just R even when
the data are large and complex.
\subsection{A Few Basic Hadoop Features}
The two principal computational operations of Hadoop are Map and Reduce. The
first runs parallel computations on subsets without communication among them.
The second can compute across subset outputs. So Map carries out the
analytic method computation. Reduce takes the outputs from Map
and runs the recombination computation.
A division is typically carried out both by Map and Reduce, sometimes each used
several times, and can occur as
part of the reading of the data into R at the start of the analysis.
Usage of Map and Reduce involves the critical Hadoop element of key-value
pairs. We give one instance here. The Map operation, instructed by the
analyst R code, puts a key on each subset
output. This forms a key-value pair with the output as the value.
Each output can have a unique key, or each key can be given to many
outputs, or all outputs can have the same key. When Reduce is given the Map
outputs, it assembles the key-value pairs by key, which forms groups,
and then the R recombination code is applied to the values of each group
independently; so the running of the code on the different groups is
embarrassingly parallel. This framework provides substantial flexibility for
the recombination method.
Hadoop attempts to optimize computation in a number of ways. One example is
Map. Typically, there are vastly more subsets than cores on the cluster.
When Map finishes the application of the analytic method to a subset on a core,
Hadoop seeks to assign a subset on the same node as the core to avoid
transmission of the subset across the network connecting the nodes, which is
more time consuming.