GitHub

Hadoop 2.8.0 Ecosystem

Big Data Engineering and Analytics

Linux OS options: Debian Jessie, CentOS 7, Centos 6.8

and Alpine Linux (483 MB)

· Pseudo distributed mode

· Fully distributed mode

PySpark Jupyter Notebook - Kernels (Python, R, Julia)

RStudio Server

ETL - (Data Lake)

. Raise and Import databases Mariadb and Oracle 11g with sqoop

. Hive, Pig, HBase

. JDBC implemented and ready for sqoop and spark

Machine Learning

. Mahout (Naive Bayes, K-Means)

Fully distributed mode

One host containers

Script for your cluster from 1 to 9 nodes.

curl -L https://raw.githubusercontent.com/luvres/hadoop/master/zoneCluster.sh -o ~/zoneCluster.sh
alias zoneCluster="bash ~/zoneCluster.sh"

Create a directory for notebooks and Include directory created above on flag "-v"

mkdir $HOME/notebooks

Create cluster of a node

The total of 2, as the namenode assumes one more node

zoneCluster

Hadoop Browser

http://localhost:8088

http://localhost:50070

HBase Browser

http://localhost:60010

Access by Jupyter Notebook

http://localhost:8888/terminals/1

 sh-4.2# bash <enter>

To create a cluster of maximum 9 nodes (10 including the namenode)

zoneCluster 3

docker logs -f Hadoop

Note: The script is limited to a maximum of 9 nodes because multiple hosts are being created on only one host and I see no point in overloading your machine. The settings are ready for a real cluster and in the future I want to create scripts for provisioning with docker swarm.

Options: { stop | start | remove | Stop | pseudo | cos6 | cos7 | alpine }

Stop and Remove the cluster

zoneCluster Stop

ETL - (Data Lake)

Import databases Mariadb and Oracle 11g with sqoop

zoneCluster 2 -db

Import data from Mariadb with Sqoop

http://localhost:8888/terminals/1
# bash <Enter>

sqoop import \
	--connect jdbc:mysql://mariadb:3306/mysql \
	--username root \
	--password maria \
	--table user -m 1

Checking imported data for the hdfs

hdfs dfs -ls -R user

Import data from Oracle with Sqoop

Access Oracle

docker exec -ti OracleXE bash
cd $HOME/data/

Download file

curl -O http://files.grouplens.org/datasets/movielens/ml-20m.zip
unzip ml-20m.zip
cd ml-20m

Create file 100 times smaller

cat ratings.csv |tail -n $((`cat ratings.csv | wc -l` /100)) >ml_ratings.csv

Load table in Oracle

Access database and create user

sqlplus sys/oracle as sysdba

Create the schema in the database and grant privileges

SQL> create user aluno identified by dsacademy;
SQL> grant connect, resource, unlimited tablespace to aluno;
SQL> conn aluno@xe/dsacademy
SQL> select user from dual;

Create a table in the Oracle database

SQL> CREATE TABLE cinema ( 
  ID   NUMBER PRIMARY KEY, 
  USER_ID       VARCHAR2(30), 
  MOVIE_ID      VARCHAR2(30),
  RATING        DECIMAL(30),
  TIMESTAMP     VARCHAR2(256) 
);

SQL> desc cinema;

SQL> quit

Create file loader.dat

tee $HOME/data/loader.dat <<EOF
load data
INFILE '$HOME/data/ml-20m/ml_ratings.csv'
INTO TABLE cinema
APPEND
FIELDS TERMINATED BY ','
trailing nullcols
(id SEQUENCE (MAX,1),
 user_id CHAR(30),
 movie_id CHAR(30),
 rating   decimal external,
 timestamp  char(256))
EOF

Run SQL * Loader

sqlldr userid=aluno/dsacademy control=$HOME/data/loader.dat log=$HOME/data/loader.log

Check load

sqlplus aluno/dsacademy

SQL> select count(*) from cinema;

Import with Sqoop

http://localhost:8888/terminals/1
# bash <Enter>

sqoop import \
--connect jdbc:oracle:thin:@oraclexe:1521:XE \
--username aluno \
--password dsacademy \
--query "select user_id, movie_id from cinema where rating = 1 and \$CONDITIONS" \
--target-dir /user/oracle/output -m 1

Hive (Structured Data in hdfs)

Download and copy dataset to hdfs

curl -O  https://raw.githubusercontent.com/luvres/hadoop/master/datasets/empregados.csv

hdfs dfs -mkdir /hive
hdfs dfs -copyFromLocal empregados.csv /hive

Create the first schema on Hive (Before starting Hive)

schematool -initSchema -dbType derby

If you have problems with the previous command

rm metastore_db -fR

Start Hive

hive

Create table to receive the file

CREATE TABLE temp_colab (texto String);

Upload file data

LOAD DATA INPATH '/hive/empregados.csv' OVERWRITE INTO TABLE temp_colab;

Check file insertion

SELECT * FROM temp_colab;

Extract data from table temp_colab and separate by column

CREATE TABLE IF NOT EXISTS colaboradores(
id int,
nome String,
cargo String,
salario double,
cidade String
);

insert overwrite table colaboradores
SELECT
regexp_extract(texto, '^(?:([^,]*),?){1}', 1) ID,
regexp_extract(texto, '^(?:([^,]*),?){2}', 1) nome,
regexp_extract(texto, '^(?:([^,]*),?){3}', 1) cargo,
regexp_extract(texto, '^(?:([^,]*),?){4}', 1) salario,
regexp_extract(texto, '^(?:([^,]*),?){5}', 1) cidade
from temp_colab;

HiveQL Commands

SELECT * FROM colaboradores;

SELECT * FROM colaboradores WHERE Id = 3002;

SELECT sum(salario), cidade from colaboradores group by cidade;

Machine Learning

Creation of the Predictive Model with Naive Bayes

Create Folders in HDFS

hdfs dfs -mkdir -p /mahout/input/{ham,spam}

Download and copy dataset to hdfs

curl https://raw.githubusercontent.com/luvres/hadoop/master/datasets/ham.tar.gz | tar -xzf -
curl https://raw.githubusercontent.com/luvres/hadoop/master/datasets/spam.tar.gz | tar -xzf -

hdfs dfs -copyFromLocal ham/* /mahout/input/ham

hdfs dfs -copyFromLocal spam/* /mahout/input/spam

Converts data to a sequence (required when working with Mahout)

mahout seqdirectory -i /mahout/input -o /mahout/output/seqoutput

Converts the sequence to TF-IDF vectors

mahout seq2sparse -i /mahout/output/seqoutput -o /mahout/output/sparseoutput

Displays output

hdfs dfs -ls /mahout/output/sparseoutput

Convert training and test datasets

mahout split -i /mahout/output/sparseoutput/tfidf-vectors --trainingOutput /mahout/nbTrain --testOutput /mahout/nbTest --randomSelectionPct 30 --overwrite --sequenceFiles -xm sequencial

Predictive model construction

mahout trainnb -i /mahout/nbTrain -li /mahout/nbLabels -o /mahout/nbmodel -ow -c

Test model

mahout testnb -i /mahout/nbTest -m /mahout/nbmodel -l /mahout/nbLabels -ow -o /mahout/nbpredictions -c

Creating a Predictive Model of Unsupervised Learning with K-Means

Create Folders in HDFS

hdfs dfs -mkdir -p /mahout/clustering/data

Download and copy dataset to hdfs

curl https://raw.githubusercontent.com/luvres/hadoop/master/datasets/news.tar.gz | tar -xzf -

hdfs dfs -copyFromLocal news/* /mahout/clustering/data

Converts the dataset to sequence object

mahout seqdirectory -i /mahout/clustering/data -o /mahout/clustering/kmeansseq

Converts the sequence to TF-IDF vectors

mahout seq2sparse -i /mahout/clustering/kmeansseq -o /mahout/clustering/kmeanssparse

hdfs dfs -ls /mahout/clustering/kmeanssparse

Building the K-means model

mahout kmeans -i /mahout/clustering/kmeanssparse/tfidf-vectors/ -c /mahout/clustering/kmeanscentroids  -cl -o /mahout/clustering/kmeansclusters -k 3 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

hdfs dfs -ls /mahout/clustering/kmeansclusters

Dump clusters to a text file

mahout clusterdump -d /mahout/clustering/kmeanssparse/dictionary.file-0 -dt sequencefile -i /mahout/clustering/kmeansclusters/clusters-1-final -n 20 -b 100 -o clusterdump.txt -p /mahout/clustering/kmeansclusters/clusteredPoints/

View clusters

cat clisterdump.txt

PySpark with Jupyter Notebook

Browser access

http://localhost:8888

Spark management jobs

http://localhost:4040

RStudio Server

Browser access

http://localhost:8787

username: root
password: root

Creates a pseudo-distributed instance

zoneCluster pseudo

Equivalent to the command

docker run --rm --name Hadoop -h hadoop \
-p 8088:8088 -p 8042:8042 -p 50070:50070 -p 8888:8888 -p 4040:4040 \
-v $HOME/notebooks:/root/notebooks \
-ti izone/hadoop:ecosystem bash

Julia (Linear regression)

http://localhost:8888/terminals/1
bash

curl -O https://raw.githubusercontent.com/luvres/hadoop/master/julia/dataset/multilinreg.jl
curl -O https://raw.githubusercontent.com/luvres/hadoop/master/julia/dataset/data.txt

julia multilinreg.jl

Pull image latest (with Debian 8)

docker pull izone/hadoop

Run pulled image (Optional flag "-test" to start with a PI test)

docker run --rm --name Hadoop -h hadoop \
	-p 8088:8088 \
	-p 8042:8042 \
	-p 50070:50070 \
	-ti izone/hadoop -test bash

Pull image with CentOS 7

docker pull izone/hadoop:cos7

Pull image with CentOS 6

docker pull izone/hadoop:cos6

Pull reduced image with Alpine (483 MB)

docker pull izone/hadoop:alpine

Run pulled image (Optional flag "-test" to start with a PI test)

docker run --rm --name Hadoop -h hadoop \
	-p 8088:8088 \
	-p 8042:8042 \
	-p 50070:50070 \
	-ti izone/hadoop:alpine -test bash

Examples:

Hadoop Map Reduce

Create a Directory

hdfs dfs -mkdir /bigdata

List diretory

hadoop fs -ls /

Download a file csv

wget -c http://compras.dados.gov.br/contratos/v1/contratos.csv

Copy file to the HDFS directory created above

hadoop fs -copyFromLocal contratos.csv /bigdata

Read file

hadoop fs -cat /bigdata/contratos.csv

Test word count with mapreduce

hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar wordcount /bigdata/contratos.csv /output

Read result

hdfs dfs -cat /output/*

Spark MapReduce

pyspark jupyter notebook

http://localhost:8888/
new -> python

Terminal commands executed with "!" Straight into the notebook

It is the same as running direct on the terminal

!mkdir datasets
!curl -L http://www.gutenberg.org/files/11/11-0.txt -o datasets/book.txt
!hdfs dfs -mkdir -p /spark/input
!hdfs dfs -put datasets/book.txt /spark/input
!hdfs dfs -ls /spark/input

Examples of http://spark.apache.org/examples.html

text_file = sc.textFile("hdfs://localhost:9000/spark/input/book.txt")

counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs://localhost:9000/spark/output"

View result

!hdfs dfs -ls /spark/output
!hdfs dfs -cat /spark/output/part-00000

Spark Yarn management

Client enviroment

export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf

Submit

spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster $SPARK_HOME/examples/jars/spark-examples_2.11-2.0.2.jar 10

Pull image with Anaconda

docker run --rm --name Hadoop -h hadoop \
	-p 8088:8088 \
	-p 8042:8042 \
	-p 50070:50070 \
	-p 8888:8888 \
	-p 4040:4040 \
	-v $HOME/notebooks:/root/notebooks \
	-ti izone/hadoop:anaconda bash

Pull image with RStudio

docker run --rm --name Hadoop -h hadoop \
	-p 8088:8088 \
	-p 8042:8042 \
	-p 50070:50070 \
	-p 8888:8888 \
	-p 4040:4040 \
	-p 8787:8787 \
	-v $HOME/notebooks:/root/notebooks \
	-ti izone/hadoop:rstudio bash

AUTO CONSTRUCTION creation sequence that are in the Docker Hub

Debian 8

git clone https://github.com/luvres/hadoop.git
cd hadoop

docker build -t izone/hadoop . && \
docker build -t izone/hadoop:anaconda ./anaconda/ && \
docker build -t izone/hadoop:rstudio ./rstudio/ && \
docker build -t izone/hadoop:julia ./julia/ && \
docker build -t izone/hadoop:ecosystem ./ecosystem/ && \
docker build -t izone/hadoop:cluster ./cluster/ && \
docker build -t izone/hadoop:datanode ./cluster/datanode

CentOS 7

git clone https://github.com/luvres/hadoop.git
cd hadoop

docker build -t izone/hadoop:cos7 ./centos7/ && \
docker build -t izone/hadoop:cos7-miniconda ./centos7/miniconda/ && \
docker build -t izone/hadoop:cos7-ecosystem ./centos7/ecosystem/ && \
docker build -t izone/hadoop:cos7-anaconda ./centos7/anaconda/ && \
docker build -t izone/hadoop:cos7-mahout ./centos7/mahout/ && \
docker build -t izone/hadoop:cos7-cluster ./centos7/cluster/ && \
docker build -t izone/hadoop:cos7-datanode ./centos7/cluster/datanode/

CentOS 6

git clone https://github.com/luvres/hadoop.git
cd hadoop

docker build -t izone/hadoop:cos6 ./centos6/ && \
docker build -t izone/hadoop:cos6-miniconda ./centos6/miniconda/ && \
docker build -t izone/hadoop:cos6-ecosystem ./centos6/ecosystem/ && \
docker build -t izone/hadoop:cos6-anaconda ./centos6/anaconda/ && \
docker build -t izone/hadoop:cos6-rstudio ./centos6/rstudio/ && \
docker build -t izone/hadoop:cos6-mahout ./centos6/mahout/ && \
docker build -t izone/hadoop:cos6-cluster ./centos6/cluster/ && \
docker build -t izone/hadoop:cos6-datanode ./centos6/cluster/datanode

Alpine

git clone https://github.com/luvres/hadoop.git
cd hadoop

docker build -t izone/hadoop:alpine ./alpine/ && \
docker build -t izone/hadoop:alpine-datanode ./alpine/datanode/

Name		Name	Last commit message	Last commit date
Latest commit History 455 Commits
alpine		alpine
anaconda		anaconda
centos6		centos6
centos7		centos7
cluster		cluster
datasets		datasets
ecosystem		ecosystem
julia		julia
rstudio		rstudio
scripts		scripts
storm		storm
Dockerfile		Dockerfile
README.md		README.md
core-site.xml		core-site.xml
docker-compose.yml		docker-compose.yml
hadoop-env.sh		hadoop-env.sh
hdfs-site.xml		hdfs-site.xml
mapred-site.xml		mapred-site.xml
start.sh		start.sh
supervisord.conf		supervisord.conf
yarn-site.xml		yarn-site.xml
zoneCluster.sh		zoneCluster.sh

mlfunston/hadoop

Folders and files

Latest commit

History

Repository files navigation

Hadoop 2.8.0 Ecosystem

Big Data Engineering and Analytics

Linux OS options: Debian Jessie, CentOS 7, Centos 6.8

and Alpine Linux (483 MB)

· Pseudo distributed mode

· Fully distributed mode

PySpark Jupyter Notebook - Kernels (Python, R, Julia)

RStudio Server

ETL - (Data Lake)

. Raise and Import databases Mariadb and Oracle 11g with sqoop

. Hive, Pig, HBase

. JDBC implemented and ready for sqoop and spark

Machine Learning

. Mahout (Naive Bayes, K-Means)

Fully distributed mode

One host containers

Script for your cluster from 1 to 9 nodes.

Create a directory for notebooks and Include directory created above on flag "-v"

Create cluster of a node

The total of 2, as the namenode assumes one more node

Hadoop Browser

HBase Browser

Access by Jupyter Notebook

To create a cluster of maximum 9 nodes (10 including the namenode)

Note: The script is limited to a maximum of 9 nodes because multiple hosts are being created on only one host and I see no point in overloading your machine. The settings are ready for a real cluster and in the future I want to create scripts for provisioning with docker swarm.

Options: { stop | start | remove | Stop | pseudo | cos6 | cos7 | alpine }

Stop and Remove the cluster

ETL - (Data Lake)

Import databases Mariadb and Oracle 11g with sqoop

Import data from Mariadb with Sqoop

Checking imported data for the hdfs

Import data from Oracle with Sqoop

Access Oracle

Download file

Create file 100 times smaller

Load table in Oracle

Access database and create user

Create the schema in the database and grant privileges

Create a table in the Oracle database

Create file loader.dat

Run SQL * Loader

Check load

Import with Sqoop

Hive (Structured Data in hdfs)

Download and copy dataset to hdfs

Create the first schema on Hive (Before starting Hive)

If you have problems with the previous command

Start Hive

Create table to receive the file

Upload file data

Check file insertion

Extract data from table temp_colab and separate by column

HiveQL Commands

Machine Learning

Creation of the Predictive Model with Naive Bayes

Create Folders in HDFS

Download and copy dataset to hdfs

Converts data to a sequence (required when working with Mahout)

Converts the sequence to TF-IDF vectors

Displays output

Convert training and test datasets

Predictive model construction

Test model

Creating a Predictive Model of Unsupervised Learning with K-Means

Create Folders in HDFS

Download and copy dataset to hdfs

Converts the dataset to sequence object

Converts the sequence to TF-IDF vectors

Building the K-means model

Dump clusters to a text file

View clusters

PySpark with Jupyter Notebook

Browser access

Spark management jobs

RStudio Server

Packages