This tutorial will help you to install Apache Hadoop with a basic cluster. In the example below, there will be 1 master (also be used as a worker) node - cluster1, and 3 worker nodes - cluster2, cluster3, cluster4. More worker nodes can be used as users need. All nodes in the instruction use OS Ubuntu Server 18.04, with login user ubuntu, therefore the home directory will be /home/ubuntu/. Remember to replace your appropriate Home directory with /home/ubuntu/.
Below are the 4 nodes and their IP addresses I will be referring to here:
172.16.203.128 cluster1
172.16.203.129 cluster2
172.16.203.130 cluster3
172.16.203.131 cluster4
sudo apt-get update
sudo apt-get install ssh
sudo vim /etc/hosts
And paste this to the end of the file:
/etc/hosts
172.16.203.128 cluster1
172.16.203.129 cluster2
172.16.203.130 cluster3
172.16.203.131 cluster4
Copy the key pem from local to AWS instance:
scp -i {pem_file} {pem_file_to_send} {username}@{dns_address_name}:~/.
Add an inbound rule for the network group we're using, to allow access from cluster 1
. For example, the IP of cluster 1 is 172.31.1.136
, we can add an inbound rule like the one below:
Generate key in cluster1
ssh-keygen -t rsa -P '' -f /home/ubuntu/.ssh/id_rsa
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys
And copy to cluster2
, cluster3
, cluster4
by using scp
:
scp -i path/to/keyfile.pem /home/ubuntu/.ssh/authorized_keys cluster2:/home/ubuntu/.ssh/authorized_keys
scp -i path/to/keyfile.pem /home/ubuntu/.ssh/authorized_keys cluster3:/home/ubuntu/.ssh/authorized_keys
scp -i path/to/keyfile.pem /home/ubuntu/.ssh/authorized_keys cluster4:/home/ubuntu/.ssh/authorized_keys
(or just manually copy the authorized_keys to the worker nodes)
sudo apt-get -y install openjdk-8-jdk-headless
Download Hadoop 2.7.3 in all nodes
:
mkdir /tmp/hadoop/ && cd /tmp/hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
Unzip
tar -xzf hadoop-2.7.3.tar.gz
Rename the directory for short
mv hadoop-2.7.3 /home/ubuntu/hadoop
Update hadoop-env.sh
in all nodes:
vim /home/ubuntu/hadoop/etc/hadoop/hadoop-env.sh
In hadoop-env.sh
file, file the line starts with export JAVA_HOME=
and replaces it with the line below. If not found, then add the line at the end of the file.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Update core-site.xml
in all nodes:
vim /home/ubuntu/hadoop/etc/hadoop/core-site.xml
The full content of core-site.xml
in all nodes:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cluster1:9000</value>
</property>
</configuration>
Update hdfs-site.xml
in all nodes (The content of master node will be different):
vim /home/ubuntu/hadoop/etc/hadoop/hdfs-site.xml
The full content of hdfs-site.xml
in cluster1:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/datanode</value>
</property>
</configuration>
The full content of hdfs-site.xml
in cluster2
, cluster3
, cluster4
:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/ubuntu/hadoop/hdfs/datanode</value>
</property>
</configuration>
Update yarn-site.xml
in all nodes:
vim /home/ubuntu/hadoop/etc/hadoop/yarn-site.xml
The full content of yarn-site.xml
in all nodes:
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>cluster1:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>cluster1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>cluster1:8050</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>16384</value>
</property>
</configuration>
Update mapred-site.xml
in all nodes (If not existed, create this file):
vim /home/ubuntu/hadoop/etc/hadoop/mapred-site.xml
The full content of mapred-site.xml
in all nodes:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>cluster1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hostName:19888</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
</configuration>
Edit slaves (workers) file in all nodes:
vim /home/ubuntu/hadoop/etc/hadoop/slaves
And add the following lines (delete localhost if such a line exists):
[slaves file]
cluster1
cluster2
cluster3
cluster4
Edit master
file in all nodes:
vim /home/ubuntu/hadoop/etc/hadoop/masters
And add the following lines (delete localhost if such a line exists):
[masters file]
cluster1
vim /home/ubuntu/.bashrc
And add these lines to the end of the file:
export HADOOP_HOME="/home/ubuntu/hadoop"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
Now load the environment variables to the opened session
source /home/ubuntu/.bashrc
In cluster1
:
sudo rm -rf /home/ubuntu/hadoop/hdfs/namenode/
sudo rm -rf /home/ubuntu/hadoop/hdfs/datanode/
sudo mkdir -p /home/ubuntu/hadoop/hdfs/namenode/
sudo mkdir -p /home/ubuntu/hadoop/hdfs/datanode/
sudo chown ubuntu:ubuntu /home/ubuntu/hadoop/hdfs/namenode/
sudo chown ubuntu:ubuntu /home/ubuntu/hadoop/hdfs/datanode/
sudo chmod 777 /home/ubuntu/hadoop/hdfs/namenode/
sudo chmod 777 /home/ubuntu/hadoop/hdfs/datanode/
In cluster2
, cluster3
, cluster4
:
sudo rm -rf /home/ubuntu/hadoop/hdfs/datanode/
sudo mkdir -p /home/ubuntu/hadoop/hdfs/datanode/
sudo chown ubuntu:ubuntu /home/ubuntu/hadoop/hdfs/datanode/
sudo chmod 777 /home/ubuntu/hadoop/hdfs/datanode/
I set chmod to 777 for easy access. You can change it if you want.
hdfs namenode -format
start-dfs.sh && start-yarn.sh
You should see the following lines:
Run jps on cluster1 should list the following:
Run jps on cluster2, 3, 4 should list the following: Update firewall rules for port 50070 and port 8088 to be accessible.
sudo ufw allow 50070
sudo ufw allow 8088
Then set up a security group for inbound rules value for ports 50070
and 8088
to be accessed from the internet (in this case from my IP address).
By accessing http://${cluster1}:50070
you should see the following HDFS web UI (where ${cluster_1} is the IP value you can retrieve from AWS console).
By accessing http://${cluster1}:8088
you should see the following YARN web UI.
Note that if you're having the issue which DataNode is not up and running in cluster 1
:
.
You would need to remove the following directories in cluster 1
:
sudo rm -rf /home/ubuntu/hadoop/hdfs/namenode/
sudo rm -rf /home/ubuntu/hadoop/hdfs/datanode/
Then creating the new directories and set write permissions for it.
sudo mkdir -p /home/ubuntu/hadoop/hdfs/namenode/
sudo mkdir -p /home/ubuntu/hadoop/hdfs/datanode/
sudo chown ubuntu:ubuntu /home/ubuntu/hadoop/hdfs/namenode/
sudo chown ubuntu:ubuntu /home/ubuntu/hadoop/hdfs/datanode/
sudo chmod 777 /home/ubuntu/hadoop/hdfs/namenode/
sudo chmod 777 /home/ubuntu/hadoop/hdfs/datanode/
Writing and reading to HDFS is done with command hdfs dfs. First, manually create your home directory. All other commands will use a path relative to this default home directory: (note that ubuntu is my logged in user. If you login with different user then please use your user id instead of ubuntu).
hdfs dfs -mkdir -p /user/ubuntu/
Get a books file
wget -O alice.txt https://www.gutenberg.org/files/11/11-0.txt
Upload downloaded file to hdfs using -put:
hdfs dfs -mkdir books
hdfs dfs -put alice.txt books/alice.txt
List a file on hdfs:
hdfs dfs -ls books/
. You can also check in HDFS Web UI
stop-yarn.sh && stop-dfs.sh
cd /home/ubuntu/
wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
mv spark-2.4.0-bin-hadoop2.7 spark
Open file in vim editor:
vim /home/ubuntu/.bashrc
Add the below variables to the end of the file:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/home/ubuntu/spark
export PATH=$PATH:$SPARK_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
Now load the environment variables to the opened session by running below command:
source /home/ubuntu/.bashrc
In case if you added to .profile file then restart your session by logging out and logging in again.
spark-submit --master local\[*\] --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.4.0.jar 10
Notes: when running with VEP, need to run hdfs dfsadmin -safemode leave to disable safemode.
sudo apt-get update
sudo apt-get install libdbi-perl gcc libdbd-mysql-perl perl=5.26.1-6ubuntu0.5 gcc=4:7.4.0-1ubuntu2.3 g++=4:7.4.0-1ubuntu2.3 make=4.1-9.1ubuntu1 libbz2-dev=1.0.6-8.1ubuntu0.2 liblzma-dev=5.2.2-1.3 libpng-dev=1.6.34-1ubuntu0.18.04.2 uuid-dev=2.31.1-0.4ubuntu3.7 cpanminus libmysqlclient-dev mysql-server unzip=6.0-21ubuntu1.1 git make unzip libpng-dev uuid-dev bcftools
sudo cpanm Archive::Zip
sudo cpanm Archive::Extract
sudo cpanm DBD::mysql
sudo cpanm Set::IntervalTree
sudo cpanm JSON
sudo cpanm PerlIO::gzip
wget http://hgdownload.cse.ucsc.edu/admin/jksrc.zip
unzip jksrc.zip
cd kent/src/lib
export MACHTYPE=i686
make
cd ..
export KENT_SRC=`pwd`
sudo cpanm Bio::DB::BigFile
git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
git checkout release/100
perl INSTALL.pl
Notes:
- Cache version: 100
- GRCh38
- Fasta: Homo sapiens
- Plugins: All
- All else configs: Default
Copy data file from local to AWS EC2 instance
scp -i ${pem_file} ${path_to_data_file_local} ${user_name_ec2_machine}@${ec2_ip}:${path_to_folder_ec2}
/home/ubuntu/ensembl-vep/vep --format vcf --no_stats --force_overwrite --dir_cache /home/ubuntu/.vep --offline --vcf --vcf_info_field ANN --buffer_size 60000 --phased --hgvsg --hgvs --symbol --variant_class --biotype --gene_phenotype --regulatory --ccds --transcript_version --tsl --appris --canonical --protein --uniprot --domains --sift b --polyphen b --check_existing --af --max_af --af_1kg --af_gnomad --minimal --allele_number --pubmed --fasta /home/ubuntu/.vep/homo_sapiens/100_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --input_file ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --output_file f1_b60000.vcf
sudo apt-get update
sudo apt-get install scala=2.11.12-4~18.04 zip=3.0-11build1
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install sbt 1.3.8
git clone https://github.com/variant-annotation/va-spark.git
cd va-spark/
git checkout fix/snpeff
sbt assembly
(time spark-submit --master yarn --deploy-mode cluster --executor-memory 4g --num_executors 4 --executor_cores 2 /home/ubuntu/vep-spark/target/scala-2.11/vepspark_2.11-0.1.jar --vep_dir /home/ubuntu/ensembl-vep/vep --format vcf --no_stats --force_overwrite --cache_dir /home/ubuntu/.vep --offline --vcf --vcf_info_field ANN --phased --hgvsg --hgvs --symbol --variant_class --biotype --gene_phenotype --regulatory --ccds --transcript_version --tsl --appris --canonical --protein --uniprot --domains --sift b --polyphen b --check_existing --af --max_af --af_1kg --af_gnomad --minimal --allele_number --pubmed --fasta /home/ubuntu/.vep/homo_sapiens/100_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz --input_file /user/ubuntu/data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf --output_file /user/output/vs_10gb_nop34_r4_non4_442.vcf) &> time_vs_10gb_nop34_r8_non4_442.txt