Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update code to support newer java versions #586

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

luisfponce
Copy link

@luisfponce luisfponce commented Jun 5, 2019

Compile HiBench using JDK 1.11 for hadoop 3.2.0 and spark 2.4.0

supporting the following benchmarks:

  • sparkbench
  • hadoopbench

Environment variables: JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk/
Compile command: mvn clean package -Psparkbench -Phadoopbench -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dexclude-streaming

Log:

[INFO] Reactor Summary:
[INFO] 
[INFO] hibench 7.1-SNAPSHOT ............................... SUCCESS [  0.188 s]
[INFO] hibench-common 7.1-SNAPSHOT ........................ SUCCESS [  5.859 s]
[INFO] HiBench data generation tools 7.1-SNAPSHOT ......... SUCCESS [ 11.859 s]
[INFO] sparkbench 7.1-SNAPSHOT ............................ SUCCESS [  0.014 s]
[INFO] sparkbench-common 7.1-SNAPSHOT ..................... SUCCESS [  7.313 s]
[INFO] sparkbench micro benchmark 7.1-SNAPSHOT ............ SUCCESS [  4.936 s]
[INFO] sparkbench machine learning benchmark 7.1-SNAPSHOT . SUCCESS [  8.397 s]
[INFO] sparkbench-websearch 7.1-SNAPSHOT .................. SUCCESS [  4.023 s]
[INFO] sparkbench-graph 7.1-SNAPSHOT ...................... SUCCESS [  6.131 s]
[INFO] sparkbench-sql 7.1-SNAPSHOT ........................ SUCCESS [  3.402 s]
[INFO] sparkbench project assembly 7.1-SNAPSHOT ........... SUCCESS [  9.242 s]
[INFO] hadoopbench 7.1-SNAPSHOT ........................... SUCCESS [  0.003 s]
[INFO] hadoopbench-sql 7.1-SNAPSHOT ....................... SUCCESS [  2.297 s]
[INFO] mahout 7.1-SNAPSHOT ................................ SUCCESS [  5.024 s]
[INFO] PEGASUS: A Peta-Scale Graph Mining System 2.0-SNAPSHOT SUCCESS [  0.942 s]
[INFO] nutchindexing 7.1-SNAPSHOT ......................... SUCCESS [  4.124 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------                                                   

Compile HiBench using JDK 1.8 for hadoop 3.2.0 and spark 2.4.0

supporting the following benchmarks:

  • sparkbench
  • flinkbench
  • hadoopbench
  • stormbench
  • gearpumpbench

Environment variables: JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk/
Compile command: mvn clean package -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11

Log:

[INFO] Reactor Summary:
[INFO] 
[INFO] hibench 7.1-SNAPSHOT ............................... SUCCESS [  0.149 s]
[INFO] hibench-common 7.1-SNAPSHOT ........................ SUCCESS [  7.683 s]
[INFO] HiBench data generation tools 7.1-SNAPSHOT ......... SUCCESS [ 11.872 s]
[INFO] sparkbench 7.1-SNAPSHOT ............................ SUCCESS [  0.013 s]
[INFO] sparkbench-common 7.1-SNAPSHOT ..................... SUCCESS [  7.477 s]
[INFO] sparkbench micro benchmark 7.1-SNAPSHOT ............ SUCCESS [  5.131 s]
[INFO] sparkbench machine learning benchmark 7.1-SNAPSHOT . SUCCESS [ 10.215 s]
[INFO] sparkbench-websearch 7.1-SNAPSHOT .................. SUCCESS [  3.450 s]
[INFO] sparkbench-graph 7.1-SNAPSHOT ...................... SUCCESS [  7.457 s]
[INFO] sparkbench-sql 7.1-SNAPSHOT ........................ SUCCESS [  3.747 s]
[INFO] sparkbench-streaming 7.1-SNAPSHOT .................. SUCCESS [  5.236 s]
[INFO] sparkbench project assembly 7.1-SNAPSHOT ........... SUCCESS [ 10.339 s]
[INFO] flinkbench 7.1-SNAPSHOT ............................ SUCCESS [  0.003 s]
[INFO] flinkbench-streaming 7.1-SNAPSHOT .................. SUCCESS [  7.554 s]
[INFO] gearpumpbench 7.1-SNAPSHOT ......................... SUCCESS [  0.003 s]
[INFO] gearpumpbench-streaming 7.1-SNAPSHOT ............... SUCCESS [  7.013 s]
[INFO] hadoopbench 7.1-SNAPSHOT ........................... SUCCESS [  0.002 s]
[INFO] hadoopbench-sql 7.1-SNAPSHOT ....................... SUCCESS [  1.663 s]
[INFO] mahout 7.1-SNAPSHOT ................................ SUCCESS [  4.487 s]
[INFO] PEGASUS: A Peta-Scale Graph Mining System 2.0-SNAPSHOT SUCCESS [  0.817 s]
[INFO] nutchindexing 7.1-SNAPSHOT ......................... SUCCESS [  3.344 s]
[INFO] stormbench 7.1-SNAPSHOT ............................ SUCCESS [  0.001 s]
[INFO] stormbench-streaming 7.1-SNAPSHOT .................. SUCCESS [  2.136 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------

@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch 9 times, most recently from b20b1f1 to cd25224 Compare June 6, 2019 00:29
@luisfponce
Copy link
Author

luisfponce commented Jun 6, 2019

Hi @carsonwang,

I work for SSP Intel doing Data Analytics Reference Stack
I'm wondering if is possible to merge this, due we are looking for HiBench to be built using latest java version to test Spark and Hadoop.

Best regards,
Luis

@carsonwang
Copy link
Collaborator

@luisfponce , thank you for working on this. We are reviewing and validating this.

.travis.yml Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch 15 times, most recently from 731dcac to 6455c21 Compare June 24, 2019 21:30
@carsonwang carsonwang mentioned this pull request Jul 2, 2019
.travis.yml Outdated Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
sparkbench/common/pom.xml Outdated Show resolved Hide resolved
@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch 3 times, most recently from 90502d6 to 51e5c71 Compare July 24, 2019 22:55
@carsonwang
Copy link
Collaborator

@gczsjdy , can you help take a look at the latest update?

@gcz2022
Copy link
Contributor

gcz2022 commented Jul 31, 2019

@carsonwang No problem.

.travis.yml Show resolved Hide resolved
README.md Outdated
- Hadoop: Apache Hadoop 2.x, CDH5, HDP
- Spark: Spark 1.6.x, Spark 2.0.x, Spark 2.1.x, Spark 2.2.x
### Supported Hadoop/Spark releases: ###
- Hadoop: Apache Hadoop 2.x, 3.x, CDH5, HDP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test Hadoop 3.0, 3.1, 3.3?
Otherwise 2.x, 3.2?

Why do you separate streaming/non-streaming frameworks? I don't see a very good reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Did you test Hadoop 3.0, 3.1, 3.3?
A: No, you right, Otherwise 2.x, 3.2.

Q: Why do you separate streaming/non-streaming frameworks?
A: Because Scala < 2.12 does not compiles on java 1.11 jdk and, scala 2.12
requires to change (or bump) the package org.apache.kafka from 0.8.2.1 to at least 0.10.2.2 and then the whole code related with Kafka and streaming testing must be ported.
This last kafka version (0.10.2.2) will require to modify following classes:

  • KafkaCollector.scala
  • KafkaConsumer.scala
  • MetricsUtil.scala

So, bottom line, as mentioned in previous comment for @carsonwang, to avoid break the streaming benchmarks in scala 2.11 and 2.10 was streaming/non-streaming frameworks split.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: Otherwise 2.x, 3.2.

mvn clean package -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11

Supported frameworks only: hadoopbench, sparkbench, (Not yet tested flinkbench, stormbench, gearpumpbench)
Supported modules includes: micro, ml(machine learning), websearch and graph (not tested streaming and structuredStreaming) (Does not support sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all modules can be built under JDK8? We normally use 8 in our environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Spark 2.4 wont support sql benchmarks, Hive not used anymore.
I can be more specific on this and document that for Spark xx version SQL benchmarks not supported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the Not yet tested part, leaving it on master seems... @carsonwang

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid off it, to avoid causing noise in master.

Supported modules includes: micro, ml(machine learning), websearch and graph (not tested streaming and structuredStreaming) (Does not support sql)

### Build using JDK 1.11
If you are interested in building using Java 11 indicate that streaming benchmarks won't be compiled also, specify scala, spark and hadoop version as below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also specify:

  • Which scala version(besides 2.12) is compatible with JDK11?
  • Which Hadoop/Spark version is compatible with JDK11?
    About the streaming benchmarks support, I think it's okay to lack some streaming(Flink, Gearpump, Spark Streaming, but not Structured Streaming) support on new versions, as long as we pointed it out clearly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Which Hadoop/Spark version is compatible with JDK11?
A: This is not my area, but documentation could be more specific if required.
At least it can be wrote down that Scala2.12 + JDK11 + Spark2.4 (Compiled with Scala 2.12) works excluding the streaming and SQL benchmarks)_

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please indicate that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mvn clean package -Psparkbench -Phadoopbench -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dexclude-streaming

Supported frameworks only: hadoopbench, sparkbench (Does not support flinkbench, stormbench, gearpumpbench)
Supported modules includes: micro, ml(machine learning), websearch and graph (does not support streaming, structuredStreaming and sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What problem did SQL module meet? It's an essential part of Spark, leaving it alone makes not much sense : )
  • Structured Streaming is a part of SQL, so making SQL work can also benefit SS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What problem did SQL module meet?
A: For newer versions of Spark HiveContext is deprecated, I can point out in de documentation that if -Dspark=2.4 or further versions required then SQLBench won't work.
(Again here it is necessary an update/port of ScalaSparkSQLBench.scala code)

HiveContext is deprecated

In Spark 2, HiveContext is deprecated. Replace all usage with an instantiation of the singletonSparkSession:
val spark: SparkSession = SparkSession.builder
.config(conf)
.enableHiveSupport()
.getOrCreate()
Most functionality of HiveContext is now available directly on the SparkSession instance. Note that, if you need them, SparkContext and SQLContext are now properties of SparkSession:
val sc = spark.sparkContext
val sqlContext = spark.sqlContext

Info Source

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I got it.
I think supporting Spark 2.4 without SQL module is quite weird. I can think of 2 ways:

  • Drop Spark 1.6 support and modify the ScalaSparkSQLBench.scala to use SparkSession, which is introduced in Spark 2.0
  • Create another seperate ScalaSparkSQLBench, deciding which class to use by Spark version

I like the first one better, newer HiBench version should drop some old codebase. cc @carsonwang

Copy link
Contributor

@gcz2022 gcz2022 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @luisfponce , I left some comments.

.travis.yml Outdated
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has it to be root? What if we don't set these environment variables?

Copy link
Author

@luisfponce luisfponce Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Has it to be root?
A: Not really, depending on the user.

Q: What if we don't set these environment variables?
A: If those variables were not set (only starting Hadoop 3.2 services) I got:

  • start-dfs.sh:
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [ubuntu-hib]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
  • start-yarn.sh
Starting resourcemanager
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.

However, I moved:

HDFS_NAMENODE_USER=$USER
HDFS_DATANODE_USER=$USER
HDFS_SECONDARYNAMENODE_USER=$USER
YARN_RESOURCEMANAGER_USER=$USER
YARN_NODEMANAGER_USER=$USER

to hadoop-env.sh, and now it is user agnostic, and travis.yml looks cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

.travis.yml Outdated
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export SPARK_DIST_CLASSPATH=$(/opt/$HADOOP_BINARIES_FOLDER/bin/hadoop classpath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to remove unnecessary envs(Line 46-54), I suppose even if they are not set, Spark/Hadoop will probe the right HOME, and that's verified in the original travis(for Spark 1.6, though).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, I will get rid off them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.travis.yml Outdated

sudo -E ./travis/configssh.sh
sudo -E ./travis/restart_hadoop_spark.sh
sudo -E ./bin/run_all.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nit new line
And other files. : )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- cp ./travis/spark.conf ./conf/
- /opt/hadoop-2.6.5/bin/yarn node -list 2
- sudo -E ./bin/run_all.sh
- |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nit remove this

Copy link
Author

@luisfponce luisfponce Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this pipe (literal style) because since my perspective looks cleaner when putting script in yaml files, avoiding writting \ every line.

Other way it would look like this example:

script:
  - if [[ "$java_ver" == 11 ]]; then \
        mvn clean package -q -Psparkbench -Phadoopbench -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dmaven-compiler-plugin.version=3.8.0 -Dexclude-streaming \
    elif [[ "$java_ver" == 8 ]]; then \
        mvn clean package -q -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11 \
    elif [[ "$java_ver" == 7 ]]; then \
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.2 -Dscala=2.11 \
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.0 -Dscala=2.11 \
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=1.6 -Dscala=2.10 \
    else \
        exit 1 \
    fi

  - sudo -E ./travis/configssh.sh
  - sudo -E ./travis/restart_hadoop_spark.sh
  - sudo -E ./bin/run_all.sh

instead of currently it is:

script:
  - |
    if [[ "$java_ver" == 11 ]]; then
        mvn clean package -q -Psparkbench -Phadoopbench -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dmaven-compiler-plugin.version=3.8.0 -Dexclude-streaming
    elif [[ "$java_ver" == 8 ]]; then
        mvn clean package -q -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11
    elif [[ "$java_ver" == 7 ]]; then
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.2 -Dscala=2.11
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.0 -Dscala=2.11
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=1.6 -Dscala=2.10
    else
        exit 1
    fi

    sudo -E ./travis/configssh.sh
    sudo -E ./travis/restart_hadoop_spark.sh
    sudo -E ./bin/run_all.sh

Up to you, for me both ways still working, (and Mr. Yaml lint indicates both ways works too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but seems

    if [[ "$java_ver" == 11 ]]; then
        mvn clean package -q -Psparkbench -Phadoopbench -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dmaven-compiler-plugin.version=3.8.0 -Dexclude-streaming
    elif [[ "$java_ver" == 8 ]]; then
        mvn clean package -q -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11
    elif [[ "$java_ver" == 7 ]]; then
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.2 -Dscala=2.11
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=2.0 -Dscala=2.11
        mvn clean package -q -Dmaven.javadoc.skip=true -Dspark=1.6 -Dscala=2.10
    else
        exit 1
    fi

    sudo -E ./travis/configssh.sh
    sudo -E ./travis/restart_hadoop_spark.sh
    sudo -E ./bin/run_all.sh

without any pipes is also valid, the \ns will be automatically escaped in travis?

cp ./travis/artifacts/hadoop32/mapred-site.xml $HADOOP_CONF_DIR
cp ./travis/artifacts/hadoop32/yarn-site.xml $HADOOP_CONF_DIR
sed -i "s|<maven.compiler.source>1.6</maven.compiler.source>|<maven.compiler.source>1.8</maven.compiler.source>|g" pom.xml
sed -i "s|<maven.compiler.target>1.6</maven.compiler.target>|<maven.compiler.target>1.8</maven.compiler.target>|g" pom.xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 1.11?

Copy link
Author

@luisfponce luisfponce Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, will changed for Java11 + maven compiler version 3.8
Source: Choose Java Version

Copy link
Author

@luisfponce luisfponce Aug 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.travis.yml Show resolved Hide resolved
mvn clean package -Psparkbench -Phadoopbench -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dexclude-streaming

Supported frameworks only: hadoopbench, sparkbench (Does not support flinkbench, stormbench, gearpumpbench)
Supported modules includes: micro, ml(machine learning), websearch and graph (does not support streaming, structuredStreaming and sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I got it.
I think supporting Spark 2.4 without SQL module is quite weird. I can think of 2 ways:

  • Drop Spark 1.6 support and modify the ScalaSparkSQLBench.scala to use SparkSession, which is introduced in Spark 2.0
  • Create another seperate ScalaSparkSQLBench, deciding which class to use by Spark version

I like the first one better, newer HiBench version should drop some old codebase. cc @carsonwang

### Build using JDK 1.8
If you are interested in building using Java 11 specify scala, spark and hadoop version as below

mvn clean package -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be misleading that Java 8 can only be used with the specified Scala/Hadoop/Spark version. I think we can drop this section and only leave JDK 11 section.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mvn clean package -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11

Supported frameworks only: hadoopbench, sparkbench, (Not yet tested flinkbench, stormbench, gearpumpbench)
Supported modules includes: micro, ml(machine learning), websearch and graph (not tested streaming and structuredStreaming) (Does not support sql)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the Not yet tested part, leaving it on master seems... @carsonwang

Supported modules includes: micro, ml(machine learning), websearch and graph (not tested streaming and structuredStreaming) (Does not support sql)

### Build using JDK 1.11
If you are interested in building using Java 11 indicate that streaming benchmarks won't be compiled also, specify scala, spark and hadoop version as below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please indicate that.

@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch from 51e5c71 to bc996f9 Compare August 22, 2019 18:10
* sparkbench/assembly/pom.xml:
	* Changed property name activation on `allModules` profile.
	* Added new profile that excludes `sparkbench-streaming` artifact.

* sparkbench/pom.xml:
	* Changed property name activation on `allModules` profile.
        * Added new profile that excludes `streaming` module.
	* Added profile spark2.4 due spark-core_2.12 supports > 2.4.0 version.
	* Added profile scala 2.12. Scala < 2.12 does not compiles on java 1.11 jdk.
	* Added profile hadoop3.2 to propagate this variable to all spark benchmark.

* sparkbecnh/streaming/pom.xml:
        * Added profile spark2.4 on sparkbench-streaming POM with spark-streaming-kafka-0-8_2.11 version 2.4.0.

Signed-off-by: Luis Ponce <luis.f.ponce.navarro@linux.intel.com>
@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch from bc996f9 to a300f3b Compare August 22, 2019 19:01
@luisfponce
Copy link
Author

luisfponce commented Aug 26, 2019

Hi @carsonwang, @gczsjdy

Important questions here:

Apache Hadoop 3.x support only Java 8 according to official website Hadoop Java Versions, and Java 11 support is WIP.

So, if HiBench is built using source and target JDK11 and then is run in Travis CI, then we get:

Exception in thread "main" java.lang.UnsupportedClassVersionError: 
HiBench/DataGen has been compiled by a more recent version of the Java Runtime (class file version 55.0), 
this version of the Java Runtime only recognizes class file versions up to 52.0

Nevertheless Clearlinux (and possibly other clients that has patched Hadoop too ) has compiled Hadoop 3.2 and Spark 2.4 using Java 11 patches as Data Analytics Reference Stacks documentation did.

Is there a way to compile HiBench using Java11 but skip the testing part?

We'd like to get HiBench JDK11 from upstream, and that's why this PR contribution.
By the way following benchmarks passes in both Spark and Hadoop frameworks built using Java 11 on following image Clearlinux DARS MKL that I personally test:

  • micro.sort
  • ml.bayes
  • ml.pca
  • ml.gbt
  • ml.rf
  • ml.svd
  • ml.lda
  • ml.svm
  • websearch.nutchindexing
  • graph.nweight

Hi Bench log on following link:
https://gist.github.com/luisfponce/4c25c353c0e13e34556d356970766ae5

@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch 6 times, most recently from a97527b to e631f22 Compare August 27, 2019 20:35
* Moved mapred-site and yarn-site xml files to a created a folder that contains the artifacts for
either hadoop 2.6 or 3.2, those will be pickd up depending on the testing
needs in travis.yml

* Moved spark-env file to a created a folder that contains the artifacts for
either spark1.6 or 2.4, those will be pickd up depending on the testing
needs in travis.yml

* Created hadoop-env.sh file for Hadoop 3.2 to store required
environment variables to start hdfs and yarn services.

* Removed harcoded values from haddop.conf and spark.conf, this will be
filled up depending on the testing needs.

* Added an `install_hadoop_spark` script that will download hadoop and spark
binaries depending on the testing needs.

* Added a `config_hadoop_spark` script that will setup hadoop, spark and
hibench depending on the testing needs.

* Added a `jdk_ver` script to pick up the current java version installed for
travis CI.

* `restart_hadoop_spark` script modified to be agnostic to the required
binaries for testing.

* travis/config_hadoop_spark.sh:
	* for Java 8 and 11 skiping `sql` test since HIVE is no longer
used to perform queries. Newer Spark version perform queries using
`SparkSession` no longer used `import org.apache.spark.sql`

* .travis.yml:
	* Added `dist: trusty` to keep using this distro, Travis picks
up xenial if not especified.. If Any
greather Ubuntu version required in Travis won't support openjdk 7.
	* Refactored the CI flow to behave, download, setup, run and
test hadoop and spark depending on the jdk required either versions 7, 8 and
11.
	* Hibench will be configured depending on the jdk required
either versions 7, 8 and 11.
	* Hibench will be built depending on the jdk required
either versions 7, 8 and 11.
	* benchmarks will be run for all jdk versions set.

Signed-off-by: Luis Ponce <luis.f.ponce.navarro@linux.intel.com>
* autogen/pom.xml
	* Add hadoop mr2 profile to be used for hadoop hdfs and client.

Signed-off-by: Luis Ponce <luis.f.ponce.navarro@linux.intel.com>
* docs/build-hibench.md:
	* Update 2.4 version to specify Spark Version.
	* Add Specify Hadoop version documentation.
	* Add Build using JDK 11 documentation.

* README.md:
	* Update Supported Hadoop/Spark releases to hadoop 3.2 and spark
2.4

Signed-off-by: Luis Ponce <luis.f.ponce.navarro@linux.intel.com>
@luisfponce luisfponce force-pushed the update_code_to_support_newer_java_versions branch from e631f22 to 0e48596 Compare August 27, 2019 21:07
Copy link
Collaborator

@carsonwang carsonwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luisfponce, I noticed one issue in the pom. Others look good to me.

@@ -37,6 +37,11 @@ default . For example , if we want use spark2.0 and scala2.11 to build hibench.
package` , but for spark2.0 and scala2.10 , we need use the command `mvn -Dspark=2.0 -Dscala=2.10 clean package` .
Similarly , the spark1.6 is associated with the scala2.10 by default.

### Specify Hadoop Version ###
To specify the spark version, use -Dhadoop=xxx(3.2). By default, it builds for hadoop 2.4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spark version -> hadoop version

@@ -159,7 +159,43 @@
</dependencies>
<activation>
<property>
<name>!modules</name>
<name>!exclude-streaming</name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user specifies modules=xxx and doesn't specify exclude-streaming, this allModules will be activated, which is not expected.

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>

</configuration>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:nit empty line

<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@@ -28,7 +28,7 @@ Because some Maven plugins cannot support Scala version perfectly, there are som


### Specify Spark Version ###
To specify the spark version, use -Dspark=xxx(1.6, 2.0, 2.1 or 2.2). By default, it builds for spark 2.0
To specify the spark version, use -Dspark=xxx(1.6, 2.0, 2.1, 2.2 or 2.4). By default, it builds for spark 2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this Spark 2.4 support doesn't include SQL module?
This is the only main remaining concern for this patch, see
#586 (comment)
I think we can drop Spark 1.6 support and modify the SQL module code to support 2.4 in HiBench 8.0, whoever needs 1.6 can go to HiBench 7.0. @carsonwang

if [[ "$java_ver" == 11 ]]; then
mvn clean package -Psparkbench -Phadoopbench -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dmaven-compiler-plugin.version=3.8.0 -Dexclude-streaming
elif [[ "$java_ver" == 8 ]]; then
mvn clean package -q -Dmaven.javadoc.skip=true -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this, even if we don't run SQL module tests for Spark2.4, how did the compiling work...

@william-wang
Copy link

Is there any progress on this ticket, when will this ticket be avaliable?

@luisfponce
Copy link
Author

Will retake it, resolve conflicts and get back to you @william-wang @gczsjdy

@sajanraj
Copy link

[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 22, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 26, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-mllib_2.10:jar must be a valid version but is '${spark.version}'. @ line 32, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-mllib_2.10:jar must be a valid version but is '${spark.version}'. @ line 33, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-graphx_2.10:jar must be a valid version but is '${spark.version}'. @ line 39, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 28, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-hive_2.10:jar must be a valid version but is '${spark.version}'. @ line 34, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 26, column 16
[ERROR] 'dependencies.dependency.version' for org.apache.spark:spark-streaming_2.10:jar must be a valid version but is '${spark.version}'. @ line 32, column 16
 @
[ERROR] The build could not read 7 projects -> [Help 1]
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-common:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/common/pom.xml) has 1 error
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 22, column 16
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-micro:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/micro/pom.xml) has 1 error
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-ml:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/ml/pom.xml) has 2 errors
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 26, column 16
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-mllib_2.10:jar must be a valid version but is '${spark.version}'. @ line 32, column 16
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-websearch:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/websearch/pom.xml) has 1 error
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-graph:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/graph/pom.xml) has 3 errors
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 27, column 16
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-mllib_2.10:jar must be a valid version but is '${spark.version}'. @ line 33, column 16
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-graphx_2.10:jar must be a valid version but is '${spark.version}'. @ line 39, column 16
[ERROR]
[ERROR]   The project com.intel.hibench.sparkbench:sparkbench-sql:7.1-SNAPSHOT (/home/sajanraj_t_d/metro/HiBench/sparkbench/sql/pom.xml) has 2 errors
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-core_2.10:jar must be a valid version but is '${spark.version}'. @ line 28, column 16
[ERROR]     'dependencies.dependency.version' for org.apache.spark:spark-hive_2.10:jar must be a valid version but is '${spark.version}'. @ line 34, column 16

mvn clean package -Psparkbench -Phadoopbench -Dhadoop=3.2 -Dspark=2.4 -Dscala=2.12 -Dexclude-streaming getting above error, is there any fix for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants