Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
.
Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
- Upsert support with fast, pluggable indexing
- Atomically publish data with rollback support
- Snapshot isolation between writer & queries
- Savepoints for data recovery
- Manages file sizes, layout using statistics
- Async compaction of row & columnar data
- Timeline metadata to track lineage
- Optimize data lake layout with clustering
Hudi supports three types of queries:
- Snapshot Query - Provides snapshot queries on real-time data, using a combination of columnar & row-based storage (e.g Parquet + Avro).
- Incremental Query - Provides a change stream with records inserted or updated after a point in time.
- Read Optimized Query - Provides excellent snapshot query performance via purely columnar storage (e.g. Parquet).
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
- Unix-like system (like Linux, Mac OS X)
- Java 8 (Java 9 or 10 may work)
- Git
- Maven (>=3.3.1)
# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests
# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
--jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
To build the Javadoc for all Java and Scala classes:
# Javadoc generated under target/site/apidocs
mvn clean javadoc:aggregate -Pjavadocs
The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12
profile
mvn clean package -DskipTests -Dscala-2.12
The default Spark version supported is 2.4.4. To build for different Spark 3 versions, use the corresponding profile
# Build against Spark 3.2.0 (the default build shipped with the public Spark 3 bundle)
mvn clean package -DskipTests -Dspark3
# Build against Spark 3.1.2
mvn clean package -DskipTests -Dspark3.1.x
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro
profile
# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests -Pspark-shade-unbundle-avro
# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
--packages org.apache.spark:spark-avro_2.11:2.4.4 \
--jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Unit tests can be run with maven profile unit-tests
.
mvn -Punit-tests test
Functional tests, which are tagged with @Tag("functional")
, can be run with maven profile functional-tests
.
mvn -Pfunctional-tests test
To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.
mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.