DataHub v0.8.25
Known Issues
- Adding Glossary Terms to schema fields does not work with this version due to a bug. Upgrade to v0.8.26 for the fix.
Release Highlights
Buckle up, folks! v0.8.25 brings some very exciting (and highly-requested!) updates.
Notable UI-Based Features
- UI-based Ingestion - as demoed in December Town Hall, we now support creating, configuring, scheduling, & executing batch metadata ingestion using the DataHub user interface. This makes getting metadata into DataHub easier by minimizing the overhead required to operate custom integration pipelines.
- Data Domains - DataHub now supports grouping data assets into logical collections called Domains. Domains are curated, top-level folders or categories where related assets can be explicitly grouped. Read the guide here!
- Data Containers are now supported! This is the physical grouping of entities, ex. a Schema is a container of 1 or more Datasets; a Dashboard is a container of 1 or more Charts.
Notable Metadata Model & Ingestion-Based Features
- Data Quality test results are now supported in the DataHub metadata model. This is the first milestone toward surfacing Dataset & Column-level Data Quality results in the UI (read full scope of work here). Future releases will include a Great Expectations integration & UI support - we’re on track to complete this in Q1 as planned.
- Avro files are now supported in the Data Lake File ingestion source
- Ingest metadata from multiple instances of the same platform type. This has been a very common use case within the Community - you can now differentiate multiple instances of the same platform type! If you already have pre-existing entries, use the
datahub
migrate command to migrate them over to platform instances. - Ignore users from Top Users calculation
- BigQuery - Data Profiling on only the latest partition/shard
- (feat)(Business Glossary) add tabular schema and new UI for business glossary by @saxo-lalrishav in #3813
Notable Fixes
- Fix to support
View in Looker
* feat(looker): Adding optional Looker external url base url config by @jjoyce0510 in #3985 - fix(graphql): support group display name in ownership by @thomasplarsson in #3979
- fix(profiling): Enabling profiling for low cardinality number columns by @treff7es in #3990
- fix(ingestion): match default username for Azure OIDC and Azure ingestion source by @iasoon in #3926
DataHub Usage Guides
- docs(domains): Adding a User Guide for Domains by @jjoyce0510 in #4038
- docs(ingest): Adding UI ingestion guide by @jjoyce0510 in #4048
What's Changed
- fix(vulnerability): Upgrade gms base image by @dexter-mh-lee in #3962
- logging(frontend): Improve OIDC debug logs by @jjoyce0510 in #3967
- docs(delete): add curl request example to delete entity by @anshbansal in #3928
- fix(ingestion): match default username for Azure OIDC and Azure ingestion source by @iasoon in #3926
- Feature/dynamic platform icons by @RyanHolstien in #3968
- refactor(ingestion): remove duplicate aspect type by @hsheth2 in #3972
- fix(example): fix typo by @anshbansal in #3907
- fix(ingestion): Restrict python to <=3.9.9 by @treff7es in #3961
- feat(build): remove requirement for git directory for builds by @swaroopjagadish in #3977
- fix(ingestion): tighten conditions for restli json transformation by @hsheth2 in #3973
- fix(ingestion): don't dump variables for config errors by @hsheth2 in #3974
- Bugfix/increase socket timeout by @RyanHolstien in #3982
- feat(ingest): support for Avro data lake files by @kevinhu in #3913
- fix(build): exclude old log4j core by @RickardCardell in #3966
- fix(quickstart): Pin Quickstart version to v0.8.23. by @jjoyce0510 in #3983
- feat(looker): Adding optional Looker external url base url config by @jjoyce0510 in #3985
- fix(graphql): support group display name in ownership by @thomasplarsson in #3979
- fix(quickstart): Assign correct mysql-setup container for M1 and remove "head" default version. by @jjoyce0510 in #3987
- feat(embedded search results): support custom endpoints in embedded search result by @gabe-lyons in #3986
- fix(docker): datahub-gms - build in native, copy to target by @swaroopjagadish in #3992
- fix(ci): moving defaults back to head now that docker builds are green by @swaroopjagadish in #3993
- feat(ui): UI-based ingestion (as featured in Dec Townhall) by @jjoyce0510 in #3975
- quickstart: Adding UI ingestion to quickstart YAML by @jjoyce0510 in #3994
- feat(domains): Adding backend for Asset Domains (p1) by @jjoyce0510 in #3952
- Bug: a bug fix to bigquery_to_datahub.yml file by @dipeshmaurya in #3988
- fix(ingest): check if feature data type is present by @maaaikoool in #3932
- feat(platform-instance): a simple client-only change to support platf… by @swaroopjagadish in #3996
- docs(metadata-model): Adding to Metadata model docs by @jjoyce0510 in #3998
- Add Stash Logo & new Source Icons by @maggiehays in #4002
- feat(domains): UI for Asset Domains (p2) by @jjoyce0510 in #3995
- docs: add missing back tick for metadata-ingestion/README.md by @nickwu241 in #4003
- Bugfix/add missing classes by @RyanHolstien in #4000
- fix(superset): fix connection for redshift by @anshbansal in #3944
- fix(setup): fix setup for M1 by @anshbansal in #3958
- docs:add Optum logo by @maggiehays in #4005
- Refining Metadata Model docs further by @jjoyce0510 in #4001
- fix(docker): Alpine based multiplatform docker build for kafka-setup by @treff7es in #3991
- Bugfix/graph concurrency issue by @RyanHolstien in #4007
- feat(ingest): Add additional snowflake auth by @MikeSchlosser16 in #4009
- fix(ci): Reverting unnecessary domain test changes by @jjoyce0510 in #4013
- fix(metrics): Add metrics for mcl hooks by @dexter-mh-lee in #4008
- feat(platform) - Update FabricType enum to represent more fabrics by @aditya-radhakrishnan in #3997
- feat(ingest): emit flags and stats for profiling telemetry by @kevinhu in #3969
- fix(formatting): fix linting lib version requirement by @anshbansal in #3939
- fix(docs): fix business glossary docs by @anshbansal in #3916
- fix(profiling): Enabling profiling for low cardinality number columns by @treff7es in #3990
- fix(docs): update gms link by @lhvubtqn in #3927
- fix(ingest): lint fix a few files by @swaroopjagadish in #4016
- fix(ingest): adding platform instance urn to data platform instance aspects by @swaroopjagadish in #4015
- feat(ingest): use trino python client for sqlalchemy, supports python… by @mayurinehate in #3888
- fix(spark-lineage): select mock server port dynamically for unit test by @MugdhaHardikar-GSLab in #4018
- (feat)(Business Glossary) add tabular schema and new UI for business glossary by @saxo-lalrishav in #3813
- Test/add concurrency issue smoke test by @RyanHolstien in #4014
- feat(glossary-terms): Index glossary term custom properties by @jjoyce0510 in #3960
- feat(ingestion): Adding ability to ignore users from top users calculation by @treff7es in #3735
- Docs/remote deploy and auto render by @RyanHolstien in #4020
- fix(ingest): snowflake - Run authentication validation if default value used by @treff7es in #4024
- feat(nifi): handle provenance api variation for older versions by @mayurinehate in #4022
- feat(ingestion) bigquery: Profiling only the latest partition/shard on bigquery by @treff7es in #3930
- fix(groups): Fix UI encoding of groups with spaces in urns by @jjoyce0510 in #4021
- fix(text): fix confusing text by @anshbansal in #4025
- fix(clean): add missing cleanup by @anshbansal in #4023
- feat(containers): Backend for Asset Containers (as demo'd in townhall) by @jjoyce0510 in #4019
- fix(docs): Adding Initiate login uri to okta docs (Okta OIDC) by @jjoyce0510 in #4030
- fix: docker-compose now persists kafka broker data by @icy in #4031
- feat(ingestion): Support Kafka confluent external schema resolution by name or subject by @rslanka in #4035
- docs(domains): Adding a User Guide for Domains by @jjoyce0510 in #4038
- feat(Stateful Ingestion-3/3): Client side changes for Monitoring/Reporting by @rslanka in #3807
- feat(containers): Adding Containers UI (as demo'd in Jan Townhall) by @jjoyce0510 in #4037
- feat(users): adding user graphql mutation by @gabe-lyons in #4033
- feat(ingest): add tests for platform instance by @swaroopjagadish in #4047
- feat(model): Data quality model by @ksrinath in #3787
- Bugfix/prevent invalid urn by @RyanHolstien in #4045
- refactor(spark-lineage): remove dependency of spark from McpEmitter by @MugdhaHardikar-GSLab in #4042
- feat(analytics): add more analytics for entities by @anshbansal in #4040
- docs(ingest): Adding UI ingestion guide by @jjoyce0510 in #4048
- fix(mae-consumer-docker): Fix condition for skipping elasticsearch check by @dexter-mh-lee in #4052
- feat(ci): pin tox requirements to speed up ci runs, remove airflow-1 … by @swaroopjagadish in #4055
- feat(container): Add domains aspect to container. by @jjoyce0510 in #4059
- feat(profile) - bigquery: Fix for hitting limit with too many partitioned tables by @treff7es in #4056
- [Docs] Mark data lake metadata source as Beta by @pedro93 in #4061
- feat(ingest): log CLI invocations and completions by @kevinhu in #4062
- fix(ingest): Add aws dependencies for data lake by @kevinhu in #4060
- fix(ingest) - add aws_common as a snowflake_common dependency by @aditya-radhakrishnan in #4054
- feat(ui): Add svg datahub loading logo by @eburairu in #4065
- refactor(models): Refactoring new Assertion models by @jjoyce0510 in #4064
- feat(cli): add --force option to ingest rollback subcommand by @danilopeixoto in #4032
- fix(analytics): fix missing events from UI by @anshbansal in #4026
- Data domain containers ingestion by @treff7es in #4051
- docs(ingestion) glue: document required IAM permissions by @iasoon in #3929
- fix(profile):bigquery - Check for every table if it is partitioned to not hit table quota by @treff7es in #4074
New Contributors
- @dipeshmaurya made their first contribution in #3988
- @maaaikoool made their first contribution in #3932
- @icy made their first contribution in #4031
- @ksrinath made their first contribution in #3787
- @eburairu made their first contribution in #4065
- @danilopeixoto made their first contribution in #4032
Full Changelog: v0.8.24...v0.8.25