Skip to content

Commit

Permalink
chore: update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
v3g42 committed Apr 16, 2024
1 parent bdca795 commit 03bbcb7
Showing 1 changed file with 48 additions and 83 deletions.
131 changes: 48 additions & 83 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,53 @@
<div align="center">
<a target="_blank" href="https://getdozer.io/">
<br><img src="https://dozer-assets.s3.ap-southeast-1.amazonaws.com/logo-blue.svg" width=40%><br>
</a>
</div>

<p align="center">
<a href="https://github.com/getdozer/dozer/actions/workflows/dozer.yaml" target="_blank"><img src="https://github.com/getdozer/dozer/actions/workflows/dozer.yaml/badge.svg" alt="CI"></a>
<a href="https://coveralls.io/github/getdozer/dozer?branch=main" target="_blank"><img src="https://coveralls.io/repos/github/getdozer/dozer/badge.svg?branch=main&t=kZMYaV&style=flat" alt="Coverage Status"></a>
<a href="https://getdozer.io/docs/dozer" target="_blank"><img src="https://img.shields.io/badge/doc-reference-green" alt="Docs"></a>
<a href="https://discord.com/invite/3eWXBgJaEQ" target="_blank"><img src="https://img.shields.io/badge/join-on%20discord-primary" alt="Join on Discord"></a>
<a href="https://github.com/getdozer/dozer/blob/main/LICENSE.txt" target="_blank"><img src="https://img.shields.io/badge/license-Apache-blue" alt="License"></a>
</p>

## Overview

Dozer is a **data platform for building, deploying and maintaining real-time data products.**

It is ideal for companies with multiple databases, data warehouses and data lakes that are in need of combining, aggregating and transforming data in real time, and create customer facing or internal data applications.

*Put it simply, Dozer empowers a single developer go from data sources to ready-made APIs in just a few minutes. All with just a with a simple configuration file.*
Dozer is a **real time data movement tool leveraging CDC from various sources to multiple sinks.**

## How it works
Dozer pulls data from various sources like databases, data lakes, and data warehouses using Change Data Capture (CDC) and periodic polling mechanisms. This ensures up-to-date data ingestion in real-time or near-real-time.

After capturing data, Dozer offers the possibility of combining, transforming and aggregating it
using its own internal real-time transformation engine. It supports Streaming SQL, WebAssembly (coming soon) and TypeScript (coming soon), as well as ONNX for performing AI predictions in real-time.

After processing, data is stored and indexed in a low-latency datastore (based on [LMDB](https://github.com/LMDB/lmdb)), queryable using REST and gRPC.
Dozer is magnitudes of times faster than Debezium+Kafka and natively supports stateless transformations.
Primarily used for moving data into warehouses. In our own application, we move data to **Clickhouse** and build data APIs and integration with LLMs.

## How to use it

### ① Build
A Dozer application consists of a YAML file that can be run locally using the Dozer Live UI or Dozer CLI. As YAML is edited,
changes are immediately reflected on Dozer Live UI.

![Screenshot](./images/dozer_live_screen1.png)

### ② Test
Dozer can run the entire infrastructure locally. You can inspect data flowing in in real time or use the built-it API explorer to query data through REST and gRPC. Dozer Live explorer also provides ready-made samples to integrate results into your front-end applications.

![Screenshot](./images/dozer_live_screen2.png)

### ③ Deploy
Dozer applications can be self-hosted or deployed in the cloud with a single command. Dozer Cloud (coming soon) provides self-healing and monitoring capabilities, making sure your APIs are always available.


## Supported Sources and Tranformation Engines
Dozer currently supports a variety of source databases, data warehouses and object stores. Whenever possible, Dozer leverages Change Data Capture (CDC) to keep data always fresh. For sources that do not support CDC, periodic polling is used.

Dozer transformations can be executed using Dozer's highly cutomizable streaming SQL engine, which provides UDF supports in WASM (coming soon), TypeScript (coming soon) and ONNX.

Here is an overview of all supported source types and transformation engines:

![Screenshot](./images/supported_sources.png)


## Why Dozer ?
As teams embark on the journey of implementing real-time data products, they invariably come across a host of challenges that can make the task seem daunting:

1. **Integration with Various Systems**: Integrating with various data sources can present numerous technical hurdles and interoperability issues.

2. **Managing Latency**: Ensuring low-latency data access, especially for customer-facing applications, can be a significant challenge.

3. **Real-Time Data Transformation**: Managing real-time data transformations, especially when dealing with complex queries or large volumes of data, can be difficult and resource-intensive.

4. **Maintaining Data Freshness**: Keeping the data up-to-date in real-time, particularly when it's sourced from multiple locations like databases, data lakes, or warehouses, can be a daunting task.

4. **Scalability and High Availability**: Building a data application that can efficiently handle high-volume operations and remain reliable under heavy loads requires advanced architecture design and robust infrastructure.

To address all the above issues, teams often find themselves stitching together multiple technologies and a significant amount of custom code. This could involve integrating diverse systems like Kafka for real-time data streaming, Redis for low-latency data access and caching, and Spark or Flink for processing and analyzing streaming data.

![Complex Tools Setup](./images/tools.png)

The complexity of such a setup can become overwhelming. Ensuring that these different technologies communicate effectively, maintaining them, and handling potential failure points requires extensive effort and expertise.

This is where Dozer steps in, aiming to dramatically simplify this process. Dozer is designed as an all-in-one backend solution that integrates the capabilities of these disparate technologies into a single, streamlined tool. By doing so, Dozer offers the capacity to build an end-to-end real-time data product without the need to manage multiple technologies and extensive custom code.

Dozer's goal is to empower a single engineer or a small team of engineers to fully manage the entire lifecycle of a Data Product!

## Getting Started

Follow the links below to get started with Dozer:

- [Installation](https://getdozer.io/docs/installation)
- [Build a sample application using NY Taxi dataset](https://getdozer.io/docs/getting_started)

For a more comprehensive list of samples check out our [GitHub Samples repo](https://github.com/getdozer/dozer-samples)
Dozer runs with a single configuration file like the following:
```yaml
app_name: dozer-bench
version: 1
connections:
- name: pg_1
config: !Postgres
user: user
password: postgres
host: localhost
port: 5432
database: customers
sinks:
- name: customers
config: !Dummy
table_name: customers
```
Full documentation can be found [here](https://github.com/getdozer/dozer/blob/main/dozer-types/src/models/config.rs#L15)
## Supported Sources
| Connector | Extraction | Resuming | Enterprise |
| -------------------- | ---------- | -------- | ------------------- |
| Postgres | ✅ | ✅ | ✅ |
| MySQL | ✅ | ✅ | ✅ |
| Snowflake | ✅ | ✅ | ✅ |
| Kafka | ✅ | 🚧 | ✅ |
| MongoDB | ✅ | 🎯 | ✅ |
| Amazon S3 | ✅ | 🎯 | ✅ |
| Google Cloud Storage | ✅ | 🎯 | ✅ |
| **Oracle | ✅ | ✅ | **Enterprise Only** |
| **Aerospike | ✅ | ✅ | **Enterprise Only** |
## Supported Sinks
| Database | Connectivity | Enterprise |
| ---------- | ------------ | ------------------- |
| Clickhouse | ✅ | |
| Postgres | ✅ | |
| MySQL | ✅ | |
| Big Query | ✅ | |
| Oracle | ✅ | **Enterprise Only** |
| Aerospike | ✅ | **Enterprise Only** |

0 comments on commit 03bbcb7

Please sign in to comment.