Skip to content

Commit

Permalink
feature: External Table ( for ClickHouse ) (#546)
Browse files Browse the repository at this point in the history
Co-authored-by: Jove Zhong <jove@timeplus.io>
  • Loading branch information
zliang-min and jovezhong authored Feb 3, 2024
1 parent 5dc02c0 commit ec89390
Show file tree
Hide file tree
Showing 68 changed files with 1,744 additions and 61 deletions.
6 changes: 4 additions & 2 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,14 @@ This folder lists some examples to run Proton in various use cases. For more rea

- cdc: demonstrates how to use Debezium to sync database changes from MySQL to Proton, via Redpanda and show live updates(UPSERT and DELETE) in Proton via changelog stream.

- clickhouse: demonstrates how to read from ClickHouse or write to ClickHouse with the new External Table feature.

- ecommerce: a combination of Proton, Redpanda, owl-shop and Redpanda Console. Owl Shop is an imaginary ecommerce shop that simulates microservices exchanging data via Apache Kafka. Sample data streams are: clickstreams(frontend events), customer info, customer orders. [Learn more](https://docs.timeplus.com/proton-kafka#tutorial)

- fraud_detection: demonstrates how to leverage proton to build a real-time fraud detection where proton is used as a real-time feature store.

- hackernews: just two containers: Proton and [a bytewax-based data loader](https://github.com/timeplus-io/proton-python-driver/tree/develop/example/bytewax). Inspired by https://bytewax.io/blog/polling-hacker-news, you can call Hacker News HTTP API with Bytewax and send latest news to Proton for SQL-based analysis.

- grafana: an example of how to use Grafana to connect to Proton and visualize the query results.

- hackernews: just two containers: Proton and [a bytewax-based data loader](https://github.com/timeplus-io/proton-python-driver/tree/develop/example/bytewax). Inspired by https://bytewax.io/blog/polling-hacker-news, you can call Hacker News HTTP API with Bytewax and send latest news to Proton for SQL-based analysis.

- jdbc: demonstrates how to connect to Proton via JDBC using DBeaver or Metabase.
76 changes: 76 additions & 0 deletions examples/clickhouse/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Demo for ClickHouse External Table

This docker compose file demonstrates how to read from ClickHouse or write to ClickHouse with the new [External Table](https://docs.timeplus.com/proton-clickhouse-external-table) feature.

A YouTube video tutorial is available for visual learners: TBD

## Start the example

Simply run `docker compose up` in this folder. Three docker containers in the stack:

1. ghcr.io/timeplus-io/proton:latest, as the streaming SQL engine.
2. clickhouse/clickhouse-server:latest
3. quay.io/cloudhut/owl-shop:latest, as the data generator. [Owl Shop](https://github.com/cloudhut/owl-shop) is an imaginary ecommerce shop that simulates microservices exchanging data via Apache Kafka.
4. docker.redpanda.com/redpandadata/redpanda, as the Kafka compatiable streaming message bus
5. docker.redpanda.com/redpandadata/console, as the web UI to explore data in Kafka/Redpanda

When all containers are up running, a few topics will be created in Redpanda with live demo.

## Read data from Redpanda, apply ETL and write to ClickHouse
Open the `proton client` in the proton container. Run the following SQL to create an external stream to read live data from Redpanda.

```sql
CREATE EXTERNAL STREAM frontend_events(raw string)
SETTINGS type='kafka',
brokers='redpanda:9092',
topic='owlshop-frontend-events';
```

Open the `clickhouse client` in the clickhouse container. Run the following SQL to create a regular MergeTree table.

```sql
CREATE TABLE events
(
_tp_time DateTime64(3),
url String,
method String,
ip String
)
ENGINE=MergeTree()
PRIMARY KEY (_tp_time, url);
```

Go back to `proton client`, run the following SQL to create an external table to connect to ClickHouse:
```sql
CREATE EXTERNAL TABLE ch_local
SETTINGS type='clickhouse',
address='clickhouse:9000',
table='events';
```

Then create a materialized view to read data from Redpanda, extract the values and turn the IP to masked md5, and send data to the external table. By doing so, the transformed data will be written to ClickHouse continuously.

```sql
CREATE MATERIALIZED VIEW mv INTO ch_local AS
SELECT now64() AS _tp_time,
raw:requestedUrl AS url,
raw:method AS method,
lower(hex(md5(raw:ipAddress))) AS ip
FROM frontend_events;
```

## Read data from ClickHouse

You can run the following SQL to query ClickHouse:

```sql
SELECT * FROM ch_local;
```

Or apply SQL functions or group by, such as

```sql
SELECT method, count() AS cnt FROM ch_local GROUP BY method
```

Please note, Proton will read all rows with selected columns from the ClickHouse and apply aggregation locally. Check [External Table](https://docs.timeplus.com/proton-clickhouse-external-table) documentation for details.
54 changes: 54 additions & 0 deletions examples/clickhouse/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
version: '3.7'
name: proton-ch-demo
volumes:
redpanda: null
services:
proton:
image: ghcr.io/timeplus-io/proton:latest
pull_policy: always

clickhouse:
image: clickhouse/clickhouse-server:latest
ports:
- 9000:9000
ulimits:
nofile:
soft: 262144
hard: 262144

redpanda:
image: docker.redpanda.com/redpandadata/redpanda:v23.2.15
command:
- redpanda start
- --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092
- --advertise-kafka-addr internal://redpanda:9092,external://localhost:19092
- --smp 1
- --memory 1G
- --mode dev-container
volumes:
- redpanda:/var/lib/redpanda/data

redpanda-console:
image: docker.redpanda.com/redpandadata/console:v2.3.5
entrypoint: /bin/sh
command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console"
environment:
CONFIG_FILEPATH: /tmp/config.yml
CONSOLE_CONFIG_FILE: |
kafka:
brokers: ["redpanda:9092"]
ports:
- 8080:8080
depends_on:
- redpanda

owl-shop:
image: quay.io/cloudhut/owl-shop:latest
#platform: 'linux/amd64'
environment:
- SHOP_KAFKA_BROKERS=redpanda:9092
- SHOP_KAFKA_TOPICREPLICATIONFACTOR=1
- SHOP_TRAFFIC_INTERVAL_RATE=1
- SHOP_TRAFFIC_INTERVAL_DURATION=0.1s
depends_on:
- redpanda
5 changes: 5 additions & 0 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,11 @@ else()
message(FATAL "rdkafka is not enabled which is required")
endif()

add_subdirectory(ClickHouse)
add_object_library(clickhouse_clickhouse ClickHouse)

add_subdirectory(Storages/ExternalStream)
add_subdirectory(Storages/ExternalTable)
# proton: end

set (DBMS_COMMON_LIBRARIES)
Expand All @@ -264,6 +268,7 @@ endif()

target_link_libraries (dbms PRIVATE klog)
target_link_libraries (dbms PRIVATE external_stream)
target_link_libraries (dbms PRIVATE external_table)
target_link_libraries (dbms PRIVATE checkpoint)

set (all_modules dbms)
Expand Down
3 changes: 3 additions & 0 deletions src/ClickHouse/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# if (ENABLE_TESTS)
# add_subdirectory(tests)
# endif ()
Loading

0 comments on commit ec89390

Please sign in to comment.