Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add myst parser #3

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/docs/poetry.lock
/docs/_build
/source/.doctrees
/source/.doctrees
2 changes: 1 addition & 1 deletion docs/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ authors = ["ScyllaDB Documentation Contributors"]
python = "^3.7"
pyyaml = "6.0"
pygments = "2.11.2"
recommonmark = "0.7.1"
sphinx-scylladb-theme = "~1.3.1"
sphinx-autobuild = "2021.3.14"
Sphinx = "4.3.2"
sphinx-multiversion-scylla = "~0.2.11"
sphinx-sitemap = "2.2.0"
redirects_cli ="~0.1.3"
myst-parser = "0.18.1"

[build-system]
requires = ["poetry>=0.12"]
Expand Down
6 changes: 3 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
"sphinx_sitemap",
"sphinx_scylladb_theme",
"sphinx_multiversion", # optional
"recommonmark", # optional
"myst_parser", # optional
]

# The suffix(es) of source filenames.
Expand All @@ -44,7 +44,7 @@
master_doc = "index"

# General information about the project.
project = "ScyllaDB feature store example docs"
project = "ScyllaDB feature store example"
copyright = str(date.today().year) + ", ScyllaDB. All rights reserved."
author = u"ScyllaDB Project Contributors"

Expand Down Expand Up @@ -135,4 +135,4 @@ def setup(sphinx):
action="ignore",
category=UserWarning,
message=r".*Container node skipped.*",
)
)
264 changes: 1 addition & 263 deletions docs/source/getting-started.md
Original file line number Diff line number Diff line change
@@ -1,264 +1,2 @@
# Get started

## ScyllaDB feature store example (online storage)
This example project demonstrates a machine learning feature store use case for ScyllaDB.

You'll set up a database with flight feature data and use ScyllaDB to analyze flight delays.

## Clone the repository
```bash
git clone https://github.com/scylladb/scylladb-feature-store.git
```{include} ../../readme.md
```

## Sign up for a ScyllaDB Cloud account
To complete this project, sign up for a free trial account on [ScyllaDB Cloud](https://cloud.scylladb.com/user/signup). This is the easiest way to start using ScyllaDB.

If you prefer to use a self-hosted version of ScyllaDB, [see installation options here](https://www.scylladb.com/download/#open-source) (e.g. Docker).


## Prerequisites:
* [cqlsh](https://pypi.org/project/cqlsh/)
* [Python 3.7+](https://www.python.org/downloads/)


## Data model
Run the `schema.cql` file to create keyspace `feature_store` and table `flight_features`:
```bash
cqlsh "node-0.aws_us_east_1.xxxxxxxxx.clusters.scylla.cloud" 9042 -u scylla -p "password" -f schema.cql
```

This creates the following table in your database:
```sql
create table feature_store.flight_features(
FL_DATE TIMESTAMP,
OP_CARRIER TEXT,
OP_CARRIER_FL_NUM INT,
ORIGIN TEXT,
DEST TEXT,
CRS_DEP_TIME INT,
DEP_TIME FLOAT,
DEP_DELAY FLOAT,
TAXI_OUT FLOAT,
WHEELS_OFF FLOAT,
WHEELS_ON FLOAT,
TAXI_IN FLOAT,
CRS_ARR_TIME INT,
ARR_TIME FLOAT,
ARR_DELAY FLOAT,
CANCELLED FLOAT,
CANCELLATION_CODE TEXT,
DIVERTED FLOAT,
CRS_ELAPSED_TIME FLOAT,
ACTUAL_ELAPSED_TIME FLOAT,
AIR_TIME FLOAT,
DISTANCE FLOAT,
CARRIER_DELAY FLOAT,
WEATHER_DELAY FLOAT,
NAS_DELAY FLOAT,
SECURITY_DELAY FLOAT,
LATE_AIRCRAFT_DELAY FLOAT,
PRIMARY KEY (OP_CARRIER_FL_NUM)
);
```


## Import the dataset into ScyllaDB
```bash
cqlsh "node-0.aws_us_east_1.xxxxxxxxx.clusters.scylla.cloud" 9042 -u scylla -p "password"

scylla@cqlsh> COPY feature_store.flight_features FROM 'flight_features.csv';
```

This will start ingesting data into your ScyllaDB instance:
```
op_carrier_fl_num|actual_elapsed_time|air_time|arr_delay|arr_time|cancellation_code|cancelled|carrier_delay|crs_arr_time|crs_dep_time|crs_elapsed_time|dep_delay|dep_time|dest|distance|diverted|fl_date |late_aircraft_delay|nas_delay|op_carrier|origin|security_delay|taxi_in|taxi_out|weather_delay|wheels_off|wheels_on|
-----------------+-------------------+--------+---------+--------+-----------------+---------+-------------+------------+------------+----------------+---------+--------+----+--------+--------+-------------------+-------------------+---------+----------+------+--------------+-------+--------+-------------+----------+---------+
4317| 96.0| 73.0| -19.0| 2113.0| | 0.0| | 2132| 2040| 112.0| -3.0| 2037.0|MLI | 373.0| 0.0|2018-12-31 02:00:00| | |OO |DTW | | 5.0| 18.0| | 2055.0| 2108.0|
3372| 94.0| 74.0| 81.0| 1500.0| | 0.0| 0.0| 1339| 1150| 109.0| 96.0| 1326.0|RNO | 564.0| 0.0|2018-12-31 02:00:00| 81.0| 0.0|OO |SEA | 0.0| 3.0| 17.0| 0.0| 1343.0| 1457.0|
1584| 385.0| 348.0| -21.0| 2023.0| | 0.0| | 2034| 1700| 394.0| -9.0| 1658.0|SFO | 2565.0| 0.0|2018-12-31 02:00:00| | |UA |EWR | | 13.0| 33.0| | 1731.0| 2019.0|
4830| 119.0| 85.0| -35.0| 1431.0| | 0.0| | 1437| 1245| 136.0| -13.0| 1232.0|MSP | 546.0| 0.0|2018-12-31 02:00:00| | |OO |MSP | | 16.0| 18.0| | 1250.0| 1415.0|
2731| 158.0| 146.0| -19.0| 1911.0| | 0.0| | 1930| 1800| 160.0| -8.0| 1800.0|MSP | 842.0| 0.0|2018-12-31 02:00:00| | |WN |PVD | | 7.0| 7.0| | 1807.0| 1904.0|
```

Now that you have some sample data to play with, let's see a decision tree example.

## Decision tree classification
Create a new virtual environment and activate it:
```bash
virtualenv env
source env/bin/activate
```

Install requirements:
```bash
pip install scikit-learn pandas scylla-driver
```

> To follow the code examples below you can either create a new python file or just use the Jupyter notebook that's in the repo. If you use the notebook make sure the edit the `config.py` file with your credentials.

Connect to ScyllaDB:
```python
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sqlalchemy import create_engine
from cassandra.cluster import Cluster, ExecutionProfile, EXEC_PROFILE_DEFAULT
from cassandra.policies import DCAwareRoundRobinPolicy, TokenAwarePolicy
from cassandra.auth import PlainTextAuthProvider

def pandas_factory(colnames, rows):
return pd.DataFrame(rows, columns=colnames)

def getCluster():
profile = ExecutionProfile(load_balancing_policy=TokenAwarePolicy(DCAwareRoundRobinPolicy(local_dc='AWS_US_EAST_1')),
row_factory=pandas_factory)
return Cluster(
execution_profiles={EXEC_PROFILE_DEFAULT: profile},
contact_points=[
""
],
port=9042,
auth_provider = PlainTextAuthProvider(username="scylla", password="******"))


cluster = getCluster()
session = cluster.connect()
```

Query data into a dataframe:
```python
query = "SELECT * FROM demo.flight_features;"
rows = session.execute(query)
df = rows._current_rows
```

Split the dataset into features and target variable:
Feature attributes:
* `actual_elapsed_time`
* `air_time`
* `arr_time`
* `crs_arr_time`
* `crs_dep_time`
* `crs_elapsed_time`
* `dep_time`
* `distance`
* `taxi_in`
* `taxi_out`
* `wheels_off`
* `wheels_on`
* `arr_delay`

Target attribute:
* `dep_delay`


```python
#Features
feature_cols = ["actual_elapsed_time", "air_time",
"arr_time", "crs_arr_time", "crs_dep_time",
"crs_elapsed_time", "dep_time", "distance", "taxi_in", "taxi_out",
"wheels_off", "wheels_on", "arr_delay"]

X = df[feature_cols]
y = df.dep_delay # Target variable

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
```

Create the classifier:
```python
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.052884615384615384
```

Decision tree visualization in text form
```python
from sklearn.tree import export_text

tree_rules = export_text(clf, feature_names=feature_cols)
print(tree_rules)
```

```text
| | | | |--- taxi_in > 5.50
| | | | | |--- crs_arr_time <= 1307.50
| | | | | | |--- crs_dep_time <= 514.00
| | | | | | | |--- dep_time <= 501.00
| | | | | | | | |--- dep_time <= 272.50
| | | | | | | | | |--- class: 14.0
| | | | | | | | |--- dep_time > 272.50
| | | | | | | | | |--- class: -5.0
| | | | | | | |--- dep_time > 501.00
| | | | | | | | |--- class: -3.0
| | | | | | |--- crs_dep_time > 514.00
| | | | | | | |--- wheels_off <= 553.50
| | | | | | | | |--- crs_dep_time <= 536.50
| | | | | | | | | |--- wheels_on <= 745.50
| | | | | | | | | | |--- air_time <= 55.00
| | | | | | | | | | | |--- class: 6.0
| | | | | | | | | | |--- air_time > 55.00
| | | | | | | | | | | |--- truncated branch of depth 3
| | | | | | | | | |--- wheels_on > 745.50
| | | | | | | | | | |--- class: -2.0
| | | | | | | | |--- crs_dep_time > 536.50
| | | | | | | | | |--- class: -6.0
| | | | | | | |--- wheels_off > 553.50
| | | | | | | | |--- wheels_on <= 1257.00
| | | | | | | | | |--- taxi_out <= 12.50
| | | | | | | | | | |--- arr_delay <= -2.50
| | | | | | | | | | | |--- truncated branch of depth 5
| | | | | | | | | | |--- arr_delay > -2.50
| | | | | | | | | | | |--- truncated branch of depth 6
| | | | | | | | | |--- taxi_out > 12.50
| | | | | | | | | | |--- crs_elapsed_time <= 271.50
| | | | | | | | | | | |--- truncated branch of depth 11
| | | | | | | | | | |--- crs_elapsed_time > 271.50
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | |--- wheels_on > 1257.00
| | | | | | | | | |--- class: 3.0
| | | | | |--- crs_arr_time > 1307.50
| | | | | | |--- crs_dep_time <= 1746.00
| | | | | | | |--- taxi_out <= 16.50
| | | | | | | | |--- crs_arr_time <= 1349.50
| | | | | | | | | |--- arr_delay <= -5.00
| | | | | | | | | | |--- class: 9.0
| | | | | | | | | |--- arr_delay > -5.00
| | | | | | | | | | |--- actual_elapsed_time <= 100.50
| | | | | | | | | | | |--- class: 10.0
| | | | | | | | | | |--- actual_elapsed_time > 100.50
| | | | | | | | | | | |--- truncated branch of depth 4
| | | | | | | | |--- crs_arr_time > 1349.50
| | | | | | | | | |--- wheels_on <= 1337.50
| | | | | | | | | | |--- class: -2.0
| | | | | | | | | |--- wheels_on > 1337.50
| | | | | | | | | | |--- taxi_in <= 15.50
| | | | | | | | | | | |--- truncated branch of depth 13
| | | | | | | | | | |--- taxi_in > 15.50
| | | | | | | | | | | |--- truncated branch of depth 6
| | | | | | | |--- taxi_out > 16.50
| | | | | | | | |--- crs_dep_time <= 1521.50
| | | | | | | | | |--- crs_dep_time <= 1105.00
| | | | | | | | | | |--- actual_elapsed_time <= 175.50
| | | | | | | | | | | |--- class: 5.0
| | | | | | | | | | |--- actual_elapsed_time > 175.50
| | | | | | | | | | | |--- truncated branch of depth 3
```

## Jupyter notebook
You can run and modify the classifier script using the Jupyter notebook file in the repository. Before running it locally, make sure to edit the `config.py` file with your proper (local or ScyllaDB Cloud) ScyllaDB configuration.


## Narrow table design
In the example above, we use a wide table design
5 changes: 0 additions & 5 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,3 @@
About feature stores <about-feature-stores>
Getting started <getting-started>
Feature store GitHub repository <https://github.com/scylladb/scylladb-feature-store>