Skip to content

Commit

Permalink
Merge pull request #20 from FAST-HEP/kreczko-issue-15
Browse files Browse the repository at this point in the history
Fixing issue #15: Removing references to Apache Airflow
  • Loading branch information
kreczko authored Mar 18, 2024
2 parents 3a7c37b + 17102cd commit 6346088
Show file tree
Hide file tree
Showing 12 changed files with 90 additions and 79 deletions.
2 changes: 1 addition & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ specific jobs:
```console
$ nox -s lint # Lint only
$ nox -s tests # Python tests
$ nox -s docs -- serve # Build and serve the docs
$ nox -s docs -- --serve # Build and serve the docs
$ nox -s build # Make an SDist and wheel
```

Expand Down
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
[![Documentation Status][rtd-badge]][rtd-link]

[![PyPI version][pypi-version]][pypi-link]
[![Conda-Forge][conda-badge]][conda-link]
[![PyPI platforms][pypi-platforms]][pypi-link]

[![GitHub Discussion][github-discussions-badge]][github-discussions-link]
Expand All @@ -14,8 +13,6 @@
<!-- prettier-ignore-start -->
[actions-badge]: https://github.com/FAST-HEP/fasthep-flow/workflows/CI/badge.svg
[actions-link]: https://github.com/FAST-HEP/fasthep-flow/actions
[conda-badge]: https://img.shields.io/conda/vn/conda-forge/fasthep-flow
[conda-link]: https://github.com/conda-forge/fasthep-flow-feedstock
[github-discussions-badge]: https://img.shields.io/static/v1?label=Discussions&message=Ask&color=blue&logo=github
[github-discussions-link]: https://github.com/FAST-HEP/fasthep-flow/discussions
[pypi-link]: https://pypi.org/project/fasthep-flow/
Expand Down
30 changes: 27 additions & 3 deletions docs/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@
`fasthep-flow` does not implement any processing itself, but rather delegates
between user workflow description (the
[YAML configuration file](./configuration.md)), the workflow stages (e.g.
**Python Callables**), the **Apache Airflow DAG** and the **Executor** engine.
Unless excplicitely stated, every workflow has to start with a **Data Stage**,
has one or more **Processing stages**, and end with an **Output stage**.
**Python Callables**), the **workflow DAG** and the **Executor** engine. Unless
excplicitely stated, every workflow has to start with a **Data Stage**, has one
or more **Processing stages**, and end with an **Output stage**.

## Stages

**Data Stage**: The data stage is any callable that returns data. This can be a
function that reads data from a file, or a function that generates data. The
Expand All @@ -24,6 +26,28 @@ returns data as output. The output of the last processing stage is passed to the
output stage. The output stage is executed only once, and is the last stage in a
workflow. The output of the output stage is saved to disk.

### Exceptional stages

Of course, not all workflows are as simple as the above. In some cases, you may
want to checkpoint the process, write out intermediate results, or do some other
special processing. For this, `fasthep-flow` has the following special stages:

**Provenance Stage**: The provenance stage is a special stagen that typically
runs outside the workflow. It is used to collect information about the workflow,
such as the software versions used, the input data, and the output data. The
provenance stage is executed only once, and is the last stage in a workflow. The
output of the provenance stage is saved to disk.

**Caching stage**: The caching stage is a special stage that can be used to
cache the output of a processing stage. The caching stage can be added to any
processing stage, and will save the output of the processing stage to disk or
remote storage.

**Monitoring stage**: The monitoring stage is a special stage that can be used
to monitor the progress of the workflow. The monitoring stage can be added to
any processing stage, and can either store the information locally or send it in
intervals to a specified endpoint.

## Anatomy of an analysis workflow

In the most general terms, an analysis workflow consists of the following parts:
Expand Down
4 changes: 2 additions & 2 deletions docs/configuration/environments.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@ Let's start with a simple example:
```yaml
stages:
- name: runROOTMacro
type: "airflow::BashOperator"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: root -bxq <path to ROOT macro>
environment:
image: docker.io/rootproject/root:6.28.04-ubuntu22.04
variables: <path to .env>
executor: LocalExecutor
- name: runStatsCode
type: "airflow::BashOperator"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: ./localStatsCode.sh
environment:
Expand Down
36 changes: 10 additions & 26 deletions docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
`fasthep-flow` provides a way to describe a workflow in a YAML file. It does not
handle the input data specification, which is instead handled by
`fasthep-curator`. `fasthep-flow` checks the input parameters, imports specified
modules, and maps the YAML onto an Apache Airflow DAG. The workflow can then be
executed on a local machine or on a cluster using
modules, and maps the YAML onto a workflow. The workflow can then be executed on
a local machine or on a cluster using
[CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or
[Google Cloud Composer](https://cloud.google.com/composer).

Expand All @@ -22,39 +22,23 @@ Here's a simplified example of a YAML file:
```yaml
stages:
- name: printEcho
type: "airflow::BashOperator"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: echo "Hello World!"
- name: printPython
type: "airflow::PythonOperator"
type: "fasthep_flow.operators.PythonOperator"
kwargs:
python_callable: print
op_args: ["Hello World!"]
```
This YAML file defines two stages, `printEcho` and `printPython`. The
`printEcho` stage uses the `BashOperator` from Apache Airflow, and the
`printPython` stage uses the `PythonOperator` from Apache Airflow. The
`printEcho` stage passes the argument `echo "Hello World!"` to the
`bash_command` argument of the `BashOperator`. The `printPython` stage uses the

To make it easier to use Python callables, `fasthep-flow` provides a `pycall`
operator. This operator takes a Python callable and its arguments, and then
calls the callable with the arguments. The `printPython` stage can be rewritten
using the `pycall` operator as follows:

```yaml
stages:
- name: printEcho
type: "airflow::BashOperator"
kwargs:
bash_command: echo "Hello World!"
- name: printPython
type: "fasthep_flow::PyCallOperator"
args: ["Hello World!"]
kwargs:
callable: print
```
`printEcho` stage uses the `BashOperator`, and the `printPython` stage uses the
`PythonOperator`. The `printEcho` stage passes the argument
`echo "Hello World!"` to the `bash_command` argument of the `BashOperator`. To
make it easier to use Python callables, `fasthep-flow` provides the
`PythonOperator`. This operator takes a Python callable and its arguments, and
then calls the callable with the arguments.

```{note}
- you can test the validity of a config via `fasthep-flow lint <config.yaml>`
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/register.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ stages:
```

```{note}
There are protected namespaces, which you cannot use like this. These are `airflow` and anything starting with `fasthep`.
There are protected namespaces, which you cannot use like this. These are anything starting with `fasthep`.
```

## Registering a stage that can run on GPU
Expand Down
6 changes: 3 additions & 3 deletions docs/examples/cms_pub_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ fasthep download --json /path/to/json --destination /path/to/data

```{note}
While you can automate the data download and curator steps, we will do them manually in this example.
Both could be added as stages of the type `airflow.operators.bash.BashOperator`.
Both could be added as stages of the type `fasthep_flow.operators.bash.BashOperator`.
```

Expand Down Expand Up @@ -233,11 +233,11 @@ by [fasthep-carpenter](https://fast-hep.github.io/fasthep-carpenter/).
### Making paper-ready plots

The final step is to make a paper-ready plot. We will use the
`airflow.operators.bash.BashOperator` for this:
`fasthep_flow.operators.bash.BashOperator` for this:

```yaml
- name: Make paper-ready plot
type: airflow.operators.bash.BashOperator
type: fasthep_flow.operators.bash.BashOperator
kwargs:
bash_command: |
fasthep plotter \
Expand Down
4 changes: 2 additions & 2 deletions docs/examples/lz_example.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,11 @@ This is a feature of `fasthep-flow` that allows you to use expressions here. Alt
### Making paper-ready plots

The final step is to make a paper-ready plot. We will use the
`airflow.operators.bash.BashOperator` for this:
`fasthep_flow.operators.bash.BashOperator` for this:

```yaml
- name: Make paper-ready plot
type: airflow.operators.bash.BashOperator
type: fasthep_flow.operators.bash.BashOperator
kwargs:
bash_command: |
fasthep plotter \
Expand Down
38 changes: 20 additions & 18 deletions docs/executors.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,32 @@
# Executors - run your workflows

Since `fasthep-flow` is built on top of Apache Airflow, it supports all of the
executors that Airflow supports. The default executor is the
[LocalExecutor](#localexecutor), which runs all tasks in parallel on the local
machine. A full list of executors can be found in the
[Airflow documentation](https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html).
`fasthep-flow` is currently built on top of
[prefect](https://docs.prefect.io/latest/), which means it supports all of the
executors, or in this case "Task Runners", that prefect supports. The default
executor is the [SequentialTaskRunner](#sequentialtaskrunner), which runs all
tasks in sequence on the local machine. A full list of executors can be found in
the
[prefect documentation](https://docs.prefect.io/latest/concepts/task-runners/).

Since Apache Airflow is not widely used in High Energy Particle Physics, let's
go over the executors that are most relevant to us.
Since prefect is not widely used in High Energy Particle Physics, let's go over
the executors that are most relevant to us.

## LocalExecutor
## SequentialTaskRunner

The `LocalExecutor` (see
[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/local.html))
The `SequentialTaskRunner` (see
[prefect docs](https://docs.prefect.io/latest/api-ref/prefect/task-runners/#prefect.task_runners.SequentialTaskRunner))
runs each task in a separate process on the local machine. This is the default
executor for `fasthep-flow`.

## DaskExecutor
## DaskTaskRunner

The `DaskExecutor` (see
[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/dask.html))
runs each task in a separate process on a Dask cluster. A Dask cluster can be
run on a local machine or as a distributed cluster using a batch system (e.g.
HTCondor, LSF, PBS, SGE, SLURM) or other distributed systems such as LHCb's
DIRAC. This is the recommended executor for running `fasthep-flow` workflows on
distributed resources.
The `DaskTaskRunner` (see
[prefect docs](https://prefecthq.github.io/prefect-dask/)) runs each task in a
separate process on a Dask cluster. A Dask cluster can be run on a local machine
or as a distributed cluster using a batch system (e.g. HTCondor, LSF, PBS, SGE,
SLURM) or other distributed systems such as LHCb's DIRAC. This is the
recommended executor for running `fasthep-flow` workflows on distributed
resources.

## Custom executors

Expand Down
21 changes: 10 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,21 @@

## Introduction

`fasthep-flow` is a package for converting YAML files into an
[prefect]$(prefect-link) flow. It is designed to be used with the
[fast-hep](https://fast-hep.github.io/) package ecosystem, but can be used
independently.
`fasthep-flow` is a package for describing data analysis workflows in YAML and
converting them into a workflow DAG that can be run by software like Dask. It is
designed to be used with the [fast-hep](https://fast-hep.github.io/) package
ecosystem, but can be used independently.

The goal of this package is to define a workflow, e.g. a HEP analysis, in a YAML
file, and then convert that YAML file into an
[prefect flow](https://docs.prefect.io/latest/concepts/flows/). This flow can
then be run on a local machine, or on a cluster using
file, and then convert that YAML file into a workflow DAG. This DAG can then be
run on a local machine, or on a cluster using
[CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or
[Google Cloud Composer](https://cloud.google.com/composer).

In `fasthep-flow`'s YAML files draws inspiration from Continuous Integration
(CI) pipelines and Ansible Playbooks to define the workflow, where each step is
a task that can be run in parallel. `fasthep-flow` will check the parameters of
each task, and then generate the flow. The flow will have a task for each step,
In `fasthep-flow`'s YAML files draw inspiration from Continuous Integration (CI)
pipelines and Ansible Playbooks to define the workflow, where each independent
task that can be run in parallel. `fasthep-flow` will check the parameters of
each task, and then generate the DAG. The DAG will have a task for instruction,
and the dependencies between the tasks will be defined by the `needs` key in the
YAML file. More on this under [Configuration](./configuration/index.md).

Expand Down
19 changes: 13 additions & 6 deletions docs/operators.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
# Operators

Operators are used here as a general term of callable code blocks that operate
on data. In Airflow, operators are used to define tasks in a
[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph). In `fasthep-flow`,
operators are used to define stages in a workflow. The operators are defined in
the YAML file, and then mapped to Airflow operators when the DAG is generated.
One `fasthep-flow` operator can map to multiple Airflow operators.
on data. These are similar to Ansible's modules, or Airflow's operators. In
`fasthep-flow`, operators are used to define stages in a workflow. The operators
are defined in the YAML file, and then integrated into the workflow when the DAG
is generated. The defined operators can be used to transform data, filter data,
or generate data. Operators defined in the YAML file are expected to be
callables supporting a specific protocol. When constructing the workflow,
`fasthep-flow` will try to import the module first, e.g.
`fasthep_flow.operators.bash.BashOperator`, which gives the user the flexibility
to define their own operators.

## Operator Types

There are four types of operators:
There are five types of operators:

1. **Data Input**: These are a special set that does not require any input data,
and instead generates data. These are used to start a workflow.
Expand All @@ -20,6 +24,9 @@ There are four types of operators:
4. **Filter**: These are used to filter data. They are similar to data transform
operators, but instead of adding data, they restrict part of the data to
continue in the workflow.
5. **Passive**: These are used to monitor the workflow, and are do not change
the data. Examples of such modules are the `ProvenanceOperator` and the
`MonitoringOperator`.

## Custom operators

Expand Down
4 changes: 1 addition & 3 deletions docs/provenance.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,5 @@ provenance:
- performance # Metrics to measure the efficiency of the analysis
- environment # Software environment, including library versions
- hardware # Hardware specifications where the analysis was run
airflow:
include:
- db # Database configurations and states within Airflow
metadata: metadata.json # Specifies the file to store the provenance metadata
```

0 comments on commit 6346088

Please sign in to comment.