diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index 8cccbe5..187fe1c 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -19,7 +19,7 @@ specific jobs: ```console $ nox -s lint # Lint only $ nox -s tests # Python tests -$ nox -s docs -- serve # Build and serve the docs +$ nox -s docs -- --serve # Build and serve the docs $ nox -s build # Make an SDist and wheel ``` diff --git a/README.md b/README.md index 5932596..04f2f0f 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,6 @@ [![Documentation Status][rtd-badge]][rtd-link] [![PyPI version][pypi-version]][pypi-link] -[![Conda-Forge][conda-badge]][conda-link] [![PyPI platforms][pypi-platforms]][pypi-link] [![GitHub Discussion][github-discussions-badge]][github-discussions-link] @@ -14,8 +13,6 @@ [actions-badge]: https://github.com/FAST-HEP/fasthep-flow/workflows/CI/badge.svg [actions-link]: https://github.com/FAST-HEP/fasthep-flow/actions -[conda-badge]: https://img.shields.io/conda/vn/conda-forge/fasthep-flow -[conda-link]: https://github.com/conda-forge/fasthep-flow-feedstock [github-discussions-badge]: https://img.shields.io/static/v1?label=Discussions&message=Ask&color=blue&logo=github [github-discussions-link]: https://github.com/FAST-HEP/fasthep-flow/discussions [pypi-link]: https://pypi.org/project/fasthep-flow/ diff --git a/docs/concepts.md b/docs/concepts.md index 36bf57e..3d546ee 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -3,9 +3,11 @@ `fasthep-flow` does not implement any processing itself, but rather delegates between user workflow description (the [YAML configuration file](./configuration.md)), the workflow stages (e.g. -**Python Callables**), the **Apache Airflow DAG** and the **Executor** engine. -Unless excplicitely stated, every workflow has to start with a **Data Stage**, -has one or more **Processing stages**, and end with an **Output stage**. +**Python Callables**), the **workflow DAG** and the **Executor** engine. Unless +excplicitely stated, every workflow has to start with a **Data Stage**, has one +or more **Processing stages**, and end with an **Output stage**. + +## Stages **Data Stage**: The data stage is any callable that returns data. This can be a function that reads data from a file, or a function that generates data. The @@ -24,6 +26,28 @@ returns data as output. The output of the last processing stage is passed to the output stage. The output stage is executed only once, and is the last stage in a workflow. The output of the output stage is saved to disk. +### Exceptional stages + +Of course, not all workflows are as simple as the above. In some cases, you may +want to checkpoint the process, write out intermediate results, or do some other +special processing. For this, `fasthep-flow` has the following special stages: + +**Provenance Stage**: The provenance stage is a special stagen that typically +runs outside the workflow. It is used to collect information about the workflow, +such as the software versions used, the input data, and the output data. The +provenance stage is executed only once, and is the last stage in a workflow. The +output of the provenance stage is saved to disk. + +**Caching stage**: The caching stage is a special stage that can be used to +cache the output of a processing stage. The caching stage can be added to any +processing stage, and will save the output of the processing stage to disk or +remote storage. + +**Monitoring stage**: The monitoring stage is a special stage that can be used +to monitor the progress of the workflow. The monitoring stage can be added to +any processing stage, and can either store the information locally or send it in +intervals to a specified endpoint. + ## Anatomy of an analysis workflow In the most general terms, an analysis workflow consists of the following parts: diff --git a/docs/configuration/environments.md b/docs/configuration/environments.md index 84cbc33..03ccc54 100644 --- a/docs/configuration/environments.md +++ b/docs/configuration/environments.md @@ -17,7 +17,7 @@ Let's start with a simple example: ```yaml stages: - name: runROOTMacro - type: "airflow::BashOperator" + type: "fasthep_flow.operators.BashOperator" kwargs: bash_command: root -bxq environment: @@ -25,7 +25,7 @@ stages: variables: executor: LocalExecutor - name: runStatsCode - type: "airflow::BashOperator" + type: "fasthep_flow.operators.BashOperator" kwargs: bash_command: ./localStatsCode.sh environment: diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 37fca55..dec0416 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -3,8 +3,8 @@ `fasthep-flow` provides a way to describe a workflow in a YAML file. It does not handle the input data specification, which is instead handled by `fasthep-curator`. `fasthep-flow` checks the input parameters, imports specified -modules, and maps the YAML onto an Apache Airflow DAG. The workflow can then be -executed on a local machine or on a cluster using +modules, and maps the YAML onto a workflow. The workflow can then be executed on +a local machine or on a cluster using [CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or [Google Cloud Composer](https://cloud.google.com/composer). @@ -22,39 +22,23 @@ Here's a simplified example of a YAML file: ```yaml stages: - name: printEcho - type: "airflow::BashOperator" + type: "fasthep_flow.operators.BashOperator" kwargs: bash_command: echo "Hello World!" - name: printPython - type: "airflow::PythonOperator" + type: "fasthep_flow.operators.PythonOperator" kwargs: python_callable: print op_args: ["Hello World!"] ``` This YAML file defines two stages, `printEcho` and `printPython`. The -`printEcho` stage uses the `BashOperator` from Apache Airflow, and the -`printPython` stage uses the `PythonOperator` from Apache Airflow. The -`printEcho` stage passes the argument `echo "Hello World!"` to the -`bash_command` argument of the `BashOperator`. The `printPython` stage uses the - -To make it easier to use Python callables, `fasthep-flow` provides a `pycall` -operator. This operator takes a Python callable and its arguments, and then -calls the callable with the arguments. The `printPython` stage can be rewritten -using the `pycall` operator as follows: - -```yaml -stages: - - name: printEcho - type: "airflow::BashOperator" - kwargs: - bash_command: echo "Hello World!" - - name: printPython - type: "fasthep_flow::PyCallOperator" - args: ["Hello World!"] - kwargs: - callable: print -``` +`printEcho` stage uses the `BashOperator`, and the `printPython` stage uses the +`PythonOperator`. The `printEcho` stage passes the argument +`echo "Hello World!"` to the `bash_command` argument of the `BashOperator`. To +make it easier to use Python callables, `fasthep-flow` provides the +`PythonOperator`. This operator takes a Python callable and its arguments, and +then calls the callable with the arguments. ```{note} - you can test the validity of a config via `fasthep-flow lint ` diff --git a/docs/configuration/register.md b/docs/configuration/register.md index c75cdc4..13095a1 100644 --- a/docs/configuration/register.md +++ b/docs/configuration/register.md @@ -31,7 +31,7 @@ stages: ``` ```{note} -There are protected namespaces, which you cannot use like this. These are `airflow` and anything starting with `fasthep`. +There are protected namespaces, which you cannot use like this. These are anything starting with `fasthep`. ``` ## Registering a stage that can run on GPU diff --git a/docs/examples/cms_pub_example.md b/docs/examples/cms_pub_example.md index 48a9226..700d17b 100644 --- a/docs/examples/cms_pub_example.md +++ b/docs/examples/cms_pub_example.md @@ -58,7 +58,7 @@ fasthep download --json /path/to/json --destination /path/to/data ```{note} While you can automate the data download and curator steps, we will do them manually in this example. -Both could be added as stages of the type `airflow.operators.bash.BashOperator`. +Both could be added as stages of the type `fasthep_flow.operators.bash.BashOperator`. ``` @@ -233,11 +233,11 @@ by [fasthep-carpenter](https://fast-hep.github.io/fasthep-carpenter/). ### Making paper-ready plots The final step is to make a paper-ready plot. We will use the -`airflow.operators.bash.BashOperator` for this: +`fasthep_flow.operators.bash.BashOperator` for this: ```yaml - name: Make paper-ready plot - type: airflow.operators.bash.BashOperator + type: fasthep_flow.operators.bash.BashOperator kwargs: bash_command: | fasthep plotter \ diff --git a/docs/examples/lz_example.md b/docs/examples/lz_example.md index 09e7b9f..4cfdee4 100644 --- a/docs/examples/lz_example.md +++ b/docs/examples/lz_example.md @@ -113,11 +113,11 @@ This is a feature of `fasthep-flow` that allows you to use expressions here. Alt ### Making paper-ready plots The final step is to make a paper-ready plot. We will use the -`airflow.operators.bash.BashOperator` for this: +`fasthep_flow.operators.bash.BashOperator` for this: ```yaml - name: Make paper-ready plot - type: airflow.operators.bash.BashOperator + type: fasthep_flow.operators.bash.BashOperator kwargs: bash_command: | fasthep plotter \ diff --git a/docs/executors.md b/docs/executors.md index a674142..3fe5e54 100644 --- a/docs/executors.md +++ b/docs/executors.md @@ -1,30 +1,32 @@ # Executors - run your workflows -Since `fasthep-flow` is built on top of Apache Airflow, it supports all of the -executors that Airflow supports. The default executor is the -[LocalExecutor](#localexecutor), which runs all tasks in parallel on the local -machine. A full list of executors can be found in the -[Airflow documentation](https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html). +`fasthep-flow` is currently built on top of +[prefect](https://docs.prefect.io/latest/), which means it supports all of the +executors, or in this case "Task Runners", that prefect supports. The default +executor is the [SequentialTaskRunner](#sequentialtaskrunner), which runs all +tasks in sequence on the local machine. A full list of executors can be found in +the +[prefect documentation](https://docs.prefect.io/latest/concepts/task-runners/). -Since Apache Airflow is not widely used in High Energy Particle Physics, let's -go over the executors that are most relevant to us. +Since prefect is not widely used in High Energy Particle Physics, let's go over +the executors that are most relevant to us. -## LocalExecutor +## SequentialTaskRunner -The `LocalExecutor` (see -[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/local.html)) +The `SequentialTaskRunner` (see +[prefect docs](https://docs.prefect.io/latest/api-ref/prefect/task-runners/#prefect.task_runners.SequentialTaskRunner)) runs each task in a separate process on the local machine. This is the default executor for `fasthep-flow`. -## DaskExecutor +## DaskTaskRunner -The `DaskExecutor` (see -[Airflow docs](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/dask.html)) -runs each task in a separate process on a Dask cluster. A Dask cluster can be -run on a local machine or as a distributed cluster using a batch system (e.g. -HTCondor, LSF, PBS, SGE, SLURM) or other distributed systems such as LHCb's -DIRAC. This is the recommended executor for running `fasthep-flow` workflows on -distributed resources. +The `DaskTaskRunner` (see +[prefect docs](https://prefecthq.github.io/prefect-dask/)) runs each task in a +separate process on a Dask cluster. A Dask cluster can be run on a local machine +or as a distributed cluster using a batch system (e.g. HTCondor, LSF, PBS, SGE, +SLURM) or other distributed systems such as LHCb's DIRAC. This is the +recommended executor for running `fasthep-flow` workflows on distributed +resources. ## Custom executors diff --git a/docs/index.md b/docs/index.md index c781d99..8cc3c03 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,22 +2,21 @@ ## Introduction -`fasthep-flow` is a package for converting YAML files into an -[prefect]$(prefect-link) flow. It is designed to be used with the -[fast-hep](https://fast-hep.github.io/) package ecosystem, but can be used -independently. +`fasthep-flow` is a package for describing data analysis workflows in YAML and +converting them into a workflow DAG that can be run by software like Dask. It is +designed to be used with the [fast-hep](https://fast-hep.github.io/) package +ecosystem, but can be used independently. The goal of this package is to define a workflow, e.g. a HEP analysis, in a YAML -file, and then convert that YAML file into an -[prefect flow](https://docs.prefect.io/latest/concepts/flows/). This flow can -then be run on a local machine, or on a cluster using +file, and then convert that YAML file into a workflow DAG. This DAG can then be +run on a local machine, or on a cluster using [CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or [Google Cloud Composer](https://cloud.google.com/composer). -In `fasthep-flow`'s YAML files draws inspiration from Continuous Integration -(CI) pipelines and Ansible Playbooks to define the workflow, where each step is -a task that can be run in parallel. `fasthep-flow` will check the parameters of -each task, and then generate the flow. The flow will have a task for each step, +In `fasthep-flow`'s YAML files draw inspiration from Continuous Integration (CI) +pipelines and Ansible Playbooks to define the workflow, where each independent +task that can be run in parallel. `fasthep-flow` will check the parameters of +each task, and then generate the DAG. The DAG will have a task for instruction, and the dependencies between the tasks will be defined by the `needs` key in the YAML file. More on this under [Configuration](./configuration/index.md). diff --git a/docs/operators.md b/docs/operators.md index 055397e..1aab139 100644 --- a/docs/operators.md +++ b/docs/operators.md @@ -1,15 +1,19 @@ # Operators Operators are used here as a general term of callable code blocks that operate -on data. In Airflow, operators are used to define tasks in a -[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph). In `fasthep-flow`, -operators are used to define stages in a workflow. The operators are defined in -the YAML file, and then mapped to Airflow operators when the DAG is generated. -One `fasthep-flow` operator can map to multiple Airflow operators. +on data. These are similar to Ansible's modules, or Airflow's operators. In +`fasthep-flow`, operators are used to define stages in a workflow. The operators +are defined in the YAML file, and then integrated into the workflow when the DAG +is generated. The defined operators can be used to transform data, filter data, +or generate data. Operators defined in the YAML file are expected to be +callables supporting a specific protocol. When constructing the workflow, +`fasthep-flow` will try to import the module first, e.g. +`fasthep_flow.operators.bash.BashOperator`, which gives the user the flexibility +to define their own operators. ## Operator Types -There are four types of operators: +There are five types of operators: 1. **Data Input**: These are a special set that does not require any input data, and instead generates data. These are used to start a workflow. @@ -20,6 +24,9 @@ There are four types of operators: 4. **Filter**: These are used to filter data. They are similar to data transform operators, but instead of adding data, they restrict part of the data to continue in the workflow. +5. **Passive**: These are used to monitor the workflow, and are do not change + the data. Examples of such modules are the `ProvenanceOperator` and the + `MonitoringOperator`. ## Custom operators diff --git a/docs/provenance.md b/docs/provenance.md index a01c5c9..7729e23 100644 --- a/docs/provenance.md +++ b/docs/provenance.md @@ -43,7 +43,5 @@ provenance: - performance # Metrics to measure the efficiency of the analysis - environment # Software environment, including library versions - hardware # Hardware specifications where the analysis was run - airflow: - include: - - db # Database configurations and states within Airflow + metadata: metadata.json # Specifies the file to store the provenance metadata ```