Remove slices again (#735)

* Debugging * Remove the _Slice implementation Instead of creating custom partition functions, use the already implemented __iter__ functions and a common (optimized) partitioning function. The problems with the old implementation was the larger memory usage and a larger number of iterations through the data. * Sort after grouping, which might improve grouping speed * Remove the new unneeded class * style formatting * Changelog * Dask Integration (#736) * Dask integration, unfinished * Fix test * Added dask tests * Improve the feature extraction test * Reworked the documentation for the new features * Changelog * Stylefix * Forget to add a class * Increase test coverage * Update feature_extraction_settings.rst (#740) minimum/maximum are valid feature_calculators instead of min/max https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html?highlight=extract_features#tsfresh.feature_extraction.feature_calculators.maximum * Use a better download library (#741) * Closes #743 (#744) * Closes #743 * Adds issue (#743) info to changelog * Fix the failure with the latest statsmodels installed (#749) * limits lag length to 50% of sample size in `partial_autocorrelation`. * try to fix ut * fix ut Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com> * Fix #742, while taking into account the differences between Python's indexing of vectors and Matlab's indexing (cf. Bastia et al (2004), Eq. 1) * Update docs/text/data_formats.rst Co-authored-by: HaveF <iamaplayer@gmail.com> Co-authored-by: patrjon <46594327+patrjon@users.noreply.github.com> Co-authored-by: He Kaisheng <heks93@163.com> Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com> Co-authored-by: akem134@elan <a.kempa-liehr@auckland.ac.nz> Co-authored-by: HaveF <iamaplayer@gmail.com> Co-authored-by: patrjon <46594327+patrjon@users.noreply.github.com> Co-authored-by: He Kaisheng <heks93@163.com> Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com> Co-authored-by: akem134@elan <a.kempa-liehr@auckland.ac.nz>
blue-yonder · Sep 9, 2020 · 55a1e57 · 55a1e57
1 parent babed38
commit 55a1e57
Show file tree

Hide file tree

Showing 18 changed files with 760 additions and 316 deletions.
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -23,13 +23,15 @@ We changed the default branch from "master" to "main".
     - Add a test for the dask bindings (#719)
     - Refactor input data iteration to need less memory (#707)
     - Added benchmark tests (#710)
+    - Make dask a possible input format (#736)
 - Bugfixes:
     - Fixed a bug in the selection, that caused all regression tasks with un-ordered index to be wrong (#715)
     - Fixed readthedocs (#695, #696)
     - Fix spark and dask after #705 and for non-id named id columns (#712)
     - Fix in the forecasting notebook (#729)
     - Let tsfresh choose the value column if possible (#722)
     - Move from coveralls github action to codecov (#734)
+    - Improve speed of data processing (#735)
     - Fix for newer, more strict pandas versions (#737)
     - Fix documentation for feature calculators (#743)
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -24,24 +24,24 @@ The following chapters will explain the tsfresh package in detail:
 .. toctree::
    :maxdepth: 1
 
-   Introduction <text/introduction>
-   Quick Start <text/quick_start>
-   Module Reference <api/modules>
-   Data Formats <text/data_formats>
-   scikit-learn Transformers <text/sklearn_transformers>
-   List of Calculated Features <text/list_of_features>
-   Feature Calculation <text/feature_calculation>
-   Feature Calculator Settings <text/feature_extraction_settings>
-   Feature Filtering <text/feature_filtering>
-   How to write custom Feature Calculators <text/how_to_add_custom_feature>
-   Parallelization <text/parallelization>
-   tsfresh on a cluster <text/tsfresh_on_a_cluster>
-   Rolling/Time Series Forecasting <text/forecasting>
-   FAQ <text/faq>
-   Authors <authors>
-   License <license>
-   Changelog <changes>
-   How to contribute <text/how_to_contribute>
+   text/introduction
+   text/quick_start
+   text/data_formats
+   text/sklearn_transformers
+   text/list_of_features
+   text/feature_extraction_settings
+   text/feature_filtering
+   text/how_to_add_custom_feature
+   text/large_data
+   text/tsfresh_on_a_cluster
+   text/forecasting
+   text/faq
+   api/modules
+   authors
+   license
+   changes
+   text/how_to_contribute
+   text/feature_calculation
 
 
 Indices and tables

diff --git a/docs/text/data_formats.rst b/docs/text/data_formats.rst
@@ -10,14 +10,15 @@ tsfresh offers three different options to specify the time series data to be use
 Irrespective of the input format, tsfresh will always return the calculated features in the same output format
 described below.
 
-All three input format options consist of :class:`pandas.DataFrame` objects. There are four important column types that
+Typically, the input format options consist of :class:`pandas.DataFrame` objects.
+(see :ref:`large-data-label` for other input types)
+There are four important column types that
 make up those DataFrames. Each will be described with an example from the robot failures dataset
 (see :ref:`quick-start-label`).
 
 :`column_id`: This column indicates which entities the time series belong to. Features will be extracted individually
     for each entity (id). The resulting feature matrix will contain one row per id.
     Each robot is a different entity, so each of it has a different id.
-    After rolling, each window will be a distinct entity, so it has a distinct id.
 
 :`column_sort`: This column contains values which allow to sort the time series (e.g. time stamps).
     In general, it is not required to have equidistant time steps or the same time scale for the different ids and/or kinds.
@@ -43,12 +44,12 @@ In the following we describe the different input formats, that are build on thos
 * A dictionary of flat DataFrames
 
 The difference between a flat and a stacked DataFrame is indicated by specifying or not specifying the parameters
-`column_value` and `column_kind` in the :func:`tsfresh.extract_features` function.
+``column_value`` and ``column_kind`` in the :func:`tsfresh.extract_features` function.
 
 If you do not know which one to choose, you probably want to try out the flat or stacked DataFrame.
 
-Input Option 1. Flat DataFrame
-------------------------------
+Input Option 1. Flat DataFrame or Wide DataFrame
+------------------------------------------------
 
 If both ``column_value`` and ``column_kind`` are set to ``None``, the time series data is assumed to be in a flat
 DataFrame. This means that each different time series must be saved as its own column.
@@ -82,8 +83,8 @@ and you would pass
 to the extraction functions, to extract features separately for all ids and separately for the x and y values.
 You can also omit the ``column_kind=None, column_value=None`` as this is the default.
 
-Input Option 2. Stacked DataFrame
----------------------------------
+Input Option 2. Stacked DataFrame or Long DataFrame
+---------------------------------------------------
 
 If both ``column_value`` and ``column_kind`` are set, the time series data is assumed to be a stacked DataFrame.
 This means that there are no different columns for the different types of time series.
@@ -128,6 +129,7 @@ Then you would set
     column_id="id", column_sort="time", column_kind="kind", column_value="value"
 
 to end up with the same extracted features as above.
+You can also omit the value column and let ``tsfresh`` can deduce it automatically.
 
 
 Input Option 3. Dictionary of flat DataFrames
@@ -205,4 +207,4 @@ where the x features are calculated using all x values (independently for A and
 and so on.
 
 This form of DataFrame is also the expected input format to the feature selection algorithms (e.g. the
-:func:`tsfresh.select_features` function).
+:func:`tsfresh.select_features` function).
diff --git a/docs/text/feature_calculation.rst b/docs/text/feature_calculation.rst
@@ -1,10 +1,7 @@
 .. _feature-naming-label:
 
-Feature Calculation
-===================
-
-Feature naming
-''''''''''''''
+Feature Calculator Naming
+=========================
 
 tsfresh enforces a strict naming of the created features, which you have to follow whenever you create new feature
 calculators.

diff --git a/docs/text/feature_extraction_settings.rst b/docs/text/feature_extraction_settings.rst
@@ -12,7 +12,7 @@ For the lazy: Just let me calculate some features
 -------------------------------------------------
 
 So, to just calculate a comprehensive set of features, call the :func:`tsfresh.extract_features` method without
-passing a `default_fc_parameters` or `kind_to_fc_parameters` object, which means you are using the default options
+passing a ``default_fc_parameters`` or ``kind_to_fc_parameters`` object, which means you are using the default options
 (which will use all feature calculators in this package for what we think are sane default parameters).
 
 For the advanced: How do I set the parameters for all kind of time series?
@@ -29,7 +29,7 @@ custom settings object:
 >>> from tsfresh.feature_extraction import extract_features
 >>> extract_features(df, default_fc_parameters=settings)
 
-The `default_fc_parameters` is expected to be a dictionary, which maps feature calculator names
+The ``default_fc_parameters`` is expected to be a dictionary, which maps feature calculator names
 (the function names you can find in the :mod:`tsfresh.feature_extraction.feature_calculators` file) to a list
 of dictionaries, which are the parameters with which the function will be called (as key value pairs). Each function
 parameter combination, that is in this dict will be called during the extraction and will produce a feature.
@@ -79,7 +79,7 @@ For the ambitious: How do I set the parameters for different type of time series
 It is also possible, to control the features to be extracted for the different kinds of time series individually.
 You can do so by passing another dictionary to the extract function as a
 
-`kind_to_fc_parameters` = {"kind" : `fc_parameters`}
+  kind_to_fc_parameters = {"kind" : fc_parameters}
 
 parameter. This dict must be a mapping from kind names (as string) to `fc_parameters` objects,
 which you would normally pass as an argument to the `default_fc_parameters` parameter.

diff --git a/docs/text/introduction.rst b/docs/text/introduction.rst
@@ -46,8 +46,6 @@ What not to do with tsfresh?
 Currently, tsfresh is not suitable
 
     * for usage with streaming data
-    * for batch processing over a distributed architecture when different time series are fragmented over different computational units
-      (but see how to use ``tsfresh`` on a cluster in :ref:`tsfresh-on-a-cluster-label`)
     * to train models on the features (we do not want to reinvent the wheel, check out the python package
       `scikit-learn <http://scikit-learn.org/stable/>`_ for example)
 
@@ -61,6 +59,7 @@ There is a matlab package called `hctsa <https://github.com/benfulcher/hctsa>`_
 extract features from time series.
 It is also possible to use hctsa from within python by means of the `pyopy <https://github.com/strawlab/pyopy>`_
 package.
+There also exist `featuretools <https://www.featuretools.com/>`_, `FATS <http://isadoranun.github.io/tsfeat/>`_ and `cesium <http://cesium-ml.org/>`_.
 
 References
 ----------

diff --git a/docs/text/large_data.rst b/docs/text/large_data.rst
@@ -0,0 +1,84 @@
+.. _large-data-label:
+
+Large Input Data
+================
+
+If you are dealing with large time series data, you are facing multiple problems.
+Thw two most important ones are
+* long execution times for feature extraction
+* large memory consumptions, even beyond what a single machine can handle
+
+To solve only the first problem, you can parallelize the computation as described in :ref:`tsfresh-on-a-cluster-label`.
+Please note, that parallelization on your local computer is already turned on by default.
+
+However, for even larger data you need to handle both problems at once.
+You have multiple possibilities here:
+
+Dask - the simple way
+---------------------
+
+*tsfresh* accepts a `dask dataframe <https://docs.dask.org/en/latest/dataframe.html>`_ instead of a
+pandas dataframe as input for the :func:`tsfresh.extract_features` function.
+Dask dataframes allow you to scale your computation beyond your local memory (via partitioning the data internally)
+and even to large clusters of machines.
+Its dataframe API is very similar to pandas dataframes and might even be a drop-in replacement.
+
+All arguments discussed in :ref:`data-formats-label` are also valid for the dask case.
+The input data will be transformed into the correct format for *tsfresh* using dask methods
+and the feature extraction will be added as additional computations to the computation graph.
+You can then add additional computations to the result or trigger the computation as usual with ``.compute()``.
+
+.. NOTE::
+
+    The last step of the feature extraction is to bring all features into a tabular format.
+    Especially for very large data samples, this computation can be a large
+    performance bottleneck.
+    We therefore recommend to turn the pivoting off, if you do not really need it
+    and work with the unpivoted data as much as possible.
+
+For example, to read in data from parquet and do the feature extraction:
+
+.. code::
+
+    import dask.dataframe as dd
+    from tsfresh import extract_features
+
+    df = dd.read_parquet(...)
+
+    X = extract_features(df,
+                         column_id="id", column_sort="time",
+                         pivot=False)
+
+    result = X.compute()
+
+Dask - more control
+-------------------
+
+The feature extraction method needs to perform some data transformations, before it
+can call the actual feature calculators.
+If you want to optimize your data flow, you might want to have more control on how
+exactly the feature calculation is added to you dask computation graph.
+
+Therefore, it is also possible to add the feature extraction directly:
+
+
+.. code::
+
+    from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
+    features = dask_feature_extraction_on_chunk(df_grouped,
+                                                column_id="id",
+                                                column_kind="kind",
+                                                column_sort="time",
+                                                column_value="value")
+
+In this case however, ``df_grouped`` must already be in the correct format.
+Check out the documentation of :func:`tsfresh.convenience.bindings.dask_feature_extraction_on_chunk`
+for more information.
+No pivoting will be performed in this case.
+
+PySpark
+-------
+
+Similar to dask, it is also possible to ass the feature extraction into a Spark
+computation graph.
+You can find more information in the documentation of :func:`tsfresh.convenience.bindings.spark_feature_extraction_on_chunk`.
diff --git a/docs/text/parallelization.rst b/docs/text/parallelization.rst
diff --git a/docs/text/tsfresh_on_a_cluster.rst b/docs/text/tsfresh_on_a_cluster.rst
@@ -3,8 +3,54 @@
 .. role:: python(code)
     :language: python
 
-How to deploy tsfresh at scale
-==============================
+Parallelization
+===============
+
+The feature extraction, the feature selection as well as the rolling offer the possibility of parallelization.
+By default, all of those tasks are parallelized by tsfresh.
+Here we discuss the different settings to control the parallelization.
+To achieve best results for your use-case you should experiment with the parameters.
+
+.. NOTE::
+    This document describes parallelization for processing time speed up.
+    If you are dealing with large amounts of data (which might not fit into memory anymore),
+    you can also have a look into :ref:`large-data-label`.
+
+Please let us know about your results tuning the below mentioned parameters! It will help improve this document as
+well as the default settings.
+
+Parallelization of Feature Selection
+------------------------------------
+
+We use a :class:`multiprocessing.Pool` to parallelize the calculation of the p-values for each feature. On
+instantiation we set the Pool's number of worker processes to
+`n_jobs`. This field defaults to
+the number of processors on the current system. We recommend setting it to the maximum number of available (and
+otherwise idle) processors.
+
+The chunksize of the Pool's map function is another important parameter to consider. It can be set via the
+`chunksize` field. By default it is up to
+:class:`multiprocessing.Pool` is parallelisation parameter. One data chunk is
+defined as a singular time series for one id and one kind. The chunksize is the
+number of chunks that are submitted as one task to one worker process.  If you
+set the chunksize to 10, then it means that one worker task corresponds to
+calculate all features for 10 id/kind time series combinations.  If it is set it
+to None, depending on distributor, heuristics are used to find the optimal
+chunksize.  The chunksize can have an crucial influence on the optimal cluster
+performance and should be optimised in benchmarks for the problem at hand.
+
+Parallelization of Feature Extraction
+-------------------------------------
+
+For the feature extraction tsfresh exposes the parameters
+`n_jobs` and `chunksize`. Both behave analogue to the parameters
+for the feature selection.
+
+To do performance studies and profiling, it sometimes quite useful to turn off parallelization at all. This can be
+setting the parameter `n_jobs` to 0.
+
+Parallelization beyond a single machine
+---------------------------------------
 
 The high volume of time series data can demand an analysis at scale.
 So, time series need to be processed on a group of computational units instead of a singular machine.
@@ -13,9 +59,6 @@ Accordingly, it may be necessary to distribute the extraction of time series fea
 Indeed, it is possible to extract features with *tsfresh* in a distributed fashion.
 This page will explain how to setup a distributed *tsfresh*.
 
-The distributor class
-'''''''''''''''''''''
-
 To distribute the calculation of features, we use a certain object, the Distributor class (contained in the
 :mod:`tsfresh.utilities.distribution` module).
 
@@ -95,6 +138,12 @@ Using dask to distribute the calculations
 We provide distributor for the `dask framework <https://dask.pydata.org/en/latest/>`_, where
 *"Dask is a flexible parallel computing library for analytic computing."*
 
+.. NOTE::
+    This part of the documentation only handles parallelizing the computation using
+    a dask cluster. The input and output are still pandas objects.
+    If you want to use dask's capabilities to scale to data beyond your local
+    memory, have a look into :ref:`large-data-label`.
+
 Dask is a great framework to distribute analytic calculations to a cluster.
 It scales up and down, meaning that you can even use it on a singular machine.
 The only thing that you will need to run *tsfresh* on a Dask cluster is the ip address and port number of the

diff --git a/tests/integrations/examples/test_driftbif_simulation.py b/tests/integrations/examples/test_driftbif_simulation.py
@@ -5,9 +5,11 @@
 import numpy as np
 import unittest
 import pandas as pd
+import dask.dataframe as dd
 
 from tsfresh.examples.driftbif_simulation import velocity, load_driftbif, sample_tau
-from tsfresh import extract_relevant_features
+from tsfresh import extract_relevant_features, extract_features
+from tsfresh.feature_extraction import MinimalFCParameters
 
 
 class DriftBifSimlationTestCase(unittest.TestCase):