Skip to content

Commit

Permalink
Remove slices again (#735)
Browse files Browse the repository at this point in the history
* Debugging

* Remove the _Slice implementation

Instead of creating custom partition functions, use the already
implemented __iter__ functions and a common (optimized) partitioning
function. The problems with the old implementation was the
larger memory usage and a larger number of iterations through the data.

* Sort after grouping, which might improve grouping speed

* Remove the new unneeded class

* style formatting

* Changelog

* Dask Integration (#736)

* Dask integration, unfinished

* Fix test

* Added dask tests

* Improve the feature extraction test

* Reworked the documentation for the new features

* Changelog

* Stylefix

* Forget to add a class

* Increase test coverage

* Update feature_extraction_settings.rst (#740)

minimum/maximum are valid feature_calculators instead of min/max

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html?highlight=extract_features#tsfresh.feature_extraction.feature_calculators.maximum

* Use a better download library (#741)

* Closes #743 (#744)

* Closes #743

* Adds issue (#743) info to changelog

* Fix the failure with the latest statsmodels installed (#749)

* limits lag length to 50% of sample size in `partial_autocorrelation`.

* try to fix ut

* fix ut

Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com>

* Fix #742, while taking into account the differences between Python's indexing of vectors and Matlab's indexing (cf. Bastia et al (2004), Eq. 1)

* Update docs/text/data_formats.rst

Co-authored-by: HaveF <iamaplayer@gmail.com>
Co-authored-by: patrjon <46594327+patrjon@users.noreply.github.com>
Co-authored-by: He Kaisheng <heks93@163.com>
Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com>
Co-authored-by: akem134@elan <a.kempa-liehr@auckland.ac.nz>

Co-authored-by: HaveF <iamaplayer@gmail.com>
Co-authored-by: patrjon <46594327+patrjon@users.noreply.github.com>
Co-authored-by: He Kaisheng <heks93@163.com>
Co-authored-by: hekaisheng <kaisheng.hks@alibaba-inc.com>
Co-authored-by: akem134@elan <a.kempa-liehr@auckland.ac.nz>
  • Loading branch information
6 people authored Sep 9, 2020
1 parent babed38 commit 55a1e57
Show file tree
Hide file tree
Showing 18 changed files with 760 additions and 316 deletions.
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,15 @@ We changed the default branch from "master" to "main".
- Add a test for the dask bindings (#719)
- Refactor input data iteration to need less memory (#707)
- Added benchmark tests (#710)
- Make dask a possible input format (#736)
- Bugfixes:
- Fixed a bug in the selection, that caused all regression tasks with un-ordered index to be wrong (#715)
- Fixed readthedocs (#695, #696)
- Fix spark and dask after #705 and for non-id named id columns (#712)
- Fix in the forecasting notebook (#729)
- Let tsfresh choose the value column if possible (#722)
- Move from coveralls github action to codecov (#734)
- Improve speed of data processing (#735)
- Fix for newer, more strict pandas versions (#737)
- Fix documentation for feature calculators (#743)

Expand Down
36 changes: 18 additions & 18 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,24 +24,24 @@ The following chapters will explain the tsfresh package in detail:
.. toctree::
:maxdepth: 1

Introduction <text/introduction>
Quick Start <text/quick_start>
Module Reference <api/modules>
Data Formats <text/data_formats>
scikit-learn Transformers <text/sklearn_transformers>
List of Calculated Features <text/list_of_features>
Feature Calculation <text/feature_calculation>
Feature Calculator Settings <text/feature_extraction_settings>
Feature Filtering <text/feature_filtering>
How to write custom Feature Calculators <text/how_to_add_custom_feature>
Parallelization <text/parallelization>
tsfresh on a cluster <text/tsfresh_on_a_cluster>
Rolling/Time Series Forecasting <text/forecasting>
FAQ <text/faq>
Authors <authors>
License <license>
Changelog <changes>
How to contribute <text/how_to_contribute>
text/introduction
text/quick_start
text/data_formats
text/sklearn_transformers
text/list_of_features
text/feature_extraction_settings
text/feature_filtering
text/how_to_add_custom_feature
text/large_data
text/tsfresh_on_a_cluster
text/forecasting
text/faq
api/modules
authors
license
changes
text/how_to_contribute
text/feature_calculation


Indices and tables
Expand Down
18 changes: 10 additions & 8 deletions docs/text/data_formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@ tsfresh offers three different options to specify the time series data to be use
Irrespective of the input format, tsfresh will always return the calculated features in the same output format
described below.

All three input format options consist of :class:`pandas.DataFrame` objects. There are four important column types that
Typically, the input format options consist of :class:`pandas.DataFrame` objects.
(see :ref:`large-data-label` for other input types)
There are four important column types that
make up those DataFrames. Each will be described with an example from the robot failures dataset
(see :ref:`quick-start-label`).

:`column_id`: This column indicates which entities the time series belong to. Features will be extracted individually
for each entity (id). The resulting feature matrix will contain one row per id.
Each robot is a different entity, so each of it has a different id.
After rolling, each window will be a distinct entity, so it has a distinct id.

:`column_sort`: This column contains values which allow to sort the time series (e.g. time stamps).
In general, it is not required to have equidistant time steps or the same time scale for the different ids and/or kinds.
Expand All @@ -43,12 +44,12 @@ In the following we describe the different input formats, that are build on thos
* A dictionary of flat DataFrames

The difference between a flat and a stacked DataFrame is indicated by specifying or not specifying the parameters
`column_value` and `column_kind` in the :func:`tsfresh.extract_features` function.
``column_value`` and ``column_kind`` in the :func:`tsfresh.extract_features` function.

If you do not know which one to choose, you probably want to try out the flat or stacked DataFrame.

Input Option 1. Flat DataFrame
------------------------------
Input Option 1. Flat DataFrame or Wide DataFrame
------------------------------------------------

If both ``column_value`` and ``column_kind`` are set to ``None``, the time series data is assumed to be in a flat
DataFrame. This means that each different time series must be saved as its own column.
Expand Down Expand Up @@ -82,8 +83,8 @@ and you would pass
to the extraction functions, to extract features separately for all ids and separately for the x and y values.
You can also omit the ``column_kind=None, column_value=None`` as this is the default.

Input Option 2. Stacked DataFrame
---------------------------------
Input Option 2. Stacked DataFrame or Long DataFrame
---------------------------------------------------

If both ``column_value`` and ``column_kind`` are set, the time series data is assumed to be a stacked DataFrame.
This means that there are no different columns for the different types of time series.
Expand Down Expand Up @@ -128,6 +129,7 @@ Then you would set
column_id="id", column_sort="time", column_kind="kind", column_value="value"
to end up with the same extracted features as above.
You can also omit the value column and let ``tsfresh`` can deduce it automatically.


Input Option 3. Dictionary of flat DataFrames
Expand Down Expand Up @@ -205,4 +207,4 @@ where the x features are calculated using all x values (independently for A and
and so on.

This form of DataFrame is also the expected input format to the feature selection algorithms (e.g. the
:func:`tsfresh.select_features` function).
:func:`tsfresh.select_features` function).
7 changes: 2 additions & 5 deletions docs/text/feature_calculation.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
.. _feature-naming-label:

Feature Calculation
===================

Feature naming
''''''''''''''
Feature Calculator Naming
=========================

tsfresh enforces a strict naming of the created features, which you have to follow whenever you create new feature
calculators.
Expand Down
6 changes: 3 additions & 3 deletions docs/text/feature_extraction_settings.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ For the lazy: Just let me calculate some features
-------------------------------------------------

So, to just calculate a comprehensive set of features, call the :func:`tsfresh.extract_features` method without
passing a `default_fc_parameters` or `kind_to_fc_parameters` object, which means you are using the default options
passing a ``default_fc_parameters`` or ``kind_to_fc_parameters`` object, which means you are using the default options
(which will use all feature calculators in this package for what we think are sane default parameters).

For the advanced: How do I set the parameters for all kind of time series?
Expand All @@ -29,7 +29,7 @@ custom settings object:
>>> from tsfresh.feature_extraction import extract_features
>>> extract_features(df, default_fc_parameters=settings)

The `default_fc_parameters` is expected to be a dictionary, which maps feature calculator names
The ``default_fc_parameters`` is expected to be a dictionary, which maps feature calculator names
(the function names you can find in the :mod:`tsfresh.feature_extraction.feature_calculators` file) to a list
of dictionaries, which are the parameters with which the function will be called (as key value pairs). Each function
parameter combination, that is in this dict will be called during the extraction and will produce a feature.
Expand Down Expand Up @@ -79,7 +79,7 @@ For the ambitious: How do I set the parameters for different type of time series
It is also possible, to control the features to be extracted for the different kinds of time series individually.
You can do so by passing another dictionary to the extract function as a

`kind_to_fc_parameters` = {"kind" : `fc_parameters`}
kind_to_fc_parameters = {"kind" : fc_parameters}

parameter. This dict must be a mapping from kind names (as string) to `fc_parameters` objects,
which you would normally pass as an argument to the `default_fc_parameters` parameter.
Expand Down
3 changes: 1 addition & 2 deletions docs/text/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,6 @@ What not to do with tsfresh?
Currently, tsfresh is not suitable

* for usage with streaming data
* for batch processing over a distributed architecture when different time series are fragmented over different computational units
(but see how to use ``tsfresh`` on a cluster in :ref:`tsfresh-on-a-cluster-label`)
* to train models on the features (we do not want to reinvent the wheel, check out the python package
`scikit-learn <http://scikit-learn.org/stable/>`_ for example)

Expand All @@ -61,6 +59,7 @@ There is a matlab package called `hctsa <https://github.com/benfulcher/hctsa>`_
extract features from time series.
It is also possible to use hctsa from within python by means of the `pyopy <https://github.com/strawlab/pyopy>`_
package.
There also exist `featuretools <https://www.featuretools.com/>`_, `FATS <http://isadoranun.github.io/tsfeat/>`_ and `cesium <http://cesium-ml.org/>`_.

References
----------
Expand Down
84 changes: 84 additions & 0 deletions docs/text/large_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
.. _large-data-label:

Large Input Data
================

If you are dealing with large time series data, you are facing multiple problems.
Thw two most important ones are
* long execution times for feature extraction
* large memory consumptions, even beyond what a single machine can handle

To solve only the first problem, you can parallelize the computation as described in :ref:`tsfresh-on-a-cluster-label`.
Please note, that parallelization on your local computer is already turned on by default.

However, for even larger data you need to handle both problems at once.
You have multiple possibilities here:

Dask - the simple way
---------------------

*tsfresh* accepts a `dask dataframe <https://docs.dask.org/en/latest/dataframe.html>`_ instead of a
pandas dataframe as input for the :func:`tsfresh.extract_features` function.
Dask dataframes allow you to scale your computation beyond your local memory (via partitioning the data internally)
and even to large clusters of machines.
Its dataframe API is very similar to pandas dataframes and might even be a drop-in replacement.

All arguments discussed in :ref:`data-formats-label` are also valid for the dask case.
The input data will be transformed into the correct format for *tsfresh* using dask methods
and the feature extraction will be added as additional computations to the computation graph.
You can then add additional computations to the result or trigger the computation as usual with ``.compute()``.

.. NOTE::

The last step of the feature extraction is to bring all features into a tabular format.
Especially for very large data samples, this computation can be a large
performance bottleneck.
We therefore recommend to turn the pivoting off, if you do not really need it
and work with the unpivoted data as much as possible.

For example, to read in data from parquet and do the feature extraction:

.. code::
import dask.dataframe as dd
from tsfresh import extract_features
df = dd.read_parquet(...)
X = extract_features(df,
column_id="id", column_sort="time",
pivot=False)
result = X.compute()
Dask - more control
-------------------

The feature extraction method needs to perform some data transformations, before it
can call the actual feature calculators.
If you want to optimize your data flow, you might want to have more control on how
exactly the feature calculation is added to you dask computation graph.

Therefore, it is also possible to add the feature extraction directly:


.. code::
from tsfresh.convenience.bindings import dask_feature_extraction_on_chunk
features = dask_feature_extraction_on_chunk(df_grouped,
column_id="id",
column_kind="kind",
column_sort="time",
column_value="value")
In this case however, ``df_grouped`` must already be in the correct format.
Check out the documentation of :func:`tsfresh.convenience.bindings.dask_feature_extraction_on_chunk`
for more information.
No pivoting will be performed in this case.

PySpark
-------

Similar to dask, it is also possible to ass the feature extraction into a Spark
computation graph.
You can find more information in the documentation of :func:`tsfresh.convenience.bindings.spark_feature_extraction_on_chunk`.
42 changes: 0 additions & 42 deletions docs/text/parallelization.rst

This file was deleted.

59 changes: 54 additions & 5 deletions docs/text/tsfresh_on_a_cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,54 @@
.. role:: python(code)
:language: python

How to deploy tsfresh at scale
==============================
Parallelization
===============

The feature extraction, the feature selection as well as the rolling offer the possibility of parallelization.
By default, all of those tasks are parallelized by tsfresh.
Here we discuss the different settings to control the parallelization.
To achieve best results for your use-case you should experiment with the parameters.

.. NOTE::
This document describes parallelization for processing time speed up.
If you are dealing with large amounts of data (which might not fit into memory anymore),
you can also have a look into :ref:`large-data-label`.

Please let us know about your results tuning the below mentioned parameters! It will help improve this document as
well as the default settings.

Parallelization of Feature Selection
------------------------------------

We use a :class:`multiprocessing.Pool` to parallelize the calculation of the p-values for each feature. On
instantiation we set the Pool's number of worker processes to
`n_jobs`. This field defaults to
the number of processors on the current system. We recommend setting it to the maximum number of available (and
otherwise idle) processors.

The chunksize of the Pool's map function is another important parameter to consider. It can be set via the
`chunksize` field. By default it is up to
:class:`multiprocessing.Pool` is parallelisation parameter. One data chunk is
defined as a singular time series for one id and one kind. The chunksize is the
number of chunks that are submitted as one task to one worker process. If you
set the chunksize to 10, then it means that one worker task corresponds to
calculate all features for 10 id/kind time series combinations. If it is set it
to None, depending on distributor, heuristics are used to find the optimal
chunksize. The chunksize can have an crucial influence on the optimal cluster
performance and should be optimised in benchmarks for the problem at hand.

Parallelization of Feature Extraction
-------------------------------------

For the feature extraction tsfresh exposes the parameters
`n_jobs` and `chunksize`. Both behave analogue to the parameters
for the feature selection.

To do performance studies and profiling, it sometimes quite useful to turn off parallelization at all. This can be
setting the parameter `n_jobs` to 0.

Parallelization beyond a single machine
---------------------------------------

The high volume of time series data can demand an analysis at scale.
So, time series need to be processed on a group of computational units instead of a singular machine.
Expand All @@ -13,9 +59,6 @@ Accordingly, it may be necessary to distribute the extraction of time series fea
Indeed, it is possible to extract features with *tsfresh* in a distributed fashion.
This page will explain how to setup a distributed *tsfresh*.

The distributor class
'''''''''''''''''''''

To distribute the calculation of features, we use a certain object, the Distributor class (contained in the
:mod:`tsfresh.utilities.distribution` module).

Expand Down Expand Up @@ -95,6 +138,12 @@ Using dask to distribute the calculations
We provide distributor for the `dask framework <https://dask.pydata.org/en/latest/>`_, where
*"Dask is a flexible parallel computing library for analytic computing."*

.. NOTE::
This part of the documentation only handles parallelizing the computation using
a dask cluster. The input and output are still pandas objects.
If you want to use dask's capabilities to scale to data beyond your local
memory, have a look into :ref:`large-data-label`.

Dask is a great framework to distribute analytic calculations to a cluster.
It scales up and down, meaning that you can even use it on a singular machine.
The only thing that you will need to run *tsfresh* on a Dask cluster is the ip address and port number of the
Expand Down
4 changes: 3 additions & 1 deletion tests/integrations/examples/test_driftbif_simulation.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
import numpy as np
import unittest
import pandas as pd
import dask.dataframe as dd

from tsfresh.examples.driftbif_simulation import velocity, load_driftbif, sample_tau
from tsfresh import extract_relevant_features
from tsfresh import extract_relevant_features, extract_features
from tsfresh.feature_extraction import MinimalFCParameters


class DriftBifSimlationTestCase(unittest.TestCase):
Expand Down
Loading

0 comments on commit 55a1e57

Please sign in to comment.