Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-pure computations #37

Open
CMCDragonkai opened this issue Apr 30, 2019 · 5 comments
Open

Add support for non-pure computations #37

CMCDragonkai opened this issue Apr 30, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@CMCDragonkai
Copy link

In Dask, whether you set pure to be true or not affects the way the computational graph is constructed. In fact enabling pure will allow dask to optimise prior to constructing the graph.

I've noticed that graphchain appears to assume all computational nodes are pure, and caches them. But if some nodes are not pure, they shouldn't be cached.

@lsorber
Copy link
Member

lsorber commented Apr 30, 2019

That's a good observation. One problem with that is that we can/should not count on developers to annotate all pure functions in a graph. In the case of a hand rolled dask graph, I'm not even aware of a way to annotate the fact that a computation is pure.

For those reasons, we optimistically assume that all computations are pure, and ask the user to pass those keys that they do not wish to cache. (EDIT: The choice of whether to cache a computation or not does not change the fact that graphchain assumes all computations are pure.)

What we can do is add an option that, if set, automatically caches only pure computations by default. Would that be a satisfactory solution?

@CMCDragonkai
Copy link
Author

I think that if you are using graphchain on a manually constructed graph, then there's no way to know. But if I'm using dask smart constructors dask.delayed, and I have set a node to be pure or not pure (and by default dask sets it to be False which is impure), then I expect that graphchain optimiser should respect this option and not cache the nodes which are impure.

@CMCDragonkai
Copy link
Author

The problem is, I don't know if the dask graph that you get in the optimiser actually preserves the purity attribute? Does the HighLevelGraph preserve it?

@lsorber
Copy link
Member

lsorber commented May 4, 2019

I think the pure attribute is part of the Delayed computation, so yes, we would have access to it.

One question we need to address is what should happen if a pure function depends on the input of a non-pure function. In that case, we are in principle forced to compute the non-pure function and use its output as input for the pure function. With a joblib.Memory-style cache, we could hash the inputs of the pure function and see if we have the pure function's result cached.

However, that is not compatible with graphchain's current caching mechanism: a computation's result is cached based on a chain of hashes of the preceding computations' source and the root inputs. If there is a single non-pure computation in between, then that breaks the chain of hashes and we can no longer determine the cache key to retrieve.

In effect, graphchain assumes the full dask graph is pure at this time. If we want to add support for non-pure computations, that would require some refactoring.

@CMCDragonkai
Copy link
Author

CMCDragonkai commented May 4, 2019

Impurity is infectious. If a pure function depends on an impure input, then it is also impure. I fully expect this. So if I put a impure computation at the beginning of a pipeline, I expect graphchain to assume pretty much the whole graph is impure.

However if I put the impure computation near the end, then only that should be recomputed as its inputs are pure and cached.

@lsorber lsorber changed the title Does graphchain optimiser automatically assume all computational nodes are pure? Add support for non-pure computations May 4, 2019
@lsorber lsorber added the enhancement New feature or request label May 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants