-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utilize spatialdata format to save sparcspy results #40
Comments
spatialdata writers for SPARCStools implemented in MannLabs/SPARCStools@13301fa |
segmentation workflow is functional for sharded and regular segmentations as of this commit: 523f142 Small benchmarking comparison on the same dataset (relatively small 2000x2000 px)
Sharding generates some overhead in the new sparcsspatial implementation vs non-shading but is required for larger than memory computations. This overhead seems to be larder than in the original SPARCSpy implementation. This is probably a result from a suboptimal implementation of the sharding resolution currently found in the working version. We first generate a memory mapped temp array that we aggregate all results to. Then we transfer this array to the sdata object. This workaround was implemented because I did not find a way to update a mask in the sdata object in iterative steps while always backing to memory. Maybe a solution exists that I have not found yet that would allow us to circumvent this additional step and make the process more efficient. |
extraction workflow is functional for single-threaded and multi-threaded processes as of this commit: dd3ee0f Small Benchmarking comparison on the same dataset (relatively small 2000x2000 px, total of 683 unique cells to extract)
Performance in spatialdata seems to be much lower than in sparcspy. This is most likely a result of lower read speeds being achieved from the sdata object vs HDF5. The chosen example is also not ideal to really benchmark the performance of multiple threads as not enough cells are being extracted to counterbalance the overhead generated by needing to instantiate multiple workers. By changing the chunking of the segmentation masks to reflect the default chunking set up in SPARCSpy it was possible to reduce the extraction time from 29.4 s ± 2.48 s per loop to 18.5 s ± 811 ms per loop. While this is still slower than in SPARCSpy it gives us a starting point to start addressing this issue. Benchmarking of reading/writing speed to sdata vs H5PYI developed a script which mimics the extraction workflow setup but initialised with random data and exactly identical code executions to better be able to compare the performance between h5py and sdata as backend for saving prior results. Used code:
Executing this code and tracking the resulting computation times gives the following results:
|
Potential Solution idea to slow read times from sdata: map relevant sdata elements to memory-mapped temp arrays and perform extraction from those. Initial implementation in commit: bb16cdd Updated benchmarking with the new method:
The use of memory mapped temp arrays seems to speed up the actual extraction process but generates some overhead required to map the images/masks to disk beforehand. To better understand this tradeoff a proper benchmarking needs to be performed where the following things are quantified:
UpdateThe significant difference between sparcspy and the sparcsspatial implementation was resulting from no longer using memory mapped temp arrays for saving intermediate results that were assigned as global variables but instead reconnecting to these arrays in each save process. This generates a lot of time overhead. By reimplementing the workflow to only connect once to the memory mapped temp arrays per process (so for single-threaded once during the extraction and for multi-threaded once for the thread call) this significantly boosted the extraction times. This fix was then also transferred to the SPARCSspatial implementation and benchmarked against each other
Replicates of 6 independent runs on the same input dataset with identical segmentation masks. While the SPARCSspatial implementation is still somewhat less efficient an the base SPARCSpy implementation this is an acceptable result with which we can continue working. This analysis should be replicated with larger datasets to better estimate the result of multithreading on processing time. |
classification workflow implemented as of commit 52e3ccf This also entailed some major classification workflow remodelling addressing issue #22. The classification results are written to tables in the sdata object as well as csv files. This later functionality can be deprecated after we have worked with the sdata tables more. |
Aim is to use SpatialData is the data storage backend underlying the SPARCS workflow. This allows us to interface with many existing software solutions and build up on the existing software frameworks.
Needs
Data Saving: we need an sdata object for each SPARCSpy run which contains all the data associated with that project. This data includes:
Data visualisation:
To Dos
Classic Project structure
Batched Project structure
Issues that need to be addressed
Questions that remain to be addressed
how can we save images so that they are displayed in napari as individual channelshow can we link 1 adata table to several annotations (cytosol and nuclei)→ each segmentation layer should receive its own table. Technically the classification results are not necessarily the same for both layersThe text was updated successfully, but these errors were encountered: