Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utilize spatialdata format to save sparcspy results #40

Open
4 of 11 tasks
sophiamaedler opened this issue Jul 14, 2024 · 5 comments
Open
4 of 11 tasks

utilize spatialdata format to save sparcspy results #40

sophiamaedler opened this issue Jul 14, 2024 · 5 comments

Comments

@sophiamaedler
Copy link
Collaborator

sophiamaedler commented Jul 14, 2024

Aim is to use SpatialData is the data storage backend underlying the SPARCS workflow. This allows us to interface with many existing software solutions and build up on the existing software frameworks.

Needs

  1. Data Saving: we need an sdata object for each SPARCSpy run which contains all the data associated with that project. This data includes:

    1. input images:
      1. multi-channel (ideally multi-scale) images that are the input for all processing CXY
      2. multi-channel multi-region images that are input for all processing NCXY
    2. segmentation masks:
      1. nuclear mask
      2. cytosolic mask
    3. classification results:
      1. per cell classification results that match to cell-ids listed in the segmentation masks → tabular views
    4. calibration crosses: points per slide that can be used for LMD excision
    5. potential additional information:
      1. centers of each cell linked to celled
      2. regions of the slide for subdividing subsequent analysis (e.g for annotation specific slide regions)
  2. Data visualisation:

    1. overlay different channels with one another in different colors
    2. overlay segmentation masks with input channels
    3. color segmentation masks according to properties in the classification results

To Dos

Classic Project structure

  • write a spatialdata writer for SPARCStools to directly translate stitched images into sdata format
  • write a function to save segmentation masks to sdata object
  • adapt extraction workflow to read results from sdata object
  • adapt classification workflow to write results to sdata object
  • adapt selection workflow to be able to read coordinates from sdata object and segmentation masks from sdata object

Batched Project structure

  • write function to save multi stacked images to sdata
  • write a function to save mutlistacked segmentation masks to sdata object
  • adapt extraction workflow to read from multi-stacked images
  • adapt classification workflow to write results to multi-stacked sdata

Issues that need to be addressed

Questions that remain to be addressed

  • how can we save images so that they are displayed in napari as individual channels
  • how can we link 1 adata table to several annotations (cytosol and nuclei) → each segmentation layer should receive its own table. Technically the classification results are not necessarily the same for both layers
  • if a segmentation is updated after it has been mapped to file are the results automatically saved or do they need to be forcefully written: not automatically saved and only the modified scale is updated → this issue needs to be circumvented so that masks are updated in memory mapped arrays instead of the sdata object itself → implementation of sdata file handler to solve this issue
  • how does visualisation of multi-stacks work? is this a realistic way to approach batched projects?
@sophiamaedler
Copy link
Collaborator Author

spatialdata writers for SPARCStools implemented in MannLabs/SPARCStools@13301fa

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Jul 22, 2024

segmentation workflow is functional for sharded and regular segmentations as of this commit: 523f142

Small benchmarking comparison on the same dataset (relatively small 2000x2000 px)

Sharded Regular
sparcsspatial 38.6 s ± 76.6 ms 30.4 s ± 313 ms
SPARCSpy 27.3 s ± 1.29 s 29.7 s ± 570 ms

Sharding generates some overhead in the new sparcsspatial implementation vs non-shading but is required for larger than memory computations.

This overhead seems to be larder than in the original SPARCSpy implementation. This is probably a result from a suboptimal implementation of the sharding resolution currently found in the working version.

We first generate a memory mapped temp array that we aggregate all results to. Then we transfer this array to the sdata object. This workaround was implemented because I did not find a way to update a mask in the sdata object in iterative steps while always backing to memory. Maybe a solution exists that I have not found yet that would allow us to circumvent this additional step and make the process more efficient.

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Jul 24, 2024

extraction workflow is functional for single-threaded and multi-threaded processes as of this commit: dd3ee0f

Small Benchmarking comparison on the same dataset (relatively small 2000x2000 px, total of 683 unique cells to extract)

Method Threads = 1 Threads = 2 Threads = 3
sparcsspatial 29.4 s ± 2.48 s per loop 19.8 s ± 1.89 s per loop 15.3 s ± 1.63 s per loop
sparcspy 3.31 s ± 2.02 s per loop 1.63 s ± 636 ms per loop 2.28 s ± 2.03 s per loop

Performance in spatialdata seems to be much lower than in sparcspy. This is most likely a result of lower read speeds being achieved from the sdata object vs HDF5.

The chosen example is also not ideal to really benchmark the performance of multiple threads as not enough cells are being extracted to counterbalance the overhead generated by needing to instantiate multiple workers.

By changing the chunking of the segmentation masks to reflect the default chunking set up in SPARCSpy it was possible to reduce the extraction time from 29.4 s ± 2.48 s per loop to 18.5 s ± 811 ms per loop. While this is still slower than in SPARCSpy it gives us a starting point to start addressing this issue.

Benchmarking of reading/writing speed to sdata vs H5PY

I developed a script which mimics the extraction workflow setup but initialised with random data and exactly identical code executions to better be able to compare the performance between h5py and sdata as backend for saving prior results.

Used code:

import h5py
import spatialdata as spdata
import numpy as np
import time
from tqdm.auto import tqdm

chunk_sizes = [50, 100, 1000, 5000, 10000]

#initialize data structures for saving timing results
times_reading_hdf5 = {}
times_reading_sdata = {}
times_write_hdf5 = {}
times_write_sdata = {}

for chunk_size in chunk_sizes:
    times_reading_hdf5[chunk_size] = []
    times_reading_sdata[chunk_size] = []
    times_write_hdf5[chunk_size] = []
    times_write_sdata[chunk_size] = []

for chunk_size in tqdm(chunk_sizes):

    for iteration in tqdm(range(0, 3)):
        image = np.random.randint(0, np.iinfo(np.uint16).max, (3, 10000, 10000), dtype=np.uint16)
        mask = np.random.randint(0, np.iinfo(np.uint32).max, (2, 10000, 10000), dtype=np.uint32)
        
        #HDF5 write
        start = time.time()
        with h5py.File("test.h5", "w") as hf:
            hf.create_dataset("input_image", 
                            data=image, 
                            chunks=(1, chunk_size, chunk_size))

            hf.create_dataset(
                    "seg_mask",
                    data=mask,
                    chunks=(1, chunk_size, chunk_size),
                )
        time_write_h5py = time.time() - start
        times_write_hdf5[chunk_size].append(time_write_h5py)

        #SDATA WRITE
        start = time.time()
        sdata = spdata.SpatialData()
        sdata.write("test.sdata", overwrite=True)

        input_image = spdata.models.Image2DModel.parse(image, dims=["c", "y", "x"], transformations= {"global": spdata.transformations.transformations.Identity()}, chunks=(1, 50, 50))
        input_image.data = input_image.data.rechunk((1, 50, 50))
        sdata.images["input_image"] = input_image
        sdata.write_element("input_image")

        mask_0 = spdata.models.Labels2DModel.parse(image[0],  dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))
        mask_1 = spdata.models.Labels2DModel.parse(image[1], dims=["y", "x"], transformations={"global": spdata.transformations.transformations.Identity()}, chunks=(50, 50))

        mask_0.data = mask_0.data.rechunk((chunk_size, chunk_size))
        mask_1.data = mask_1.data.rechunk((chunk_size, chunk_size))

        sdata.labels["nuc_seg"] = mask_0
        sdata.labels["cyto_seg"] = mask_1
        sdata.write_element("nuc_seg")
        sdata.write_element("cyto_seg")

        time_write_sdata = time.time() - start
        times_write_sdata[chunk_size].append(time_write_sdata)

        #randomly generate 5000 xy coordinates that are within the image size
        coords = np.random.randint(0, 10000, (5000, 2))
        image_size = 128
        width = 128//2

        #simulate extraction with h5py
        start = time.time()
        with h5py.File("test.h5",
                    "r",
                    rdcc_nbytes=5242880000,
                    rdcc_w0=1,
                    rdcc_nslots=50000,
                )  as hf:
            
            masks = hf["seg_mask"]
            images = hf["input_image"]

            image_width = images.shape[1]
            height = images.shape[2]

            for i in range(0, len(coords)):
                x, y = coords[i]

                window_y = slice(x-width, x+width)
                window_x = slice(y-width, y+width)

                condition = [
                        width < x,
                        x < image_width - width,
                        width < y,
                        y < height - width,
                    ]
                
                if np.all(condition):
                    mask = masks[:2, window_x, window_y]
                    image = images[:, window_x, window_y]

                    #do something with the mask and image
                    mask = mask == 10
                    image = image * mask[0]

        time_read_h5py = time.time() - start
        times_reading_hdf5[chunk_size].append(time_read_h5py)

        #simulate extraction with sdata
        start = time.time()

        for i in range(0, len(coords)):
            x, y = coords[i]
            window_x = slice(x-width, x+width)
            window_y = slice(y-width, y+width)

            mask1 = sdata.labels["nuc_seg"].data[window_x, window_y]
            mask2 = sdata.labels["cyto_seg"].data[window_x, window_y]
            image = sdata.images["input_image"].data[:,window_x, window_y]

            mask1 = (mask1 == 10)
            mask2 = (mask2 == 10)
            image = image * mask1

        time_read_sdata = time.time() - start
        times_reading_sdata[chunk_size].append(time_read_sdata)

#code for plotting
import pandas as pd
data = {
    'chunk': [],
    'times_read_h5py': [],
    'times_read_sdata': [],
    'times_write_h5py': [],
    'times_write_sdata': []
}

for chunk in times_reading_hdf5.keys():
    for i in range(len(times_reading_hdf5[chunk])):
        data['chunk'].append(chunk)
        data['times_read_h5py'].append(times_reading_hdf5[chunk][i])
        data['times_read_sdata'].append(times_reading_sdata[chunk][i])
        data['times_write_h5py'].append(times_write_hdf5[chunk][i])
        data['times_write_sdata'].append(times_write_sdata[chunk][i])

# Create the DataFrame
full_results = pd.DataFrame(data)

results_reading = full_results.get(["chunk", "times_read_sdata", "times_read_h5py"]).groupby(["chunk"]).agg(["mean", "std"])
results_writing = full_results.get(["chunk", "times_write_sdata", "times_write_h5py"]).groupby(["chunk"]).agg(["mean", "std"])


# Plot the data
import matplotlib.pyplot as plt

df = results_reading
plt.figure(figsize=(10, 6))

# SDATA read times
plt.errorbar(df.index, df[('times_read_sdata', 'mean')], yerr=df[('times_read_sdata', 'std')],
             label='SDATA Read Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_read_h5py', 'mean')], yerr=df[('times_read_h5py', 'std')],
             label='H5PY Read Time', fmt='-o', capsize=5)

plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Read Times with Error Bars')
plt.legend()

plt.xscale('log')  # If chunks are on a logarithmic scale
#plt.yscale('linear')  # Adjust if needed

plt.show()


# Plot the data
import matplotlib.pyplot as plt

df = results_writing
plt.figure(figsize=(10, 6))

# SDATA read times
plt.errorbar(df.index, df[('times_write_sdata', 'mean')], yerr=df[('times_write_sdata', 'std')],
             label='SDATA Write Time', fmt='-o', capsize=5)
# H5PY read times
plt.errorbar(df.index, df[('times_write_h5py', 'mean')], yerr=df[('times_write_h5py', 'std')],
             label='H5PY Write Time', fmt='-o', capsize=5)

plt.xlabel('Chunk Size')
plt.ylabel('Time (s)')
plt.title('Write Times with Error Bars')
plt.legend()

plt.xscale('log')  # If chunks are on a logarithmic scale
#plt.yscale('linear')  # Adjust if needed

plt.show()

Executing this code and tracking the resulting computation times gives the following results:

chunk times_read_sdata_mean times_read_sdata_std times_read_h5py_mean times_read_h5py_std
50 6.031215 0.072047 1.042576 0.068853
100 6.751713 0.176368 0.605539 0.042001
1000 6.562206 0.083729 0.526818 0.017236
5000 6.800794 0.292802 0.533734 0.030269
10000 6.686381 0.112755 0.540749 0.034289
chunk times_write_sdata_mean times_write_sdata_std times_write_h5py_mean times_write_h5py_std
50 66.940983 5.739447 1.243855 0.116332
100 50.373895 1.717572 0.620364 0.021012
1000 45.023347 0.529781 0.381684 0.010140
5000 44.070088 0.114096 0.394537 0.030758
10000 45.046422 0.561808 0.380628 0.014421

reading_times
writing_times

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Jul 24, 2024

Potential Solution idea to slow read times from sdata: map relevant sdata elements to memory-mapped temp arrays and perform extraction from those.

Initial implementation in commit: bb16cdd

Updated benchmarking with the new method:

Method Threads = 1 Threads = 2 Threads = 3
sparcsspatial 29.4 s ± 2.48 s per loop 19.8 s ± 1.89 s per loop 15.3 s ± 1.63 s per loop
sparcsspatial (mmap) 12 s ± 1.78 s per loop 12.4 s ± 1.86 s per loop 10.2 s ± 307 ms per loop
sparcspy 3.31 s ± 2.02 s per loop 1.63 s ± 636 ms per loop 2.28 s ± 2.03 s per loop

The use of memory mapped temp arrays seems to speed up the actual extraction process but generates some overhead required to map the images/masks to disk beforehand. To better understand this tradeoff a proper benchmarking needs to be performed where the following things are quantified:

  1. time to load to memory mapped array for various image sizes
  2. extraction time per cell (compared across the three methods)

Update

The significant difference between sparcspy and the sparcsspatial implementation was resulting from no longer using memory mapped temp arrays for saving intermediate results that were assigned as global variables but instead reconnecting to these arrays in each save process. This generates a lot of time overhead. By reimplementing the workflow to only connect once to the memory mapped temp arrays per process (so for single-threaded once during the extraction and for multi-threaded once for the thread call) this significantly boosted the extraction times. This fix was then also transferred to the SPARCSspatial implementation and benchmarked against each other

Software Version
SPARCSpy b289ec7
SPARCSpy_new_extraction_workflow cd70c8f
SPARCSpy_new_extraction_workflow_improved b37f2fd
SPARCSspatial a59c1a4

total_extraction_time
rate_of_extraction_time

Replicates of 6 independent runs on the same input dataset with identical segmentation masks.

While the SPARCSspatial implementation is still somewhat less efficient an the base SPARCSpy implementation this is an acceptable result with which we can continue working.

This analysis should be replicated with larger datasets to better estimate the result of multithreading on processing time.

@sophiamaedler
Copy link
Collaborator Author

sophiamaedler commented Jul 28, 2024

classification workflow implemented as of commit 52e3ccf

This also entailed some major classification workflow remodelling addressing issue #22.

The classification results are written to tables in the sdata object as well as csv files. This later functionality can be deprecated after we have worked with the sdata tables more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant