Save nexus v1 #108

bmaranville · 2023-05-09T21:32:51Z

This version of the PR is separate from #107, and implements a NeXus save/load with the current data order.

I added some tests as suggested by @andyfaff and it roundtrips the data structures successfully.

We can probably delete the previous PR #97 which also implemented the change of data order; as @aglavic pointed out that is a breaking change and we've already released v1.0, so we need to come to an agreement on that issue separately.

…eady there in the header)

aglavic · 2023-06-29T08:25:43Z

Is there a functional requirement to require h5py>=3.7 for the code to work? Otherwise I'd change that to 3.1 to allow python 3.6 to install it.

bmaranville · 2023-06-29T11:09:57Z

I think the version requirement is to support track_order on the root group, so that the datasets are loaded in the order they were saved (see h5py/h5py#1097).

Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.

bmaranville · 2023-06-29T14:04:36Z

Actually it looks like the workaround described in the h5py issue above (which we are already using) works fine back to h5py=3.1 and python=3.6 (I just tried out that combination).

So we can remove that version requirement for h5py.

aglavic · 2023-07-03T09:30:36Z

Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.

I don't think that was ever specified. In the text file this is always the case. I don't see the need in the nexus as all headers are present in each dataset. Maybe if you want to presever plotting order it is nice.

andyfaff · 2023-07-24T00:42:45Z

What does everyone think the merge process for this looks like? There may be a desire for a bit of churn in this kind of file as we experiment and learn.

Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.

I'm interested in starting to use an HDF representation because I've got some data that needs a detailed resolution kernel, and I'd like to investigate using ORB to do this.

bmaranville · 2023-07-24T14:56:13Z

A major breaking change from this that I can foresee is a proposed re-ordering of the dataset arrays on the orsopy side (the representation in the .ort and .orb files would be unchanged), as described in #107, which specifically addresses your use case of multidimensional "columns"

By the end of the comments on that PR I think there was at least tentative agreement on a way forward (use lists of "columns" to hold the data instead of a 2-d array with column as the 2nd index in the orsopy structure).

I agree that we might need to go through a bit of churn at the outset in order to get stable support for the HDF5 structure in the long term.

andyfaff · 2023-07-25T00:17:08Z

I read through #107, and I'd like to understand more clearly. What do we mean by "lists of columns"? Considering an HDF point of view it makes little sense to store a 2D array, that's something that is a limitation of ASCII.

For .orb does "lists of columns" mean creating a new dataset for each of the 'columns'?

e.g. something along the lines of:

Qz = np.linspace(0.01, 0.5, 101)
R = np.random.random(101)
dR = np.random.random(101)
# dQz is multidimensional, and is the shape I'd like to reconstitute
dQz = np.random(101, 2, 51)

with h5py.File(fname, "w") as f:
        f.create_group("entry1/data")
        f.create_dataset("entry1/data/Qz", Qz)
        f.create_dataset("entry1/data/R", R)
        f.create_dataset("entry1/data/dR", dR)
        f.create_dataset("entry1/data/dQz", dQz)

? With the 'column' names then listed appropriately (a la NeXus/NXcanSAS) in attributes (that could also specify ordering of how they're supposed to be loaded)?

I guess there is a further issue, how those arrays are made available through OrsoDataset.data. At the moment OrsoDataset.data is supposed to be (npnts, ncols). A 2-D array is not sensible if one wants to store multidimensional data. Could we change the data attribute to be Union[dict, np.ndarray]? If it's a dict then the keys are the column names and the values are the (multidimensional) arrays. If it's an np.ndarray then it's what we have now. The dict would be writeable to .orb, and a subset to .ort. The np.ndarray would be writeable to either.

bmaranville · 2023-07-25T01:22:10Z

There are two considerations: how the data is stored in the file formats (.ort and .orb) and how the data is represented in the orsopy python objects. The proposal in #107 changes only the the python objects. In the .orb files, as you suggest above, I believe it makes sense to create a separate HDF5 dataset (not to be confused with orsopy OrsoDataset) for each column in the data, and I don't believe that to be controversial.

For the python objects, Instead of storing a numpy ndarray of shape (npnts, ncols), in OrsoDataset.data, what is suggested in #107 is that a python list of numpy ndarray objects (one per "column") is stored,. A minimal validation is done to ensure that the first dimension has the same length for each object,. In the simple case where all arrays are one-dimensional) a single combined 2d numpy.ndarray could also be accepted, though it would make sense to use a shape of (ncols ,npts) instead, so that the order of dimensions is the same for List[ndarray] and simple ndarray (2d). Support for this is trivial when the list of column data is unpacked with (for data in columndata...) and is included in #107.

So what is proposed is that the type of OrsoDataset.data be Union[List[numpy.ndarray], numpy.ndarray] (where the second form should have 2 dimensions). We might want to just have one type for the stored data but automatically convert an array input to list of arrays in __init__

For .ort files, we could support multidimensional columns by adding new metadata to the column description (e.g. name="angular_resolution" index=[1]), but this is not implemented yet.

bmaranville · 2023-07-25T02:13:08Z

I think I'd also be ok with a dict as you suggested, but currently column metadata is stored as a list, and I was trying to make as small a change as possible. During the write process to an HDF it is converted to exactly the dict you describe, but also the "sequence_index" is added as an attribute so it can be reconstructed into a list on read.

andyfaff · 2023-07-25T02:13:42Z

Thanks for that clear explanation. I get what's being proposed now. List of columns, yeah let's go.

What's the roadmap to going forwards then? Is it merge #107 then this?

andyfaff · 2023-07-31T12:32:11Z

@bmaranville @aglavic do you think we can merge this PR? Or are there issues that need to be addressed before it gets merged?
Does this PR get merged simultaneously with #107? I can commit to review that.

bmaranville · 2023-07-31T13:25:31Z

We can do this one first, if you like. The changes for #107 are pretty small and can be done after.

aglavic · 2023-08-11T10:55:47Z

Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.

My take on this is that we should merge this as soon as it's ready (maybe now) and make a minor revision. I don't see an issue with breaking changes, if they only concern the binary representation, as we should not release that officially, yet. My assumption is, that 90% of people implementing the binary standart within the next couple of years are within this project, so a breaking change could be fixed quite easily.

What I would try to avaoid is a change that breaks the text format and reading of important header information through orsopy (e.g. changing of keys or orsopy main class interface attributes). But I don't remember any of the conversations pointing towards such an issue.

aglavic

Minor changes you might want to take a look at, but can also be done in the next iteration.

aglavic · 2023-08-11T11:02:58Z

orsopy/fileio/orso.py

+            return v
+
+
+def load_nexus(fname: Union[str, BinaryIO]) -> List[OrsoDataset]:


I think it would be advantageous to have a single load_orso function that chooses which loader to use automatically. That would allow external software to be agnostic about the file type that's used.
We can do this in the next release, though.

Suggested way of implementation:

load_orso scans through a list of known data format classes

Each format class has a static method to check if it supports the file in questoin (e.g. first header line for ort, hdf header for .orb)

If it's supported, the function uses the format class to retrieve the Orso objects from the file

Potentially we could store the format used to unpack in the Orso class

aglavic · 2023-08-11T11:29:00Z

orsopy/fileio/base.py

 def orsodataclass(cls: type):
+    ORSO_DATACLASSES[cls.__name__] = cls


I think it would be cleaner to use in-build python functionality instead of keeping track of our sub-classes ourselfs. All header classes are derived from Header so you could replace:

if value.__class__ in ORSO_DATACLASSES.values():

with

if isinstance(value, Header):

and

cls = ORSO_DATACLASSES[ORSO_class]

with

orso_classes = dict((c.__name__, c) for c in Header.__subclasses__())) ... cls = orso_classes[ORSO_class]

(I think the first call misses sub-subclasses, so it's probably easier to implement it as recursive method in the Header class.)

In this case we are also safe with people sub-classing orso Header without using a decorator (e.g. adding functionality not attributes).

The subclass recursive search would miss OrsoDataset, which needs to be recognized as an ORSO class during deserialization. I think trying to make an inheritance tree that includes both Header and OrsoDataset is going to be more complicated.

andyfaff · 2023-11-21T01:22:59Z

@aglavic, @bmaranville I propose that we merge this as-is, then make Artur's suggested changes afterwards. It'd be good to get some movement going again.

bmaranville added 6 commits May 9, 2023 17:09

add save_nexus and load_nexus

93ab7ba

add load_nexus and save_nexus to exports

ea2602d

add tests of reading and writing OrsoDatasets to NeXus

dc72f4f

add h5py to dev requirements

55d80bb

add ORSO_VERSION to attributes of NeXus datasets

ec6d716

add physical_quantity to column attrs (just for convenience, it's alr…

77f57b5

…eady there in the header)

andyfaff closed this Jun 21, 2023

andyfaff reopened this Jun 21, 2023

the nexus writing and reading work fine with h5py >= 3.10

e202c89

aglavic approved these changes Aug 11, 2023

View reviewed changes

bmaranville merged commit 50cf7b9 into main Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save nexus v1 #108

Save nexus v1 #108

bmaranville commented May 9, 2023

aglavic commented Jun 29, 2023

bmaranville commented Jun 29, 2023

bmaranville commented Jun 29, 2023

aglavic commented Jul 3, 2023

andyfaff commented Jul 24, 2023

bmaranville commented Jul 24, 2023

andyfaff commented Jul 25, 2023

bmaranville commented Jul 25, 2023

bmaranville commented Jul 25, 2023

andyfaff commented Jul 25, 2023

andyfaff commented Jul 31, 2023

bmaranville commented Jul 31, 2023

aglavic commented Aug 11, 2023

aglavic left a comment

aglavic Aug 11, 2023

aglavic Aug 11, 2023

bmaranville Nov 21, 2023

andyfaff commented Nov 21, 2023

		return v


		def load_nexus(fname: Union[str, BinaryIO]) -> List[OrsoDataset]:

		def orsodataclass(cls: type):
		ORSO_DATACLASSES[cls.__name__] = cls

Save nexus v1 #108

Save nexus v1 #108

Conversation

bmaranville commented May 9, 2023

aglavic commented Jun 29, 2023

bmaranville commented Jun 29, 2023

bmaranville commented Jun 29, 2023

aglavic commented Jul 3, 2023

andyfaff commented Jul 24, 2023

bmaranville commented Jul 24, 2023

andyfaff commented Jul 25, 2023

bmaranville commented Jul 25, 2023

bmaranville commented Jul 25, 2023

andyfaff commented Jul 25, 2023

andyfaff commented Jul 31, 2023

bmaranville commented Jul 31, 2023

aglavic commented Aug 11, 2023

aglavic left a comment

Choose a reason for hiding this comment

aglavic Aug 11, 2023

Choose a reason for hiding this comment

aglavic Aug 11, 2023

Choose a reason for hiding this comment

bmaranville Nov 21, 2023

Choose a reason for hiding this comment

andyfaff commented Nov 21, 2023