Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save nexus v1 #108

Merged
merged 7 commits into from
Nov 24, 2023
Merged

Save nexus v1 #108

merged 7 commits into from
Nov 24, 2023

Conversation

bmaranville
Copy link
Contributor

This version of the PR is separate from #107, and implements a NeXus save/load with the current data order.

I added some tests as suggested by @andyfaff and it roundtrips the data structures successfully.

We can probably delete the previous PR #97 which also implemented the change of data order; as @aglavic pointed out that is a breaking change and we've already released v1.0, so we need to come to an agreement on that issue separately.

@andyfaff andyfaff closed this Jun 21, 2023
@andyfaff andyfaff reopened this Jun 21, 2023
@aglavic
Copy link
Collaborator

aglavic commented Jun 29, 2023

Is there a functional requirement to require h5py>=3.7 for the code to work? Otherwise I'd change that to 3.1 to allow python 3.6 to install it.

@bmaranville
Copy link
Contributor Author

I think the version requirement is to support track_order on the root group, so that the datasets are loaded in the order they were saved (see h5py/h5py#1097).

Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.

@bmaranville
Copy link
Contributor Author

Actually it looks like the workaround described in the h5py issue above (which we are already using) works fine back to h5py=3.1 and python=3.6 (I just tried out that combination).

So we can remove that version requirement for h5py.

@aglavic
Copy link
Collaborator

aglavic commented Jul 3, 2023

Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec.

I don't think that was ever specified. In the text file this is always the case. I don't see the need in the nexus as all headers are present in each dataset. Maybe if you want to presever plotting order it is nice.

@andyfaff
Copy link
Contributor

What does everyone think the merge process for this looks like? There may be a desire for a bit of churn in this kind of file as we experiment and learn.

Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.

I'm interested in starting to use an HDF representation because I've got some data that needs a detailed resolution kernel, and I'd like to investigate using ORB to do this.

@bmaranville
Copy link
Contributor Author

A major breaking change from this that I can foresee is a proposed re-ordering of the dataset arrays on the orsopy side (the representation in the .ort and .orb files would be unchanged), as described in #107, which specifically addresses your use case of multidimensional "columns"

By the end of the comments on that PR I think there was at least tentative agreement on a way forward (use lists of "columns" to hold the data instead of a 2-d array with column as the 2nd index in the orsopy structure).

I agree that we might need to go through a bit of churn at the outset in order to get stable support for the HDF5 structure in the long term.

@andyfaff
Copy link
Contributor

I read through #107, and I'd like to understand more clearly. What do we mean by "lists of columns"? Considering an HDF point of view it makes little sense to store a 2D array, that's something that is a limitation of ASCII.

For .orb does "lists of columns" mean creating a new dataset for each of the 'columns'?

e.g. something along the lines of:

Qz = np.linspace(0.01, 0.5, 101)
R = np.random.random(101)
dR = np.random.random(101)
# dQz is multidimensional, and is the shape I'd like to reconstitute
dQz = np.random(101, 2, 51)

with h5py.File(fname, "w") as f:
        f.create_group("entry1/data")
        f.create_dataset("entry1/data/Qz", Qz)
        f.create_dataset("entry1/data/R", R)
        f.create_dataset("entry1/data/dR", dR)
        f.create_dataset("entry1/data/dQz", dQz)

? With the 'column' names then listed appropriately (a la NeXus/NXcanSAS) in attributes (that could also specify ordering of how they're supposed to be loaded)?

I guess there is a further issue, how those arrays are made available through OrsoDataset.data. At the moment OrsoDataset.data is supposed to be (npnts, ncols). A 2-D array is not sensible if one wants to store multidimensional data. Could we change the data attribute to be Union[dict, np.ndarray]? If it's a dict then the keys are the column names and the values are the (multidimensional) arrays. If it's an np.ndarray then it's what we have now. The dict would be writeable to .orb, and a subset to .ort. The np.ndarray would be writeable to either.

@bmaranville
Copy link
Contributor Author

There are two considerations: how the data is stored in the file formats (.ort and .orb) and how the data is represented in the orsopy python objects. The proposal in #107 changes only the the python objects. In the .orb files, as you suggest above, I believe it makes sense to create a separate HDF5 dataset (not to be confused with orsopy OrsoDataset) for each column in the data, and I don't believe that to be controversial.

For the python objects, Instead of storing a numpy ndarray of shape (npnts, ncols), in OrsoDataset.data, what is suggested in #107 is that a python list of numpy ndarray objects (one per "column") is stored,. A minimal validation is done to ensure that the first dimension has the same length for each object,. In the simple case where all arrays are one-dimensional) a single combined 2d numpy.ndarray could also be accepted, though it would make sense to use a shape of (ncols ,npts) instead, so that the order of dimensions is the same for List[ndarray] and simple ndarray (2d). Support for this is trivial when the list of column data is unpacked with (for data in columndata...) and is included in #107.

So what is proposed is that the type of OrsoDataset.data be Union[List[numpy.ndarray], numpy.ndarray] (where the second form should have 2 dimensions). We might want to just have one type for the stored data but automatically convert an array input to list of arrays in __init__

For .ort files, we could support multidimensional columns by adding new metadata to the column description (e.g. name="angular_resolution" index=[1]), but this is not implemented yet.

@bmaranville
Copy link
Contributor Author

I think I'd also be ok with a dict as you suggested, but currently column metadata is stored as a list, and I was trying to make as small a change as possible. During the write process to an HDF it is converted to exactly the dict you describe, but also the "sequence_index" is added as an attribute so it can be reconstructed into a list on read.

@andyfaff
Copy link
Contributor

Thanks for that clear explanation. I get what's being proposed now. List of columns, yeah let's go.

What's the roadmap to going forwards then? Is it merge #107 then this?

@andyfaff
Copy link
Contributor

@bmaranville @aglavic do you think we can merge this PR? Or are there issues that need to be addressed before it gets merged?
Does this PR get merged simultaneously with #107? I can commit to review that.

@bmaranville
Copy link
Contributor Author

We can do this one first, if you like. The changes for #107 are pretty small and can be done after.

@aglavic
Copy link
Collaborator

aglavic commented Aug 11, 2023

Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist.

My take on this is that we should merge this as soon as it's ready (maybe now) and make a minor revision. I don't see an issue with breaking changes, if they only concern the binary representation, as we should not release that officially, yet. My assumption is, that 90% of people implementing the binary standart within the next couple of years are within this project, so a breaking change could be fixed quite easily.

What I would try to avaoid is a change that breaks the text format and reading of important header information through orsopy (e.g. changing of keys or orsopy main class interface attributes). But I don't remember any of the conversations pointing towards such an issue.

Copy link
Collaborator

@aglavic aglavic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes you might want to take a look at, but can also be done in the next iteration.

return v


def load_nexus(fname: Union[str, BinaryIO]) -> List[OrsoDataset]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be advantageous to have a single load_orso function that chooses which loader to use automatically. That would allow external software to be agnostic about the file type that's used.
We can do this in the next release, though.

Suggested way of implementation:

  1. load_orso scans through a list of known data format classes
  2. Each format class has a static method to check if it supports the file in questoin (e.g. first header line for ort, hdf header for .orb)
  3. If it's supported, the function uses the format class to retrieve the Orso objects from the file
  4. Potentially we could store the format used to unpack in the Orso class

def orsodataclass(cls: type):
ORSO_DATACLASSES[cls.__name__] = cls
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to use in-build python functionality instead of keeping track of our sub-classes ourselfs. All header classes are derived from Header so you could replace:

if value.__class__ in ORSO_DATACLASSES.values():

with

if isinstance(value, Header):

and

cls = ORSO_DATACLASSES[ORSO_class]

with

orso_classes = dict((c.__name__, c) for c in Header.__subclasses__()))
...
  cls = orso_classes[ORSO_class]

(I think the first call misses sub-subclasses, so it's probably easier to implement it as recursive method in the Header class.)

In this case we are also safe with people sub-classing orso Header without using a decorator (e.g. adding functionality not attributes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subclass recursive search would miss OrsoDataset, which needs to be recognized as an ORSO class during deserialization. I think trying to make an inheritance tree that includes both Header and OrsoDataset is going to be more complicated.

@andyfaff
Copy link
Contributor

@aglavic, @bmaranville I propose that we merge this as-is, then make Artur's suggested changes afterwards. It'd be good to get some movement going again.

@bmaranville bmaranville merged commit 50cf7b9 into main Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants