-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save nexus v1 #108
Save nexus v1 #108
Conversation
…eady there in the header)
Is there a functional requirement to require h5py>=3.7 for the code to work? Otherwise I'd change that to 3.1 to allow python 3.6 to install it. |
I think the version requirement is to support track_order on the root group, so that the datasets are loaded in the order they were saved (see h5py/h5py#1097). Do we care if the order of the datasets is preserved? I can't remember if keeping order is part of the spec. |
Actually it looks like the workaround described in the h5py issue above (which we are already using) works fine back to h5py=3.1 and python=3.6 (I just tried out that combination). So we can remove that version requirement for h5py. |
I don't think that was ever specified. In the text file this is always the case. I don't see the need in the nexus as all headers are present in each dataset. Maybe if you want to presever plotting order it is nice. |
What does everyone think the merge process for this looks like? There may be a desire for a bit of churn in this kind of file as we experiment and learn. Can we make backwards incompatible changes after we've already made a release, if we think we've improved the state of the code? I'm happy for there to be churn and for those backcompat breaks to exist. I'm interested in starting to use an HDF representation because I've got some data that needs a detailed resolution kernel, and I'd like to investigate using ORB to do this. |
A major breaking change from this that I can foresee is a proposed re-ordering of the dataset arrays on the orsopy side (the representation in the .ort and .orb files would be unchanged), as described in #107, which specifically addresses your use case of multidimensional "columns" By the end of the comments on that PR I think there was at least tentative agreement on a way forward (use lists of "columns" to hold the data instead of a 2-d array with column as the 2nd index in the orsopy structure). I agree that we might need to go through a bit of churn at the outset in order to get stable support for the HDF5 structure in the long term. |
I read through #107, and I'd like to understand more clearly. What do we mean by "lists of columns"? Considering an HDF point of view it makes little sense to store a 2D array, that's something that is a limitation of ASCII. For .orb does "lists of columns" mean creating a new dataset for each of the 'columns'? e.g. something along the lines of: Qz = np.linspace(0.01, 0.5, 101)
R = np.random.random(101)
dR = np.random.random(101)
# dQz is multidimensional, and is the shape I'd like to reconstitute
dQz = np.random(101, 2, 51)
with h5py.File(fname, "w") as f:
f.create_group("entry1/data")
f.create_dataset("entry1/data/Qz", Qz)
f.create_dataset("entry1/data/R", R)
f.create_dataset("entry1/data/dR", dR)
f.create_dataset("entry1/data/dQz", dQz) ? With the 'column' names then listed appropriately (a la NeXus/NXcanSAS) in attributes (that could also specify ordering of how they're supposed to be loaded)? I guess there is a further issue, how those arrays are made available through |
There are two considerations: how the data is stored in the file formats (.ort and .orb) and how the data is represented in the orsopy python objects. The proposal in #107 changes only the the python objects. In the .orb files, as you suggest above, I believe it makes sense to create a separate HDF5 dataset (not to be confused with orsopy OrsoDataset) for each column in the data, and I don't believe that to be controversial. For the python objects, Instead of storing a numpy ndarray of shape So what is proposed is that the type of OrsoDataset.data be For .ort files, we could support multidimensional columns by adding new metadata to the column description (e.g. name="angular_resolution" index=[1]), but this is not implemented yet. |
I think I'd also be ok with a dict as you suggested, but currently column metadata is stored as a list, and I was trying to make as small a change as possible. During the write process to an HDF it is converted to exactly the dict you describe, but also the "sequence_index" is added as an attribute so it can be reconstructed into a list on read. |
Thanks for that clear explanation. I get what's being proposed now. List of columns, yeah let's go. What's the roadmap to going forwards then? Is it merge #107 then this? |
@bmaranville @aglavic do you think we can merge this PR? Or are there issues that need to be addressed before it gets merged? |
We can do this one first, if you like. The changes for #107 are pretty small and can be done after. |
My take on this is that we should merge this as soon as it's ready (maybe now) and make a minor revision. I don't see an issue with breaking changes, if they only concern the binary representation, as we should not release that officially, yet. My assumption is, that 90% of people implementing the binary standart within the next couple of years are within this project, so a breaking change could be fixed quite easily. What I would try to avaoid is a change that breaks the text format and reading of important header information through orsopy (e.g. changing of keys or orsopy main class interface attributes). But I don't remember any of the conversations pointing towards such an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor changes you might want to take a look at, but can also be done in the next iteration.
return v | ||
|
||
|
||
def load_nexus(fname: Union[str, BinaryIO]) -> List[OrsoDataset]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be advantageous to have a single load_orso function that chooses which loader to use automatically. That would allow external software to be agnostic about the file type that's used.
We can do this in the next release, though.
Suggested way of implementation:
- load_orso scans through a list of known data format classes
- Each format class has a static method to check if it supports the file in questoin (e.g. first header line for ort, hdf header for .orb)
- If it's supported, the function uses the format class to retrieve the Orso objects from the file
- Potentially we could store the format used to unpack in the Orso class
def orsodataclass(cls: type): | ||
ORSO_DATACLASSES[cls.__name__] = cls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to use in-build python functionality instead of keeping track of our sub-classes ourselfs. All header classes are derived from Header
so you could replace:
if value.__class__ in ORSO_DATACLASSES.values():
with
if isinstance(value, Header):
and
cls = ORSO_DATACLASSES[ORSO_class]
with
orso_classes = dict((c.__name__, c) for c in Header.__subclasses__()))
...
cls = orso_classes[ORSO_class]
(I think the first call misses sub-subclasses, so it's probably easier to implement it as recursive method in the Header class.)
In this case we are also safe with people sub-classing orso Header without using a decorator (e.g. adding functionality not attributes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subclass recursive search would miss OrsoDataset, which needs to be recognized as an ORSO class during deserialization. I think trying to make an inheritance tree that includes both Header and OrsoDataset is going to be more complicated.
@aglavic, @bmaranville I propose that we merge this as-is, then make Artur's suggested changes afterwards. It'd be good to get some movement going again. |
This version of the PR is separate from #107, and implements a NeXus save/load with the current data order.
I added some tests as suggested by @andyfaff and it roundtrips the data structures successfully.
We can probably delete the previous PR #97 which also implemented the change of data order; as @aglavic pointed out that is a breaking change and we've already released v1.0, so we need to come to an agreement on that issue separately.