Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk import JSON FHIR resources #187

Merged
merged 11 commits into from
May 21, 2024
Merged

Bulk import JSON FHIR resources #187

merged 11 commits into from
May 21, 2024

Conversation

Wambere
Copy link
Contributor

@Wambere Wambere commented May 6, 2024

IMPORTANT: Where possible all PRs must be linked to a Github issue

Fixes #184

Engineer Checklist

  • I have run ./gradlew spotlessApply to check my code follows the project's style guide
  • I have built and run the efsity jar to verify my change fixes the issue and/or does not break the application

@Wambere
Copy link
Contributor Author

Wambere commented May 6, 2024

Since we have referential integrity enabled, there is a high chance of getting a lot of failures with the data the way it currently is. Some possible solutions include :

Option 1: Require the data to be separated based on resourceType
This way we can load the resourceTypes with dependencies first
The code can be modified to take in a folder with multiple JSON files, each holding a different resourceType, and a string/list of resourceTypes to be used as the order of import

Option 2: Allow creation of placeholder resources
Allow the tool to create a placeholder resource every time it comes across a non existing one, with the assumption that these resources will eventually be updated with their full information once we get to that point in the dataset
We'll probably need to tag these placeholder resources somehow so that we can check once the importing is done that they all have been updated and figure out what to do with any that have not

Option 3: Sort the resources in the dataset
Attempt to sort the resources in the batches and import them breaking down the batches even more if need be

@pld
Copy link
Member

pld commented May 6, 2024

Is option 3 the same as building a dependency graph (I think this could be many graphs) and importing based on the order (from leaves to roots) of that graph?

@ndegwamartin do you have a link to the Android FHIR SDK in progress code that orders resource by dependency graph? If that's straightforward maybe we could lift and reimplement that

@dubdabasoduba
Copy link
Member

@ndegwamartin please confirm if this is what @pld is talking about

@Wambere
Copy link
Contributor Author

Wambere commented May 7, 2024

Looked a bit at that PR and it looks like it is building a mapping that is then used to pull the resources from the database?

I think the worst part about this is probably the fact that we might need to load the whole file in order to get the references/info we need to build the mapping. Or maybe iterate over the file twice? The first time getting the info and the second time actually building the chunks. Considering the size estimate (8k Patient resources each having 5 or more related resources) we definitely have to think more on how this can be optimized best. Looking more into the implementation to see how much we can borrow

@ndegwamartin
Copy link
Contributor

Concurred on both the above i.e. the logic behind the SDK's implementation could be (partially) lifted or borrowed from but also that the performance for our use case may be more negatively impacted than in the SDK since we are dealing with raw files while the SDK working with an (indexed) DB.

We could also look at this from a different perspective. Migration is meant to be a one off (or sporadic) process and done especially at the beginning of any server deployment's timeline. For the issue on referential integrity we could disable the server validation for the migration process then re-enable it after. If the tool guarantees processing of all resources, any references that were present before the migration will still be present after the migration. Server validation should mainly be in place for real time data creation e.g. by users. This approach guarantees the mirroring of the data's state across systems which IMO is the objective of a migration.

@Wambere
Copy link
Contributor Author

Wambere commented May 9, 2024

After a chat with @pld yesterday we agreed on a first pass to simply sort the resources in the file based on resourceTypes and post those payloads one after the other

To test the current implementation run

 python3 main.py --bulk_import True --json_file tests/fhir_sample.json --log_level info --chunk_size 500000 --sync sort

The "chunk_size" is used when reading from the file

If you pass "--sync sort" this will sort your resources into the different resourceTypes
The order of resourceTypes for sync is based on the order in which they appear in the file, so you can put a patient resource first in the list if you want patients to be synced first

The default sync is "direct" which does not sort, just tries to push the data as is. Faster sync, useful if you don't have differential integrity turned on

Will pull most of this comments into the docs

@Wambere
Copy link
Contributor Author

Wambere commented May 9, 2024

@Rkareko if you are able to test this with your dataset and provide feedback, that would be awesome

@Rkareko
Copy link
Contributor

Rkareko commented May 14, 2024

@Rkareko if you are able to test this with your dataset and provide feedback, that would be awesome

This has been tested with a list of around 150 records. The importer successfully parsed the data and synced to the server cc @allan-on

@Rkareko
Copy link
Contributor

Rkareko commented May 14, 2024

@Wambere is it possible to display the progress of processed resources ?

@Wambere Wambere changed the title WIP: bulk import JSON FHIR resources Bulk import JSON FHIR resources May 17, 2024
@Wambere Wambere requested review from pld and Rkareko May 17, 2024 06:23
@pld pld merged commit bbaa8ae into main May 21, 2024
4 checks passed
@pld pld deleted the 184-bulk-import branch May 21, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bulk import of FHIR Resources
5 participants