Bulk import JSON FHIR resources #187

Wambere · 2024-05-06T11:09:24Z

IMPORTANT: Where possible all PRs must be linked to a Github issue

Fixes #184

Engineer Checklist

I have run ./gradlew spotlessApply to check my code follows the project's style guide
I have built and run the efsity jar to verify my change fixes the issue and/or does not break the application

Wambere · 2024-05-06T14:53:06Z

Since we have referential integrity enabled, there is a high chance of getting a lot of failures with the data the way it currently is. Some possible solutions include :

Option 1: Require the data to be separated based on resourceType
This way we can load the resourceTypes with dependencies first
The code can be modified to take in a folder with multiple JSON files, each holding a different resourceType, and a string/list of resourceTypes to be used as the order of import

Option 2: Allow creation of placeholder resources
Allow the tool to create a placeholder resource every time it comes across a non existing one, with the assumption that these resources will eventually be updated with their full information once we get to that point in the dataset
We'll probably need to tag these placeholder resources somehow so that we can check once the importing is done that they all have been updated and figure out what to do with any that have not

Option 3: Sort the resources in the dataset
Attempt to sort the resources in the batches and import them breaking down the batches even more if need be

pld · 2024-05-06T15:10:48Z

Is option 3 the same as building a dependency graph (I think this could be many graphs) and importing based on the order (from leaves to roots) of that graph?

@ndegwamartin do you have a link to the Android FHIR SDK in progress code that orders resource by dependency graph? If that's straightforward maybe we could lift and reimplement that

dubdabasoduba · 2024-05-07T06:43:22Z

@ndegwamartin please confirm if this is what @pld is talking about

Wambere · 2024-05-07T14:32:29Z

Looked a bit at that PR and it looks like it is building a mapping that is then used to pull the resources from the database?

I think the worst part about this is probably the fact that we might need to load the whole file in order to get the references/info we need to build the mapping. Or maybe iterate over the file twice? The first time getting the info and the second time actually building the chunks. Considering the size estimate (8k Patient resources each having 5 or more related resources) we definitely have to think more on how this can be optimized best. Looking more into the implementation to see how much we can borrow

ndegwamartin · 2024-05-08T14:17:28Z

Concurred on both the above i.e. the logic behind the SDK's implementation could be (partially) lifted or borrowed from but also that the performance for our use case may be more negatively impacted than in the SDK since we are dealing with raw files while the SDK working with an (indexed) DB.

We could also look at this from a different perspective. Migration is meant to be a one off (or sporadic) process and done especially at the beginning of any server deployment's timeline. For the issue on referential integrity we could disable the server validation for the migration process then re-enable it after. If the tool guarantees processing of all resources, any references that were present before the migration will still be present after the migration. Server validation should mainly be in place for real time data creation e.g. by users. This approach guarantees the mirroring of the data's state across systems which IMO is the objective of a migration.

Wambere · 2024-05-09T10:45:58Z

After a chat with @pld yesterday we agreed on a first pass to simply sort the resources in the file based on resourceTypes and post those payloads one after the other

To test the current implementation run

 python3 main.py --bulk_import True --json_file tests/fhir_sample.json --log_level info --chunk_size 500000 --sync sort

The "chunk_size" is used when reading from the file

If you pass "--sync sort" this will sort your resources into the different resourceTypes
The order of resourceTypes for sync is based on the order in which they appear in the file, so you can put a patient resource first in the list if you want patients to be synced first

The default sync is "direct" which does not sort, just tries to push the data as is. Faster sync, useful if you don't have differential integrity turned on

Will pull most of this comments into the docs

Wambere · 2024-05-09T12:42:53Z

@Rkareko if you are able to test this with your dataset and provide feedback, that would be awesome

Rkareko · 2024-05-14T07:31:21Z

@Rkareko if you are able to test this with your dataset and provide feedback, that would be awesome

This has been tested with a list of around 150 records. The importer successfully parsed the data and synced to the server cc @allan-on

Rkareko · 2024-05-14T07:32:20Z

@Wambere is it possible to display the progress of processed resources ?

Initial commit for json fhir resources bulk import

8545482

Wambere mentioned this pull request May 7, 2024

Bulk import of FHIR Resources #184

Closed

Update sync strategy with option to sort based on resource type

f7ab9d7

Handle long resource lists

3748122

Wambere added 7 commits May 16, 2024 10:12

Update split_index to work with resources not separated by a new line

b6fc114

Add progressbar to show progress as chunks are processed

46686bd

Make number of resources configurable

e2447c4

Add documentation for json array resource import

7f7d8d3

Clean up

15aafa7

Add tests

6c39dc8

sample json

fbf8a61

Wambere changed the title ~~WIP: bulk import JSON FHIR resources~~ Bulk import JSON FHIR resources May 17, 2024

Formatting

22b11c6

Wambere requested review from pld and Rkareko May 17, 2024 06:23

pld approved these changes May 21, 2024

View reviewed changes

pld merged commit bbaa8ae into main May 21, 2024
4 checks passed

pld deleted the 184-bulk-import branch May 21, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import JSON FHIR resources #187

Bulk import JSON FHIR resources #187

Wambere commented May 6, 2024 •

edited

Loading

Wambere commented May 6, 2024

pld commented May 6, 2024

dubdabasoduba commented May 7, 2024

Wambere commented May 7, 2024

ndegwamartin commented May 8, 2024

Wambere commented May 9, 2024

Wambere commented May 9, 2024

Rkareko commented May 14, 2024

Rkareko commented May 14, 2024

Bulk import JSON FHIR resources #187

Bulk import JSON FHIR resources #187

Conversation

Wambere commented May 6, 2024 • edited Loading

Wambere commented May 6, 2024

pld commented May 6, 2024

dubdabasoduba commented May 7, 2024

Wambere commented May 7, 2024

ndegwamartin commented May 8, 2024

Wambere commented May 9, 2024

Wambere commented May 9, 2024

Rkareko commented May 14, 2024

Rkareko commented May 14, 2024

Wambere commented May 6, 2024 •

edited

Loading