-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CHORE: Clean Up Parquet File Writing #189
Conversation
target-version = ['py39'] | ||
target-version = ['py310'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed this on the change from python 3.9 to 3.10
for retry_count in range(max_retries): | ||
for retry_count in range(max_retries + 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allows for actual number of max_retries
db_batch_size = 1024 * 1024 / 2 | ||
|
||
db_manager.write_to_parquet( | ||
select_query=sa.text(self.create_query), | ||
write_path=self.local_parquet_path, | ||
schema=self.parquet_schema, | ||
batch_size=db_batch_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on ECS Health dashboard, memory utilization appeared to peak somewhere between 70-90% on our ECS with 16GB of memory. A reduced batch_size
should cut that in half.
" , ve.pm_trip_id" | ||
" , ve.stop_sequence" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think pm_trip_id
or updated_on
fields have any use to OPMI, so drop from parquet file.
" , vt.first_last_station_match" | ||
" , vt.first_last_station_match as exact_static_trip_match" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exact_static_trip_match
is a much more accurate name for this field.
# this is a fairly wide dataset, so dial back the batch size | ||
# to limit memory usage | ||
db_batch_size = 1024 * 1024 / 2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should limit memory utilization during initial file creation events.
) | ||
|
||
def update_parquet(self, db_manager: DatabaseManager) -> bool: | ||
dataset_batch_size = 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the update double the create size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, we shouldn't see very large update query sizes, unless the parquet process itself is turned off for awhile, but I'll update this one as well, just to cover our basis.
Sorry, actually this is the "dataset" batch size. My intuition is that these batch sizes are much less memory intensive than the large DB query batches. And I believe there is a file size reduction by selecting the largest possible batch_size
for these "dataset" batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one question but generally looks good.
🍰 |
PR #189 Introduced a spacing error with the LAMP_ALL_RT_fields SQL query. Update query to have consistent spacing before terms.
This change has some minor updates to our parquet file writing process that should be helpful prior to our push to the prod environment.
batch_size
for initial parquet file creation to reduce memory usageLAMP_ALL_RT_fields
schemagc.collect()
call to end of parquet file creation process