Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHORE: Clean Up Parquet File Writing #189

Merged
merged 1 commit into from
Nov 15, 2023
Merged

Conversation

rymarczy
Copy link
Collaborator

This change has some minor updates to our parquet file writing process that should be helpful prior to our push to the prod environment.

  • decrease batch_size for initial parquet file creation to reduce memory usage
  • clean up LAMP_ALL_RT_fields schema
  • add gc.collect() call to end of parquet file creation process

target-version = ['py39']
target-version = ['py310']
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this on the change from python 3.9 to 3.10

for retry_count in range(max_retries):
for retry_count in range(max_retries + 1):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allows for actual number of max_retries

Comment on lines +43 to +49
db_batch_size = 1024 * 1024 / 2

db_manager.write_to_parquet(
select_query=sa.text(self.create_query),
write_path=self.local_parquet_path,
schema=self.parquet_schema,
batch_size=db_batch_size,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on ECS Health dashboard, memory utilization appeared to peak somewhere between 70-90% on our ECS with 16GB of memory. A reduced batch_size should cut that in half.

" , ve.pm_trip_id"
" , ve.stop_sequence"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think pm_trip_id or updated_on fields have any use to OPMI, so drop from parquet file.

Comment on lines -62 to +58
" , vt.first_last_station_match"
" , vt.first_last_station_match as exact_static_trip_match"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exact_static_trip_match is a much more accurate name for this field.

Comment on lines +134 to +137
# this is a fairly wide dataset, so dial back the batch size
# to limit memory usage
db_batch_size = 1024 * 1024 / 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should limit memory utilization during initial file creation events.

)

def update_parquet(self, db_manager: DatabaseManager) -> bool:
dataset_batch_size = 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the update double the create size?

Copy link
Collaborator Author

@rymarczy rymarczy Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, we shouldn't see very large update query sizes, unless the parquet process itself is turned off for awhile, but I'll update this one as well, just to cover our basis.

Sorry, actually this is the "dataset" batch size. My intuition is that these batch sizes are much less memory intensive than the large DB query batches. And I believe there is a file size reduction by selecting the largest possible batch_size for these "dataset" batches.

Copy link
Contributor

@mzappitello mzappitello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question but generally looks good.

@mzappitello
Copy link
Contributor

🍰

@rymarczy rymarczy merged commit cc96d1d into main Nov 15, 2023
6 checks passed
rymarczy added a commit that referenced this pull request Nov 15, 2023
PR #189 Introduced a spacing error with the LAMP_ALL_RT_fields SQL query.

Update query to have consistent spacing before terms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants