CHORE: Clean Up Parquet File Writing #189

rymarczy · 2023-11-15T11:54:42Z

This change has some minor updates to our parquet file writing process that should be helpful prior to our push to the prod environment.

decrease batch_size for initial parquet file creation to reduce memory usage
clean up LAMP_ALL_RT_fields schema
add gc.collect() call to end of parquet file creation process

rymarczy · 2023-11-15T11:55:05Z

python_src/pyproject.toml

-target-version = ['py39']
+target-version = ['py310']


missed this on the change from python 3.9 to 3.10

rymarczy · 2023-11-15T11:55:36Z

python_src/src/lamp_py/tableau/hyper.py

-        for retry_count in range(max_retries):
+        for retry_count in range(max_retries + 1):


allows for actual number of max_retries

rymarczy · 2023-11-15T11:57:01Z

python_src/src/lamp_py/tableau/jobs/gtfs_rail.py

+        db_batch_size = 1024 * 1024 / 2
+
        db_manager.write_to_parquet(
            select_query=sa.text(self.create_query),
            write_path=self.local_parquet_path,
            schema=self.parquet_schema,
+            batch_size=db_batch_size,


Based on ECS Health dashboard, memory utilization appeared to peak somewhere between 70-90% on our ECS with 16GB of memory. A reduced batch_size should cut that in half.

rymarczy · 2023-11-15T11:57:36Z

python_src/src/lamp_py/tableau/jobs/rt_rail.py

-            "   , ve.pm_trip_id"
            "   , ve.stop_sequence"


I don't think pm_trip_id or updated_on fields have any use to OPMI, so drop from parquet file.

rymarczy · 2023-11-15T11:57:57Z

python_src/src/lamp_py/tableau/jobs/rt_rail.py

-            "   , vt.first_last_station_match"
+            "   , vt.first_last_station_match as exact_static_trip_match"


exact_static_trip_match is a much more accurate name for this field.

rymarczy · 2023-11-15T11:58:36Z

python_src/src/lamp_py/tableau/jobs/rt_rail.py

+        # this is a fairly wide dataset, so dial back the batch size
+        # to limit memory usage
+        db_batch_size = 1024 * 1024 / 2
+


Should limit memory utilization during initial file creation events.

mzappitello · 2023-11-15T15:34:40Z

python_src/src/lamp_py/tableau/jobs/rt_rail.py

        )

    def update_parquet(self, db_manager: DatabaseManager) -> bool:
+        dataset_batch_size = 1024 * 1024


why is the update double the create size?

~~In practice, we shouldn't see very large update query sizes, unless the parquet process itself is turned off for awhile, but I'll update this one as well, just to cover our basis.~~

Sorry, actually this is the "dataset" batch size. My intuition is that these batch sizes are much less memory intensive than the large DB query batches. And I believe there is a file size reduction by selecting the largest possible batch_size for these "dataset" batches.

mzappitello

one question but generally looks good.

mzappitello · 2023-11-15T17:42:00Z

🍰

PR #189 Introduced a spacing error with the LAMP_ALL_RT_fields SQL query. Update query to have consistent spacing before terms.

parquet writing clean-up for prod

5a89e52

rymarczy commented Nov 15, 2023

View reviewed changes

rymarczy requested a review from mzappitello November 15, 2023 12:51

mzappitello reviewed Nov 15, 2023

View reviewed changes

rymarczy merged commit cc96d1d into main Nov 15, 2023
6 checks passed

rymarczy mentioned this pull request Nov 15, 2023

FIX: SQL Query Spacing in LAMP_ALL_RT_fields VIEW #191

Merged

rymarczy added a commit that referenced this pull request Nov 15, 2023

FIX: SQL Query Spacing in LAMP_ALL_RT_fields VIEW (#191)

b15e464

PR #189 Introduced a spacing error with the LAMP_ALL_RT_fields SQL query. Update query to have consistent spacing before terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHORE: Clean Up Parquet File Writing #189

CHORE: Clean Up Parquet File Writing #189

rymarczy commented Nov 15, 2023

rymarczy Nov 15, 2023

rymarczy Nov 15, 2023

rymarczy Nov 15, 2023

rymarczy Nov 15, 2023

rymarczy Nov 15, 2023

rymarczy Nov 15, 2023

mzappitello Nov 15, 2023

rymarczy Nov 15, 2023 •

edited

Loading

mzappitello left a comment

mzappitello commented Nov 15, 2023

		for retry_count in range(max_retries):
		for retry_count in range(max_retries + 1):

		" , vt.first_last_station_match"
		" , vt.first_last_station_match as exact_static_trip_match"

		target-version = ['py39']
		target-version = ['py310']

CHORE: Clean Up Parquet File Writing #189

CHORE: Clean Up Parquet File Writing #189

Conversation

rymarczy commented Nov 15, 2023

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023

Choose a reason for hiding this comment

mzappitello Nov 15, 2023

Choose a reason for hiding this comment

rymarczy Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

mzappitello left a comment

Choose a reason for hiding this comment

mzappitello commented Nov 15, 2023

rymarczy Nov 15, 2023 •

edited

Loading