FIX: Stream DB Results to Parquet Files #183

rymarczy · 2023-11-13T14:37:36Z

The current Parquet -> Tableau pipeline process is flawed in the amount of memory required to create parquet files from DB SELECT queries. This change is meant to result in a fixed amount of memory usage, no matter what the number of results are returned from a DB query, when creating a parquet file.

This fixed memory usage is achieved by utilizing the yield_per method of the SQLAlchemy Result object, as well as the RecordBatch object of the pyarrow library.

In testing, memory usage for the creation of a parquet file from the static_stop_times table maxes out at approximately 5-6GB.

If memory usage needs to be further limited, the write_to_parquet function of DatabaseManager offers a batch_size parameter to limit the number for records flowing into a parquet file per partition.

Asana Task: https://app.asana.com/0/1205827492903547/1205940053804614

rymarczy · 2023-11-13T14:38:08Z

python_src/src/lamp_py/postgres/postgres_utils.py

+        if not running_in_docker() and not running_in_aws():
+            db_host = "127.0.0.1"


pulled this over from dmap_import

rymarczy · 2023-11-13T14:39:03Z

python_src/src/lamp_py/runtime_utils/env_validation.py

@@ -16,6 +17,9 @@ def validate_environment(
    process_logger = ProcessLogger("validate_env")
    process_logger.log_start()

+    if private_variables is None:
+        private_variables = []
+


Had to re-structure this function to allow for a private_variable parameter and avoid a pylint too-many-branches flag

rymarczy · 2023-11-13T14:40:50Z

python_src/src/lamp_py/tableau/hyper.py

-        self.db_manager = DatabaseManager()
-


dropped self.db_manager from being created for all hyper jobs. This would have led to issues with the Hyper file writing ecs that won't be able to connect to our rds.

db_manager is now passed directly into create_parquet and update_parquet methods, as they are the only portions of the class that require db access.

mzappitello · 2023-11-13T19:08:42Z

python_src/src/lamp_py/postgres/postgres_utils.py

+        with self.session.begin() as cursor:
+            result = cursor.execute(select_query).yield_per(batch_size)
+            with pq.ParquetWriter(write_path, schema=schema) as pq_writer:
+                for part in result.partitions():
+                    pq_writer.write_batch(
+                        pyarrow.RecordBatch.from_pylist(
+                            [row._asdict() for row in part], schema=schema
+                        )
+                    )
+
+        return write_path


i don't think we need to return the write path since it was provided as part of the input.

Incorporated

mzappitello

nice. excited to see this in staging.

mzappitello

lgtm 🍰

stream db resutls to parquet

4ef025e

rymarczy requested a review from mzappitello November 13, 2023 14:37

rymarczy commented Nov 13, 2023

View reviewed changes

mzappitello reviewed Nov 13, 2023

View reviewed changes

mzappitello suggested changes Nov 13, 2023

View reviewed changes

no return from write_to_parquet

6474a16

rymarczy requested a review from mzappitello November 13, 2023 19:34

mzappitello approved these changes Nov 13, 2023

View reviewed changes

rymarczy merged commit bc28218 into main Nov 13, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Stream DB Results to Parquet Files #183

FIX: Stream DB Results to Parquet Files #183

rymarczy commented Nov 13, 2023 •

edited

Loading

rymarczy Nov 13, 2023

rymarczy Nov 13, 2023

rymarczy Nov 13, 2023

mzappitello Nov 13, 2023

rymarczy Nov 13, 2023

mzappitello left a comment

mzappitello left a comment

		if not running_in_docker() and not running_in_aws():
		db_host = "127.0.0.1"

FIX: Stream DB Results to Parquet Files #183

FIX: Stream DB Results to Parquet Files #183

Conversation

rymarczy commented Nov 13, 2023 • edited Loading

rymarczy Nov 13, 2023

Choose a reason for hiding this comment

rymarczy Nov 13, 2023

Choose a reason for hiding this comment

rymarczy Nov 13, 2023

Choose a reason for hiding this comment

mzappitello Nov 13, 2023

Choose a reason for hiding this comment

rymarczy Nov 13, 2023

Choose a reason for hiding this comment

mzappitello left a comment

Choose a reason for hiding this comment

mzappitello left a comment

Choose a reason for hiding this comment

rymarczy commented Nov 13, 2023 •

edited

Loading