PXP-10845: Add utilities for dsv file operations and manipulation #179

MaribelleHGomez · 2023-03-20T17:42:45Z

Jira Ticket: PXP-10845

New Features

Utility file for DSV file manipulation; functions include: collecting file contents into a list, finding file intersection and difference, chunking files, splitting file by headers, merging files

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

github-actions · 2023-03-20T17:43:18Z

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

BinamB · 2023-03-24T18:40:43Z

gen3/tools/file_utils.py

+            output_writer.writerow(record)
+
+
+def convert_type(file_name: str, new_type: str, has_headers=False):


so is this only changing the extensions? For a tsv to a csv would we not need to replace the tabs to commas?

BinamB · 2023-03-24T18:42:07Z

gen3/tools/file_utils.py

+    write(records, new_file_name)
+
+
+def merge_files(files: Union[str, list], output_file_name: str, has_headers=False):


What if we have different headers or are different file types? I think we should check for that

BinamB · 2023-03-24T18:44:15Z

gen3/tools/file_utils.py

+    Args:
+        file_name(str):
+            Filename or file path
+    Returns:


Could you add an example in the description here?

BinamB · 2023-03-24T18:46:33Z

gen3/tools/file_utils.py

+    has_headers=False,
+):
+    """
+    Chunk records into files of input size


I think this could use a better description. Something like "chunk manifests into x different chunks" and maybe in the logs add how many records there will be per manifest.

BinamB · 2023-03-24T18:48:19Z

gen3/tools/file_utils.py

+
+def chunk(
+    file_name: str,
+    chunk_size: int,


Oh actually a fun idea would be adding another input, number of records in a manifest. So you're required to either give a number of chunks or number of records per chunks, do the math and create chunks according to that. If both are given, maybe do the math of wheteher or not those chunk and record numbers per chunk are compatible

BinamB · 2023-03-24T18:58:14Z

gen3/tools/file_utils.py

+    if has_headers:
+        file_one_headers = get_headers(file_one)
+        file_two_headers = get_headers(file_two)
+        if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(


i guess i dont understand what is going on here.
if we have headers:

A = ["a", "b", "c"] B = ["b", "c", "a"] print(A == B) : False print(sorted(A) == sorted(B)) :True

So i see you're moving the header, i think you should add logs notifying that the columns aren't arranged correctly and we're moving the columns.

BinamB · 2023-03-24T18:58:41Z

gen3/tools/file_utils.py

+        if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(
+            file_two_headers
+        ):
+            list_two = move_columns(file_one_headers, file_two)


This should be a sorted(A) != sorted(B)

BinamB · 2023-03-24T19:03:45Z

gen3/tools/file_utils.py

+    else:
+        list_two = file_to_list(file_two, has_headers)
+
+    if not strict:


i think you need to handle cases where filename and file_name are the same thing and this code should be smart enough to figure that out. Luckily there are functions in the sdk that already handles that. Check here. https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/tools/utils.py
Maybe right after we open up the files we standardize the header formats

BinamB · 2023-03-24T19:04:55Z

gen3/tools/file_utils.py

+
+    logging.info(f"Taking intersection between {file_one} & {file_two}...")
+
+    list_one = file_to_list(file_one, has_headers)


Maybe we should validation for all manifests that the utils processes.

BinamB · 2023-03-24T19:05:11Z

gen3/tools/file_utils.py

+    return [row for row in list_one if row in list_two]
+
+
+# TODO


Can you create a ticket for this TODO and link it here as well?

MaribelleHGomez added 2 commits March 20, 2023 12:35

add utilities file for dsv file operations and manipulation

5bbd8b7

Merge branch 'master' into feat/add_dsvfile_utils

cd9cd31

Apply automatic documentation changes

f1fa109

BinamB requested changes Mar 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PXP-10845: Add utilities for dsv file operations and manipulation #179

PXP-10845: Add utilities for dsv file operations and manipulation #179

MaribelleHGomez commented Mar 20, 2023 •

edited by jira bot

Loading

github-actions bot commented Mar 20, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

BinamB Mar 24, 2023

		output_writer.writerow(record)


		def convert_type(file_name: str, new_type: str, has_headers=False):

		write(records, new_file_name)


		def merge_files(files: Union[str, list], output_file_name: str, has_headers=False):


		logging.info(f"Taking intersection between {file_one} & {file_two}...")

		list_one = file_to_list(file_one, has_headers)

PXP-10845: Add utilities for dsv file operations and manipulation #179

Are you sure you want to change the base?

PXP-10845: Add utilities for dsv file operations and manipulation #179

Conversation

MaribelleHGomez commented Mar 20, 2023 • edited by jira bot Loading

New Features

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

github-actions bot commented Mar 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaribelleHGomez commented Mar 20, 2023 •

edited by jira bot

Loading