Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PXP-10845: Add utilities for dsv file operations and manipulation #179

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

MaribelleHGomez
Copy link
Contributor

@MaribelleHGomez MaribelleHGomez commented Mar 20, 2023

Jira Ticket: PXP-10845

New Features

  • Utility file for DSV file manipulation; functions include: collecting file contents into a list, finding file intersection and difference, chunking files, splitting file by headers, merging files

Breaking Changes

Bug Fixes

Improvements

Dependency updates

Deployment changes

@github-actions
Copy link

The style in this PR agrees with black. ✔️

This formatting comment was generated automatically by a script in uc-cdis/wool.

output_writer.writerow(record)


def convert_type(file_name: str, new_type: str, has_headers=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is this only changing the extensions? For a tsv to a csv would we not need to replace the tabs to commas?

write(records, new_file_name)


def merge_files(files: Union[str, list], output_file_name: str, has_headers=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have different headers or are different file types? I think we should check for that

Args:
file_name(str):
Filename or file path
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example in the description here?

has_headers=False,
):
"""
Chunk records into files of input size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could use a better description. Something like "chunk manifests into x different chunks" and maybe in the logs add how many records there will be per manifest.


def chunk(
file_name: str,
chunk_size: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually a fun idea would be adding another input, number of records in a manifest. So you're required to either give a number of chunks or number of records per chunks, do the math and create chunks according to that. If both are given, maybe do the math of wheteher or not those chunk and record numbers per chunk are compatible

if has_headers:
file_one_headers = get_headers(file_one)
file_two_headers = get_headers(file_two)
if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i dont understand what is going on here.
if we have headers:

A = ["a", "b", "c"]
B = ["b", "c", "a"]

print(A == B) : False
print(sorted(A) == sorted(B)) :True

So i see you're moving the header, i think you should add logs notifying that the columns aren't arranged correctly and we're moving the columns.

if file_one_headers != file_two_headers and sorted(file_one_headers) == sorted(
file_two_headers
):
list_two = move_columns(file_one_headers, file_two)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a sorted(A) != sorted(B)

else:
list_two = file_to_list(file_two, has_headers)

if not strict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to handle cases where filename and file_name are the same thing and this code should be smart enough to figure that out. Luckily there are functions in the sdk that already handles that. Check here. https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/tools/utils.py
Maybe right after we open up the files we standardize the header formats


logging.info(f"Taking intersection between {file_one} & {file_two}...")

list_one = file_to_list(file_one, has_headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should validation for all manifests that the utils processes.

return [row for row in list_one if row in list_two]


# TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a ticket for this TODO and link it here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants