Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher order syntactic and semantic profiler transforms #653

Conversation

aishwariyachakraborty
Copy link
Member

Why are these changes needed?

This PR implements the higher order syntactic and semantic profiler transforms as a downstream application of the UBSR transform.

Related issue number (if any).

Signed-off-by: aishwariyachakraborty <aishwariya.chakraborty@gmail.com>
Signed-off-by: aishwariyachakraborty <aishwariya.chakraborty@gmail.com>
Signed-off-by: aishwariyachakraborty <aishwariya.chakraborty@gmail.com>
Signed-off-by: aishwariyachakraborty <aishwariya.chakraborty@gmail.com>
@aishwariyachakraborty aishwariyachakraborty marked this pull request as ready for review October 4, 2024 04:54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please merge your branch with dev and regenerate this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made all the changes you have requested. I do not have an option to merge yet. It shows : Merging is blocked
Merging can be performed automatically once the requested changes are addressed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge I'm referring to is one you do locally into your fork.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please merge your branch with dev and regenerate this file.

transforms/code/semantic_profiler/python/pyproject.toml Outdated Show resolved Hide resolved
transforms/code/semantic_profiler/ray/pyproject.toml Outdated Show resolved Hide resolved
transforms/code/semantic_profiler/Makefile Show resolved Hide resolved
1. Updated makefiles to new the format used in noop
2. Updated author names in toml file

Signed-off-by: aishwariyachakraborty <aishwariya.chakraborty@gmail.com>
@aishwariyachakraborty
Copy link
Member Author

I have made all the changes you have requested. I do not have an option to merge yet. It shows : Merging is blocked
Merging can be performed automatically once the requested changes are addressed.


| Parameter | Default | Description |
|------------|----------|--------------|
| `HOSP_METRICS_LIST` | `CCR` | Metrics to be calculated for profiling. Multiple metrics can be entered separated by space. Only valid metric is `CCR` as of now. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the key needs to be in lower case, per the transform implementation. And, the transform seems to look for metrics not metrics_list and definitely not hosp_metrics_list.

the options provided by
the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md).
```
--hosp_metrics_list HOSP_METRICS_LIST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md as a template for documenting the configurations and annotations. In particular, you haven't discussed the annotations that are added, but following the format for the configuration would be good too.

input parquet to the output folder, without modification.
"""
self.logger.debug(f"Transforming one table with {len(table)} rows")
if self.metrics_list is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check for this error/condition in the initializer and throw an exception there if a bad value, such as None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears your are adding a column whose name is the name of the metric. This is not clear from README.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to distribute. these, maybe they belong in a data directory that is a sibling of src and then include them in the wheel via the toml file. My problem with this is that I'm wondering how someone would specify this file (or other data files) when using the wheel and not the repo project. If they can be referenced from within the wheel, it would be better to specify data/ikb/ikb_model.csv than src/ikb/ikb_model.csv I think.

"pipeline": "pipeline_id",
"job details": {
"job category": "preprocessing",
"job name": "NOOP",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file needs to be regenerated since it seems to have been copied from the NOOP project.

@touma-I touma-I self-requested a review October 12, 2024 15:25
Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aishwariyachakraborty: Please slack me when you have change to discuss - Internal ID touma@us.ibm.com

@aishwariyachakraborty
Copy link
Member Author

aishwariyachakraborty commented Oct 14, 2024

It was suggested that we combine this with the Syntactic Construct Extractor transform and create a single transform. So I am closing this PR. I will create a new PR after the PR for Syntactic Construct Extractor transform is merged with the dev branch. Hope this is ok?

@aishwariyachakraborty aishwariyachakraborty closed this by deleting the head repository Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants