Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to switch output languages for multilingual models #69 #72

Closed
wants to merge 27 commits into from

Conversation

SamDewriter
Copy link
Contributor

Here contains the few changes I have made, which I am doubting makes sense.

@SamDewriter SamDewriter self-assigned this Aug 15, 2023
@SamDewriter
Copy link
Contributor Author

My questions are:

  • I have added the target language as a parameter to the Segment class, but I am not sure if it is necessary to implement it in the DataLoader.
  • Somehow, if implemented in the Segmented and the DataLoader classes, the target language will still need to be entered manually before running it

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 15, 2023
@ibanesh
Copy link
Contributor

ibanesh commented Aug 15, 2023

You are on the right track.

To keep it simple, for now consider this as the use case for which you are making the changes:

When running the example pipeline(CounterInTargetLanguageAgentPipeline) we specify the target file (reference/en.txt) which has a list of references to compare the output of the pipeline. Let's say in addition to this target file, we also specify another file (eg: target_language.txt) which has the list of languages for the output corresponding to the respect input from the source file.

context of target_language.txt:

en

Now we want dataloader to load this file along with the source and target references, and then use the target language value from that file for setting the target language in the policy method of CounterInTargetLanguage agent class.

@ibanesh
Copy link
Contributor

ibanesh commented Aug 15, 2023

I have added the target language as a parameter to the Segment class, but I am not sure if it is necessary to implement it in the DataLoader.

Given the above use case, it would be required to make some necessary changes in dataloader.

Somehow, if implemented in the Segmented and the DataLoader classes, the target language will still need to be entered manually before running it

Even in the use case I stated, we are not technically changing the target language dynamically. But the changes you make in SimulEval to make things work for this use case will pave the way to dynamically pass in the target language from the demo front end.

In the given use case, instead of passing the target language as a parameter when loading the pipeline (i.e., --tgt-lang en), the target language will be inferred from a file.
When integrating with the demo (seamless-experience repo), this loading from a file part using a dataloader part will be replaced by some logic to stream data from front end, but the changes you made to enable passing the target language from the dataloader to the agent policy method will still hold and will enable us to dynamically change the target language.

@@ -15,6 +15,7 @@ class Segment:
finished: bool = False
is_empty: bool = False
data_type: str = None
tgt_lang: str = ""
Copy link
Contributor

@ibanesh ibanesh Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that you have made some changes to add target language as a property to dataloader and segment classes. But I don't see any change for passing the target language parameter from the dataloader to the segment.

In case you are wondering, the instance class (eg: https://github.com/facebookresearch/SimulEval/blob/main/simuleval/evaluator/instance.py) is the one that loads a specific instance/sample from the dataloader and then creates segments for that instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the comments, they are super helpful!

@SamDewriter
Copy link
Contributor Author

I made new changes and updated the instance.py to pass the tgt_lang to the segment when creating it. It seems to work well only that the target language is not detected and instances.log shows "unknown" where the target language is meant to be when I used the dummy model (counter_in_tgt_lang_agent.py).

I suspect that the target language parameter (args) from the counter_in_tgt_lang_agent is not getting passed to the instance.

@ibanesh
Copy link
Contributor

ibanesh commented Aug 18, 2023

Are you sure all changes have been pushed to this PR?
I'm not seeing any changes to instance.py in this PR even after your latest commit.

@ibanesh
Copy link
Contributor

ibanesh commented Aug 18, 2023

This is the summary of changes I think that needs to be made:

  • Changes in dataloader to infer/load the target language from a file.
  • Changes for passing the tgt lang attribute loaded in the dataloader to the instance object for each instance.
  • Changes for passing the tgt lang attribute from instance object to segment.
  • Changes for passing the tgt lang attribute from the segment to agent policy through the agent states.

@SamDewriter
Copy link
Contributor Author

  • Changes in dataloader to infer/load the target language from a file.
  • Changes for passing the tgt lang attribute loaded in the dataloader to the instance object for each instance.
  • Changes for passing the tgt lang attribute from instance object to segment.
  • Changes for passing the tgt lang attribute from the segment to agent policy through the agent states.

I am sure I have touched on all these aspects.

@ibanesh
Copy link
Contributor

ibanesh commented Aug 18, 2023

@SamDewriter
I still don't see any logic added in this PR for passing the tgt_lang from the dataloader to instance object to segment object. That connection is needed for propagating the parameter. Right now you are setting the tgt_lang property in dataloader from the file, but this property is not being used to set the tgt_lang property you added in instance or segment.

If you are looking for more code pointers, this should help:
https://github.com/facebookresearch/SimulEval/blob/seamless_main/simuleval/evaluator/instance.py#L46-L48
https://github.com/facebookresearch/SimulEval/blob/seamless_main/simuleval/evaluator/instance.py#L278-L283

The final missing part will be the passing of the tgt_lang from segment to the agent's policy method. This is the main part that is going to be useful when making changes to the demo.
pointers:

@SamDewriter
Copy link
Contributor Author

SamDewriter commented Aug 20, 2023

In this new commit, I pass the target language to only the SpeechInputInstance class. The property tgt_lang is not added to the parent class instance. It was commented out in the last commit.

Comment on lines 17 to 20
if args is not None:
with open(args.tgt_lang, "r") as file:
tgt_lang = file.read()
self.tgt_lang = tgt_lang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part here as a fallback? It is fine to leave it in as a fallback, but if this is the sole logic for getting tgt_lang in this agent then it won't meet our purpose.

We primarily want to get the tgt lang param dynamically passed to the policy method instead of solely getting set at initialization. You understand that solely getting set at initialization means it will remain static, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I added this part is just to preprocess the tgt_lang because we are reading it from a file. I came to this conclusion after several trials and errors. I noticed that before adding the logic, the evaluation works well but the tgt_lang is not being passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd also need clarification on is that in the counter_in_tgt_lang_agent, the tgt_lang is being passed to the class after inheriting from the parent class Agent, which means that the tgt_lang is not implemented in the parent class. I suppose the right thing will be to implement it in the parent class so that all the children classes will automatically have it

Copy link
Contributor

@ibanesh ibanesh Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to set tgt_lang in the init method by the end of this effort, so don't worry about moving this attribute to the parent class for now.

If the objective is still not clear to you, imagine that

parser.add_argument(
"--tgt-lang", default="en", type=str, choices=["en", "es", "de"]
)
and will be removed. We want to be able to get the tgt_lang passed down to the policy method as part of the states, something like:

def policy(self, states: Optional[AgentStates] = None):
    ....
    ....
    tgt_lang = states.tgt_lang
    if tgt_lang == "en": 
        prediction += "seconds"
    elif tgt_lang == "es":
        prediction += "segundos"
    elif tgt_lang == "de":
        prediction += "sekunden"
    else:
        prediction += "<unknown>"
    ....
    ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Now I understand the goal. Thanks for the clarification!

@ibanesh
Copy link
Contributor

ibanesh commented Aug 25, 2023

The changes in latest commit looks good.
Can you please clean the PR up by removing the additional unnecessary changes and make it ready for review?

@ibanesh ibanesh closed this Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants