Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added disambiguated text to texts folder along with source text #2

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

pranavad
Copy link
Contributor

@pranavad pranavad commented Nov 4, 2018

Formatted the disambiguated text a bit.

@shardulc
Copy link
Member

shardulc commented Nov 5, 2018

Hi @pranavad, git shows these files as binary files and not text files. (See the 'Files changed' tab for example.) How did you create the files? Is there an issue with encoding or something?

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

I saved them as Unicode encoding in Notepad. I'll try and export it again in UTF-8 perhaps?

@shardulc
Copy link
Member

shardulc commented Nov 5, 2018

Yup, do that.

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

I think it's fine now. I re-encoded them in UTF-8, and now it isn't showing them as binary files

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

Hey, for something like Bay of Bengal/Indian Ocean, is the disambiguation for "Indian"/"Bengal" an adjective, an hydronym or a toponym?

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

Similarly, is Ocean/Bay here a common noun or a proper noun?

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

For words such as "Soviet Union" or "World Cup" where each word can be interpreted as a common noun, how do I disambiguate it?

Languages and sports are "altres" proper nouns?

And lastly, things like Hindustani Music, would have "Hindustani" as a proper noun or adjective?

@pranavad
Copy link
Contributor Author

pranavad commented Nov 5, 2018

@shardulc
Here are the words that the analyzer didn't recognize I couldn't manually disambiguate

प्रभावित

इत्यादि

हालाँकि

उभरी

सोवियत

पुरजोर

अलावा

वाली

हिन्दुस्तानी

@shardulc
Copy link
Member

shardulc commented Nov 6, 2018

@pranavad Sorry for the late response.

For all the examples like "Bay of Bengal", "Soviet Union", etc. the short answer is that do it the way the English module analyzes them, since that module is mature now. The long answer, if you want to know how/why things are done, is that I'm not sure about the technicalities and you'll have to ask on the channel.

From your list, प्रभावित and हिन्दुस्तानी seem to be normal adjectives to me. The comment above applies to the rest.

@pranavad
Copy link
Contributor Author

pranavad commented Nov 6, 2018

Hey, @shardulc, I completed the disambiguation for all the lemmatized words as well as fixed a few others.

@pranavad
Copy link
Contributor Author

pranavad commented Nov 6, 2018

Is there anything else required for merging/task completion?

@shardulc
Copy link
Member

shardulc commented Nov 6, 2018

Approved your task! I'll look into merging with perhaps some minor changes shortly.

@Kainatic
Copy link

Kainatic commented Nov 9, 2018

@pranavad @shardulc
How to open morphological analyser? (steps please)
Checked the docs. Nothing related to my Q.
I'm doing the same task.

@pranavad
Copy link
Contributor Author

pranavad commented Nov 9, 2018

Hey @Kainatic. I had trouble understanding this at first as well. I'll help you out on IRC, join the Apertium channel

@Kainatic
Copy link

Kainatic commented Nov 9, 2018

On IRC
help

@pranavad
Copy link
Contributor Author

pranavad commented Nov 9, 2018

@Kainatic Off memory, you have to install lttoolbox first of all and download apertium-hin.hin.dix to your home folder in the Virtual Box after installation [you'll find steps for this in the doc]. Then, run
$ lt-comp lr apertium-hin.hin.dix hin.analyser.bin

(The hin.analyser.bin is a name that you can change)

Then you can echo any text you want to analyse by
$ echo "text" | lt-proc hin.analyser.bin

Hope this helps.

@Kainatic
Copy link

Kainatic commented Nov 9, 2018

Thanks

@Kainatic
Copy link

Kainatic commented Nov 11, 2018

@pranavad How did you do it?
Like, did you separate and copy each sentence, or pass the text file, or something else?
Please tell me the next steps.
And how did you measure exactly 500 tokens?

@jonorthwash
Copy link
Member

The best way to compile apertium-hin is to install apertium, clone apertium-hin, go into the cloned directory, run ./configure and when that's done make.

@Kainatic
Copy link

@jonorthwash Went into the cloned directory and ran './configure' but it said 'bash: ./configure: No such file or directory'
Also, read about make. There is already a Makefile.am present in the cloned directory.
Require help about running hin-tagger mode on Wiki.txt to disambiguate.

Adding tests to a PDF, will upload on the GCI task page.
@pranavad
Copy link
Contributor Author

Hey, @shardulc. Should I make another PR for the constraint grammar task, or is this fine for now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants