Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German Stemmer #123

Open
michelole opened this issue Dec 3, 2019 · 0 comments
Open

German Stemmer #123

michelole opened this issue Dec 3, 2019 · 0 comments
Labels
P2 High priority issues, a COULD

Comments

@michelole
Copy link
Member

michelole commented Dec 3, 2019

Since plural and grammatical case are all considered perfect matches in our annotation guidelines, we could apply a stemmer to the data to make our models denser.

However, we might need to annotate the new expansions because some pairs might decrease ranking during stemming due to it being considered an abbreviation (e.g. "Vorbefund" -> "Vorbefu", "Vesikuläratmen" -> "Vesikuläratm", "Operation" -> "Operatio").

The CISTEM stemmer seems to improve results over Porter stemmer and has a Python NLTK implementation.

Relates to #87.

@michelole michelole mentioned this issue Dec 3, 2019
@michelole michelole added the P2 High priority issues, a COULD label Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 High priority issues, a COULD
Projects
None yet
Development

No branches or pull requests

1 participant