-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cb changes 20241030 #10974
Cb changes 20241030 #10974
Conversation
WalkthroughThe pull request introduces a significant expansion of LanguageTool's vocabulary across multiple language modules. Key changes include the addition of numerous proper nouns to Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (9)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1)
65220-65223
: LGTM: Financial terms with proper formatting
The financial instrument terms follow proper German compound noun formation rules with correct hyphenation and capitalization. The /S and /SN suffixes are appropriately used for singular and plural forms.
Consider adding other common variants like "Hebel-Bear-Zertifikat" for completeness, as these typically come in pairs in financial contexts.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)
104447-104458
: LGTM! Consider adding examples in comments
The antipatterns effectively handle common cases that should not trigger the rule. The skip="6" flexibility in the "give talks" pattern is particularly useful for various phrasings.
Consider adding XML comments with example matches for each antipattern to improve maintainability:
<antipattern>
+ <!-- Matches possessive forms like "John's" -->
<token postag="NNP" />
<token>'s</token>
</antipattern>
<antipattern>
+ <!-- Matches compound proper nouns like "John Smith" -->
<token postag="NNP" />
<token case_sensitive="yes" regexp="yes">[A-Z].*</token>
</antipattern>
<antipattern>
+ <!-- Matches phrases like "give an interesting talk" -->
<token inflected="yes" skip="6">give</token>
<token regexp="yes">talks?</token>
</antipattern>
languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/remote-rule-filters.xml (2)
741-754
: LGTM! The new rule group for handling unnecessary punctuation is well-structured.
The implementation follows the existing pattern structure and correctly handles unnecessary punctuation marks at sentence endings and standalone quotation marks.
Consider enhancing the rules by:
- Adding more examples to cover different scenarios
- Including message attributes to provide specific error messages
<rulegroup id="AI_FR_GGEC_UNNECESSARY.*" name="">
<rule>
<pattern>
<token postag="SENT_END" regexp="yes">[","]</token>
</pattern>
+ <message>Remove unnecessary punctuation mark at the end of the sentence.</message>
<example correction="">Je suis Chris.<marker>"</marker></example>
+ <example correction="">L'histoire continue,<marker>,</marker></example>
</rule>
<rule>
<pattern>
<token>"</token>
</pattern>
+ <message>Remove standalone quotation mark.</message>
<example correction="">Je suis Chris.<marker>"</marker></example>
+ <example correction="">Il a dit<marker>"</marker>bonjour.</example>
</rule>
</rulegroup>
744-744
: Consider expanding the pattern to catch more cases of unnecessary punctuation.
The current pattern only catches comma and quotation marks. Consider including other punctuation marks that might be unnecessarily repeated.
- <token postag="SENT_END" regexp="yes">[","]</token>
+ <token postag="SENT_END" regexp="yes">[",";:!?]</token>
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)
3113-3114
: Consider consolidating the Sambal variants.
"Sambal Ulek" and "Sambal Oelek" refer to the same Indonesian chili paste. Consider keeping only the more common spelling variant to maintain consistency and reduce redundancy in the dictionary.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)
2453-2462
: Consider enhancing the greeting pattern coverage.
The rule implementation for handling commas after greetings is logically sound. However, consider these improvements:
- Add support for additional greeting variations like "Schönen" or "Herzlichen".
- Consider reducing the
skip="2"
to be more precise and avoid potential false positives.
Apply this diff to enhance the pattern coverage:
<rule>
<pattern>
- <token>Guten</token>
- <token skip="2">Morgen|Abend|Tag</token>
+ <token regexp="yes">Guten|Schönen|Herzlichen</token>
+ <token skip="1">Morgen|Abend|Tag</token>
<marker>
<token postag="SENT_END">,</token>
</marker>
</pattern>
<example correction="">Guten Morgen Frank<marker>,</marker></example>
+ <example correction="">Schönen Tag Herr Schmidt<marker>,</marker></example>
</rule>
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)
26704-26710
: Consider standardizing the Brazilian league entries.
The Brazilian league is listed with three variations. While all are valid, consider standardizing to the official name to maintain consistency.
-Série A
-Campeonato Brasileiro Série A
-Brasileirão Série A
+Campeonato Brasileiro Série A
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml (1)
11217-11225
: Add examples and consider refining the acronym pattern.
The rule correctly handles acronyms at sentence start, but could benefit from some improvements:
- Add example cases in XML comments for better maintainability:
<rule>
+ <!-- Examples:
+ Matches: SNCF, RATP, CNRS
+ Non-matches: ABC123, XYZabc -->
<pattern case_sensitive="yes">
- Consider making the pattern more specific to known French acronym patterns:
- <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
+ <token regexp="yes" postag="SENT_END">(?:SNCF|RATP|CNRS|[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token>
This would help prevent false positives while still maintaining flexibility for common French acronyms.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)
16392-16400
: Consider refining the acronym pattern to reduce false positives.
The rule correctly handles common acronym patterns at sentence start, but the current regexp [BCDFGHJKLMNPQRSTVWXZ]{3,6}s?
might be too permissive and could match invalid consonant sequences.
Consider these improvements:
- <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
+ <token regexp="yes" postag="SENT_END">(?:UNESCO|NATO|NASA|[A-Z][BCDFGHJKLMNPQRSTVWXZ]{2,5})s?</token>
This would:
- Explicitly allow common acronyms
- Require at least one valid letter start
- Still maintain flexibility for other valid acronyms
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
- languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (2 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1 hunks)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1 hunks)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
- languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1 hunks)
- languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/disambiguation.xml (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml (1 hunks)
- languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/remote-rule-filters.xml (1 hunks)
🔥 Files not summarized due to errors (2)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml: Error: Server error: no LLM provider could handle the message
- languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml: Error: Server error: no LLM provider could handle the message
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
🔇 Additional comments (11)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)
62885-62889
: LGTM: Correct comparative adjective forms
The added comparative forms of "strukturiert" follow proper German adjective declension rules and cover all necessary grammatical cases.
65217-65219
: LGTM: Valid compound adjective formations
The new compound adjectives follow proper German word formation rules and are marked correctly with the /A suffix for adjective variations.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)
104460-104462
: LGTM! Pattern changes improve precision
The addition of chunk_re attributes makes the rule more accurate by considering phrase boundaries. The pattern will now better handle complex sentence structures.
Let's verify there are no conflicting chunk patterns in other rules:
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (2)
Line range hint 1-4
: LGTM! Clear documentation of file purpose.
The header comment effectively explains:
- The file's purpose (extending spell checker)
- Its relationship with ignore.txt
- Guidelines for handling multi-token phrases
493-495
: LGTM! Valid additions of non-hyphenated variants.
The additions of "outgroup" and "ingroup" complement the existing hyphenated forms ("out-group" and "in-group"). Both variants are widely accepted in academic and professional writing, particularly in social psychology and sociology.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)
3091-3091
: LGTM! The additions follow the established pattern.
The new entries are correctly formatted and appropriately use the /S suffix for plural forms in German text. They represent commonly used terms in financial and culinary contexts.
Also applies to: 3112-3114
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)
10932-10939
: LGTM! The additions are appropriate for the spellchecker ignore list.
The new entries include common technical abbreviations (MBR, QBR, EOW), UI elements (navbar), and proper nouns (Martijn). These are valid terms that should be ignored during spell checking.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)
Line range hint 1-5
: LGTM! Clear and comprehensive header comments.
The file header provides excellent guidance on:
- The file's purpose for English disambiguation
- Clear criteria for adding terms
- Case-sensitivity rules
- Relationship with spelling.txt
8027-8027
: LGTM! Correct POS tagging for culinary term.
The term "sambal ulek" is correctly tagged as NN:U (uncountable noun) as it refers to an Indonesian chili paste that is typically used as a mass noun.
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)
26704-26710
: LGTM for the remaining entries!
The other additions (Brandeis University, Carl Boese, Le Méridien, Albert Heijn, Tiff Macklem, Carel Fabritius) are valid proper nouns that will enhance the spell checker's vocabulary.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1)
6037-6045
: Rule looks good, consider pattern refinements.
The rule appropriately handles capitalized consonant sequences (acronyms/initialisms) at sentence start. The ignore_spelling action is suitable for this case.
A few suggestions to consider:
- The pattern could be more specific to avoid matching invalid consonant combinations
- Consider adding a test case to verify the rule behaves correctly with various inputs
Let's verify the test coverage:
@@ -62879,6 +62879,14 @@ hinaus_gleiten | |||
heraus_flutschen | |||
raus_flutschen | |||
hinaus_flutschen | |||
hinein_flutschen | |||
herin_flutschen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo in compound verb formation
The word "herin_flutschen" appears to contain a typo. The correct prefix should be "herein" instead of "herin".
-herin_flutschen
+herein_flutschen
Committable suggestion was skipped due to low confidence.
@@ -104442,12 +104442,24 @@ USA | |||
<url>https://dictionary.cambridge.org/grammar/british-grammar/prefer</url> | |||
<example correction='prefers overalls to skirts'>She much <marker>prefers overalls instead of skirts</marker>, because she enjoys playing in the mud</example> | |||
</rule> | |||
<rule id='TALK_AT_TO' name='Collocation: Talk at/to'> | |||
<rule id='TALK_AT_TO' name='Collocation: Talk at/to' default="off"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Setting default="off" is a new change that requires attention
The git history shows this is a recent change where default="off"
was explicitly added to the rule. This modification makes the "TALK_AT_TO" rule opt-in rather than enabled by default, which is a significant behavioral change that could affect users' grammar checking experience.
- The rule was previously enabled by default since its creation in 2017
- The change coincides with the addition of new antipatterns for proper nouns
- This suggests the rule might have had false positives, leading to it being disabled by default
🔗 Analysis chain
Verify the impact of setting default="off"
Setting the rule to disabled by default is a significant change that affects all users. While this can help prevent false positives, we should ensure this aligns with user expectations.
Let me search for any previous versions or history of this rule to verify if this is a change in default state.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if this rule was previously enabled by default
rg -B 5 -A 5 'id=.TALK_AT_TO.' --glob '*.xml' | grep -B 5 -A 5 'default='
Length of output: 1661
Script:
#!/bin/bash
# Search git history for changes to this rule
git log -p | grep -A 10 -B 10 'TALK_AT_TO.*default='
Length of output: 1242
<rule> | ||
<pattern case_sensitive="yes"> | ||
<token postag="SENT_START" /> | ||
<marker> | ||
<token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token> | ||
</marker> | ||
</pattern> | ||
<disambig action="ignore_spelling"/> | ||
</rule> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider enhancing the abbreviation rule pattern.
The rule for handling uppercase abbreviations at sentence end is good, but could be improved:
- The pattern could be more specific to avoid false positives
- Documentation would help explain the rule's purpose
Consider this enhanced version:
- <rule>
- <pattern case_sensitive="yes">
- <token postag="SENT_START" />
- <marker>
- <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
- </marker>
- </pattern>
- <disambig action="ignore_spelling"/>
- </rule>
+ <!-- Rule to handle uppercase abbreviations at sentence end (e.g., "NASA.", "UNICEF.") -->
+ <rule id="uppercase_abbrev_sent_end" name="uppercase abbreviations at sentence end">
+ <pattern case_sensitive="yes">
+ <token postag="SENT_START" />
+ <marker>
+ <!-- Match 3-6 consonant abbreviations with optional plural 's' -->
+ <token regexp="yes" postag="SENT_END">(?:[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token>
+ </marker>
+ </pattern>
+ <disambig action="ignore_spelling"/>
+ </rule>
The improvements include:
- Added descriptive comment explaining the rule's purpose
- Added rule ID and name for better maintainability
- Enhanced regexp pattern with non-capturing group
- Added inline comment explaining the pattern
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
<rule> | |
<pattern case_sensitive="yes"> | |
<token postag="SENT_START" /> | |
<marker> | |
<token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token> | |
</marker> | |
</pattern> | |
<disambig action="ignore_spelling"/> | |
</rule> | |
<!-- Rule to handle uppercase abbreviations at sentence end (e.g., "NASA.", "UNICEF.") --> | |
<rule id="uppercase_abbrev_sent_end" name="uppercase abbreviations at sentence end"> | |
<pattern case_sensitive="yes"> | |
<token postag="SENT_START" /> | |
<marker> | |
<!-- Match 3-6 consonant abbreviations with optional plural 's' --> | |
<token regexp="yes" postag="SENT_END">(?:[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token> | |
</marker> | |
</pattern> | |
<disambig action="ignore_spelling"/> | |
</rule> |
Carl Boese | ||
Série A | ||
Campeonato Brasileiro Série A | ||
Brasileirão Série A | ||
Le Méridien | ||
Albert Heijn | ||
Tiff Macklem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Redundant name entries confirmed - consolidation needed
Found duplicate entries in spelling_global.txt
:
- "Thomas Chatterton Williams" and "Thomas Chatterton"
- "Frank Ramsey" and "Frank P. Ramsey"
Please keep only the full name versions:
- "Thomas Chatterton Williams"
- "Frank P. Ramsey"
This will maintain consistency while preserving the most complete form of each name.
🔗 Analysis chain
Review redundant name entries.
Some names have multiple variations that might need consolidation:
- "Thomas Chatterton Williams" and "Thomas Chatterton"
- "Frank Ramsey" and "Frank P. Ramsey"
Consider keeping only the most commonly used or full versions to avoid potential confusion in spell checking.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for other instances of these names in the codebase
# to ensure consistent usage
echo "Checking for existing name variations..."
rg "Thomas Chatterton|Frank.*Ramsey" --type txt
Length of output: 546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml
(1 hunks)
<rule> | ||
<pattern> | ||
<token>Guten</token> | ||
<token skip="2" regexp="yes">Morgen|Abend|Tag</token> | ||
<marker> | ||
<token postag="SENT_END">,</token> | ||
</marker> | ||
</pattern> | ||
<example correction="">Guten Morgen Frank<marker>,</marker></example> | ||
</rule> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Overlapping rules need to be consolidated
The new rule in remote-rule-filters.xml
conflicts with existing rules that handle greeting punctuation:
GUTEN_TAG_KOMMA
ingrammar.xml
is active and handles comma placement after greetings when followed by questions/phrasesANREDE_KOMMA_2
inde-DE-AT/grammar.xml
appears to be commented out (ends with-->
)
The new rule should be removed or consolidated with GUTEN_TAG_KOMMA
because:
- Both rules target the same greeting patterns ("Guten Tag/Morgen/Abend")
GUTEN_TAG_KOMMA
is more comprehensive, handling various follow-up phrases- Having multiple rules for the same pattern can lead to inconsistent corrections
🔗 Analysis chain
LGTM: New rule for greeting punctuation looks good.
The rule correctly handles comma placement after formal German greetings ("Guten Morgen", "Guten Abend", "Guten Tag"). The pattern matching is well-structured and the example demonstrates proper usage.
Let's verify there are no conflicting rules:
Potential conflict with existing greeting punctuation rules detected
The search results reveal two potentially conflicting rules in the German grammar files:
ANREDE_KOMMA_2
rule inde-DE-AT/grammar.xml
specifically handles missing commas after greeting formulasGUTEN_TAG_KOMMA
rule ingrammar.xml
handles comma placement after "Guten Tag" followed by "wie gehts"
Let's verify if these rules are actually active and how they interact with our new rule:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any existing rules that might conflict with the new greeting punctuation rule
# Look for patterns containing "Guten" and greeting words
rg -A 5 'Guten\s+(Morgen|Abend|Tag)' --type xml
Length of output: 9586
Script:
#!/bin/bash
# Check if the potentially conflicting rules are commented out and get their full context
rg -B 2 -A 10 'id="(ANREDE_KOMMA_2|GUTEN_TAG_KOMMA)"' --type xml
Length of output: 3893
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
These enhancements aim to improve the overall user experience by providing more accurate language processing and spell-checking capabilities.