Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cb changes 20241030 #10974

Merged
merged 8 commits into from
Oct 30, 2024
Merged

Cb changes 20241030 #10974

merged 8 commits into from
Oct 30, 2024

Conversation

tiff
Copy link
Member

@tiff tiff commented Oct 30, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Expanded vocabulary for German and English spell-checking, including proper nouns, compound words, and modern vernacular.
    • New rules added for improved punctuation handling in greetings for German and unnecessary punctuation in French.
    • New entries added to the ignore lists for both German and English, covering a variety of terms and acronyms.
    • Enhanced multi-token suggestions for German and English to support more phrases.
  • Bug Fixes

    • Enhanced disambiguation rules for Spanish and French to improve accuracy in grammar checking.

These enhancements aim to improve the overall user experience by providing more accurate language processing and spell-checking capabilities.

Copy link
Contributor

coderabbitai bot commented Oct 30, 2024

Walkthrough

The pull request introduces a significant expansion of LanguageTool's vocabulary across multiple language modules. Key changes include the addition of numerous proper nouns to spelling_global.txt, updates to ignore lists and spelling dictionaries in both German and English, and the introduction of new rules in the disambiguation XML files for Spanish and French. These modifications aim to enhance the tool's ability to recognize and process a broader range of terms and linguistic constructs without altering existing structures.

Changes

File Path Change Summary
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt Added a large number of proper nouns related to various topics, including many Olympic entries.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt Added two new entries: IZF and Martijn/S #name.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt Added multiple new compound words and variations related to finance and other terms.
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt Added new multi-token suggestions: Fair Market Value/S, Bami Goreng/S, Sambal Ulek/S, Sambal Oelek/S.
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml Introduced a new rule for handling greetings followed by a comma.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt Added numerous entries, including MBR, QBR, navbar, and Martijn.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt Added new words and phrases, including outgroup and ingroup, and various scientific terms.
languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt Added various multi-word terms with corresponding part-of-speech tags, including sambal ulek.
languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml Modified the TALK_AT_TO rule, adding antipatterns and changing its default state to "off".
languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/disambiguation.xml Added multiple new rules and modified existing ones to improve disambiguation in Spanish.
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml Added new rules for handling nominal groups and ambiguous cases, and removed some unnecessary rules.
languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/remote-rule-filters.xml Introduced a new rule group for unnecessary punctuation handling.

Possibly related PRs

  • Cb changes 20240830 #10856: This PR adds numerous proper nouns to spelling_global.txt, similar to the main PR's focus on expanding proper nouns in the same file.
  • Cb changes 20240905 #10865: Similar to the main PR, this PR also adds new entries to spelling_global.txt, enhancing the vocabulary for spell-checking.
  • [de] add words + gGEC AP + removed currencies #10961: This PR adds new entries to ignore.txt, which may relate to the overall expansion of recognized terms in the language tool, aligning with the main PR's goal of enhancing the knowledge base.

Suggested reviewers

  • jaumeortola
  • p-goulart
  • susanaboatto

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (9)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (1)

65220-65223: LGTM: Financial terms with proper formatting

The financial instrument terms follow proper German compound noun formation rules with correct hyphenation and capitalization. The /S and /SN suffixes are appropriately used for singular and plural forms.

Consider adding other common variants like "Hebel-Bear-Zertifikat" for completeness, as these typically come in pairs in financial contexts.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)

104447-104458: LGTM! Consider adding examples in comments

The antipatterns effectively handle common cases that should not trigger the rule. The skip="6" flexibility in the "give talks" pattern is particularly useful for various phrasings.

Consider adding XML comments with example matches for each antipattern to improve maintainability:

      <antipattern>
+       <!-- Matches possessive forms like "John's" -->
        <token postag="NNP" />
        <token>'s</token>
      </antipattern>
      <antipattern>
+       <!-- Matches compound proper nouns like "John Smith" -->
        <token postag="NNP" />
        <token case_sensitive="yes" regexp="yes">[A-Z].*</token>
      </antipattern>
      <antipattern>
+       <!-- Matches phrases like "give an interesting talk" -->
        <token inflected="yes" skip="6">give</token>
        <token regexp="yes">talks?</token>
      </antipattern>
languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/remote-rule-filters.xml (2)

741-754: LGTM! The new rule group for handling unnecessary punctuation is well-structured.

The implementation follows the existing pattern structure and correctly handles unnecessary punctuation marks at sentence endings and standalone quotation marks.

Consider enhancing the rules by:

  1. Adding more examples to cover different scenarios
  2. Including message attributes to provide specific error messages
 <rulegroup id="AI_FR_GGEC_UNNECESSARY.*" name="">
     <rule>
         <pattern>
             <token postag="SENT_END" regexp="yes">[","]</token>
         </pattern>
+        <message>Remove unnecessary punctuation mark at the end of the sentence.</message>
         <example correction="">Je suis Chris.<marker>"</marker></example>
+        <example correction="">L'histoire continue,<marker>,</marker></example>
     </rule>
     <rule>
         <pattern>
             <token>"</token>
         </pattern>
+        <message>Remove standalone quotation mark.</message>
         <example correction="">Je suis Chris.<marker>"</marker></example>
+        <example correction="">Il a dit<marker>"</marker>bonjour.</example>
     </rule>
 </rulegroup>

744-744: Consider expanding the pattern to catch more cases of unnecessary punctuation.

The current pattern only catches comma and quotation marks. Consider including other punctuation marks that might be unnecessarily repeated.

-    <token postag="SENT_END" regexp="yes">[","]</token>
+    <token postag="SENT_END" regexp="yes">[",";:!?]</token>
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

3113-3114: Consider consolidating the Sambal variants.

"Sambal Ulek" and "Sambal Oelek" refer to the same Indonesian chili paste. Consider keeping only the more common spelling variant to maintain consistency and reduce redundancy in the dictionary.

languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1)

2453-2462: Consider enhancing the greeting pattern coverage.

The rule implementation for handling commas after greetings is logically sound. However, consider these improvements:

  1. Add support for additional greeting variations like "Schönen" or "Herzlichen".
  2. Consider reducing the skip="2" to be more precise and avoid potential false positives.

Apply this diff to enhance the pattern coverage:

             <rule>
                 <pattern>
-                    <token>Guten</token>
-                    <token skip="2">Morgen|Abend|Tag</token>
+                    <token regexp="yes">Guten|Schönen|Herzlichen</token>
+                    <token skip="1">Morgen|Abend|Tag</token>
                     <marker>
                         <token postag="SENT_END">,</token>
                     </marker>
                 </pattern>
                 <example correction="">Guten Morgen Frank<marker>,</marker></example>
+                <example correction="">Schönen Tag Herr Schmidt<marker>,</marker></example>
             </rule>
languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)

26704-26710: Consider standardizing the Brazilian league entries.

The Brazilian league is listed with three variations. While all are valid, consider standardizing to the official name to maintain consistency.

-Série A
-Campeonato Brasileiro Série A
-Brasileirão Série A
+Campeonato Brasileiro Série A
languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml (1)

11217-11225: Add examples and consider refining the acronym pattern.

The rule correctly handles acronyms at sentence start, but could benefit from some improvements:

  1. Add example cases in XML comments for better maintainability:
 <rule>
+    <!-- Examples:
+         Matches: SNCF, RATP, CNRS
+         Non-matches: ABC123, XYZabc -->
     <pattern case_sensitive="yes">
  1. Consider making the pattern more specific to known French acronym patterns:
-    <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
+    <token regexp="yes" postag="SENT_END">(?:SNCF|RATP|CNRS|[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token>

This would help prevent false positives while still maintaining flexibility for common French acronyms.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1)

16392-16400: Consider refining the acronym pattern to reduce false positives.

The rule correctly handles common acronym patterns at sentence start, but the current regexp [BCDFGHJKLMNPQRSTVWXZ]{3,6}s? might be too permissive and could match invalid consonant sequences.

Consider these improvements:

-                    <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
+                    <token regexp="yes" postag="SENT_END">(?:UNESCO|NATO|NASA|[A-Z][BCDFGHJKLMNPQRSTVWXZ]{2,5})s?</token>

This would:

  1. Explicitly allow common acronyms
  2. Require at least one valid letter start
  3. Still maintain flexibility for other valid acronyms
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 42f2c0e and 2830bf9.

📒 Files selected for processing (14)
  • languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (2 hunks)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (1 hunks)
  • languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1 hunks)
  • languagetool-language-modules/es/src/main/resources/org/languagetool/resource/es/disambiguation.xml (1 hunks)
  • languagetool-language-modules/fr/src/main/resources/org/languagetool/resource/fr/disambiguation.xml (1 hunks)
  • languagetool-language-modules/fr/src/main/resources/org/languagetool/rules/fr/remote-rule-filters.xml (1 hunks)
🔥 Files not summarized due to errors (2)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml: Error: Server error: no LLM provider could handle the message
  • languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/disambiguation.xml: Error: Server error: no LLM provider could handle the message
✅ Files skipped from review due to trivial changes (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/ignore.txt
🔇 Additional comments (11)
languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/spelling.txt (2)

62885-62889: LGTM: Correct comparative adjective forms

The added comparative forms of "strukturiert" follow proper German adjective declension rules and cover all necessary grammatical cases.


65217-65219: LGTM: Valid compound adjective formations

The new compound adjectives follow proper German word formation rules and are marked correctly with the /A suffix for adjective variations.

languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml (1)

104460-104462: LGTM! Pattern changes improve precision

The addition of chunk_re attributes makes the rule more accurate by considering phrase boundaries. The pattern will now better handle complex sentence structures.

Let's verify there are no conflicting chunk patterns in other rules:

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt (2)

Line range hint 1-4: LGTM! Clear documentation of file purpose.

The header comment effectively explains:

  • The file's purpose (extending spell checker)
  • Its relationship with ignore.txt
  • Guidelines for handling multi-token phrases

493-495: LGTM! Valid additions of non-hyphenated variants.

The additions of "outgroup" and "ingroup" complement the existing hyphenated forms ("out-group" and "in-group"). Both variants are widely accepted in academic and professional writing, particularly in social psychology and sociology.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/multitoken-suggest.txt (1)

3091-3091: LGTM! The additions follow the established pattern.

The new entries are correctly formatted and appropriately use the /S suffix for plural forms in German text. They represent commonly used terms in financial and culinary contexts.

Also applies to: 3112-3114

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/ignore.txt (1)

10932-10939: LGTM! The additions are appropriate for the spellchecker ignore list.

The new entries include common technical abbreviations (MBR, QBR, EOW), UI elements (navbar), and proper nouns (Martijn). These are valid terms that should be ignored during spell checking.

languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/multiwords.txt (2)

Line range hint 1-5: LGTM! Clear and comprehensive header comments.

The file header provides excellent guidance on:

  1. The file's purpose for English disambiguation
  2. Clear criteria for adding terms
  3. Case-sensitivity rules
  4. Relationship with spelling.txt

8027-8027: LGTM! Correct POS tagging for culinary term.

The term "sambal ulek" is correctly tagged as NN:U (uncountable noun) as it refers to an Indonesian chili paste that is typically used as a mass noun.

languagetool-core/src/main/resources/org/languagetool/resource/spelling_global.txt (1)

26704-26710: LGTM for the remaining entries!

The other additions (Brandeis University, Carl Boese, Le Méridien, Albert Heijn, Tiff Macklem, Carel Fabritius) are valid proper nouns that will enhance the spell checker's vocabulary.

languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/disambiguation.xml (1)

6037-6045: Rule looks good, consider pattern refinements.

The rule appropriately handles capitalized consonant sequences (acronyms/initialisms) at sentence start. The ignore_spelling action is suitable for this case.

A few suggestions to consider:

  1. The pattern could be more specific to avoid matching invalid consonant combinations
  2. Consider adding a test case to verify the rule behaves correctly with various inputs

Let's verify the test coverage:

@@ -62879,6 +62879,14 @@ hinaus_gleiten
heraus_flutschen
raus_flutschen
hinaus_flutschen
hinein_flutschen
herin_flutschen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix typo in compound verb formation

The word "herin_flutschen" appears to contain a typo. The correct prefix should be "herein" instead of "herin".

-herin_flutschen
+herein_flutschen

Committable suggestion was skipped due to low confidence.

@@ -104442,12 +104442,24 @@ USA
<url>https://dictionary.cambridge.org/grammar/british-grammar/prefer</url>
<example correction='prefers overalls to skirts'>She much <marker>prefers overalls instead of skirts</marker>, because she enjoys playing in the mud</example>
</rule>
<rule id='TALK_AT_TO' name='Collocation: Talk at/to'>
<rule id='TALK_AT_TO' name='Collocation: Talk at/to' default="off">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Setting default="off" is a new change that requires attention

The git history shows this is a recent change where default="off" was explicitly added to the rule. This modification makes the "TALK_AT_TO" rule opt-in rather than enabled by default, which is a significant behavioral change that could affect users' grammar checking experience.

  • The rule was previously enabled by default since its creation in 2017
  • The change coincides with the addition of new antipatterns for proper nouns
  • This suggests the rule might have had false positives, leading to it being disabled by default
🔗 Analysis chain

Verify the impact of setting default="off"

Setting the rule to disabled by default is a significant change that affects all users. While this can help prevent false positives, we should ensure this aligns with user expectations.


Let me search for any previous versions or history of this rule to verify if this is a change in default state.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if this rule was previously enabled by default
rg -B 5 -A 5 'id=.TALK_AT_TO.' --glob '*.xml' | grep -B 5 -A 5 'default='

Length of output: 1661


Script:

#!/bin/bash
# Search git history for changes to this rule
git log -p | grep -A 10 -B 10 'TALK_AT_TO.*default='

Length of output: 1242

Comment on lines +8201 to +8209
<rule>
<pattern case_sensitive="yes">
<token postag="SENT_START" />
<marker>
<token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
</marker>
</pattern>
<disambig action="ignore_spelling"/>
</rule>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing the abbreviation rule pattern.

The rule for handling uppercase abbreviations at sentence end is good, but could be improved:

  1. The pattern could be more specific to avoid false positives
  2. Documentation would help explain the rule's purpose

Consider this enhanced version:

-        <rule>
-            <pattern case_sensitive="yes">
-                <token postag="SENT_START" />
-                <marker>
-                    <token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
-                </marker>
-            </pattern>
-            <disambig action="ignore_spelling"/>
-        </rule>
+        <!-- Rule to handle uppercase abbreviations at sentence end (e.g., "NASA.", "UNICEF.") -->
+        <rule id="uppercase_abbrev_sent_end" name="uppercase abbreviations at sentence end">
+            <pattern case_sensitive="yes">
+                <token postag="SENT_START" />
+                <marker>
+                    <!-- Match 3-6 consonant abbreviations with optional plural 's' -->
+                    <token regexp="yes" postag="SENT_END">(?:[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token>
+                </marker>
+            </pattern>
+            <disambig action="ignore_spelling"/>
+        </rule>

The improvements include:

  • Added descriptive comment explaining the rule's purpose
  • Added rule ID and name for better maintainability
  • Enhanced regexp pattern with non-capturing group
  • Added inline comment explaining the pattern
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<rule>
<pattern case_sensitive="yes">
<token postag="SENT_START" />
<marker>
<token regexp="yes" postag="SENT_END">[BCDFGHJKLMNPQRSTVWXZ]{3,6}s?</token>
</marker>
</pattern>
<disambig action="ignore_spelling"/>
</rule>
<!-- Rule to handle uppercase abbreviations at sentence end (e.g., "NASA.", "UNICEF.") -->
<rule id="uppercase_abbrev_sent_end" name="uppercase abbreviations at sentence end">
<pattern case_sensitive="yes">
<token postag="SENT_START" />
<marker>
<!-- Match 3-6 consonant abbreviations with optional plural 's' -->
<token regexp="yes" postag="SENT_END">(?:[BCDFGHJKLMNPQRSTVWXZ]{3,6})s?</token>
</marker>
</pattern>
<disambig action="ignore_spelling"/>
</rule>

Comment on lines +26704 to +26710
Carl Boese
Série A
Campeonato Brasileiro Série A
Brasileirão Série A
Le Méridien
Albert Heijn
Tiff Macklem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Redundant name entries confirmed - consolidation needed

Found duplicate entries in spelling_global.txt:

  • "Thomas Chatterton Williams" and "Thomas Chatterton"
  • "Frank Ramsey" and "Frank P. Ramsey"

Please keep only the full name versions:

  • "Thomas Chatterton Williams"
  • "Frank P. Ramsey"

This will maintain consistency while preserving the most complete form of each name.

🔗 Analysis chain

Review redundant name entries.

Some names have multiple variations that might need consolidation:

  • "Thomas Chatterton Williams" and "Thomas Chatterton"
  • "Frank Ramsey" and "Frank P. Ramsey"

Consider keeping only the most commonly used or full versions to avoid potential confusion in spell checking.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for other instances of these names in the codebase
# to ensure consistent usage

echo "Checking for existing name variations..."
rg "Thomas Chatterton|Frank.*Ramsey" --type txt

Length of output: 546

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a08d58b and 93ba5ee.

📒 Files selected for processing (1)
  • languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/remote-rule-filters.xml (1 hunks)

Comment on lines +2453 to +2462
<rule>
<pattern>
<token>Guten</token>
<token skip="2" regexp="yes">Morgen|Abend|Tag</token>
<marker>
<token postag="SENT_END">,</token>
</marker>
</pattern>
<example correction="">Guten Morgen Frank<marker>,</marker></example>
</rule>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Overlapping rules need to be consolidated

The new rule in remote-rule-filters.xml conflicts with existing rules that handle greeting punctuation:

  • GUTEN_TAG_KOMMA in grammar.xml is active and handles comma placement after greetings when followed by questions/phrases
  • ANREDE_KOMMA_2 in de-DE-AT/grammar.xml appears to be commented out (ends with -->)

The new rule should be removed or consolidated with GUTEN_TAG_KOMMA because:

  • Both rules target the same greeting patterns ("Guten Tag/Morgen/Abend")
  • GUTEN_TAG_KOMMA is more comprehensive, handling various follow-up phrases
  • Having multiple rules for the same pattern can lead to inconsistent corrections
🔗 Analysis chain

LGTM: New rule for greeting punctuation looks good.

The rule correctly handles comma placement after formal German greetings ("Guten Morgen", "Guten Abend", "Guten Tag"). The pattern matching is well-structured and the example demonstrates proper usage.

Let's verify there are no conflicting rules:


Potential conflict with existing greeting punctuation rules detected

The search results reveal two potentially conflicting rules in the German grammar files:

  • ANREDE_KOMMA_2 rule in de-DE-AT/grammar.xml specifically handles missing commas after greeting formulas
  • GUTEN_TAG_KOMMA rule in grammar.xml handles comma placement after "Guten Tag" followed by "wie gehts"

Let's verify if these rules are actually active and how they interact with our new rule:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any existing rules that might conflict with the new greeting punctuation rule
# Look for patterns containing "Guten" and greeting words

rg -A 5 'Guten\s+(Morgen|Abend|Tag)' --type xml

Length of output: 9586


Script:

#!/bin/bash
# Check if the potentially conflicting rules are commented out and get their full context
rg -B 2 -A 10 'id="(ANREDE_KOMMA_2|GUTEN_TAG_KOMMA)"' --type xml

Length of output: 3893

@tiff tiff merged commit 0ac604f into master Oct 30, 2024
5 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant