-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jdk19 regexp fix #10972
base: master
Are you sure you want to change the base?
Jdk19 regexp fix #10972
Conversation
WalkthroughThis pull request introduces several modifications across various classes in the LanguageTool project, primarily focusing on enhancing regular expression (regex) handling by incorporating the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (9)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (1)
Line range hint 79409-79541
: Document whitespace handling strategy.
Given that these changes are part of a larger JDK19 regexp fix, consider adding a comment in the XML file explaining the rationale behind using [ \t]?
instead of \s?
for abbreviation patterns. This will help maintain consistency as new rules are added.
Add a comment at the beginning of the abbreviation rules section:
<!--
Abbreviation rules use explicit space/tab matching ([ \t]?) instead of \s?
to ensure consistent behavior across different JDK versions and to prevent
matching other whitespace characters.
-->
languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java (1)
45-45
: Consider adding Unicode test cases.
Since this change affects Unicode character handling, it would be beneficial to add test cases that specifically verify the behavior with non-ASCII text in antipatterns.
Would you like me to help create test cases that cover Unicode scenarios for the RegexAntiPatternFilter
?
languagetool-language-modules/es/src/main/java/org/languagetool/tokenizers/es/SpanishWordTokenizer.java (1)
46-46
: Consider adding test cases for non-ASCII digits.
To ensure the new Unicode digit matching behavior works correctly, consider adding test cases that include ordinal numbers with non-ASCII digits (e.g., Eastern Arabic numerals, Devanagari digits).
Example test cases to consider:
assertEquals(Arrays.asList("٢", "º"), tokenizer.tokenize("٢º")); // Eastern Arabic
assertEquals(Arrays.asList("२", "º"), tokenizer.tokenize("२º")); // Devanagari
languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java (1)
77-77
: Consider updating related patterns for consistency.
While the change to Pattern.UNICODE_CHARACTER_CLASS
is correct, there are similar patterns in the file that would benefit from the same update:
SPACE_DIGITS0
pattern (line 73)DECIMAL_POINT
pattern (line 64)DECIMAL_COMMA
pattern (line 66)
These patterns also deal with digit matching and should be updated for consistency.
Apply this update to related patterns:
private static final Pattern SPACE_DIGITS0 = Pattern.compile("([\\d]{4}) ",
- Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
+ Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
private static final Pattern DECIMAL_POINT = Pattern.compile("([\\d])\\.([\\d])",
- Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
+ Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
private static final Pattern DECIMAL_COMMA = Pattern.compile("([\\d]),([\\d])",
- Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
+ Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS);
languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java (2)
348-348
: Consider adding a comment explaining the diacritics rule.
The pattern correctly handles old diacritics with proper Unicode support. Consider adding a brief comment explaining which version of Catalan orthography these diacritics correspond to.
422-442
: LGTM! Consider grouping related patterns.
The patterns correctly handle various cases of contractions and apostrophes with proper Unicode support. Consider grouping related patterns (e.g., all apostrophe patterns) into separate constant groups with explanatory comments for better maintainability.
Example organization:
// Group 1: Basic contractions
private static final Pattern CA_CONTRACTIONS = ...
// Group 2: Apostrophe patterns
private static final Pattern CA_APOSTROPHES1 = ...
private static final Pattern CA_APOSTROPHES2 = ...
// ... more apostrophe patterns
// Group 3: Possessive patterns
private static final Pattern POSSESSIUS_v = ...
private static final Pattern POSSESSIUS_V = ...
languagetool-core/src/main/java/org/languagetool/rules/AbstractUnitConversionRule.java (1)
199-199
: LGTM! Consider extracting the pattern for better readability.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag improves Unicode word boundary handling. However, the pattern string is quite complex.
Consider extracting the pattern string to a constant for better readability:
+ private static final String UNIT_PATTERN_TEMPLATE =
+ NUMBER_REGEX_WITH_BOUNDARY + "[\\s\u00A0]{0," + WHITESPACE_LIMIT + "}%s\\b";
- unitPatterns.put(Pattern.compile(NUMBER_REGEX_WITH_BOUNDARY + "[\\s\u00A0]{0," + WHITESPACE_LIMIT + "}" + pattern + "\\b", Pattern.UNICODE_CHARACTER_CLASS), unit);
+ unitPatterns.put(Pattern.compile(String.format(UNIT_PATTERN_TEMPLATE, pattern), Pattern.UNICODE_CHARACTER_CLASS), unit);
languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java (2)
Line range hint 392-392
: Initialize phraseMap in the constructor to prevent NPE.
The phraseMap
field is lazily initialized in finalizePhrase()
, but it would be safer to initialize it in the constructor to prevent potential NPEs if finalizePhrase()
is not called first.
Apply this change to the constructor:
public XMLRuleHandler() {
+ this.phraseMap = new HashMap<>();
}
Line range hint 392-392
: Add documentation for the phraseMap data structure.
The phraseMap
field and its usage in preparePhrase
and finalizePhrase
methods would benefit from detailed documentation explaining:
- The structure and purpose of the nested collections
- The lifecycle of phrase handling
- Example usage scenarios
Add Javadoc to the field:
/**
* Stores phrases by their IDs. The structure is:
* - Key: phraseId (String)
* - Value: List of alternative pattern token sequences for the phrase
* where each sequence is a List<PatternToken>
*/
protected Map<String, List<List<PatternToken>>> phraseMap;
Also applies to: 486-509
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (18)
- languagetool-core/src/main/java/org/languagetool/rules/AbstractUnitConversionRule.java (2 hunks)
- languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRuleHandler.java (1 hunks)
- languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java (1 hunks)
- languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java (1 hunks)
- languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java (2 hunks)
- languagetool-core/src/main/resources/org/languagetool/resource/segment.srx (10 hunks)
- languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java (3 hunks)
- languagetool-language-modules/ca/src/main/java/org/languagetool/rules/ca/PronomsFeblesHelper.java (1 hunks)
- languagetool-language-modules/de/src/main/java/org/languagetool/language/German.java (1 hunks)
- languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (9 hunks)
- languagetool-language-modules/es/src/main/java/org/languagetool/language/Spanish.java (1 hunks)
- languagetool-language-modules/es/src/main/java/org/languagetool/tokenizers/es/SpanishWordTokenizer.java (1 hunks)
- languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java (1 hunks)
- languagetool-language-modules/fr/src/main/java/org/languagetool/rules/fr/MakeContractionsFilter.java (1 hunks)
- languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java (1 hunks)
- languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/entities/hyphenised.ent (1 hunks)
- languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor (1 hunks)
- pom.xml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
- languagetool-language-modules/ca/src/main/java/org/languagetool/rules/ca/PronomsFeblesHelper.java
🔇 Additional comments (44)
languagetool-language-modules/de/src/main/resources/org/languagetool/rules/de/grammar.xml (2)
79523-79523
: LGTM! Academic title patterns are well structured.
The changes to academic title patterns are consistent and well-documented with appropriate examples. The whitespace handling aligns with the intended behavior for formal title abbreviations.
Also applies to: 79531-79531, 79541-79541
79409-79409
: Consider implications of restrictive whitespace matching.
The change from \s?
to [ \t]?
makes the whitespace matching more explicit but potentially less flexible. While this might be intentional for abbreviation handling, consider that it will no longer match other valid Unicode whitespace characters.
Let's check if there are any instances in the codebase using other whitespace characters with these abbreviations:
Also applies to: 79418-79418, 79426-79426
✅ Verification successful
Let me try a different approach to search for these specific abbreviations:
And let's also check if there are any test cases or documentation specifically addressing whitespace handling:
The whitespace pattern change is safe and consistent with usage patterns
Based on the search results, the change from \s?
to [ \t]?
is actually appropriate for these German abbreviations:
- All examples in the grammar rules and test cases show these abbreviations (
s.o.
,d.h.
,m.E.
) are used either:- Without any space (
d.h.
,m.E.
) - With a regular space (
d. h.
) - With a non-breaking space entity (
d. h.
)
- Without any space (
The search results show no instances where these abbreviations are used with other types of whitespace characters (like newlines or form feeds), confirming that the more restrictive pattern [ \t]?
correctly covers all legitimate use cases in German text.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for abbreviations with Unicode whitespace characters
rg -U "([sd]|[mM])\.\X{1}([ouhE])\." --multiline
Length of output: 158
Script:
#!/bin/bash
# Search for abbreviations with any kind of whitespace characters
rg -U "([sd]|[mM])\.[[:space:]]*([ouhE])\." --multiline languagetool-language-modules/de/
# Also check test files and example texts for real-world usage
rg -U "([sd]|[mM])\.[[:space:]]*([ouhE])\." --type txt --type xml languagetool-language-modules/de/src/test/
Length of output: 2286
Script:
#!/bin/bash
# Search for specific abbreviations with surrounding context
rg -B1 -A1 "(s\.o\.|d\.h\.|m\.E\.)" languagetool-language-modules/de/
Length of output: 2026
Script:
#!/bin/bash
# Look for test cases or documentation about whitespace handling
rg -i "whitespace|[[:space:]]|\bs\b" languagetool-language-modules/de/src/test/ --type xml
Length of output: 92
languagetool-core/src/main/java/org/languagetool/rules/patterns/RegexAntiPatternFilter.java (1)
45-45
: LGTM! Verify existing antipatterns.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag is appropriate and aligns with the JDK 19 Unicode improvements across the codebase. However, this change could affect how character classes behave with Unicode text.
Let's verify the existing antipatterns in the codebase:
languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java (2)
34-34
: LGTM: Import addition is appropriate.
The Pattern import is necessary for using the UNICODE_CHARACTER_CLASS flag and follows Java import conventions.
63-64
: Implementation improves Unicode handling but verify existing behavior.
The addition of Pattern.UNICODE_CHARACTER_CLASS will enhance Unicode character handling in tokenization, which aligns with the JDK19 regexp improvements. However, this change might subtly affect existing tokenization behavior.
Let's verify the impact:
✅ Verification successful
Let me gather more information about the usage and potential impact by checking the SRXSentenceTokenizer implementation.
Let me check how the tokenize method is used in SRXSentenceTokenizer to understand the impact of the change.
Let me check the SRXSentenceTokenizerTest to see if there are existing Unicode-related tests and how they might be affected.
Based on the code analysis, I can now provide a final response:
Unicode handling improvement is safe and beneficial
The addition of Pattern.UNICODE_CHARACTER_CLASS flag in SrxTools.tokenize() is a safe enhancement that improves Unicode character handling in text tokenization across all languages. The change:
- Is applied consistently through the central tokenization method used by all SRXSentenceTokenizer implementations
- Only affects the pattern matching behavior to better handle Unicode, without changing the tokenization logic
- Does not modify any method signatures or public APIs
- Is backward compatible as it only enhances the Unicode support without breaking existing functionality
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any test cases or usages that might be affected by the tokenization change
# Check for test files that use SrxTools.tokenize
echo "Checking for test files using SrxTools.tokenize:"
rg -l "SrxTools\.tokenize" --type java
# Look for direct usages of the tokenize method
echo "Checking for direct usages of tokenize method:"
ast-grep --pattern 'tokenize($text, $doc, $code)'
Length of output: 388
Script:
#!/bin/bash
# Check the implementation in SRXSentenceTokenizer to understand how SrxTools.tokenize is used
echo "Checking SRXSentenceTokenizer implementation:"
rg -A 5 -B 5 "SrxTools\.tokenize" languagetool-core/src/main/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java
# Check for tests of SRXSentenceTokenizer
echo -e "\nChecking for SRXSentenceTokenizer tests:"
fd -e java -x rg -l "SRXSentenceTokenizer" {} \;
# Look for any existing Unicode-related tests
echo -e "\nChecking for Unicode-related tests:"
rg -l "unicode|Unicode" $(fd -e java -t f)
Length of output: 201637
Script:
#!/bin/bash
# Check SRXSentenceTokenizer implementation to understand how SrxTools.tokenize is used
echo "Checking SRXSentenceTokenizer implementation:"
rg -A 10 -B 10 "SRXSentenceTokenizer" languagetool-core/src/main/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java
# Check for any existing tests that verify Unicode handling
echo -e "\nChecking for Unicode-related tests in tokenizers:"
rg -l "Unicode|unicode.*test" $(fd -e java -t f | grep "test.*Tokenizer")
Length of output: 6491
Script:
#!/bin/bash
# Check the core SRXSentenceTokenizerTest implementation
echo "Checking SRXSentenceTokenizerTest implementation:"
rg -A 5 -B 5 "Unicode|unicode" languagetool-standalone/src/test/java/org/languagetool/tokenizers/SRXSentenceTokenizerTest.java
# Check SrxTools implementation to understand the Unicode flag usage
echo -e "\nChecking SrxTools implementation around tokenize method:"
rg -A 10 -B 10 "tokenize\(" languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java
Length of output: 1300
languagetool-language-modules/es/src/main/java/org/languagetool/tokenizers/es/SpanishWordTokenizer.java (1)
46-46
: Verify the necessity of UNICODE_CHARACTER_CLASS flag.
The change from UNICODE_CASE
to UNICODE_CHARACTER_CLASS
looks correct, but let's verify if Spanish texts actually contain non-ASCII digits that would benefit from this change.
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/pt.sor (1)
181-181
: LGTM! The Unicode flag addition improves text processing.
The addition of (?U)
flag enables UNICODE_CHARACTER_CLASS mode, which ensures proper handling of word boundaries with Unicode characters in Portuguese text.
Let's verify if similar patterns in other language modules need this improvement:
languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java (2)
Line range hint 41-57
: Consider reviewing other regex patterns in the file.
Since this is part of a JDK 19 regexp fix, consider reviewing other patterns in the file that use Pattern.UNICODE_CASE
. For example:
TYPEWRITER_APOSTROPHE
TYPOGRAPHIC_APOSTROPHE
NEARBY_HYPHENS
HYPHENS
These patterns use character classes (\p{L}
) and might benefit from Pattern.UNICODE_CHARACTER_CLASS
for consistent Unicode handling.
Let's check if there are similar patterns in other language modules:
#!/bin/bash
# Description: Find similar patterns in other language modules that might need updating
# Look for patterns using UNICODE_CASE with \p{L} or \d
rg --type java "Pattern\.compile.*\\\\p\{L\}.*UNICODE_CASE" ./languagetool-language-modules/
75-75
: LGTM! The Unicode flag change improves digit matching.
The switch to Pattern.UNICODE_CHARACTER_CLASS
is appropriate here as it ensures proper matching of all Unicode digits, not just ASCII ones. This is particularly relevant for French text that might contain non-ASCII digits.
Let's verify the pattern behavior with non-ASCII digits:
✅ Verification successful
The Unicode flag change is appropriate and consistent with the codebase
Looking at the complete implementation, the change from UNICODE_CASE
to UNICODE_CHARACTER_CLASS
in the SPACE_DIGITS
pattern is correct because:
- The pattern only matches digits and spaces, making
UNICODE_CASE
unnecessary since there's no case-sensitivity involved UNICODE_CHARACTER_CLASS
is more appropriate as it ensures\d
matches all Unicode digits, which is important for proper number formatting in French texts- The change is consistent with the pattern's usage in the tokenizer, where it's specifically used for handling number formatting with spaces (e.g., "1 234")
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for non-ASCII digit usage in French texts
# Look for Unicode digits (like ٠١٢٣٤٥٦٧٨٩) in French files
rg --type java --type xml "[\u0660-\u0669]|[\u06F0-\u06F9]" ./languagetool-language-modules/fr/
Length of output: 834
Script:
#!/bin/bash
# Let's check the actual pattern usage in FrenchWordTokenizer.java
cat ./languagetool-language-modules/fr/src/main/java/org/languagetool/tokenizers/fr/FrenchWordTokenizer.java
Length of output: 11034
languagetool-language-modules/es/src/main/java/org/languagetool/language/Spanish.java (1)
318-318
: LGTM! Good improvement for Unicode handling.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag to the ES_CONTRACTIONS
pattern is a valuable enhancement. This flag makes word boundaries (\b
) Unicode-aware, which is particularly important for Spanish text containing accented characters. This change aligns with JDK19's improved Unicode support and ensures more reliable handling of Spanish contractions (e.g., "a el" → "al", "de el" → "del").
languagetool-language-modules/ca/src/main/java/org/languagetool/language/Catalan.java (1)
48-50
: LGTM! Proper Unicode handling for apostrophe patterns.
The addition of Pattern.UNICODE_CHARACTER_CLASS flag ensures correct handling of Unicode characters in Catalan text, particularly for apostrophe-related patterns.
languagetool-language-modules/fr/src/main/java/org/languagetool/language/French.java (1)
53-55
: LGTM! The addition of Pattern.UNICODE_CHARACTER_CLASS improves Unicode handling.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag to the apostrophe patterns is a good enhancement that ensures proper handling of Unicode apostrophe characters in French text processing. This change aligns with similar improvements across other language modules and makes the regex patterns more robust.
languagetool-core/src/main/java/org/languagetool/rules/AbstractUnitConversionRule.java (1)
83-83
: LGTM! Good Unicode support enhancement.
Adding Pattern.UNICODE_CHARACTER_CLASS
flag improves the handling of Unicode word boundaries in number range detection.
languagetool-core/src/main/java/org/languagetool/rules/patterns/XMLRuleHandler.java (2)
21-27
: LGTM: Import statements are well-organized.
The new imports are properly organized and necessary for the collection classes and function interface used in the implementation.
Line range hint 486-509
: Consider performance impact of deep copying pattern tokens.
The finalizePhrase
method creates multiple copies of pattern tokens. For rules with many phrases, this could impact memory usage and performance. Consider:
- Adding a comment explaining why deep copying is necessary
- Measuring the performance impact with large rule sets
languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt/entities/hyphenised.ent (1)
2-2
: Verify the impact of adding the (?U)
flag.
The addition of the (?U)
flag changes the behavior of quantifiers to be ungreedy by default. While this can help prevent over-matching in some cases, it might affect existing matches. Please ensure this change has been thoroughly tested with various inputs.
languagetool-language-modules/de/src/main/java/org/languagetool/language/German.java (1)
58-58
: LGTM! Consider verifying Unicode character handling.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag improves the pattern's ability to handle Unicode characters in German text, which is particularly important for typography-related transformations.
Let's verify the pattern's behavior with Unicode characters:
✅ Verification successful
Pattern change is appropriate and consistent with codebase practices
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag to TYPOGRAPHY_PATTERN
in German.java aligns with established practices across the codebase:
- Other language modules (Ukrainian, Spanish, French, Catalan) consistently use
UNICODE_CHARACTER_CLASS
flag for patterns handling language-specific text - The German pattern matches letter sequences (
[a-zA-Z]
), making Unicode support particularly relevant for handling German-specific characters - Similar typography-related patterns in other modules (e.g., French) demonstrate careful handling of Unicode characters
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for similar typography patterns in other language modules
# to ensure consistent Unicode handling across the codebase.
# Test: Search for typography patterns in other language modules
rg -l 'Pattern\.UNICODE_CHARACTER_CLASS.*TYPOGRAPHY_PATTERN' 'languagetool-language-modules/'
Length of output: 93
Script:
#!/bin/bash
# Let's try a broader search to understand typography pattern usage across language modules
# and check if other modules have similar Unicode character handling
# Search for TYPOGRAPHY_PATTERN declarations
rg -A 2 'TYPOGRAPHY_PATTERN.*=' languagetool-language-modules/
# Search for toAdvancedTypography method implementations to understand pattern usage
ast-grep --pattern 'toAdvancedTypography($$$) { $$$ }'
# Search for other patterns using UNICODE_CHARACTER_CLASS flag
rg 'Pattern\.UNICODE_CHARACTER_CLASS' languagetool-language-modules/
Length of output: 10039
languagetool-core/src/main/java/org/languagetool/rules/patterns/PatternRuleHandler.java (1)
779-780
: LGTM! Improved Unicode support for regex patterns.
The addition of Pattern.UNICODE_CHARACTER_CLASS
flag enhances the regex pattern matching by making character classes like \w, \d, \s match Unicode characters. This is a good improvement that will help better handle text in various languages.
pom.xml (2)
194-194
: Verify compatibility with the new segment library version.
The update of net.loomchild.segment
from 2.0.3 to 2.0.4 aligns with the PR's focus on improving regex handling. However, we should verify that this update doesn't introduce any breaking changes.
Let's check for any breaking changes or compatibility issues:
✅ Verification successful
Let me gather more information about the specific usage of the segment library to verify if there are any potential breaking changes.
Version update from 2.0.3 to 2.0.4 appears safe to proceed
The code inspection shows that the segment library is used in a straightforward manner for text segmentation with SRX rules:
- Usage is limited to core tokenization functionality in
SrxTools
andSRXSentenceTokenizer
- The API usage is basic and stable (TextIterator, SrxDocument, SrxParser)
- The update aligns with the PR's focus on regex improvements, particularly with the
Pattern.UNICODE_CHARACTER_CLASS
flag usage
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for potential breaking changes in segment library usage
# Test: Search for segment library usage patterns
rg -A 5 "net\.loomchild\.segment"
# Test: Search for any existing issues or workarounds related to the segment library
rg -l "TODO.*segment|FIXME.*segment|XXX.*segment|HACK.*segment|BUG.*segment"
Length of output: 2813
Script:
#!/bin/bash
# Check the actual usage of segment library classes in the codebase
ast-grep --pattern 'class $_ {
$$$
TextIterator $$$
$$$
}'
# Check for any test files related to segmentation
fd -e java -e xml test.*segment
# Look at the implementation details in SrxTools.java
cat languagetool-core/src/main/java/org/languagetool/tokenizers/SrxTools.java
# Look at the implementation details in SRXSentenceTokenizer.java
cat languagetool-core/src/main/java/org/languagetool/tokenizers/SRXSentenceTokenizer.java
Length of output: 5437
194-194
: Address dependency-related TODO comments.
Several TODO comments in the dependency block indicate potential issues that need attention:
- Compile errors in wikipedia, dev, and rpm modules need to be checked
- Missing dependencies after update:
- jopt-simple.jar
- commons-jxpath.jar
- log4j.jar
- Compatibility checks needed for various dependencies
Please verify these issues and ensure all dependencies are properly resolved.
Let's check for the mentioned dependencies and potential issues:
languagetool-core/src/main/resources/org/languagetool/resource/segment.srx (24)
5747-5747
: Regex pattern correctly handles prepositions followed by ellipsis
The regex accurately matches Ukrainian prepositions followed by an ellipsis, ensuring proper sentence segmentation.
5760-5760
: Regex for matching numbered points is correctly defined
The pattern correctly identifies numbered list items, aiding in appropriate segmentation.
5765-5765
: Regex efficiently matches lowercase words ending with punctuation
This regex effectively captures words ending with punctuation marks, which is important for accurate sentence detection.
5774-5774
: Regex correctly matches one or two-letter abbreviations
The pattern accurately identifies abbreviations consisting of one or two letters followed by a period.
5779-5779
: Regex for uppercase abbreviations with optional non-breaking space is appropriate
The regex effectively matches uppercase single-letter abbreviations, optionally preceded by a non-breaking space.
5784-5784
: Regex pattern for complex abbreviation contexts is acceptable
The pattern accurately captures complex abbreviation scenarios, enhancing the segmentation rules for Ukrainian text.
5808-5808
: Regex correctly matches years followed by 'р.' abbreviation
This regex effectively identifies years followed by the Ukrainian abbreviation for 'year', ensuring accurate processing of dates.
5813-5813
: Negative lookbehind correctly avoids matching 'р.' after digits
The negative lookbehind ensures that 'р.' is not matched when preceded by digits, preventing incorrect segmentation.
5828-5828
: Regex correctly matches year ranges with 'рр.' abbreviation
The pattern effectively captures ranges of years followed by 'рр.', the Ukrainian plural abbreviation for 'years'.
5833-5833
: Regex matches common Ukrainian financial abbreviations appropriately
This regex accurately identifies financial abbreviations such as 'тис.', 'млн.', 'млрд.', and 'грн.'.
5838-5838
: Regex correctly matches language abbreviations
The pattern effectively captures abbreviations for various languages, enhancing linguistic processing.
5846-5846
: Regex correctly matches abbreviation 'кін.'
The pattern accurately matches the abbreviation for 'кін.', enhancing abbreviation detection.
5850-5850
: Regex correctly matches abbreviation 'ст.'
This regex effectively identifies the abbreviation 'ст.', commonly used for 'сторінка' (page).
5859-5859
: Regex correctly matches abbreviation 'нар.'
The pattern successfully matches 'нар.', the abbreviation for 'народження' (birth).
5863-5863
: Regex correctly matches abbreviation 'дол.'
This regex accurately identifies 'дол.', the abbreviation for 'долар' (dollar).
5868-5868
: Negative lookbehind correctly excludes 'т.' prefix
The negative lookbehind ensures that abbreviations 'п.' or 'д.' are not matched when preceded by 'т.', preventing incorrect segmentation in specific contexts.
5873-5873
: Regex correctly matches abbreviation 'див.'
The pattern effectively captures 'див.', the abbreviation for 'дивись' (see), commonly used in references.
5885-5885
: Regex correctly matches 'див.' in specific contexts
This regex accurately identifies 'див.' when it appears within parentheses, brackets, or after a comma, enhancing reference handling.
5890-5890
: Regex matches professional title abbreviations effectively
The pattern successfully matches various professional titles such as 'проф.', 'доц.', 'акад.', which is important for proper name recognition.
5894-5894
: Negative lookbehind correctly avoids matching 'др.' after 'і' or 'и'
The negative lookbehind prevents matching 'др.' when preceded by 'і' or 'и', ensuring accurate abbreviation detection.
5898-5898
: Regex correctly matches abbreviation 'Ман.'
The pattern effectively captures 'Ман.', which may represent 'Манчестер' or similar abbreviations.
5908-5908
: Regex matches section abbreviations correctly
This regex accurately identifies abbreviations like 'Арт.', 'Мал.', 'Рис.', and 'Спр.', common in Ukrainian texts.
5918-5918
: Negative lookbehind ensures 'м.' is not matched after digits
The negative lookbehind prevents matching 'м.' (meter) when preceded by digits, avoiding incorrect segmentation.
5942-5942
: Regex correctly matches 'ред.' abbreviation
The pattern effectively captures 'ред.', the abbreviation for 'редактор' (editor), enhancing proper noun recognition.
@@ -79485,7 +79485,7 @@ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA | |||
<example correction="n. V.">Sprechstunde <marker>n.V.</marker></example> | |||
</rule> | |||
<rule><!--12--> | |||
<regexp>(u)\.\s?(v)\.\s?(m|a)\.</regexp> | |||
<regexp>(u)\.\s?(v)\.[ \t]?(m|a)\.</regexp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent whitespace pattern in rule 12.
The pattern (u)\.\s?(v)\.[ \t]?(m|a)\.
mixes \s?
and [ \t]?
. This inconsistency could lead to different matching behavior within the same expression.
Suggest applying the same pattern consistently:
-(u)\.\s?(v)\.[ \t]?(m|a)\.
+(u)\.[ \t]?(v)\.[ \t]?(m|a)\.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
<regexp>(u)\.\s?(v)\.[ \t]?(m|a)\.</regexp> | |
(u)\.[ \t]?(v)\.[ \t]?(m|a)\. |
private static final Pattern DE_LE = Pattern.compile("\\bde le\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | ||
private static final Pattern A_LE = Pattern.compile("\\bà le\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | ||
private static final Pattern DE_LES = Pattern.compile("\\bde les\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); | ||
private static final Pattern A_LES = Pattern.compile("\\bà les\\b", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CHARACTER_CLASS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Several patterns need Unicode character class updates for consistent handling
The verification revealed that there are several patterns in French-related files that use Pattern.UNICODE_CASE
without Pattern.UNICODE_CHARACTER_CLASS
, particularly in files that deal with word boundaries and character classes. Key locations that need similar updates:
-
FrenchWordTokenizer.java
:TYPEWRITER_APOSTROPHE
,TYPOGRAPHIC_APOSTROPHE
patterns using\p{L}
with word boundariesNEARBY_HYPHENS
,HYPHENS
patterns using\p{L}
with word boundariesSPACE_DIGITS0
pattern (only usingUNICODE_CASE
)
-
FrenchTagger.java
:PREFIXES_FOR_VERBS
andPREFIXES_FOR_NOUN_ADJ
patterns using case-insensitive matching with accented characters
These patterns would benefit from the same Pattern.UNICODE_CHARACTER_CLASS
flag to ensure consistent Unicode handling across the French module, especially for patterns involving word boundaries (\b
) or character classes (\p{L}
).
🔗 Analysis chain
LGTM! Improved Unicode handling for French contractions.
The update to use Pattern.UNICODE_CHARACTER_CLASS
is correct and improves the handling of Unicode word boundaries in French text. This change aligns with JDK 19 best practices for regex patterns.
Let's verify if similar patterns in other French-related classes have been updated:
This script will help us examine all Pattern.compile usages in French-related files to verify if similar Unicode handling updates are needed elsewhere. The previous search revealed multiple files using Pattern.compile, so we need to inspect their actual pattern definitions to ensure consistent Unicode handling across the French module.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for other regex patterns in French-related classes that might need similar updates
# Search for Pattern.compile in French-related files
echo "Checking for other regex patterns in French-related files..."
rg -l "Pattern\.compile" "languagetool-language-modules/fr/src/main/java"
# Search specifically for UNICODE_CASE without UNICODE_CHARACTER_CLASS
echo "Checking for patterns that might need updating..."
rg "Pattern\.compile.*UNICODE_CASE(?!.*UNICODE_CHARACTER_CLASS)" "languagetool-language-modules/fr/src/main/java"
Length of output: 1824
Script:
#!/bin/bash
# Let's check the content of these files for Pattern.compile usage
for file in $(rg -l "Pattern\.compile" "languagetool-language-modules/fr/src/main/java"); do
echo "=== $file ==="
# Show Pattern.compile lines with surrounding context
rg -B1 -A1 "Pattern\.compile" "$file"
echo
done
Length of output: 17710
@@ -1,2 +1,2 @@ | |||
|
|||
<!ENTITY hyphenised_expressions "\b(?!feij(?:ão|ões)-frade)((?:c(?:a(?:r(?:rap(?:icho(?:s-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi)|-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi))|ato(?:s-(?:p(?:assarinho|eixe)|(?:caval|sap)o|galinha|boi)|-(?:p(?:assarinho|eixe)|(?:caval|sap)o|galinha|boi)))|u(?:ru(?:s-(?:(?:s(?:oldad|ap)|cach|vead)o|espi(?:nho|ga)|po(?:mba|rco))|-(?:(?:s(?:oldad|ap)|cach|vead)o|espi(?:nho|ga)|po(?:mba|rco)))|atás?-pau)|d(?:o(?:s-(?:(?:(?:bur|ou)r|vis[cg])o|co(?:chonilha|alho|mer)|isca)|-(?:(?:(?:bur|ou)r|vis[cg])o|co(?:chonilha|alho|mer)|isca))|ea(?:is|l)-poupa)|á(?:s-(?:(?:sapateir|cabocl|espinh)o|(?:angol|pedr|águ)a|jardim)|-(?:(?:sapateir|cabocl|espinh)o|(?:pedr|águ)a|jardim))|a(?:n(?:ha(?:s-(?:viveir|toc|ri)|-(?:viveir|toc))o|guejos?-pedra)|guatás?-jardim)|(?:v(?:ões|ão)-ferreir|mim-cártam)o|nes-donzela)|p(?:i(?:ns-(?:b(?:o(?:t(?:ão|a)|lota|de)|a(?:ndeira|tatais)|u(?:cha|rro)|ezerro)|c(?:a(?:(?:rneir|val)o|(?:piva|b)ra)|o(?:ntas|rte|co)|heiro|uba)|t(?:(?:artarug|ouceir)a|e(?:nerife|so))|(?:m(?:a(?:rrec|nad)|ul)|esteir|égu)a|p(?:(?:ernambuc|ast|omb)o|lanta)|f(?:o(?:rquilha|go)|lecha|eixe)|r(?:o(?:[ls]a|des)|ebanho|aiz)|a(?:n(?:gola|dar)|çude)|s(?:apo|oca)|diamante|lastro|natal|itu)|m-(?:b(?:o(?:t(?:ão|a)|lota|de)|a(?:ndeira|tatais)|u(?:cha|rro)|ezerro)|c(?:a(?:(?:rneir|val)o|(?:piva|b)ra)|o(?:ntas|rte|co)|heiro|uba)|(?:m(?:a(?:rrec|nad)|ul)|esteir|égu|soc)a|t(?:(?:artarug|ouceir)a|e(?:nerife|so))|p(?:(?:ernambuc|ast|omb)o|lanta)|f(?:o(?:rquilha|go)|lecha|eixe)|r(?:o(?:[ls]a|des)|ebanho|aiz)|a(?:n(?:gola|dar)|çude)|d(?:iamante|eus)|lastro|natal|itu)|tã(?:es|o)-sa(?:ír|l)a|xinguis?-bicho)|elas?-viúva)|n(?:a(?:s-(?:(?:(?:chei|bur)r|passarinh|macac)o|(?:v(?:asso[iu]|íbo)r|frech|roc)a|(?:jacar|imb)é|elefante|açúcar|urubu)|-(?:(?:v(?:asso[iu]|íbo)r|frech|roc)a|(?:passarinh|macac|burr)o|(?:jacar|imb)é|elefante|açúcar|urubu)|fístula(?:s-(?:igapó|lagoa|boi)|-(?:igapó|lagoa|boi)))|el(?:a(?:s-(?:c(?:a(?:poeira|tarro)|(?:eilã|heir)o|utia)|v(?:e(?:ad|lh)o|argem)|g(?:arça|oiás)|papagaio|jacamim|ema)|-(?:c(?:a(?:poeira|tarro)|utia)|v(?:argem|eado)|g(?:arça|oiás)|papagaio|jacamim|ema))|eira(?:s-(?:cheiro|ema)|-(?:cheiro|ema)))|udo(?:s-(?:cachimbo|lagoa)|-(?:cachimbo|lagoa))|(?:ários?-franç|iços?-águ)a|sanç(?:ões|ão)-leite|oés?-botão)|s(?:t(?:a(?:nh(?:a(?:-(?:(?:á(?:fric|gu)|a(?:rar|nt))a|m(?:oçambique|acaco|inas)|c(?:aiaté|utia)|p(?:eixe|uri)|jatobá|bugre)|s-(?:(?:á(?:fric|gu)|a(?:rar|nt))a|m(?:oçambique|acaco|inas)|c(?:aiaté|utia)|p(?:eixe|uri)|bugre))|eiros?-minas)|s?-correr)|or(?:es)?-montanha)|c(?:a(?:s-(?:carvalho|jacaré|anta)|-(?:carvalho|jacaré|noz))|o(?:s-(?:cavalo|jabuti|tatu)|-(?:cavalo|jabuti|tatu))|udo(?:s-(?:enfeite|aranha)|-(?:enfeite|aranha))))|m(?:a(?:r(?:á(?:s-(?:(?:c(?:aval|heir)|espinh)o|b(?:ilro|oi)|flecha)|-(?:(?:c(?:aval|heir)|espinh)o|b(?:ilro|oi)|flecha))|ões-(?:pe(?:nedo|dra)|estalo|areia)|ão-(?:pe(?:nedo|dra)|estalo|areia))|le(?:ões-(?:pedreira|asas)|ão-(?:pedreira|asas)))|b(?:ará(?:s-(?:c(?:h(?:eir|umb)o|apoeira)|espinho|lixa)|-(?:c(?:h(?:eir|umb)o|apoeira)|espinho|lixa))|oatãs?-leite)|urus?-cheiro)|c(?:himbo(?:s-(?:(?:maca|tur)co|jabuti)|-(?:(?:maca|tur)co|jabuti))|au(?:s-(?:ca(?:racas|iena)|mico)|-(?:ca(?:racas|iena)|mico))|tos?-cabeça)|b(?:a(?:(?:ças?-trombet|cinhas?-cobr)a|s-(?:igreja|ladrão|peixe)|-(?:igreja|ladrão|peixe))|u(?:mbos?-azeite|rés?-orelha))|val(?:inho(?:-(?:judeu|deus|cão)|s-judeu)|o-cão)|fé(?:s-b(?:agueio|ugre)|-b(?:agueio|ugre))|t(?:ingueiros?-porc|otas?-espinh)o|avuranas?-cunhã)|o(?:c(?:o(?:-(?:(?:b(?:acai(?:aú|u)b|ocaiuv)|p(?:almeir|indob|urg)|quar(?:esm|t)|oitav)a|v(?:a(?:queiro|ssoura)|inagre|eado)|c(?:(?:a(?:cho|ta)rr|igan)o|olher)|(?:espinh|rosári|macac|óle)o|i(?:ndaiá|ri)|(?:gur|a)iri|na(?:tal|iá)|dendê)|s-(?:(?:b(?:acai(?:aú|u)b|ocaiuv)|p(?:almeir|indob|urg)|quar(?:esm|t)|oitav)a|v(?:a(?:queiro|ssoura)|inagre|eado)|c(?:(?:a(?:cho|ta)rr|igan)o|olher)|(?:espinh|rosári|macac|óle)o|i(?:ndaiá|ri)|na(?:tal|iá)|dendê|guriri))|(?:honilhas?-cer|as?-águ)a)|bra(?:-(?:c(?:a(?:p(?:elo|im)|scavel|belo|ju)|o(?:lchete|ral)|ipó)|(?:es[cp]ad|ferradur|barat|águ)a|(?:v(?:ead|idr)|lix|oc)o|a(?:r(?:eia)?|sa)|pernas|ratos?)|s-(?:c(?:a(?:p(?:elo|im)|scavel|belo|ju)|o(?:lchete|ral)|ipó)|(?:ferradur|barat|espad|águ)a|(?:v(?:ead|idr)|lix|oc)o|a(?:r(?:eia)?|sa)|pernas|ratos?))|l(?:a(?:-(?:(?:(?:sapatei|zor)r|caval)o|peixe)|s-(?:(?:caval|zorr)o|peixe))|eir(?:o(?:s-(?:(?:band|choc)o|sapé)|-(?:(?:band|choc)o|sapé))|as?-sapé))|r(?:(?:uj(?:as?|ão)-igrej|tiças?-montanh|-ros)a|vina(?:s-(?:corso|linha)|-(?:corso|linha))|d(?:ões|ão)-frade|reias?-inverno)|e(?:rana(?:s-(?:(?:caravel|min)as|pernambuco)|-(?:(?:caravel|min)as|pernambuco))|ntro(?:s-caboclos|-caboclo))|gumelo(?:s-(?:c(?:aboclo|hapéu)|(?:sangu|leit)e|paris)|-(?:c(?:aboclo|hapéu)|(?:sangu|leit)e|paris))|uve(?:s-(?:a(?:dorno|reia)|(?:saboi|águ)a|cortar)|-(?:a(?:dorno|reia)|(?:saboi|águ)a|cortar))|n(?:gonha(?:s-(?:caixeta|bugre|goiás)|-(?:caixeta|bugre|goiás))|durus?-sangue|tas?-cabra)|irana(?:s-(?:(?:caravel|min)as|pernambuco)|-(?:(?:caravel|min)as|pernambuco))|(?:xas?-(?:d(?:am|on)|freir)|mer(?:es)?-arar|tovias?-poup)a|queiro(?:s-(?:vassoura|dendê)|-(?:vassoura|dendê))|paibeiras?-minas)|ipó(?:-(?:c(?:a(?:r(?:neiro|ijó)|b(?:oclo|aça)|noa)|o(?:r(?:ação|da)|(?:br|l)a)|h(?:agas|umbo)|u[mn]anã|esto)|a(?:l(?:caçuz|ho)|r(?:acuã|c)o|marrar|gulha)|m(?:a(?:inibu|caco)|o(?:fumb|rceg)o|ucuna)|b(?:a(?:(?:mburra|rri)l|tata)|reu|oi)|j(?:a(?:b(?:ut[ái]|ota)|rrinha)|unta)|p(?:(?:a(?:in|lm)|oit)a|enas)|t(?:amanduá|ucunaré|imbó)|l(?:avadeira|eite)|v(?:aqueiro|iúva)|e(?:mbiri|scada)|im(?:pingem|bé)|g(?:ato|ota)|s(?:apo|eda)|(?:fo|re)go|quati|água)|s-(?:c(?:a(?:r(?:neiro|ijó)|b(?:oclo|aça)|noa)|o(?:r(?:ação|da)|(?:br|l)a)|u[mn]anã|hagas|esto)|a(?:l(?:caçuz|ho)|r(?:acuã|c)o|marrar|gulha)|m(?:a(?:inibu|caco)|o(?:fumb|rceg)o|ucuna)|b(?:a(?:(?:mburra|rri)l|tata)|reu|oi)|j(?:a(?:b(?:ut[ái]|ota)|rrinha)|unta)|p(?:(?:a(?:in|lm)|oit)a|enas)|t(?:amanduá|ucunaré|imbó)|l(?:avadeira|eite)|v(?:aqueiro|iúva)|e(?:mbiri|scada)|im(?:pingem|bé)|g(?:ato|ota)|s(?:apo|eda)|(?:fo|re)go|quati|água))|r(?:av(?:o(?:s-(?:(?:cabe(?:cinh|ç)|esperanç|sear)a|b(?:(?:astã|urr)o|ouba)|p(?:oeta|au)|defunto|tunes|urubu|amor)|-(?:(?:cabe(?:cinh|ç)|esperanç|sear)a|b(?:(?:astã|urr)o|ouba)|p(?:oeta|au)|defunto|tunes|urubu|amor))|in(?:a(?:s-(?:(?:lagartix|águ)a|ambrósio|tunes|pau)|-(?:(?:lagartix|águ)a|tunes|pau))|ho(?:s-(?:(?:lagartix|campin)a|defunto)|-(?:lagartixa|defunto))))|ista(?:s-(?:gal(?:inha|o)|mutum|peru)|-(?:gal(?:inha|o)|mutum|peru)|(?:is|l)-rocha))|e(?:bol(?:(?:a(?:s-(?:cheir|lob)|-lob)|inhas?-cheir)o|etas?-frança)|r(?:ej(?:as?-(?:caien|purg)|eiras?-purg)a|vejas?-pobre)|n(?:táureas?-jardim|ouras?-creta)|vadas?-jardim)|h(?:a(?:ga(?:s-(?:bauru|jesus)|-bauru)|scos?-leque)|u(?:p(?:ões|ão)-arroz|vas?-imbu))|u(?:tia(?:s-(?:rabo|pau)|-(?:rabo|pau))|(?:mbuc|i)as?-macaco)|ânhamo-manila)|p(?:a(?:u(?:s-(?:c(?:a(?:n(?:(?:galh|inan)a|deeiro|oas?|til)|m(?:peche|arão)|r(?:rapato|ne)|c(?:himbo|a)|i(?:bro|xa)|pitão|stor)|o(?:r(?:tiça|al)|n(?:ch|t)a|lher|bre)|h(?:a(?:pad|nc)a|i(?:cl|fr)e|eiro)|u(?:rt(?:ume|ir)|nanã|biú|tia)|erc?a|inzas|ruz)|s(?:a(?:n(?:t(?:ana|o)|gue)|p(?:ateir)?o|b(?:ão|iá)|ssafrás|lsa)|e(?:(?:rr|d)a|bo)|urriola|olar)|m(?:a(?:(?:n(?:jeriob|teig)|ri)a|(?:cac|str|lh)o)|o(?:(?:njol|rceg)o|quém|có)|(?:utamb|erd)a)|b(?:u(?:jarrona|gre|rro)|i(?:ch?o|lros)|a(?:rbas|lso)|o(?:[lt]o|ia)|r(?:incos|eu)|álsamo)|p(?:e(?:r(?:nambuco|eira)|nte)|r(?:eg(?:uiça|o)|aga)|i(?:ranha|lão)|o(?:mb|rc)o|ólvora)|l(?:a(?:g(?:arto|oa)|cre|nça)|e(?:(?:br|it)e|tras|pra)|i(?:vros|xa)|ágrima)|f(?:(?:a(?:[iv]|rinh)|ormig)a|(?:u[ms]|ígad)o|l(?:echas?|or)|e(?:bre|rro))|r(?:e(?:(?:spost|nd)a|(?:[gm]|in)o|de)|os(?:eira|as?)|a(?:inha|to))|e(?:s(?:p(?:inh|et)o|teira)|(?:rv(?:ilh)?|mbir)a|lefante)|a(?:(?:bóbor|ngol)a|r(?:ara|co)|l(?:ho|oé))|t(?:a(?:rtaruga|manco)|in(?:gui|ta)|ucano)|v(?:i(?:n(?:tém|ho)|ola)|e(?:ado|ia)|aca)|g(?:(?:asolin|om)a|ui(?:tarra|né))|j(?:erimum?|angada|udeu)|d(?:igestão|edal)|n(?:avalha|ovato)|o(?:rvalho|laria)|(?:incens|óle)o|(?:zebr|águ)a|qui(?:abo|na))|-(?:c(?:a(?:n(?:(?:galh|inan)a|deeiro|oas?|til)|m(?:peche|arão)|r(?:rapato|ne)|c(?:himbo|a)|i(?:bro|xa)|pitão|stor)|o(?:r(?:tiça|al)|n(?:ch|t)a|lher|bre)|h(?:a(?:pad|nc)a|i(?:cl|fr)e|eiro)|u(?:rt(?:ume|ir)|nanã|biú|tia)|erc?a|inzas|ruz)|s(?:a(?:n(?:t(?:ana|o)|gue)|p(?:ateir)?o|b(?:ão|iá)|ssafrás|lsa)|e(?:(?:rr|d)a|bo)|urriola|olar)|m(?:a(?:(?:n(?:jeriob|teig)|ri)a|(?:cac|str|lh)o)|o(?:(?:njol|rceg)o|quém|có)|(?:utamb|erd)a)|b(?:u(?:jarrona|gre|rro)|i(?:ch?o|lros)|a(?:rbas|lso)|o(?:[lt]o|ia)|r(?:incos|eu)|álsamo)|p(?:e(?:r(?:nambuco|eira)|nte)|r(?:eg(?:uiça|o)|aga)|i(?:ranha|lão)|o(?:mb|rc)o|ólvora)|l(?:a(?:g(?:arto|oa)|cre|nça)|e(?:(?:br|it)e|tras|pra)|i(?:vros|xa)|ágrima)|f(?:(?:a(?:[iv]|rinh)|ormig)a|(?:u[ms]|ígad)o|l(?:echas?|or)|e(?:bre|rro))|r(?:e(?:(?:spost|nd)a|(?:[gm]|in)o|de)|os(?:eira|as?)|a(?:inha|to))|e(?:s(?:p(?:inh|et)o|teira)|(?:rv(?:ilh)?|mbir)a|lefante)|t(?:a(?:rtaruga|manco)|in(?:gui|ta)|ucano)|v(?:i(?:n(?:tém|ho)|ola)|e(?:ado|ia)|aca)|a(?:(?:bóbor|ngol)a|l(?:ho|oé)|rco)|g(?:(?:asolin|om)a|ui(?:tarra|né))|j(?:erimum?|angada|udeu)|d(?:igestão|edal)|n(?:avalha|ovato)|o(?:rvalho|laria)|(?:incens|óle)o|(?:zebr|águ)a|qui(?:abo|na))|xis?-pedra)|l(?:m(?:eir(?:a(?:s-(?:(?:(?:palmi|ce)r|igrej)a|madagascar|dendê|leque|tebas|vinho)|-(?:(?:(?:palmi|ce)r|igrej)a|madagascar|dendê|leque|tebas|vinho))|inhas?-petrópolis)|a(?:s-(?:c(?:hicote|acho)|igreja|leque)|-(?:c(?:hicote|acho)|igreja|leque)|tórias?-espinho)|i(?:tos?-ferrão|lhas?-papa))|ha(?:s-(?:(?:penach|caniç)o|guiné|água)|-(?:(?:penach|caniç)o|guiné|água))|os-(?:calenturas|maria))|r(?:ic(?:á(?:s-(?:esponjas|curtume)|-(?:esponjas|curtume))|aranas?-espinhos)|a(?:cuuba(?:s-lei(?:te)?|-lei(?:te)?)|sitas?-samambaiaçu|tudos?-praia)|go(?:s-(?:m(?:itra|orro)|cótula)|-(?:m(?:itra|orro)|cótula)))|p(?:o(?:ila(?:s-(?:espinho|holanda)|-(?:espinho|holanda))|ula(?:s-(?:espinho|holanda)|-(?:espinho|holanda)))|agaio(?:s-cole(?:ira|te)|-cole(?:ira|te)))|in(?:a(?:s-(?:s(?:apo|eda)|arbusto|penas|cuba)|-(?:s(?:apo|eda)|arbusto|penas|cuba))|eira(?:s-(?:c(?:ipó|uba)|leite)|-(?:c(?:ipó|uba)|leite)))|(?:c(?:o(?:vas?-macac|s?-golung)|as-rab)|ssarinhos?-(?:arribaç|ver)ã|nelas?-bugi)o|t(?:os?-c(?:a(?:rúncul|ien)|rist)a|i(?:nhos?-igapó|s?-goiás))|v(?:ões|ão)-java)|i(?:nh(?:eir(?:o(?:s-(?:(?:(?:pur|ri)g|casquinh)a|jerusalém|alepo)|-(?:(?:(?:pur|ri)g|casquinh)a|jerusalém|alepo))|inho(?:s-(?:jardim|sala)|-(?:jardim|sala)))|ões-(?:(?:cerc|purg)a|madagascar|ratos?)|ão-(?:(?:cerc|purg)a|madagascar|rato)|o(?:s-(?:flandres|riga)|-riga)|as?-raiz)|ment(?:a(?:s-(?:c(?:(?:aien|oro)a|heiro)|(?:ra[bt]|macac)o|g(?:alinha|entio)|bu(?:gre|ta)|queimar|água)|-(?:c(?:(?:aien|oro)a|heiro)|(?:ra[bt]|macac)o|g(?:alinha|entio)|bu(?:gre|ta)|queimar|água))|ões-c(?:aiena|heiro)|ão-c(?:aiena|heiro))|t(?:o(?:mb(?:a(?:s-(?:macaco|leite)|-(?:macaco|leite))|eiras?-marajó)|s-(?:água|saci)|-(?:água|saci))|a(?:ng(?:ueira(?:s-(?:cachorro|jardim)|-(?:cachorro|jardim))|as?-cachorro)|s?-erva)|eiras?-sinal)|olho(?:s-(?:(?:galinh|balei|onç)a|(?:soldad|tubarã)o|p(?:lanta|adre)|c(?:ação|obra)|faraó|urubu)|-(?:(?:galinh|balei|onç)a|(?:soldad|tubarã)o|p(?:lanta|adre)|c(?:ação|obra)|faraó|urubu))|(?:piras?-(?:máscar|prat)|quiás?-pedr|ão-purg)a|c(?:ões-tr(?:opeiro|epar)|ão-tropeiro)|xiricas?-bolas|raíbas?-pele)|e(?:r(?:a(?:s-(?:a(?:(?:guieir|lmeid)a|dvogado)|r(?:e(?:fego|i)|osa)|(?:cris|un)to|jesus|água)|-(?:a(?:(?:guieir|lmeid)a|dvogado)|r(?:e(?:fego|i)|osa)|(?:cris|un)to|jesus|água))|oba(?:s-(?:(?:pernambuc|reg)o|ca(?:ntagalo|mpos)|go(?:iás|mo)|minas)|-(?:(?:pernambuc|reg)o|ca(?:ntagalo|mpos)|go(?:iás|mo)|minas))|(?:iquit(?:o(?:s-(?:campin|ant)|-ant)|inhos?-vassour)|cevejos?-(?:ca[ms]|galinh))a|diz(?:es)?-alqueive|us?-sol)|na(?:chos?-capim|s-avestruz)|pinos?-(?:papagai|burr)o|ssegueiros?-abrir|quiás?-pedra)|u(?:rga(?:s-(?:c(?:a(?:i(?:tité|apó)|(?:boc|va)lo|rijó)|ereja)|(?:ve(?:ad|nt)|marinheir|genti)o|pa(?:ulista|stor)|nabiça)|-(?:c(?:a(?:i(?:tité|apó)|(?:boc|va)lo|rijó)|ereja)|(?:ve(?:ad|nt)|marinheir|genti)o|pa(?:ulista|stor)|nabiça))|lg(?:a(?:s-(?:(?:a(?:rei|nt)|galinh|águ)a|bicho)|-(?:(?:a(?:rei|nt)|galinh|águ)a|bicho))|(?:ões|ão)-planta))|o(?:mb(?:a(?:s-(?:(?:(?:arribaç|sert)ã|espelh|band)o|mulata)|-(?:(?:(?:arribaç|sert)ã|espelh|band)o|mulata))|o(?:s-(?:montanha|leque)|-(?:montanha|leque)))|rco(?:s-(?:verrugas|ferro)|-(?:verrugas|ferro))|aia(?:s-(?:minas|cipó)|-(?:minas|cipó)))|ã(?:es-(?:p(?:o(?:rc(?:in)?o|bre)|ássaros)|gal(?:inha|o)|leite|cuco)|o-(?:p(?:orc(?:in)?o|ássaros)|gal(?:inha|o)|cuco))|l(?:uma(?:s-(?:príncipe|capim)|-(?:príncipe|capim))|átanos?-gênio|antas?-neve)|r(?:eguiça(?:s-(?:bentinho|coleira)|-(?:bentinho|coleira))|imaveras?-caiena)|ássaros?-f(?:andan|i)go|êssegos?-abrir)|f(?:lor(?:es-(?:c(?:a(?:(?:(?:r(?:nav|de)|m)a)?l|(?:sament|chimb|bocl)o)|o(?:(?:[iu]r|elh|c)o|ntas|bra|ral)|e(?:tim|ra)|hagas|iúme|uco)|p(?:a(?:(?:ssarinh|pagai|raís|vã)o|dre|lha|u)|e(?:licano|dra)|érolas)|m(?:a(?:r(?:acujá|iposa)|deira|io)|(?:(?:eren|osca)d|us)a|ico)|b(?:a(?:(?:b(?:eir|ad)|rbeir)o|unilha|ile)|eso[iu]ro)|a(?:(?:lgodã|njinh)o|ranha|bril|zar)|s(?:a(?:p(?:at)?o|ngue)|(?:ed|ol)a)|v(?:(?:iúv|ac)a|e(?:lu|a)do)|n(?:(?:espereir|oiv)a|atal)|l(?:is(?:ado)?|agartixa|ã)|(?:quaresm|dian|águ)a|(?:invern|índi|fog)o|e(?:spírito|nxofre)|t(?:rombeta|anino)|g(?:ra[mx]a|elo)|jesus)|-(?:c(?:a(?:(?:(?:r(?:nav|de)|m)a)?l|(?:sament|chimb|bocl)o)|o(?:(?:[iu]r|elh|c)o|ntas|bra|ral)|e(?:tim|ra)|hagas|iúme|uco)|p(?:a(?:(?:ssarinh|pagai|raís|vã)o|dre|lha|u)|e(?:licano|dra)|érolas)|m(?:a(?:r(?:acujá|iposa)|deira|io)|(?:(?:eren|osca)d|us)a|ico)|b(?:a(?:(?:b(?:eir|ad)|rbeir)o|unilha|ile)|eso[iu]ro)|a(?:(?:lgodã|njinh)o|ranha|bril|zar)|s(?:a(?:p(?:at)?o|ngue)|(?:ed|ol)a)|v(?:(?:iúv|ac)a|e(?:lu|a)do)|n(?:(?:espereir|oiv)a|atal)|(?:quaresm|dian|águ)a|(?:invern|índi|fog)o|e(?:spírito|nxofre)|l(?:agartixa|is|ã)|t(?:rombeta|anino)|g(?:ra[mx]a|elo)|jesus))|rut(?:a(?:s-(?:c(?:o(?:n(?:de(?:ssa)?|ta)|(?:dorn|ruj)a)|a(?:chorro|scavel|iapó)|utia)|g(?:(?:enti|al)o|uar(?:iba|á)|rude)|m(?:a(?:n(?:teig|il)a|caco)|orcego)|p(?:(?:a(?:pagai|vã)|omb|ã)o|erdiz)|sa(?:(?:pucainh|ír)a|b(?:ão|iá))|a(?:n(?:ambé|el)|rara)|v(?:(?:íbor|i)a|eado)|b(?:abad|urr)o|t(?:ucano|atu)|l(?:epra|obo)|jac(?:aré|u)|árvore|faraó|ema)|-(?:c(?:o(?:n(?:de(?:ssa)?|ta)|(?:dorn|ruj)a)|a(?:chorro|scavel|iapó)|utia)|g(?:(?:enti|al)o|uar(?:iba|á)|rude)|m(?:a(?:n(?:teig|il)a|caco)|orcego)|p(?:(?:a(?:pagai|vã)|omb|ã)o|erdiz)|sa(?:(?:pucainh|ír)a|b(?:ão|iá))|a(?:n(?:ambé|el)|rara)|v(?:(?:íbor|i)a|eado)|b(?:abad|urr)o|t(?:ucano|atu)|l(?:epra|obo)|jac(?:aré|u)|árvore|faraó|ema))|eira(?:s-(?:c(?:onde(?:ssa)?|achorro|utia)|(?:macac|tucan|burr|lob)o|p(?:(?:avã|omb)o|erdiz)|jac(?:aré|u)|arara|faraó)|-(?:c(?:onde(?:ssa)?|achorro|utia)|(?:macac|tucan|burr|lob)o|p(?:(?:avã|omb)o|erdiz)|jac(?:aré|u)|arara|faraó))|o(?:s-(?:c(?:a(?:xinguelê|chorro)|o(?:br|nt)a)|m(?:a(?:nteiga|caco)|orcego)|p(?:apagaio|erdiz)|burro|sabiá|imbé)|-(?:c(?:a(?:xinguelê|chorro)|o(?:br|nt)a)|m(?:a(?:nteiga|caco)|orcego)|p(?:apagaio|erdiz)|burro|sabiá|imbé)))|o(?:r(?:m(?:iga(?:s-(?:f(?:e(?:rrão|bre)|ogo)|r(?:a(?:spa|bo)|oça)|c(?:emitério|upim)|m(?:andioca|onte)|b(?:entinho|ode)|(?:imbaúv|onç)a|n(?:ovato|ós)|defunto)|-(?:f(?:e(?:rrão|bre)|ogo)|r(?:a(?:spa|bo)|oça)|c(?:emitério|upim)|m(?:andioca|onte)|b(?:entinho|ode)|(?:imbaúv|onç)a|n(?:ovato|ós)|defunto))|osa(?:s-(?:besteiros|darei)|-(?:besteiros|darei)))|no(?:s-ja(?:çanã|caré)|-ja(?:çanã|caré)))|lha(?:s-(?:s(?:a(?:n(?:tana|gue)|bão)|e(?:rr|d)a)|f(?:(?:ígad|og)o|igueira|ronte)|p(?:a(?:pagaio|dre|jé)|irarucu)|(?:comichã|bold?|gel)o|l(?:ança|eite|ouco)|(?:zeb|he|ta)ra|mangue|urubu)|-(?:s(?:a(?:n(?:tana|gue)|bão)|erra)|f(?:(?:ígad|og)o|igueira|ronte)|p(?:a(?:pagaio|dre|jé)|irarucu)|(?:comichã|bold?|gel)o|l(?:ança|eite|ouco)|(?:zeb|he|ta)ra|mangue|urubu))|cas?-capuz)|a(?:v(?:a(?:s-(?:(?:a(?:ngol|rar)|(?:mal|v)ac|r(?:osc|am)|sucupir|holand)a|c(?:a(?:labar|valo)|h(?:eiro|apa)|obra)|b(?:e(?:souro|lém)|ol(?:ach|ot)a)|(?:quebrant|engenh|ordáli)o|l(?:(?:ázar|ob)o|ima)|t(?:ambaqui|onca)|p(?:orco|aca)|impin?gem)|-(?:(?:a(?:ngol|rar)|(?:mal|v)ac|r(?:osc|am)|sucupir|holand)a|b(?:e(?:souro|lém)|ol(?:ach|ot)a)|c(?:a(?:labar|valo)|(?:hap|obr)a)|(?:quebrant|ordáli)o|t(?:ambaqui|onca)|p(?:orco|aca)|l(?:ima|obo)|impin?gem))|eira(?:s-(?:impin?gem|berloque)|-(?:impin?gem|berloque))|inhas?-capoeira)|lc(?:ões|ão)-coleira)|e(?:ij(?:õe(?:s-(?:c(?:(?:o(?:br|rd)|er|ub)a|avalo)|g(?:u(?:ando|izos)|ado)|(?:árvor|azeit|frad)e|l(?:i(?:sbo|m)a|eite)|(?:jav|rol|soj)a|m(?:acáçar|etro)|po(?:mbinha|rco)|va(?:[cr]a|gem)|tropeiro|boi)|zinhos-capoeira)|ão(?:-(?:c(?:(?:ord|er|ub)a|avalo)|g(?:u(?:ando|izos)|ado)|(?:árvor|azeit|frad)e|l(?:i(?:sbo|m)a|eite)|(?:jav|rol|soj)a|m(?:acáçar|etro)|po(?:mbinha|rco)|va(?:[cr]a|gem)|tropeiro|boi)|zinho-capoeira))|(?:l(?:es)?-genti|nos?-cheir|tos?-botã)o)|i(?:g(?:ueira(?:s-(?:(?:lombrigueir|pit|go)a|b(?:engala|aco)|to(?:car|que)|jardim)|-(?:(?:lombrigueir|pit|go)a|b(?:engala|aco)|to(?:car|que)|jardim))|o(?:s-(?:(?:figueir|banan)a|r(?:echeio|ocha)|(?:tord|verã)o)|-(?:(?:figueir|banan)a|r(?:echeio|ocha)|(?:tord|verã)o)))|lária(?:s-(?:medina|guiné)|-(?:medina|guiné))|andeiras?-algodão)|u(?:mo(?:s-(?:(?:rapos|cord|folh)a|pa(?:isan|raís)o|jardim)|-(?:(?:rapos|cord|folh)a|pa(?:isan|raís)o|jardim))|ncho(?:s-(?:(?:florenç|águ)a|porco)|-(?:(?:florenç|águ)a|porco)))|éis-gentio)|b(?:a(?:nan(?:eir(?:a(?:s-(?:(?:madag[áa]sca|flo)r|(?:italian|papagai)o|sementes|jardim|corda|leque)|-(?:(?:madag[áa]sca|flo)r|(?:italian|papagai)o|sementes|jardim|corda|leque))|inha(?:s-(?:touceira|salão|flor)|-(?:touceira|salão|flor)))|a(?:s-(?:(?:m(?:orceg|acac)|papagai)o|s(?:ementes|ancho)|imbé)|-(?:(?:m(?:orceg|acac)|papagai)o|s(?:ementes|ancho)|imbé)))|tat(?:a(?:s-(?:p(?:e(?:rdiz|dra)|ur(?:ga|i)|orco)|a(?:(?:ngol|rrob)a|maro)|b(?:(?:ranc|ugi)o|ainha)|(?:cabocl|vead)o|t(?:aiuiá|iú)|escamas|rama)|-(?:p(?:e(?:rdiz|dra)|ur(?:ga|i)|orco)|a(?:(?:ngol|rrob)a|maro)|b(?:(?:ranc|ugi)o|ainha)|(?:cabocl|vead)o|t(?:aiuiá|iú)|escamas|rama))|inhas?-cobra)|g(?:a(?:s-(?:(?:cabocl|tucan|lour)o|p(?:ombo|raia))|-(?:(?:cabocl|tucan|lour)o|p(?:ombo|raia)))|re(?:s-(?:(?:arei|lago)a|man(?:gue|ta)|penacho)|-(?:(?:arei|lago)a|man(?:gue|ta)|penacho))|os?-chumbo)|r(?:r(?:ete(?:s-(?:clérigo|eleitor|padre)|-(?:clérigo|eleitor|padre))|ig(?:udas?-espinho|as?-freira))|ba(?:-(?:(?:chib|timã)o|pa(?:ca|u)|lagoa)|s-(?:(?:chib|timã)o|lagoa|boi|pau)))|b(?:osa(?:s-(?:árvore|espiga|pau)|-(?:árvore|espiga|pau))|a(?:-(?:(?:camel|sap)o|boi)|s-(?:sapo|boi)))|c(?:u(?:r(?:aus?-(?:lajea|ban)do|is?-cerca)|(?:paris?-capoei|s?-ped)ra)|abas?-(?:azeit|lequ)e)|mbu(?:s-(?:(?:espinh|caniç)o|pescador|mobília)|-(?:(?:espinh|caniç)o|pescador|mobília))|leia(?:s-(?:b(?:arbatana|ico)|corcova|gomo)|-(?:b(?:arbatana|ico)|corcova|gomo))|i(?:acu(?:s-(?:espinho|chifre)|-(?:espinho|chifre))|nhas?-(?:espad|fac)a)|st(?:(?:i(?:ões|ão)-arrud|ardos?-rom)a|ões-velho)|d(?:ianas?-cheiro|ejos?-lista)|unilhas?-auacuri|únas?-fogo)|i(?:c(?:h(?:o(?:-(?:c(?:(?:a(?:rpintei|chor)r|est)o|o(?:nta|co)|hifre)|(?:(?:ester|bura)c|ouvid|rum)o|(?:(?:gali|u)nh|taquar|sed)a|p(?:a(?:rede|u)|orco|ena|é)|m(?:(?:edranç|osc)a|ato)|v(?:areja|eludo)|f(?:rade|ogo))|s-(?:c(?:(?:a(?:rpintei|chor|nast)r|est)o|o(?:nta|co)|hifre)|(?:m(?:edranç|osc)|(?:gali|u)nh|taquar|sed)a|(?:(?:ester|bura)c|ouvid|rum)o|p(?:a(?:rede|u)|orco|ena|é)|v(?:areja|eludo)|f(?:rade|ogo)))|eiros?-conta)|udas?-corso)|ribás?-pernambuco|telos?-gente)|o(?:r(?:boleta(?:s-(?:p(?:êssego|iracema)|a(?:moreira|lface)|(?:carvalh|band)o|gás)|-(?:p(?:êssego|iracema)|a(?:moreira|lface)|(?:carvalh|band)o|gás))|d(?:ões|ão)-(?:santiag|macac)o)|i(?:s-(?:carro|guará|deus)|-(?:carro|guará|deus)|tas?-bigodes)|a(?:is|l)-alicante|fes?-burro|tos-óculos)|r(?:edo(?:s-(?:(?:namor(?:ad)?|porc|vead|mur)o|espi(?:nho|ga)|cabeça|jardim)|-(?:(?:namor(?:ad)?|porc|vead|mur)o|espi(?:nho|ga)|cabeça|jardim))|inco(?:s-(?:s(?:a(?:guim?|uim)|urubim)|passarinho)|-(?:sa(?:guim?|uim)|passarinho))|(?:ucos?-salvaterr|ancos?-barit)a|ocas?-raiz)|e(?:s(?:ouro(?:s-(?:(?:limeir|águ)a|chifre|maio)|-(?:(?:limeir|águ)a|chifre|maio))|ugos?-ovas)|l(?:droega(?:s-(?:inverno|cuba)|-(?:inverno|cuba))|a(?:s-felgueiras?|-felgueiras?))|ngalas?-camarão|tónicas?-água|ijus?-potó)|álsamo(?:s-(?:c(?:a(?:rtagena|nudo)|heiro)|(?:arce|tol)u|enxofre)|-(?:c(?:a(?:rtagena|nudo)|heiro)|(?:arce|tol)u|enxofre))|u(?:ch(?:o(?:s-(?:veado|boi|rã)|-(?:veado|boi|rã))|as?-purga)|t(?:iás?-vinagre|uas?-corvo)|xos?-holanda))|a(?:l(?:f(?:a(?:vaca(?:s-(?:c(?:(?:abocl|heir)o|obra)|vaqueiro)|-(?:c(?:(?:abocl|heir)o|obra)|vaqueiro))|ce(?:s-(?:(?:c(?:ordeir|ã)|porc)o|alger)|-(?:(?:c(?:ordeir|ã)|porc)o|alger))|zemas?-caboclo|fas?-provença)|inete(?:s-(?:toucar|dama)|-toucar))|m(?:a(?:s-(?:c(?:a(?:(?:boc|va)lo|çador)|(?:hichar|ânta)ro)|(?:tapui|pomb|gat)o|biafada|mestre)|-(?:c(?:a(?:(?:boc|va)lo|çador)|(?:hichar|ânta)ro)|(?:tapui|pomb|gat)o|biafada))|ecegueira(?:s-(?:cheiro|minas)|-(?:cheiro|minas)))|e(?:cri(?:ns-(?:c(?:ampina|heiro)|angola)|m-(?:c(?:ampina|heiro)|angola))|trias?-pau)|ho(?:s-(?:espanha|cheiro)|-(?:espanha|cheiro))|ba(?:troz(?:es)?-sobrancelha|coras?-laje)|g(?:odoeiros?-pernambuco|ibeiras?-dama)|cachofras?-jerusalém|amandas?-jacobina)|r(?:a(?:ç(?:á(?:s-(?:c(?:o(?:mer|roa)|heiro)|(?:umbig|vead)o|(?:pomb|ant)a|tinguijar|minas)|-(?:c(?:o(?:mer|roa)|heiro)|(?:umbig|vead)o|(?:pomb|ant)a|tinguijar|minas))|aris?-minhoca)|ticu(?:ns-(?:(?:espinh|cheir)o|(?:jangad|pac)a|boia?)|m-(?:(?:espinh|cheir)o|(?:jangad|pac)a|boia?))|nha(?:s-(?:água|coco)|-(?:água|coco))|(?:pocas?-cheir|rutas?-porc)o)|r(?:aia(?:s-(?:coroa|fogo)|-(?:coroa|fogo))|ozes-(?:telhad|rat)o|udas?-campinas)|oeira(?:s-(?:(?:goiá|mina)s|capoeira|bugre)|-(?:(?:goiá|mina)s|capoeira|bugre))|(?:lequi(?:ns|m)-caien|cos?-pip)a)|n(?:g(?:ico(?:s-(?:m(?:onte|ina)s|banhado|curtume)|-(?:m(?:onte|ina)s|banhado|curtume))|eli(?:ns|m)-(?:espinh|morceg)o|élicas?-rama)|a(?:n(?:ases-(?:caraguatá|agulha)|ás-(?:caraguatá|agulha))|mbés?-capuz)|dorinha(?:s-(?:bando|casa)|-(?:bando|casa))|ingas?-(?:espinh|macac)o|u(?:n?s|m)?-enchente|z(?:óis|ol)-lontra)|m(?:or(?:e(?:ira(?:s-(?:espinho|árvore)|-(?:espinho|árvore))|s-(?:(?:(?:vaquei|bur)r|hortelã)o|moça))|-(?:(?:(?:vaquei|bur)r|hortelã)o|moça))|e(?:ndo(?:i(?:ns-(?:árvore|veado)|m-(?:árvore|veado))|eiras?-coco)|ixa(?:s-(?:madagascar|espinho)|-(?:madagascar|espinho)))|êndoas?-(?:espinh|coc)o)|b(?:elha(?:s-(?:c(?:(?:achorr|hã)o|upim)|(?:rein|fog|our|sap)o|p(?:urga|au))|-(?:(?:rein|fog|our|sap)o|p(?:urga|au)|cupim))|(?:utuas?-batat|óbora-coro)a|r(?:icós?-macaco|aços?-vide))|s(?:a(?:s-(?:pa(?:pagaio|lha)|(?:barat|telh)a|sabre)|-(?:pa(?:pagaio|lha)|(?:barat|telh)a|sabre))|pargo(?:s-(?:jardim|sala)|-(?:jardim|sala))|so[bv]ios?-(?:cobr|folh)a)|ça(?:f(?:ate(?:s-(?:o[iu]ro|prata)|-(?:o[iu]ro|prata))|roeiras?-pernambuco)|ís?-caatinga)|g(?:ulh(?:(?:ões|ão)-(?:(?:trombe|pra)t|vel)a|as?-pastor)|rílicas?-rama)|zed(?:inha(?:s-(?:corumbá|goiás)|-(?:corumbá|goiás))|as-ovelha)|ve(?:nca(?:s-(?:espiga|minas)|-(?:espiga|minas))|s?-crocodilo)|(?:ipos?-montevid|carás?-v)éu|tu(?:ns|m)-galha)|m(?:a(?:r(?:acujá(?:s-(?:c(?:a(?:iena|cho)|o(?:rtiç|br)a|heiro)|(?:ga(?:rap|vet)|mochil)a|pe(?:riquito|dra)|est(?:rada|alo)|(?:alh|rat)o)|-(?:c(?:a(?:iena|cho)|o(?:rtiç|br)a|heiro)|(?:ga(?:rap|vet)|mochil)a|pe(?:riquito|dra)|est(?:rada|alo)|(?:alh|rat)o))|m(?:el(?:adas?-(?:ca(?:chorr|val)|invern|verã)o|(?:eir)?os?-bengala)|itas?-macaco)|reco(?:s-(?:pequim|ruão)|-(?:pequim|ruão))|imbondos?-chapéu|quesas?-belas)|c(?:a(?:co(?:s-(?:(?:cheir|band)o|noite|sabá)|-(?:(?:cheir|band)o|noite|sabá))|mbira(?:s-(?:(?:flech|pedr)a|serrote)|-(?:(?:flech|pedr)a|serrote))|quinhos?-bambá)|ieira(?:s-(?:(?:anáfeg|coro)a|boi)|-(?:(?:anáfeg|coro)a|boi))|elas?-(?:tabuleir|botã)o|ucus?-paca)|n(?:gue(?:s-(?:(?:(?:pend|bot)ã|sapateir|espet)o|obó)|-(?:(?:(?:pend|bot)ã|sapateir|espet)o|obó))|t(?:imento(?:s-(?:araponga|pobre)|-(?:araponga|pobre))|as?-bretão)|jeric(?:ões|ão)-(?:ceilã|molh)o|d(?:ibis?-juntas|acarus?-boi))|çã(?:s-(?:c(?:(?:[au]c|rav)o|ipreste|obra)|a(?:náfega|rrátel)|(?:espelh|prat)o|rosa|vime|boi)|-(?:c(?:(?:[au]c|rav)o|ipreste|obra)|a(?:náfega|rrátel)|(?:espelh|prat)o|rosa|vime|boi))|t(?:inho(?:s-(?:agulhas|lisboa|sargo)|-(?:agulhas|lisboa|sargo))|o(?:s-(?:engodo|salema)|-(?:engodo|salema)))|m(?:(?:icas?-(?:ca(?:chorr|del)|porc)|(?:ões|ão)-cord)a|oeiro(?:s-(?:espinho|corda)|-(?:espinho|corda)))|lva(?:s-(?:(?:cheir|pendã)o|marajó)|-(?:marajó|pendão)|íscos?-pernambuco)|d(?:ressilvas?-cheiro|eiras?-rei)|itacas?-maximiliano|parás?-cametá)|o(?:s(?:ca(?:s-(?:b(?:a(?:nheir|gaç)o|ich(?:eira|o))|e(?:lefante|stábulo)|ca(?:valos?|sa)|f(?:reira|ogo)|(?:madei|u)ra|inverno)|-(?:b(?:a(?:nheir|gaç)o|ich(?:eira|o))|e(?:lefante|stábulo)|ca(?:valos?|sa)|(?:madei|u)ra|fogo)|t(?:éis-(?:setúbal|jesus)|el-(?:setúbal|jesus)))|quitos?-parede)|ela(?:s-(?:mutum|ema)|-(?:mutum|ema))|n(?:stros?-gila|cos?-peru|tes?-ouro)|(?:uriscos?-sement|reias?-mangu)e|c(?:itaíbas?-leite|hos?-orelhas)|longós?-colher)|u(?:r(?:ici(?:s-(?:(?:tabuleir|porc)o|lenha)|-(?:(?:tabuleir|porc)o|lenha))|uré(?:s-(?:canudo|pajés)|-(?:canudo|pajé))|ta(?:s-(?:cheiro|parida)|-parida))|s(?:go(?:s-(?:irlanda|perdão)|-(?:irlanda|perdão))|aranhos?-água)|tu(?:ns-(?:asso[bv]io|fava)|m-(?:asso[bv]io|fava))|çambés?-espinhos|ngunzás?-cortar)|i(?:lh(?:o(?:s-(?:cobr|águ)|-águ)a|ãs?-pendão)|mos(?:as?-vereda|os?-cacho)|neiras?-petrópolis|cos?-topete|jos?-cavalo|olos-capim)|el(?:(?:(?:ões|ão)-(?:cabocl|morceg|soldad)|oeiros?-soldad)o|(?:ros?-(?:coleir|águ)|ancias?-cobr)a))|e(?:rv(?:a(?:s-(?:m(?:a(?:l(?:eitas|aca)|caé)|u(?:lher|ro)|o[iu]ra|endigo)|p(?:a(?:(?:rid|in)a|ssarinho)|(?:ântan|iolh)o|ontada)|a(?:n(?:(?:dorinh|t)a|jinho|il)|l(?:finete|ho)|mor)|b(?:(?:(?:ascul|ic)h|otã)o|(?:esteir|álsam)os|ugre)|c(?:(?:abr(?:it)?|obr)a|h(?:eir|umb)o)|sa(?:n(?:t(?:iago|ana)|gue)|(?:le)?po)|l(?:a(?:(?:vadeir|c)a|garto)|ouco)|g(?:o(?:[mt]|iabeir)a|uiné|elo)|f(?:(?:og|um|i)o|ebra)|(?:r(?:ober|a)t|our)o|ja(?:raraca|buti)|impingem|esteira)|-(?:m(?:a(?:l(?:eitas|aca)|caé)|u(?:lher|ro)|o[iu]ra|endigo)|p(?:a(?:(?:rid|in)a|ssarinho)|(?:ântan|iolh)o|ontada)|a(?:l(?:finete|míscar|ho)|n(?:(?:dorinh|t)a|il)|mor)|b(?:(?:(?:ascul|ic)h|otã)o|(?:esteir|álsam)os|ugre)|sa(?:n(?:t(?:iago|ana)|gue)|(?:le)?po)|c(?:(?:abr(?:it)?|obr)a|(?:humb|ã)o)|l(?:a(?:(?:vadeir|c)a|garto)|ouco)|g(?:o(?:[mt]|iabeir)a|uiné|elo)|f(?:(?:og|um|i)o|ebra)|(?:r(?:ober|a)t|our)o|ja(?:raraca|buti)|impingem|esteira))|i(?:lha(?:s-(?:(?:cheir|pomb)o|(?:árvo|leb)re|(?:angol|vac)a)|-(?:(?:cheir|pomb)o|(?:árvo|leb)re|(?:angol|vac)a))|nhas?-parida))|s(?:p(?:i(?:n(?:h(?:o(?:s-(?:c(?:a(?:(?:chor|rnei)ro|çada)|r(?:isto|uz)|erca)|(?:bananeir|agulh|roset)a|(?:ladrã|tour|urs)o|j(?:erusalém|udeu)|mari(?:ana|cá)|vintém|deus)|-(?:c(?:a(?:(?:chor|rnei)ro|çada)|r(?:isto|uz)|erca)|(?:bananeir|agulh|roset)a|(?:ladrã|tour|urs)o|j(?:erusalém|udeu)|mari(?:ana|cá)|vintém|deus))|eiro(?:s-(?:c(?:a(?:rneiro|iena)|risto|erca)|j(?:erusalém|udeu)|a(?:gulh|meix)a|vintém)|-(?:c(?:a(?:rneiro|iena)|risto|erca)|j(?:erusalém|udeu)|a(?:gulh|meix)a|vintém))|as?-(?:carneir|vead)o)|afres?-cuba)|ga(?:s-(?:(?:sangu|leit)e|ferrugem|água)|-(?:(?:sangu|leit)e|ferrugem|água)))|onjas?-raiz)|c(?:a(?:móneas?-alepo|das?-jabuti)|ovas?-macaco|umas?-sangue)|tercos?-jurema)|mbira(?:s-(?:ca(?:rrapato|çador)|(?:porc|sap)o)|-(?:ca(?:rrapato|çador)|(?:porc|sap)o))|n(?:xertos?-passarinho|redadeiras?-borla))|g(?:r(?:a(?:m(?:a(?:s-(?:p(?:(?:ernambuc|ast)o|onta)|(?:forquilh|sananduv)a|c(?:oradouro|idade)|ja(?:cobina|rdim)|ma(?:rajó|caé)|adorno)|-(?:p(?:(?:ernambuc|ast)o|onta)|(?:forquilh|sananduv)a|c(?:oradouro|idade)|ja(?:cobina|rdim)|ma(?:rajó|caé)|adorno))|inha(?:s-(?:campinas|jacobina|raiz)|-(?:campinas|jacobina|raiz)))|vatá(?:s-(?:(?:moquec|agulh)a|c(?:o[iu]ro|erca)|(?:ganch|lajed)o|r(?:aposa|ede)|árvore|tingir)|-(?:(?:moquec|agulh)a|c(?:o[iu]ro|erca)|(?:ganch|lajed)o|r(?:aposa|ede)|árvore|tingir))|lhas?-crista)|ão(?:s-(?:(?:c(?:aval|humb)|(?:malu|bi)c|gal)o|p(?:orco|ulha))|-(?:(?:(?:malu|bi)c|(?:cav|g)al)o|p(?:orco|ulha))|zinhos?-galo)|inaldas?-viúva)|a(?:l(?:o(?:s-(?:p(?:enacho|luma)|b(?:ando|riga)|rebanho|fita|ebó)|-(?:p(?:enacho|luma)|b(?:ando|riga)|rebanho|fita|ebó))|inha(?:s-(?:bugre|faraó|água)|-(?:bugre|faraó|água)))|fanhoto(?:s-(?:(?:(?:marmel|coqu)eir|arribaçã)o|(?:jurem|prag)a)|-(?:(?:(?:marmel|coqu)eir|arribaçã)o|(?:jurem|prag)a))|vi(?:ões-(?:(?:(?:colei|ser)r|queimad)a|a(?:nta|ruá)|penacho)|ão-(?:(?:(?:colei|ser)r|queimad)a|a(?:nta|ruá)|penacho))|meleira(?:-(?:(?:lombrigueir|p(?:in|ur)g)a|(?:cansaç|venen)o)|s-(?:(?:cansaç|venen)o|lombrigueiras|p(?:in|ur)ga))|to(?:-(?:madagáscar|algália)|s-algália)|r(?:oupas?-segunda|gantas-ferro))|u(?:a(?:birob(?:eira(?:s-(?:cachorro|minas)|-(?:cachorro|minas))|a(?:s-(?:cachorro|minas)|-(?:cachorro|minas)))|ricangas?-bengala)|iratãs?-coqueiro)|o(?:iab(?:a(?:s-(?:(?:espinh|macac)o|anta)|-(?:(?:espinh|macac)o|anta))|eiras?-(?:cuti|pac)a)|meiros?-minas|gós?-guariba|elas?-lobo)|e(?:rgeli(?:ns|m)-laguna|ngibres?-dourar)|irass(?:óis|ol)-batatas)|t(?:r(?:e(?:vo(?:s-(?:c(?:ar(?:retilha|valho)|heiro)|(?:se[ar]r|águ)a)|-(?:c(?:ar(?:retilha|valho)|heiro)|(?:se[ar]r|águ)a))|moço(?:s-(?:cheiro|jardim|minas)|-(?:cheiro|jardim|minas)))|i(?:go(?:s-(?:p(?:rioste|erdiz)|milagre|israel|verão)|-(?:p(?:rioste|erdiz)|milagre|israel|verão))|colino(?:s-c(?:hifre|rista)|-c(?:hifre|rista))|nca(?:is|l)-pau)|(?:aças?-bibliotec|épanos?-coro)a|omb(?:as?-elefante|etas?-arauto)|utas?-lago)|a(?:i(?:uiá(?:s-(?:c(?:omer|ipó)|pimenta|jardim|quiabo|goiás)|-(?:c(?:omer|ipó)|pimenta|jardim|quiabo|goiás))|nhas?-(?:cors|ri)o)|r(?:taruga(?:s-(?:couro|pente)|-(?:couro|pente))|umã(?:s-espinhos?|-espinhos?))|m(?:b(?:etarus?-espinh|ori[ls]-brav)o|anqueiras?-leite)|j(?:ujás?-(?:cabacinh|quiab)o|ás?-cobra)|(?:xizeiros?-tint|tus?-folh)a|b(?:ocas?-marajó|acos?-cão)|quaris?-cavalo)|i(?:n(?:gui(?:s-(?:c(?:(?:aien|ol)a|ipó)|(?:leit|peix)e)|-(?:c(?:(?:aien|ol)a|ipó)|(?:leit|peix)e))|hor(?:ões|ão)-lombriga)|mbó(?:s-(?:boticário|caiena|jacaré|peixe|raiz)|-(?:boticário|caiena|jacaré|peixe|raiz))|gres?-bengala)|o(?:m(?:at(?:e(?:s-(?:princesa|árvore)|-(?:princesa|árvore))|inhos?-capucho)|ilhos?-creta)|(?:rós?-espinh|adas-cour)o|petes?-cardeal)|u(?:c(?:u(?:ns-(?:carnaúba|redes)|m-(?:carnaúba|redes))|anos?-cinta)|bar(?:ões|ão)-focinho|lipas?-jardim|ias?-areia)|e(?:m(?:betarus?-espinho|porãos?-coruche)|rebintina-quio)|úberas?-(?:invern|verã)o)|r(?:a(?:to(?:s-(?:p(?:a(?:lmatória|iol)|entes|raga)|(?:es(?:pinh|got)|algodã)o|(?:t(?:aquar|romb)|águ)a|c(?:ouro|asa)|fa(?:raó|va)|bambu)|-(?:p(?:a(?:lmatória|iol)|entes|raga)|(?:es(?:pinh|got)|algodã)o|(?:t(?:aquar|romb)|águ)a|c(?:ouro|asa)|fa(?:raó|va)|bambu))|ízes-(?:c(?:(?:edr|urv)o|o(?:bra|rvo)|h(?:eiro|á)|âmaras|ana)|b(?:(?:ar?beir|randã)o|ugre)|(?:angélic|mostard|quin)a|l(?:agarto|opes)|sol(?:teira)?|t(?:ucano|iú)|f(?:rade|el)|guiné|pipi)|iz-(?:c(?:(?:edr|urv)o|o(?:bra|rvo)|h(?:eiro|á)|âmaras|ana)|b(?:(?:ar?beir|randã)o|ugre)|(?:angélic|mostard|quin)a|l(?:agarto|opes)|sol(?:teira)?|t(?:ucano|iú)|f(?:rade|el)|guiné|pipi)|b(?:uge(?:ns|m)-cachorr|anetes?-caval)o|m(?:as?-bezerro|os?-seda)|pés?-saci)|o(?:s(?:a(?:-(?:(?:c(?:a(?:chorr|bocl)|h?ã)|[bl]ob|defunt|o[iu]r|musg)o|p(?:áscoa|au)|jericó|toucar)|s-(?:(?:c(?:a(?:chorr|bocl)|h?ã)|[bl]ob|defunt|o[iu]r|musg)o|jericó|páscoa|toucar))|ário(?:s-(?:jamb[ou]|ifá)|-(?:jamb[ou]|ifá))|e(?:tas?-pernambu|iras?-damas)co)|uxin(?:óis-(?:m(?:uralha|anaus)|(?:espadan|jav)a|caniços)|ol-(?:m(?:uralha|anaus)|(?:espadan|jav)a|caniços))|(?:balos?-(?:arei|galh)|az(?:es)?-bandeir)a|ca(?:s-(?:flores|eva)|-(?:flores|eva)))|(?:e(?:sedás?-cheir|des?-leã)|ábanos?-caval)o|inocerontes?-Java)|s(?:a(?:l(?:sa(?:s-(?:c(?:a(?:stanheiro|valos)|heiro|upim)|(?:roch|águ)a|burro)|-(?:c(?:a(?:stanheiro|valos?)|upim)|(?:roch|águ)a|burro)|parrilhas?-lisboa)|va(?:s-(?:pernambuco|marajó)|-(?:pernambuco|marajó))|amandras?-água)|r(?:a(?:ndis?-(?:(?:carangu|gargar)ej|espinh)o|(?:magos?-águ|s?-pit)a)|dinha(?:s-(?:ga(?:lha|to)|laje)|-(?:ga(?:lha|to)|laje))|go(?:s-(?:beiço|dente)|-(?:beiço|dente))|ros?-pito)|n(?:haç(?:o(?:s-(?:(?:(?:coqu|mamo)eir|fog)o|encontros)|-(?:(?:(?:coqu|mamo)eir|fog)o|encontros))|us?-(?:encont|mamoei)ro)|ãs?-samambaia)|p(?:(?:ucaias?-castanh|és?-capoeir)a|o(?:-chifres?|s-chifre))|gui(?:n?s|m)?-bigode|mambaias?-penacho|[bv]acus?-coroa)|u(?:rucucu(?:s-(?:p(?:ati|ind)oba|fogo)|-(?:p(?:ati|ind)oba|fogo))|ma(?:umeiras?-macaco|gres?-provença|rés?-pedras))|orgo(?:s-(?:(?:vassour|espig)a|pincel|alepo)|-(?:(?:vassour|espig)a|pincel|alepo))|iris?-coral)|j(?:a(?:ca(?:r(?:andá(?:s-(?:campinas|espinho|sangue)|-(?:campinas|espinho|sangue))|és?-óculos)|(?:tir(?:ões|ão)-capot|s?-pobr)e)|smi(?:ns-(?:c(?:a(?:chorro|iena)|erca)|soldado|leite)|m-(?:c(?:a(?:chorro|iena)|erca)|soldado|leite))|(?:mb(?:eir)?os?-malac|lapas?-lisbo|puçá-coleir)a|tobá(?:s-(?:porco|anta)|-(?:porco|anta))|buticab(?:eiras?-campinas|as?-cipó)|r(?:aracas?-agosto|rinhas?-franja))|u(?:n(?:co(?:s-(?:c(?:a(?:ngalh|br)|obr)a|banhado)|-(?:c(?:a(?:ngalh|br)|obr)a|banhado))|ta(?:s-c(?:alangro|obra)|-c(?:alangro|obra))|ças-c(?:heiro|onta))|á(?:s-c(?:apote|omer)|-c(?:apote|omer))|rubebas?-espinho|ciris?-comer|quis?-cerca)|o(?:ões-(?:santarém|barros?|leite|puça)|ão-(?:santarém|barro|leite|puça))|e(?:quitibás?-agulheir|taís?-pernambuc)o|i(?:queranas?-goiás|tiranas?-leite))|l(?:i(?:m(?:(?:a(?:s-(?:cheir|umbig|bic)|-(?:umbig|bic))|eiras?-umbig)o|ões-(?:c(?:aiena|heiro)|galinha)|ão-(?:c(?:aiena|heiro)|galinha)|os?-manta)|n(?:ho(?:s-(?:raposa|cuco)|-(?:raposa|cuco))|gu(?:eir(?:ões|ão)-canud|ados?-ri)o)|x(?:a(?:s-(?:lei|pau)|-(?:lei|pau))|inhas?-fundura))|a(?:ranj(?:a(?:s-(?:(?:terr|onç)a|umbigo)|-(?:umbigo|onça))|eiras?-vaqueiro)|g(?:art(?:as?-(?:vidr|fog)o|os?-água)|ostas?-espinho)|lás?-cintura)|e(?:it(?:e(?:s-(?:ga(?:meleir|linh)a|cachorro)|-(?:ga(?:meleir|linh)a|cachorro)|iras?-espinho)|ugas?-burro)|sma-conchinha)|o(?:ur(?:eiro(?:s-(?:jardim|apolo)|-(?:jardim|apolo))|os?-cheiro)|ireiros?-apolo)|u(?:(?:tos-quaresm|vas-pastor)a|zernas?-sequeiro)|írios?-petrópolis)|v(?:a(?:sso(?:urinha(?:s-(?:(?:relógi|botã)o|varrer)|-(?:(?:relógi|botã)o|varrer))|ira(?:s-(?:fe(?:iticeira|rro)|bruxa)|-(?:feiticeir|brux)a))|ra(?:s-(?:foguete|o[iu]ro|canoa)|-o[iu]ro)|les?-arinto)|e(?:r(?:g(?:onhas?-estudante|as?-jabuti)|(?:melhinhas?-galh|ças?-cã)o)|spa(?:s-(?:rodeio|cobra)|-(?:rodeio|cobra))|l(?:ames?-cheiro|udos?-penca)|ados?-virgínia|nenos?-porco)|i(?:d(?:eiras?-enforcado|oeiros?-papel)|oletas?-(?:par|da)ma|nháticos?-espinho)|oador(?:es)?-pedra)|qu(?:i(?:n(?:a(?:s-(?:(?:per(?:nambuc|iquit)|vead)o|c(?:errado|aiena|ipó)|r(?:e(?:mígi|g)o|aiz)|goiás)|-(?:(?:per(?:nambuc|iquit)|vead)o|c(?:errado|aiena|ipó)|r(?:e(?:mígi|g)o|aiz)|goiás))|gombós?-(?:espinh|cheir)o)|ab(?:o(?:s-(?:c(?:aiena|heiro|ipó)|(?:angol|quin)a)|-(?:c(?:aiena|heiro|ipó)|(?:angol|quin)a)|ranas?-espinho)|eiros?-angola)|(?:gombós?-cheir|to-pernambuc)o|bondos?-água)|ati(?:s-(?:bando|vara)|-(?:bando|vara))|ássias?-caiena)|á(?:rvore(?:s-(?:(?:bálsam|incens|ranc?h|seb)o|(?:gra(?:lh|x)|orquíde)a|c(?:hocalho|oral|uia)|l(?:ótus|eite|ã)|(?:jud|vel)as|a(?:rr|n)oz|pagode|natal)|-(?:(?:bálsam|incens|ranc?h|seb)o|(?:gra(?:lh|x)|orquíde)a|c(?:hocalho|oral|uia)|l(?:ótus|eite|ã)|(?:jud|vel)as|a(?:rr|n)oz|pagode|natal))|(?:gu(?:as?-colóni|ias?-poup)|caros?-galinh)a)|i(?:n(?:ha(?:me(?:s-(?:c(?:oriolá|ão)|lagartixa|enxerto|benim)|-(?:c(?:oriolá|ão)|lagartixa|enxerto|benim))|íbas?-rego)|censos?-caiena|gás?-fogo)|mb(?:(?:ur(?:ana(?:s-(?:c(?:ambã|heir)|espinh)|-(?:espinh|cambã))|is?-cachorr)|aúbas?-(?:cheir|vinh))o|és?-(?:amarra|come)r)|p(?:ês?-impingem|ecas?-cuiabá)|xoras-cheiro|scas?-sola)|n(?:o(?:z(?:-(?:b(?:a(?:tauá|nda)|ugre)|co(?:(?:br|l)a|co)|(?:arec|galh)a)|es-(?:(?:co(?:br|l)|arec|galh)a|b(?:a(?:tauá|nda)|ugre)))|gueira(?:s-(?:cobra|pecã)|-(?:cobra|pecã)))|a(?:(?:rciso(?:-(?:invern|cheir)|s-cheir)|valhas?-macac)o|nás?-raposa)|iqui(?:ns-(?:areia|saco)|m-areia)|ené(?:ns|m)-galinha|ós-cachorro)|u(?:va(?:s-(?:(?:(?:espin|fac)h|g(?:enti|al)|c(?:heir|ã)|urs)o|r(?:ato|ei)|praia|obó)|-(?:(?:(?:espin|fac)h|g(?:enti|al)|urs|cã)o|praia|obó|rei))|(?:irapurus?-band|xis?-morceg|bás?-fach)o|queté(?:s-(?:água|obó)|-(?:água|obó))|m(?:baranas?-abelha|iris?-cheiro)|apuçás?-coleira|ntués?-obó)|h(?:ortelã(?:s-(?:c(?:a(?:mpina|valo)|heiro)|b(?:urro|oi)|leite)|-(?:c(?:a(?:mpina|valo)|heiro)|b(?:urro|oi)|leite))|idras?-água)|o(?:liveira(?:s-(?:marrocos|cheiro)|-(?:marrocos|cheiro))|iti(?:s-porcoóleo-copaíbaóleos-copaíba|-porco)|stras?-pobre)|ç(?:ana-áçúcar|or-Rosa)|ébanos?-zanzibar|xexéu-bananeira|Grão-Bico))\b"> | |||
<!ENTITY hyphenised_expressions "(?U)\b(?!feij(?:ão|ões)-frade)((?:c(?:a(?:r(?:rap(?:icho(?:s-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi)|-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi))|ato(?:s-(?:p(?:assarinho|eixe)|(?:caval|sap)o|galinha|boi)|-(?:p(?:assarinho|eixe)|(?:caval|sap)o|galinha|boi)))|u(?:ru(?:s-(?:(?:s(?:oldad|ap)|cach|vead)o|espi(?:nho|ga)|po(?:mba|rco))|-(?:(?:s(?:oldad|ap)|cach|vead)o|espi(?:nho|ga)|po(?:mba|rco)))|atás?-pau)|d(?:o(?:s-(?:(?:(?:bur|ou)r|vis[cg])o|co(?:chonilha|alho|mer)|isca)|-(?:(?:(?:bur|ou)r|vis[cg])o|co(?:chonilha|alho|mer)|isca))|ea(?:is|l)-poupa)|á(?:s-(?:(?:sapateir|cabocl|espinh)o|(?:angol|pedr|águ)a|jardim)|-(?:(?:sapateir|cabocl|espinh)o|(?:pedr|águ)a|jardim))|a(?:n(?:ha(?:s-(?:viveir|toc|ri)|-(?:viveir|toc))o|guejos?-pedra)|guatás?-jardim)|(?:v(?:ões|ão)-ferreir|mim-cártam)o|nes-donzela)|p(?:i(?:ns-(?:b(?:o(?:t(?:ão|a)|lota|de)|a(?:ndeira|tatais)|u(?:cha|rro)|ezerro)|c(?:a(?:(?:rneir|val)o|(?:piva|b)ra)|o(?:ntas|rte|co)|heiro|uba)|t(?:(?:artarug|ouceir)a|e(?:nerife|so))|(?:m(?:a(?:rrec|nad)|ul)|esteir|égu)a|p(?:(?:ernambuc|ast|omb)o|lanta)|f(?:o(?:rquilha|go)|lecha|eixe)|r(?:o(?:[ls]a|des)|ebanho|aiz)|a(?:n(?:gola|dar)|çude)|s(?:apo|oca)|diamante|lastro|natal|itu)|m-(?:b(?:o(?:t(?:ão|a)|lota|de)|a(?:ndeira|tatais)|u(?:cha|rro)|ezerro)|c(?:a(?:(?:rneir|val)o|(?:piva|b)ra)|o(?:ntas|rte|co)|heiro|uba)|(?:m(?:a(?:rrec|nad)|ul)|esteir|égu|soc)a|t(?:(?:artarug|ouceir)a|e(?:nerife|so))|p(?:(?:ernambuc|ast|omb)o|lanta)|f(?:o(?:rquilha|go)|lecha|eixe)|r(?:o(?:[ls]a|des)|ebanho|aiz)|a(?:n(?:gola|dar)|çude)|d(?:iamante|eus)|lastro|natal|itu)|tã(?:es|o)-sa(?:ír|l)a|xinguis?-bicho)|elas?-viúva)|n(?:a(?:s-(?:(?:(?:chei|bur)r|passarinh|macac)o|(?:v(?:asso[iu]|íbo)r|frech|roc)a|(?:jacar|imb)é|elefante|açúcar|urubu)|-(?:(?:v(?:asso[iu]|íbo)r|frech|roc)a|(?:passarinh|macac|burr)o|(?:jacar|imb)é|elefante|açúcar|urubu)|fístula(?:s-(?:igapó|lagoa|boi)|-(?:igapó|lagoa|boi)))|el(?:a(?:s-(?:c(?:a(?:poeira|tarro)|(?:eilã|heir)o|utia)|v(?:e(?:ad|lh)o|argem)|g(?:arça|oiás)|papagaio|jacamim|ema)|-(?:c(?:a(?:poeira|tarro)|utia)|v(?:argem|eado)|g(?:arça|oiás)|papagaio|jacamim|ema))|eira(?:s-(?:cheiro|ema)|-(?:cheiro|ema)))|udo(?:s-(?:cachimbo|lagoa)|-(?:cachimbo|lagoa))|(?:ários?-franç|iços?-águ)a|sanç(?:ões|ão)-leite|oés?-botão)|s(?:t(?:a(?:nh(?:a(?:-(?:(?:á(?:fric|gu)|a(?:rar|nt))a|m(?:oçambique|acaco|inas)|c(?:aiaté|utia)|p(?:eixe|uri)|jatobá|bugre)|s-(?:(?:á(?:fric|gu)|a(?:rar|nt))a|m(?:oçambique|acaco|inas)|c(?:aiaté|utia)|p(?:eixe|uri)|bugre))|eiros?-minas)|s?-correr)|or(?:es)?-montanha)|c(?:a(?:s-(?:carvalho|jacaré|anta)|-(?:carvalho|jacaré|noz))|o(?:s-(?:cavalo|jabuti|tatu)|-(?:cavalo|jabuti|tatu))|udo(?:s-(?:enfeite|aranha)|-(?:enfeite|aranha))))|m(?:a(?:r(?:á(?:s-(?:(?:c(?:aval|heir)|espinh)o|b(?:ilro|oi)|flecha)|-(?:(?:c(?:aval|heir)|espinh)o|b(?:ilro|oi)|flecha))|ões-(?:pe(?:nedo|dra)|estalo|areia)|ão-(?:pe(?:nedo|dra)|estalo|areia))|le(?:ões-(?:pedreira|asas)|ão-(?:pedreira|asas)))|b(?:ará(?:s-(?:c(?:h(?:eir|umb)o|apoeira)|espinho|lixa)|-(?:c(?:h(?:eir|umb)o|apoeira)|espinho|lixa))|oatãs?-leite)|urus?-cheiro)|c(?:himbo(?:s-(?:(?:maca|tur)co|jabuti)|-(?:(?:maca|tur)co|jabuti))|au(?:s-(?:ca(?:racas|iena)|mico)|-(?:ca(?:racas|iena)|mico))|tos?-cabeça)|b(?:a(?:(?:ças?-trombet|cinhas?-cobr)a|s-(?:igreja|ladrão|peixe)|-(?:igreja|ladrão|peixe))|u(?:mbos?-azeite|rés?-orelha))|val(?:inho(?:-(?:judeu|deus|cão)|s-judeu)|o-cão)|fé(?:s-b(?:agueio|ugre)|-b(?:agueio|ugre))|t(?:ingueiros?-porc|otas?-espinh)o|avuranas?-cunhã)|o(?:c(?:o(?:-(?:(?:b(?:acai(?:aú|u)b|ocaiuv)|p(?:almeir|indob|urg)|quar(?:esm|t)|oitav)a|v(?:a(?:queiro|ssoura)|inagre|eado)|c(?:(?:a(?:cho|ta)rr|igan)o|olher)|(?:espinh|rosári|macac|óle)o|i(?:ndaiá|ri)|(?:gur|a)iri|na(?:tal|iá)|dendê)|s-(?:(?:b(?:acai(?:aú|u)b|ocaiuv)|p(?:almeir|indob|urg)|quar(?:esm|t)|oitav)a|v(?:a(?:queiro|ssoura)|inagre|eado)|c(?:(?:a(?:cho|ta)rr|igan)o|olher)|(?:espinh|rosári|macac|óle)o|i(?:ndaiá|ri)|na(?:tal|iá)|dendê|guriri))|(?:honilhas?-cer|as?-águ)a)|bra(?:-(?:c(?:a(?:p(?:elo|im)|scavel|belo|ju)|o(?:lchete|ral)|ipó)|(?:es[cp]ad|ferradur|barat|águ)a|(?:v(?:ead|idr)|lix|oc)o|a(?:r(?:eia)?|sa)|pernas|ratos?)|s-(?:c(?:a(?:p(?:elo|im)|scavel|belo|ju)|o(?:lchete|ral)|ipó)|(?:ferradur|barat|espad|águ)a|(?:v(?:ead|idr)|lix|oc)o|a(?:r(?:eia)?|sa)|pernas|ratos?))|l(?:a(?:-(?:(?:(?:sapatei|zor)r|caval)o|peixe)|s-(?:(?:caval|zorr)o|peixe))|eir(?:o(?:s-(?:(?:band|choc)o|sapé)|-(?:(?:band|choc)o|sapé))|as?-sapé))|r(?:(?:uj(?:as?|ão)-igrej|tiças?-montanh|-ros)a|vina(?:s-(?:corso|linha)|-(?:corso|linha))|d(?:ões|ão)-frade|reias?-inverno)|e(?:rana(?:s-(?:(?:caravel|min)as|pernambuco)|-(?:(?:caravel|min)as|pernambuco))|ntro(?:s-caboclos|-caboclo))|gumelo(?:s-(?:c(?:aboclo|hapéu)|(?:sangu|leit)e|paris)|-(?:c(?:aboclo|hapéu)|(?:sangu|leit)e|paris))|uve(?:s-(?:a(?:dorno|reia)|(?:saboi|águ)a|cortar)|-(?:a(?:dorno|reia)|(?:saboi|águ)a|cortar))|n(?:gonha(?:s-(?:caixeta|bugre|goiás)|-(?:caixeta|bugre|goiás))|durus?-sangue|tas?-cabra)|irana(?:s-(?:(?:caravel|min)as|pernambuco)|-(?:(?:caravel|min)as|pernambuco))|(?:xas?-(?:d(?:am|on)|freir)|mer(?:es)?-arar|tovias?-poup)a|queiro(?:s-(?:vassoura|dendê)|-(?:vassoura|dendê))|paibeiras?-minas)|ipó(?:-(?:c(?:a(?:r(?:neiro|ijó)|b(?:oclo|aça)|noa)|o(?:r(?:ação|da)|(?:br|l)a)|h(?:agas|umbo)|u[mn]anã|esto)|a(?:l(?:caçuz|ho)|r(?:acuã|c)o|marrar|gulha)|m(?:a(?:inibu|caco)|o(?:fumb|rceg)o|ucuna)|b(?:a(?:(?:mburra|rri)l|tata)|reu|oi)|j(?:a(?:b(?:ut[ái]|ota)|rrinha)|unta)|p(?:(?:a(?:in|lm)|oit)a|enas)|t(?:amanduá|ucunaré|imbó)|l(?:avadeira|eite)|v(?:aqueiro|iúva)|e(?:mbiri|scada)|im(?:pingem|bé)|g(?:ato|ota)|s(?:apo|eda)|(?:fo|re)go|quati|água)|s-(?:c(?:a(?:r(?:neiro|ijó)|b(?:oclo|aça)|noa)|o(?:r(?:ação|da)|(?:br|l)a)|u[mn]anã|hagas|esto)|a(?:l(?:caçuz|ho)|r(?:acuã|c)o|marrar|gulha)|m(?:a(?:inibu|caco)|o(?:fumb|rceg)o|ucuna)|b(?:a(?:(?:mburra|rri)l|tata)|reu|oi)|j(?:a(?:b(?:ut[ái]|ota)|rrinha)|unta)|p(?:(?:a(?:in|lm)|oit)a|enas)|t(?:amanduá|ucunaré|imbó)|l(?:avadeira|eite)|v(?:aqueiro|iúva)|e(?:mbiri|scada)|im(?:pingem|bé)|g(?:ato|ota)|s(?:apo|eda)|(?:fo|re)go|quati|água))|r(?:av(?:o(?:s-(?:(?:cabe(?:cinh|ç)|esperanç|sear)a|b(?:(?:astã|urr)o|ouba)|p(?:oeta|au)|defunto|tunes|urubu|amor)|-(?:(?:cabe(?:cinh|ç)|esperanç|sear)a|b(?:(?:astã|urr)o|ouba)|p(?:oeta|au)|defunto|tunes|urubu|amor))|in(?:a(?:s-(?:(?:lagartix|águ)a|ambrósio|tunes|pau)|-(?:(?:lagartix|águ)a|tunes|pau))|ho(?:s-(?:(?:lagartix|campin)a|defunto)|-(?:lagartixa|defunto))))|ista(?:s-(?:gal(?:inha|o)|mutum|peru)|-(?:gal(?:inha|o)|mutum|peru)|(?:is|l)-rocha))|e(?:bol(?:(?:a(?:s-(?:cheir|lob)|-lob)|inhas?-cheir)o|etas?-frança)|r(?:ej(?:as?-(?:caien|purg)|eiras?-purg)a|vejas?-pobre)|n(?:táureas?-jardim|ouras?-creta)|vadas?-jardim)|h(?:a(?:ga(?:s-(?:bauru|jesus)|-bauru)|scos?-leque)|u(?:p(?:ões|ão)-arroz|vas?-imbu))|u(?:tia(?:s-(?:rabo|pau)|-(?:rabo|pau))|(?:mbuc|i)as?-macaco)|ânhamo-manila)|p(?:a(?:u(?:s-(?:c(?:a(?:n(?:(?:galh|inan)a|deeiro|oas?|til)|m(?:peche|arão)|r(?:rapato|ne)|c(?:himbo|a)|i(?:bro|xa)|pitão|stor)|o(?:r(?:tiça|al)|n(?:ch|t)a|lher|bre)|h(?:a(?:pad|nc)a|i(?:cl|fr)e|eiro)|u(?:rt(?:ume|ir)|nanã|biú|tia)|erc?a|inzas|ruz)|s(?:a(?:n(?:t(?:ana|o)|gue)|p(?:ateir)?o|b(?:ão|iá)|ssafrás|lsa)|e(?:(?:rr|d)a|bo)|urriola|olar)|m(?:a(?:(?:n(?:jeriob|teig)|ri)a|(?:cac|str|lh)o)|o(?:(?:njol|rceg)o|quém|có)|(?:utamb|erd)a)|b(?:u(?:jarrona|gre|rro)|i(?:ch?o|lros)|a(?:rbas|lso)|o(?:[lt]o|ia)|r(?:incos|eu)|álsamo)|p(?:e(?:r(?:nambuco|eira)|nte)|r(?:eg(?:uiça|o)|aga)|i(?:ranha|lão)|o(?:mb|rc)o|ólvora)|l(?:a(?:g(?:arto|oa)|cre|nça)|e(?:(?:br|it)e|tras|pra)|i(?:vros|xa)|ágrima)|f(?:(?:a(?:[iv]|rinh)|ormig)a|(?:u[ms]|ígad)o|l(?:echas?|or)|e(?:bre|rro))|r(?:e(?:(?:spost|nd)a|(?:[gm]|in)o|de)|os(?:eira|as?)|a(?:inha|to))|e(?:s(?:p(?:inh|et)o|teira)|(?:rv(?:ilh)?|mbir)a|lefante)|a(?:(?:bóbor|ngol)a|r(?:ara|co)|l(?:ho|oé))|t(?:a(?:rtaruga|manco)|in(?:gui|ta)|ucano)|v(?:i(?:n(?:tém|ho)|ola)|e(?:ado|ia)|aca)|g(?:(?:asolin|om)a|ui(?:tarra|né))|j(?:erimum?|angada|udeu)|d(?:igestão|edal)|n(?:avalha|ovato)|o(?:rvalho|laria)|(?:incens|óle)o|(?:zebr|águ)a|qui(?:abo|na))|-(?:c(?:a(?:n(?:(?:galh|inan)a|deeiro|oas?|til)|m(?:peche|arão)|r(?:rapato|ne)|c(?:himbo|a)|i(?:bro|xa)|pitão|stor)|o(?:r(?:tiça|al)|n(?:ch|t)a|lher|bre)|h(?:a(?:pad|nc)a|i(?:cl|fr)e|eiro)|u(?:rt(?:ume|ir)|nanã|biú|tia)|erc?a|inzas|ruz)|s(?:a(?:n(?:t(?:ana|o)|gue)|p(?:ateir)?o|b(?:ão|iá)|ssafrás|lsa)|e(?:(?:rr|d)a|bo)|urriola|olar)|m(?:a(?:(?:n(?:jeriob|teig)|ri)a|(?:cac|str|lh)o)|o(?:(?:njol|rceg)o|quém|có)|(?:utamb|erd)a)|b(?:u(?:jarrona|gre|rro)|i(?:ch?o|lros)|a(?:rbas|lso)|o(?:[lt]o|ia)|r(?:incos|eu)|álsamo)|p(?:e(?:r(?:nambuco|eira)|nte)|r(?:eg(?:uiça|o)|aga)|i(?:ranha|lão)|o(?:mb|rc)o|ólvora)|l(?:a(?:g(?:arto|oa)|cre|nça)|e(?:(?:br|it)e|tras|pra)|i(?:vros|xa)|ágrima)|f(?:(?:a(?:[iv]|rinh)|ormig)a|(?:u[ms]|ígad)o|l(?:echas?|or)|e(?:bre|rro))|r(?:e(?:(?:spost|nd)a|(?:[gm]|in)o|de)|os(?:eira|as?)|a(?:inha|to))|e(?:s(?:p(?:inh|et)o|teira)|(?:rv(?:ilh)?|mbir)a|lefante)|t(?:a(?:rtaruga|manco)|in(?:gui|ta)|ucano)|v(?:i(?:n(?:tém|ho)|ola)|e(?:ado|ia)|aca)|a(?:(?:bóbor|ngol)a|l(?:ho|oé)|rco)|g(?:(?:asolin|om)a|ui(?:tarra|né))|j(?:erimum?|angada|udeu)|d(?:igestão|edal)|n(?:avalha|ovato)|o(?:rvalho|laria)|(?:incens|óle)o|(?:zebr|águ)a|qui(?:abo|na))|xis?-pedra)|l(?:m(?:eir(?:a(?:s-(?:(?:(?:palmi|ce)r|igrej)a|madagascar|dendê|leque|tebas|vinho)|-(?:(?:(?:palmi|ce)r|igrej)a|madagascar|dendê|leque|tebas|vinho))|inhas?-petrópolis)|a(?:s-(?:c(?:hicote|acho)|igreja|leque)|-(?:c(?:hicote|acho)|igreja|leque)|tórias?-espinho)|i(?:tos?-ferrão|lhas?-papa))|ha(?:s-(?:(?:penach|caniç)o|guiné|água)|-(?:(?:penach|caniç)o|guiné|água))|os-(?:calenturas|maria))|r(?:ic(?:á(?:s-(?:esponjas|curtume)|-(?:esponjas|curtume))|aranas?-espinhos)|a(?:cuuba(?:s-lei(?:te)?|-lei(?:te)?)|sitas?-samambaiaçu|tudos?-praia)|go(?:s-(?:m(?:itra|orro)|cótula)|-(?:m(?:itra|orro)|cótula)))|p(?:o(?:ila(?:s-(?:espinho|holanda)|-(?:espinho|holanda))|ula(?:s-(?:espinho|holanda)|-(?:espinho|holanda)))|agaio(?:s-cole(?:ira|te)|-cole(?:ira|te)))|in(?:a(?:s-(?:s(?:apo|eda)|arbusto|penas|cuba)|-(?:s(?:apo|eda)|arbusto|penas|cuba))|eira(?:s-(?:c(?:ipó|uba)|leite)|-(?:c(?:ipó|uba)|leite)))|(?:c(?:o(?:vas?-macac|s?-golung)|as-rab)|ssarinhos?-(?:arribaç|ver)ã|nelas?-bugi)o|t(?:os?-c(?:a(?:rúncul|ien)|rist)a|i(?:nhos?-igapó|s?-goiás))|v(?:ões|ão)-java)|i(?:nh(?:eir(?:o(?:s-(?:(?:(?:pur|ri)g|casquinh)a|jerusalém|alepo)|-(?:(?:(?:pur|ri)g|casquinh)a|jerusalém|alepo))|inho(?:s-(?:jardim|sala)|-(?:jardim|sala)))|ões-(?:(?:cerc|purg)a|madagascar|ratos?)|ão-(?:(?:cerc|purg)a|madagascar|rato)|o(?:s-(?:flandres|riga)|-riga)|as?-raiz)|ment(?:a(?:s-(?:c(?:(?:aien|oro)a|heiro)|(?:ra[bt]|macac)o|g(?:alinha|entio)|bu(?:gre|ta)|queimar|água)|-(?:c(?:(?:aien|oro)a|heiro)|(?:ra[bt]|macac)o|g(?:alinha|entio)|bu(?:gre|ta)|queimar|água))|ões-c(?:aiena|heiro)|ão-c(?:aiena|heiro))|t(?:o(?:mb(?:a(?:s-(?:macaco|leite)|-(?:macaco|leite))|eiras?-marajó)|s-(?:água|saci)|-(?:água|saci))|a(?:ng(?:ueira(?:s-(?:cachorro|jardim)|-(?:cachorro|jardim))|as?-cachorro)|s?-erva)|eiras?-sinal)|olho(?:s-(?:(?:galinh|balei|onç)a|(?:soldad|tubarã)o|p(?:lanta|adre)|c(?:ação|obra)|faraó|urubu)|-(?:(?:galinh|balei|onç)a|(?:soldad|tubarã)o|p(?:lanta|adre)|c(?:ação|obra)|faraó|urubu))|(?:piras?-(?:máscar|prat)|quiás?-pedr|ão-purg)a|c(?:ões-tr(?:opeiro|epar)|ão-tropeiro)|xiricas?-bolas|raíbas?-pele)|e(?:r(?:a(?:s-(?:a(?:(?:guieir|lmeid)a|dvogado)|r(?:e(?:fego|i)|osa)|(?:cris|un)to|jesus|água)|-(?:a(?:(?:guieir|lmeid)a|dvogado)|r(?:e(?:fego|i)|osa)|(?:cris|un)to|jesus|água))|oba(?:s-(?:(?:pernambuc|reg)o|ca(?:ntagalo|mpos)|go(?:iás|mo)|minas)|-(?:(?:pernambuc|reg)o|ca(?:ntagalo|mpos)|go(?:iás|mo)|minas))|(?:iquit(?:o(?:s-(?:campin|ant)|-ant)|inhos?-vassour)|cevejos?-(?:ca[ms]|galinh))a|diz(?:es)?-alqueive|us?-sol)|na(?:chos?-capim|s-avestruz)|pinos?-(?:papagai|burr)o|ssegueiros?-abrir|quiás?-pedra)|u(?:rga(?:s-(?:c(?:a(?:i(?:tité|apó)|(?:boc|va)lo|rijó)|ereja)|(?:ve(?:ad|nt)|marinheir|genti)o|pa(?:ulista|stor)|nabiça)|-(?:c(?:a(?:i(?:tité|apó)|(?:boc|va)lo|rijó)|ereja)|(?:ve(?:ad|nt)|marinheir|genti)o|pa(?:ulista|stor)|nabiça))|lg(?:a(?:s-(?:(?:a(?:rei|nt)|galinh|águ)a|bicho)|-(?:(?:a(?:rei|nt)|galinh|águ)a|bicho))|(?:ões|ão)-planta))|o(?:mb(?:a(?:s-(?:(?:(?:arribaç|sert)ã|espelh|band)o|mulata)|-(?:(?:(?:arribaç|sert)ã|espelh|band)o|mulata))|o(?:s-(?:montanha|leque)|-(?:montanha|leque)))|rco(?:s-(?:verrugas|ferro)|-(?:verrugas|ferro))|aia(?:s-(?:minas|cipó)|-(?:minas|cipó)))|ã(?:es-(?:p(?:o(?:rc(?:in)?o|bre)|ássaros)|gal(?:inha|o)|leite|cuco)|o-(?:p(?:orc(?:in)?o|ássaros)|gal(?:inha|o)|cuco))|l(?:uma(?:s-(?:príncipe|capim)|-(?:príncipe|capim))|átanos?-gênio|antas?-neve)|r(?:eguiça(?:s-(?:bentinho|coleira)|-(?:bentinho|coleira))|imaveras?-caiena)|ássaros?-f(?:andan|i)go|êssegos?-abrir)|f(?:lor(?:es-(?:c(?:a(?:(?:(?:r(?:nav|de)|m)a)?l|(?:sament|chimb|bocl)o)|o(?:(?:[iu]r|elh|c)o|ntas|bra|ral)|e(?:tim|ra)|hagas|iúme|uco)|p(?:a(?:(?:ssarinh|pagai|raís|vã)o|dre|lha|u)|e(?:licano|dra)|érolas)|m(?:a(?:r(?:acujá|iposa)|deira|io)|(?:(?:eren|osca)d|us)a|ico)|b(?:a(?:(?:b(?:eir|ad)|rbeir)o|unilha|ile)|eso[iu]ro)|a(?:(?:lgodã|njinh)o|ranha|bril|zar)|s(?:a(?:p(?:at)?o|ngue)|(?:ed|ol)a)|v(?:(?:iúv|ac)a|e(?:lu|a)do)|n(?:(?:espereir|oiv)a|atal)|l(?:is(?:ado)?|agartixa|ã)|(?:quaresm|dian|águ)a|(?:invern|índi|fog)o|e(?:spírito|nxofre)|t(?:rombeta|anino)|g(?:ra[mx]a|elo)|jesus)|-(?:c(?:a(?:(?:(?:r(?:nav|de)|m)a)?l|(?:sament|chimb|bocl)o)|o(?:(?:[iu]r|elh|c)o|ntas|bra|ral)|e(?:tim|ra)|hagas|iúme|uco)|p(?:a(?:(?:ssarinh|pagai|raís|vã)o|dre|lha|u)|e(?:licano|dra)|érolas)|m(?:a(?:r(?:acujá|iposa)|deira|io)|(?:(?:eren|osca)d|us)a|ico)|b(?:a(?:(?:b(?:eir|ad)|rbeir)o|unilha|ile)|eso[iu]ro)|a(?:(?:lgodã|njinh)o|ranha|bril|zar)|s(?:a(?:p(?:at)?o|ngue)|(?:ed|ol)a)|v(?:(?:iúv|ac)a|e(?:lu|a)do)|n(?:(?:espereir|oiv)a|atal)|(?:quaresm|dian|águ)a|(?:invern|índi|fog)o|e(?:spírito|nxofre)|l(?:agartixa|is|ã)|t(?:rombeta|anino)|g(?:ra[mx]a|elo)|jesus))|rut(?:a(?:s-(?:c(?:o(?:n(?:de(?:ssa)?|ta)|(?:dorn|ruj)a)|a(?:chorro|scavel|iapó)|utia)|g(?:(?:enti|al)o|uar(?:iba|á)|rude)|m(?:a(?:n(?:teig|il)a|caco)|orcego)|p(?:(?:a(?:pagai|vã)|omb|ã)o|erdiz)|sa(?:(?:pucainh|ír)a|b(?:ão|iá))|a(?:n(?:ambé|el)|rara)|v(?:(?:íbor|i)a|eado)|b(?:abad|urr)o|t(?:ucano|atu)|l(?:epra|obo)|jac(?:aré|u)|árvore|faraó|ema)|-(?:c(?:o(?:n(?:de(?:ssa)?|ta)|(?:dorn|ruj)a)|a(?:chorro|scavel|iapó)|utia)|g(?:(?:enti|al)o|uar(?:iba|á)|rude)|m(?:a(?:n(?:teig|il)a|caco)|orcego)|p(?:(?:a(?:pagai|vã)|omb|ã)o|erdiz)|sa(?:(?:pucainh|ír)a|b(?:ão|iá))|a(?:n(?:ambé|el)|rara)|v(?:(?:íbor|i)a|eado)|b(?:abad|urr)o|t(?:ucano|atu)|l(?:epra|obo)|jac(?:aré|u)|árvore|faraó|ema))|eira(?:s-(?:c(?:onde(?:ssa)?|achorro|utia)|(?:macac|tucan|burr|lob)o|p(?:(?:avã|omb)o|erdiz)|jac(?:aré|u)|arara|faraó)|-(?:c(?:onde(?:ssa)?|achorro|utia)|(?:macac|tucan|burr|lob)o|p(?:(?:avã|omb)o|erdiz)|jac(?:aré|u)|arara|faraó))|o(?:s-(?:c(?:a(?:xinguelê|chorro)|o(?:br|nt)a)|m(?:a(?:nteiga|caco)|orcego)|p(?:apagaio|erdiz)|burro|sabiá|imbé)|-(?:c(?:a(?:xinguelê|chorro)|o(?:br|nt)a)|m(?:a(?:nteiga|caco)|orcego)|p(?:apagaio|erdiz)|burro|sabiá|imbé)))|o(?:r(?:m(?:iga(?:s-(?:f(?:e(?:rrão|bre)|ogo)|r(?:a(?:spa|bo)|oça)|c(?:emitério|upim)|m(?:andioca|onte)|b(?:entinho|ode)|(?:imbaúv|onç)a|n(?:ovato|ós)|defunto)|-(?:f(?:e(?:rrão|bre)|ogo)|r(?:a(?:spa|bo)|oça)|c(?:emitério|upim)|m(?:andioca|onte)|b(?:entinho|ode)|(?:imbaúv|onç)a|n(?:ovato|ós)|defunto))|osa(?:s-(?:besteiros|darei)|-(?:besteiros|darei)))|no(?:s-ja(?:çanã|caré)|-ja(?:çanã|caré)))|lha(?:s-(?:s(?:a(?:n(?:tana|gue)|bão)|e(?:rr|d)a)|f(?:(?:ígad|og)o|igueira|ronte)|p(?:a(?:pagaio|dre|jé)|irarucu)|(?:comichã|bold?|gel)o|l(?:ança|eite|ouco)|(?:zeb|he|ta)ra|mangue|urubu)|-(?:s(?:a(?:n(?:tana|gue)|bão)|erra)|f(?:(?:ígad|og)o|igueira|ronte)|p(?:a(?:pagaio|dre|jé)|irarucu)|(?:comichã|bold?|gel)o|l(?:ança|eite|ouco)|(?:zeb|he|ta)ra|mangue|urubu))|cas?-capuz)|a(?:v(?:a(?:s-(?:(?:a(?:ngol|rar)|(?:mal|v)ac|r(?:osc|am)|sucupir|holand)a|c(?:a(?:labar|valo)|h(?:eiro|apa)|obra)|b(?:e(?:souro|lém)|ol(?:ach|ot)a)|(?:quebrant|engenh|ordáli)o|l(?:(?:ázar|ob)o|ima)|t(?:ambaqui|onca)|p(?:orco|aca)|impin?gem)|-(?:(?:a(?:ngol|rar)|(?:mal|v)ac|r(?:osc|am)|sucupir|holand)a|b(?:e(?:souro|lém)|ol(?:ach|ot)a)|c(?:a(?:labar|valo)|(?:hap|obr)a)|(?:quebrant|ordáli)o|t(?:ambaqui|onca)|p(?:orco|aca)|l(?:ima|obo)|impin?gem))|eira(?:s-(?:impin?gem|berloque)|-(?:impin?gem|berloque))|inhas?-capoeira)|lc(?:ões|ão)-coleira)|e(?:ij(?:õe(?:s-(?:c(?:(?:o(?:br|rd)|er|ub)a|avalo)|g(?:u(?:ando|izos)|ado)|(?:árvor|azeit|frad)e|l(?:i(?:sbo|m)a|eite)|(?:jav|rol|soj)a|m(?:acáçar|etro)|po(?:mbinha|rco)|va(?:[cr]a|gem)|tropeiro|boi)|zinhos-capoeira)|ão(?:-(?:c(?:(?:ord|er|ub)a|avalo)|g(?:u(?:ando|izos)|ado)|(?:árvor|azeit|frad)e|l(?:i(?:sbo|m)a|eite)|(?:jav|rol|soj)a|m(?:acáçar|etro)|po(?:mbinha|rco)|va(?:[cr]a|gem)|tropeiro|boi)|zinho-capoeira))|(?:l(?:es)?-genti|nos?-cheir|tos?-botã)o)|i(?:g(?:ueira(?:s-(?:(?:lombrigueir|pit|go)a|b(?:engala|aco)|to(?:car|que)|jardim)|-(?:(?:lombrigueir|pit|go)a|b(?:engala|aco)|to(?:car|que)|jardim))|o(?:s-(?:(?:figueir|banan)a|r(?:echeio|ocha)|(?:tord|verã)o)|-(?:(?:figueir|banan)a|r(?:echeio|ocha)|(?:tord|verã)o)))|lária(?:s-(?:medina|guiné)|-(?:medina|guiné))|andeiras?-algodão)|u(?:mo(?:s-(?:(?:rapos|cord|folh)a|pa(?:isan|raís)o|jardim)|-(?:(?:rapos|cord|folh)a|pa(?:isan|raís)o|jardim))|ncho(?:s-(?:(?:florenç|águ)a|porco)|-(?:(?:florenç|águ)a|porco)))|éis-gentio)|b(?:a(?:nan(?:eir(?:a(?:s-(?:(?:madag[áa]sca|flo)r|(?:italian|papagai)o|sementes|jardim|corda|leque)|-(?:(?:madag[áa]sca|flo)r|(?:italian|papagai)o|sementes|jardim|corda|leque))|inha(?:s-(?:touceira|salão|flor)|-(?:touceira|salão|flor)))|a(?:s-(?:(?:m(?:orceg|acac)|papagai)o|s(?:ementes|ancho)|imbé)|-(?:(?:m(?:orceg|acac)|papagai)o|s(?:ementes|ancho)|imbé)))|tat(?:a(?:s-(?:p(?:e(?:rdiz|dra)|ur(?:ga|i)|orco)|a(?:(?:ngol|rrob)a|maro)|b(?:(?:ranc|ugi)o|ainha)|(?:cabocl|vead)o|t(?:aiuiá|iú)|escamas|rama)|-(?:p(?:e(?:rdiz|dra)|ur(?:ga|i)|orco)|a(?:(?:ngol|rrob)a|maro)|b(?:(?:ranc|ugi)o|ainha)|(?:cabocl|vead)o|t(?:aiuiá|iú)|escamas|rama))|inhas?-cobra)|g(?:a(?:s-(?:(?:cabocl|tucan|lour)o|p(?:ombo|raia))|-(?:(?:cabocl|tucan|lour)o|p(?:ombo|raia)))|re(?:s-(?:(?:arei|lago)a|man(?:gue|ta)|penacho)|-(?:(?:arei|lago)a|man(?:gue|ta)|penacho))|os?-chumbo)|r(?:r(?:ete(?:s-(?:clérigo|eleitor|padre)|-(?:clérigo|eleitor|padre))|ig(?:udas?-espinho|as?-freira))|ba(?:-(?:(?:chib|timã)o|pa(?:ca|u)|lagoa)|s-(?:(?:chib|timã)o|lagoa|boi|pau)))|b(?:osa(?:s-(?:árvore|espiga|pau)|-(?:árvore|espiga|pau))|a(?:-(?:(?:camel|sap)o|boi)|s-(?:sapo|boi)))|c(?:u(?:r(?:aus?-(?:lajea|ban)do|is?-cerca)|(?:paris?-capoei|s?-ped)ra)|abas?-(?:azeit|lequ)e)|mbu(?:s-(?:(?:espinh|caniç)o|pescador|mobília)|-(?:(?:espinh|caniç)o|pescador|mobília))|leia(?:s-(?:b(?:arbatana|ico)|corcova|gomo)|-(?:b(?:arbatana|ico)|corcova|gomo))|i(?:acu(?:s-(?:espinho|chifre)|-(?:espinho|chifre))|nhas?-(?:espad|fac)a)|st(?:(?:i(?:ões|ão)-arrud|ardos?-rom)a|ões-velho)|d(?:ianas?-cheiro|ejos?-lista)|unilhas?-auacuri|únas?-fogo)|i(?:c(?:h(?:o(?:-(?:c(?:(?:a(?:rpintei|chor)r|est)o|o(?:nta|co)|hifre)|(?:(?:ester|bura)c|ouvid|rum)o|(?:(?:gali|u)nh|taquar|sed)a|p(?:a(?:rede|u)|orco|ena|é)|m(?:(?:edranç|osc)a|ato)|v(?:areja|eludo)|f(?:rade|ogo))|s-(?:c(?:(?:a(?:rpintei|chor|nast)r|est)o|o(?:nta|co)|hifre)|(?:m(?:edranç|osc)|(?:gali|u)nh|taquar|sed)a|(?:(?:ester|bura)c|ouvid|rum)o|p(?:a(?:rede|u)|orco|ena|é)|v(?:areja|eludo)|f(?:rade|ogo)))|eiros?-conta)|udas?-corso)|ribás?-pernambuco|telos?-gente)|o(?:r(?:boleta(?:s-(?:p(?:êssego|iracema)|a(?:moreira|lface)|(?:carvalh|band)o|gás)|-(?:p(?:êssego|iracema)|a(?:moreira|lface)|(?:carvalh|band)o|gás))|d(?:ões|ão)-(?:santiag|macac)o)|i(?:s-(?:carro|guará|deus)|-(?:carro|guará|deus)|tas?-bigodes)|a(?:is|l)-alicante|fes?-burro|tos-óculos)|r(?:edo(?:s-(?:(?:namor(?:ad)?|porc|vead|mur)o|espi(?:nho|ga)|cabeça|jardim)|-(?:(?:namor(?:ad)?|porc|vead|mur)o|espi(?:nho|ga)|cabeça|jardim))|inco(?:s-(?:s(?:a(?:guim?|uim)|urubim)|passarinho)|-(?:sa(?:guim?|uim)|passarinho))|(?:ucos?-salvaterr|ancos?-barit)a|ocas?-raiz)|e(?:s(?:ouro(?:s-(?:(?:limeir|águ)a|chifre|maio)|-(?:(?:limeir|águ)a|chifre|maio))|ugos?-ovas)|l(?:droega(?:s-(?:inverno|cuba)|-(?:inverno|cuba))|a(?:s-felgueiras?|-felgueiras?))|ngalas?-camarão|tónicas?-água|ijus?-potó)|álsamo(?:s-(?:c(?:a(?:rtagena|nudo)|heiro)|(?:arce|tol)u|enxofre)|-(?:c(?:a(?:rtagena|nudo)|heiro)|(?:arce|tol)u|enxofre))|u(?:ch(?:o(?:s-(?:veado|boi|rã)|-(?:veado|boi|rã))|as?-purga)|t(?:iás?-vinagre|uas?-corvo)|xos?-holanda))|a(?:l(?:f(?:a(?:vaca(?:s-(?:c(?:(?:abocl|heir)o|obra)|vaqueiro)|-(?:c(?:(?:abocl|heir)o|obra)|vaqueiro))|ce(?:s-(?:(?:c(?:ordeir|ã)|porc)o|alger)|-(?:(?:c(?:ordeir|ã)|porc)o|alger))|zemas?-caboclo|fas?-provença)|inete(?:s-(?:toucar|dama)|-toucar))|m(?:a(?:s-(?:c(?:a(?:(?:boc|va)lo|çador)|(?:hichar|ânta)ro)|(?:tapui|pomb|gat)o|biafada|mestre)|-(?:c(?:a(?:(?:boc|va)lo|çador)|(?:hichar|ânta)ro)|(?:tapui|pomb|gat)o|biafada))|ecegueira(?:s-(?:cheiro|minas)|-(?:cheiro|minas)))|e(?:cri(?:ns-(?:c(?:ampina|heiro)|angola)|m-(?:c(?:ampina|heiro)|angola))|trias?-pau)|ho(?:s-(?:espanha|cheiro)|-(?:espanha|cheiro))|ba(?:troz(?:es)?-sobrancelha|coras?-laje)|g(?:odoeiros?-pernambuco|ibeiras?-dama)|cachofras?-jerusalém|amandas?-jacobina)|r(?:a(?:ç(?:á(?:s-(?:c(?:o(?:mer|roa)|heiro)|(?:umbig|vead)o|(?:pomb|ant)a|tinguijar|minas)|-(?:c(?:o(?:mer|roa)|heiro)|(?:umbig|vead)o|(?:pomb|ant)a|tinguijar|minas))|aris?-minhoca)|ticu(?:ns-(?:(?:espinh|cheir)o|(?:jangad|pac)a|boia?)|m-(?:(?:espinh|cheir)o|(?:jangad|pac)a|boia?))|nha(?:s-(?:água|coco)|-(?:água|coco))|(?:pocas?-cheir|rutas?-porc)o)|r(?:aia(?:s-(?:coroa|fogo)|-(?:coroa|fogo))|ozes-(?:telhad|rat)o|udas?-campinas)|oeira(?:s-(?:(?:goiá|mina)s|capoeira|bugre)|-(?:(?:goiá|mina)s|capoeira|bugre))|(?:lequi(?:ns|m)-caien|cos?-pip)a)|n(?:g(?:ico(?:s-(?:m(?:onte|ina)s|banhado|curtume)|-(?:m(?:onte|ina)s|banhado|curtume))|eli(?:ns|m)-(?:espinh|morceg)o|élicas?-rama)|a(?:n(?:ases-(?:caraguatá|agulha)|ás-(?:caraguatá|agulha))|mbés?-capuz)|dorinha(?:s-(?:bando|casa)|-(?:bando|casa))|ingas?-(?:espinh|macac)o|u(?:n?s|m)?-enchente|z(?:óis|ol)-lontra)|m(?:or(?:e(?:ira(?:s-(?:espinho|árvore)|-(?:espinho|árvore))|s-(?:(?:(?:vaquei|bur)r|hortelã)o|moça))|-(?:(?:(?:vaquei|bur)r|hortelã)o|moça))|e(?:ndo(?:i(?:ns-(?:árvore|veado)|m-(?:árvore|veado))|eiras?-coco)|ixa(?:s-(?:madagascar|espinho)|-(?:madagascar|espinho)))|êndoas?-(?:espinh|coc)o)|b(?:elha(?:s-(?:c(?:(?:achorr|hã)o|upim)|(?:rein|fog|our|sap)o|p(?:urga|au))|-(?:(?:rein|fog|our|sap)o|p(?:urga|au)|cupim))|(?:utuas?-batat|óbora-coro)a|r(?:icós?-macaco|aços?-vide))|s(?:a(?:s-(?:pa(?:pagaio|lha)|(?:barat|telh)a|sabre)|-(?:pa(?:pagaio|lha)|(?:barat|telh)a|sabre))|pargo(?:s-(?:jardim|sala)|-(?:jardim|sala))|so[bv]ios?-(?:cobr|folh)a)|ça(?:f(?:ate(?:s-(?:o[iu]ro|prata)|-(?:o[iu]ro|prata))|roeiras?-pernambuco)|ís?-caatinga)|g(?:ulh(?:(?:ões|ão)-(?:(?:trombe|pra)t|vel)a|as?-pastor)|rílicas?-rama)|zed(?:inha(?:s-(?:corumbá|goiás)|-(?:corumbá|goiás))|as-ovelha)|ve(?:nca(?:s-(?:espiga|minas)|-(?:espiga|minas))|s?-crocodilo)|(?:ipos?-montevid|carás?-v)éu|tu(?:ns|m)-galha)|m(?:a(?:r(?:acujá(?:s-(?:c(?:a(?:iena|cho)|o(?:rtiç|br)a|heiro)|(?:ga(?:rap|vet)|mochil)a|pe(?:riquito|dra)|est(?:rada|alo)|(?:alh|rat)o)|-(?:c(?:a(?:iena|cho)|o(?:rtiç|br)a|heiro)|(?:ga(?:rap|vet)|mochil)a|pe(?:riquito|dra)|est(?:rada|alo)|(?:alh|rat)o))|m(?:el(?:adas?-(?:ca(?:chorr|val)|invern|verã)o|(?:eir)?os?-bengala)|itas?-macaco)|reco(?:s-(?:pequim|ruão)|-(?:pequim|ruão))|imbondos?-chapéu|quesas?-belas)|c(?:a(?:co(?:s-(?:(?:cheir|band)o|noite|sabá)|-(?:(?:cheir|band)o|noite|sabá))|mbira(?:s-(?:(?:flech|pedr)a|serrote)|-(?:(?:flech|pedr)a|serrote))|quinhos?-bambá)|ieira(?:s-(?:(?:anáfeg|coro)a|boi)|-(?:(?:anáfeg|coro)a|boi))|elas?-(?:tabuleir|botã)o|ucus?-paca)|n(?:gue(?:s-(?:(?:(?:pend|bot)ã|sapateir|espet)o|obó)|-(?:(?:(?:pend|bot)ã|sapateir|espet)o|obó))|t(?:imento(?:s-(?:araponga|pobre)|-(?:araponga|pobre))|as?-bretão)|jeric(?:ões|ão)-(?:ceilã|molh)o|d(?:ibis?-juntas|acarus?-boi))|çã(?:s-(?:c(?:(?:[au]c|rav)o|ipreste|obra)|a(?:náfega|rrátel)|(?:espelh|prat)o|rosa|vime|boi)|-(?:c(?:(?:[au]c|rav)o|ipreste|obra)|a(?:náfega|rrátel)|(?:espelh|prat)o|rosa|vime|boi))|t(?:inho(?:s-(?:agulhas|lisboa|sargo)|-(?:agulhas|lisboa|sargo))|o(?:s-(?:engodo|salema)|-(?:engodo|salema)))|m(?:(?:icas?-(?:ca(?:chorr|del)|porc)|(?:ões|ão)-cord)a|oeiro(?:s-(?:espinho|corda)|-(?:espinho|corda)))|lva(?:s-(?:(?:cheir|pendã)o|marajó)|-(?:marajó|pendão)|íscos?-pernambuco)|d(?:ressilvas?-cheiro|eiras?-rei)|itacas?-maximiliano|parás?-cametá)|o(?:s(?:ca(?:s-(?:b(?:a(?:nheir|gaç)o|ich(?:eira|o))|e(?:lefante|stábulo)|ca(?:valos?|sa)|f(?:reira|ogo)|(?:madei|u)ra|inverno)|-(?:b(?:a(?:nheir|gaç)o|ich(?:eira|o))|e(?:lefante|stábulo)|ca(?:valos?|sa)|(?:madei|u)ra|fogo)|t(?:éis-(?:setúbal|jesus)|el-(?:setúbal|jesus)))|quitos?-parede)|ela(?:s-(?:mutum|ema)|-(?:mutum|ema))|n(?:stros?-gila|cos?-peru|tes?-ouro)|(?:uriscos?-sement|reias?-mangu)e|c(?:itaíbas?-leite|hos?-orelhas)|longós?-colher)|u(?:r(?:ici(?:s-(?:(?:tabuleir|porc)o|lenha)|-(?:(?:tabuleir|porc)o|lenha))|uré(?:s-(?:canudo|pajés)|-(?:canudo|pajé))|ta(?:s-(?:cheiro|parida)|-parida))|s(?:go(?:s-(?:irlanda|perdão)|-(?:irlanda|perdão))|aranhos?-água)|tu(?:ns-(?:asso[bv]io|fava)|m-(?:asso[bv]io|fava))|çambés?-espinhos|ngunzás?-cortar)|i(?:lh(?:o(?:s-(?:cobr|águ)|-águ)a|ãs?-pendão)|mos(?:as?-vereda|os?-cacho)|neiras?-petrópolis|cos?-topete|jos?-cavalo|olos-capim)|el(?:(?:(?:ões|ão)-(?:cabocl|morceg|soldad)|oeiros?-soldad)o|(?:ros?-(?:coleir|águ)|ancias?-cobr)a))|e(?:rv(?:a(?:s-(?:m(?:a(?:l(?:eitas|aca)|caé)|u(?:lher|ro)|o[iu]ra|endigo)|p(?:a(?:(?:rid|in)a|ssarinho)|(?:ântan|iolh)o|ontada)|a(?:n(?:(?:dorinh|t)a|jinho|il)|l(?:finete|ho)|mor)|b(?:(?:(?:ascul|ic)h|otã)o|(?:esteir|álsam)os|ugre)|c(?:(?:abr(?:it)?|obr)a|h(?:eir|umb)o)|sa(?:n(?:t(?:iago|ana)|gue)|(?:le)?po)|l(?:a(?:(?:vadeir|c)a|garto)|ouco)|g(?:o(?:[mt]|iabeir)a|uiné|elo)|f(?:(?:og|um|i)o|ebra)|(?:r(?:ober|a)t|our)o|ja(?:raraca|buti)|impingem|esteira)|-(?:m(?:a(?:l(?:eitas|aca)|caé)|u(?:lher|ro)|o[iu]ra|endigo)|p(?:a(?:(?:rid|in)a|ssarinho)|(?:ântan|iolh)o|ontada)|a(?:l(?:finete|míscar|ho)|n(?:(?:dorinh|t)a|il)|mor)|b(?:(?:(?:ascul|ic)h|otã)o|(?:esteir|álsam)os|ugre)|sa(?:n(?:t(?:iago|ana)|gue)|(?:le)?po)|c(?:(?:abr(?:it)?|obr)a|(?:humb|ã)o)|l(?:a(?:(?:vadeir|c)a|garto)|ouco)|g(?:o(?:[mt]|iabeir)a|uiné|elo)|f(?:(?:og|um|i)o|ebra)|(?:r(?:ober|a)t|our)o|ja(?:raraca|buti)|impingem|esteira))|i(?:lha(?:s-(?:(?:cheir|pomb)o|(?:árvo|leb)re|(?:angol|vac)a)|-(?:(?:cheir|pomb)o|(?:árvo|leb)re|(?:angol|vac)a))|nhas?-parida))|s(?:p(?:i(?:n(?:h(?:o(?:s-(?:c(?:a(?:(?:chor|rnei)ro|çada)|r(?:isto|uz)|erca)|(?:bananeir|agulh|roset)a|(?:ladrã|tour|urs)o|j(?:erusalém|udeu)|mari(?:ana|cá)|vintém|deus)|-(?:c(?:a(?:(?:chor|rnei)ro|çada)|r(?:isto|uz)|erca)|(?:bananeir|agulh|roset)a|(?:ladrã|tour|urs)o|j(?:erusalém|udeu)|mari(?:ana|cá)|vintém|deus))|eiro(?:s-(?:c(?:a(?:rneiro|iena)|risto|erca)|j(?:erusalém|udeu)|a(?:gulh|meix)a|vintém)|-(?:c(?:a(?:rneiro|iena)|risto|erca)|j(?:erusalém|udeu)|a(?:gulh|meix)a|vintém))|as?-(?:carneir|vead)o)|afres?-cuba)|ga(?:s-(?:(?:sangu|leit)e|ferrugem|água)|-(?:(?:sangu|leit)e|ferrugem|água)))|onjas?-raiz)|c(?:a(?:móneas?-alepo|das?-jabuti)|ovas?-macaco|umas?-sangue)|tercos?-jurema)|mbira(?:s-(?:ca(?:rrapato|çador)|(?:porc|sap)o)|-(?:ca(?:rrapato|çador)|(?:porc|sap)o))|n(?:xertos?-passarinho|redadeiras?-borla))|g(?:r(?:a(?:m(?:a(?:s-(?:p(?:(?:ernambuc|ast)o|onta)|(?:forquilh|sananduv)a|c(?:oradouro|idade)|ja(?:cobina|rdim)|ma(?:rajó|caé)|adorno)|-(?:p(?:(?:ernambuc|ast)o|onta)|(?:forquilh|sananduv)a|c(?:oradouro|idade)|ja(?:cobina|rdim)|ma(?:rajó|caé)|adorno))|inha(?:s-(?:campinas|jacobina|raiz)|-(?:campinas|jacobina|raiz)))|vatá(?:s-(?:(?:moquec|agulh)a|c(?:o[iu]ro|erca)|(?:ganch|lajed)o|r(?:aposa|ede)|árvore|tingir)|-(?:(?:moquec|agulh)a|c(?:o[iu]ro|erca)|(?:ganch|lajed)o|r(?:aposa|ede)|árvore|tingir))|lhas?-crista)|ão(?:s-(?:(?:c(?:aval|humb)|(?:malu|bi)c|gal)o|p(?:orco|ulha))|-(?:(?:(?:malu|bi)c|(?:cav|g)al)o|p(?:orco|ulha))|zinhos?-galo)|inaldas?-viúva)|a(?:l(?:o(?:s-(?:p(?:enacho|luma)|b(?:ando|riga)|rebanho|fita|ebó)|-(?:p(?:enacho|luma)|b(?:ando|riga)|rebanho|fita|ebó))|inha(?:s-(?:bugre|faraó|água)|-(?:bugre|faraó|água)))|fanhoto(?:s-(?:(?:(?:marmel|coqu)eir|arribaçã)o|(?:jurem|prag)a)|-(?:(?:(?:marmel|coqu)eir|arribaçã)o|(?:jurem|prag)a))|vi(?:ões-(?:(?:(?:colei|ser)r|queimad)a|a(?:nta|ruá)|penacho)|ão-(?:(?:(?:colei|ser)r|queimad)a|a(?:nta|ruá)|penacho))|meleira(?:-(?:(?:lombrigueir|p(?:in|ur)g)a|(?:cansaç|venen)o)|s-(?:(?:cansaç|venen)o|lombrigueiras|p(?:in|ur)ga))|to(?:-(?:madagáscar|algália)|s-algália)|r(?:oupas?-segunda|gantas-ferro))|u(?:a(?:birob(?:eira(?:s-(?:cachorro|minas)|-(?:cachorro|minas))|a(?:s-(?:cachorro|minas)|-(?:cachorro|minas)))|ricangas?-bengala)|iratãs?-coqueiro)|o(?:iab(?:a(?:s-(?:(?:espinh|macac)o|anta)|-(?:(?:espinh|macac)o|anta))|eiras?-(?:cuti|pac)a)|meiros?-minas|gós?-guariba|elas?-lobo)|e(?:rgeli(?:ns|m)-laguna|ngibres?-dourar)|irass(?:óis|ol)-batatas)|t(?:r(?:e(?:vo(?:s-(?:c(?:ar(?:retilha|valho)|heiro)|(?:se[ar]r|águ)a)|-(?:c(?:ar(?:retilha|valho)|heiro)|(?:se[ar]r|águ)a))|moço(?:s-(?:cheiro|jardim|minas)|-(?:cheiro|jardim|minas)))|i(?:go(?:s-(?:p(?:rioste|erdiz)|milagre|israel|verão)|-(?:p(?:rioste|erdiz)|milagre|israel|verão))|colino(?:s-c(?:hifre|rista)|-c(?:hifre|rista))|nca(?:is|l)-pau)|(?:aças?-bibliotec|épanos?-coro)a|omb(?:as?-elefante|etas?-arauto)|utas?-lago)|a(?:i(?:uiá(?:s-(?:c(?:omer|ipó)|pimenta|jardim|quiabo|goiás)|-(?:c(?:omer|ipó)|pimenta|jardim|quiabo|goiás))|nhas?-(?:cors|ri)o)|r(?:taruga(?:s-(?:couro|pente)|-(?:couro|pente))|umã(?:s-espinhos?|-espinhos?))|m(?:b(?:etarus?-espinh|ori[ls]-brav)o|anqueiras?-leite)|j(?:ujás?-(?:cabacinh|quiab)o|ás?-cobra)|(?:xizeiros?-tint|tus?-folh)a|b(?:ocas?-marajó|acos?-cão)|quaris?-cavalo)|i(?:n(?:gui(?:s-(?:c(?:(?:aien|ol)a|ipó)|(?:leit|peix)e)|-(?:c(?:(?:aien|ol)a|ipó)|(?:leit|peix)e))|hor(?:ões|ão)-lombriga)|mbó(?:s-(?:boticário|caiena|jacaré|peixe|raiz)|-(?:boticário|caiena|jacaré|peixe|raiz))|gres?-bengala)|o(?:m(?:at(?:e(?:s-(?:princesa|árvore)|-(?:princesa|árvore))|inhos?-capucho)|ilhos?-creta)|(?:rós?-espinh|adas-cour)o|petes?-cardeal)|u(?:c(?:u(?:ns-(?:carnaúba|redes)|m-(?:carnaúba|redes))|anos?-cinta)|bar(?:ões|ão)-focinho|lipas?-jardim|ias?-areia)|e(?:m(?:betarus?-espinho|porãos?-coruche)|rebintina-quio)|úberas?-(?:invern|verã)o)|r(?:a(?:to(?:s-(?:p(?:a(?:lmatória|iol)|entes|raga)|(?:es(?:pinh|got)|algodã)o|(?:t(?:aquar|romb)|águ)a|c(?:ouro|asa)|fa(?:raó|va)|bambu)|-(?:p(?:a(?:lmatória|iol)|entes|raga)|(?:es(?:pinh|got)|algodã)o|(?:t(?:aquar|romb)|águ)a|c(?:ouro|asa)|fa(?:raó|va)|bambu))|ízes-(?:c(?:(?:edr|urv)o|o(?:bra|rvo)|h(?:eiro|á)|âmaras|ana)|b(?:(?:ar?beir|randã)o|ugre)|(?:angélic|mostard|quin)a|l(?:agarto|opes)|sol(?:teira)?|t(?:ucano|iú)|f(?:rade|el)|guiné|pipi)|iz-(?:c(?:(?:edr|urv)o|o(?:bra|rvo)|h(?:eiro|á)|âmaras|ana)|b(?:(?:ar?beir|randã)o|ugre)|(?:angélic|mostard|quin)a|l(?:agarto|opes)|sol(?:teira)?|t(?:ucano|iú)|f(?:rade|el)|guiné|pipi)|b(?:uge(?:ns|m)-cachorr|anetes?-caval)o|m(?:as?-bezerro|os?-seda)|pés?-saci)|o(?:s(?:a(?:-(?:(?:c(?:a(?:chorr|bocl)|h?ã)|[bl]ob|defunt|o[iu]r|musg)o|p(?:áscoa|au)|jericó|toucar)|s-(?:(?:c(?:a(?:chorr|bocl)|h?ã)|[bl]ob|defunt|o[iu]r|musg)o|jericó|páscoa|toucar))|ário(?:s-(?:jamb[ou]|ifá)|-(?:jamb[ou]|ifá))|e(?:tas?-pernambu|iras?-damas)co)|uxin(?:óis-(?:m(?:uralha|anaus)|(?:espadan|jav)a|caniços)|ol-(?:m(?:uralha|anaus)|(?:espadan|jav)a|caniços))|(?:balos?-(?:arei|galh)|az(?:es)?-bandeir)a|ca(?:s-(?:flores|eva)|-(?:flores|eva)))|(?:e(?:sedás?-cheir|des?-leã)|ábanos?-caval)o|inocerontes?-Java)|s(?:a(?:l(?:sa(?:s-(?:c(?:a(?:stanheiro|valos)|heiro|upim)|(?:roch|águ)a|burro)|-(?:c(?:a(?:stanheiro|valos?)|upim)|(?:roch|águ)a|burro)|parrilhas?-lisboa)|va(?:s-(?:pernambuco|marajó)|-(?:pernambuco|marajó))|amandras?-água)|r(?:a(?:ndis?-(?:(?:carangu|gargar)ej|espinh)o|(?:magos?-águ|s?-pit)a)|dinha(?:s-(?:ga(?:lha|to)|laje)|-(?:ga(?:lha|to)|laje))|go(?:s-(?:beiço|dente)|-(?:beiço|dente))|ros?-pito)|n(?:haç(?:o(?:s-(?:(?:(?:coqu|mamo)eir|fog)o|encontros)|-(?:(?:(?:coqu|mamo)eir|fog)o|encontros))|us?-(?:encont|mamoei)ro)|ãs?-samambaia)|p(?:(?:ucaias?-castanh|és?-capoeir)a|o(?:-chifres?|s-chifre))|gui(?:n?s|m)?-bigode|mambaias?-penacho|[bv]acus?-coroa)|u(?:rucucu(?:s-(?:p(?:ati|ind)oba|fogo)|-(?:p(?:ati|ind)oba|fogo))|ma(?:umeiras?-macaco|gres?-provença|rés?-pedras))|orgo(?:s-(?:(?:vassour|espig)a|pincel|alepo)|-(?:(?:vassour|espig)a|pincel|alepo))|iris?-coral)|j(?:a(?:ca(?:r(?:andá(?:s-(?:campinas|espinho|sangue)|-(?:campinas|espinho|sangue))|és?-óculos)|(?:tir(?:ões|ão)-capot|s?-pobr)e)|smi(?:ns-(?:c(?:a(?:chorro|iena)|erca)|soldado|leite)|m-(?:c(?:a(?:chorro|iena)|erca)|soldado|leite))|(?:mb(?:eir)?os?-malac|lapas?-lisbo|puçá-coleir)a|tobá(?:s-(?:porco|anta)|-(?:porco|anta))|buticab(?:eiras?-campinas|as?-cipó)|r(?:aracas?-agosto|rinhas?-franja))|u(?:n(?:co(?:s-(?:c(?:a(?:ngalh|br)|obr)a|banhado)|-(?:c(?:a(?:ngalh|br)|obr)a|banhado))|ta(?:s-c(?:alangro|obra)|-c(?:alangro|obra))|ças-c(?:heiro|onta))|á(?:s-c(?:apote|omer)|-c(?:apote|omer))|rubebas?-espinho|ciris?-comer|quis?-cerca)|o(?:ões-(?:santarém|barros?|leite|puça)|ão-(?:santarém|barro|leite|puça))|e(?:quitibás?-agulheir|taís?-pernambuc)o|i(?:queranas?-goiás|tiranas?-leite))|l(?:i(?:m(?:(?:a(?:s-(?:cheir|umbig|bic)|-(?:umbig|bic))|eiras?-umbig)o|ões-(?:c(?:aiena|heiro)|galinha)|ão-(?:c(?:aiena|heiro)|galinha)|os?-manta)|n(?:ho(?:s-(?:raposa|cuco)|-(?:raposa|cuco))|gu(?:eir(?:ões|ão)-canud|ados?-ri)o)|x(?:a(?:s-(?:lei|pau)|-(?:lei|pau))|inhas?-fundura))|a(?:ranj(?:a(?:s-(?:(?:terr|onç)a|umbigo)|-(?:umbigo|onça))|eiras?-vaqueiro)|g(?:art(?:as?-(?:vidr|fog)o|os?-água)|ostas?-espinho)|lás?-cintura)|e(?:it(?:e(?:s-(?:ga(?:meleir|linh)a|cachorro)|-(?:ga(?:meleir|linh)a|cachorro)|iras?-espinho)|ugas?-burro)|sma-conchinha)|o(?:ur(?:eiro(?:s-(?:jardim|apolo)|-(?:jardim|apolo))|os?-cheiro)|ireiros?-apolo)|u(?:(?:tos-quaresm|vas-pastor)a|zernas?-sequeiro)|írios?-petrópolis)|v(?:a(?:sso(?:urinha(?:s-(?:(?:relógi|botã)o|varrer)|-(?:(?:relógi|botã)o|varrer))|ira(?:s-(?:fe(?:iticeira|rro)|bruxa)|-(?:feiticeir|brux)a))|ra(?:s-(?:foguete|o[iu]ro|canoa)|-o[iu]ro)|les?-arinto)|e(?:r(?:g(?:onhas?-estudante|as?-jabuti)|(?:melhinhas?-galh|ças?-cã)o)|spa(?:s-(?:rodeio|cobra)|-(?:rodeio|cobra))|l(?:ames?-cheiro|udos?-penca)|ados?-virgínia|nenos?-porco)|i(?:d(?:eiras?-enforcado|oeiros?-papel)|oletas?-(?:par|da)ma|nháticos?-espinho)|oador(?:es)?-pedra)|qu(?:i(?:n(?:a(?:s-(?:(?:per(?:nambuc|iquit)|vead)o|c(?:errado|aiena|ipó)|r(?:e(?:mígi|g)o|aiz)|goiás)|-(?:(?:per(?:nambuc|iquit)|vead)o|c(?:errado|aiena|ipó)|r(?:e(?:mígi|g)o|aiz)|goiás))|gombós?-(?:espinh|cheir)o)|ab(?:o(?:s-(?:c(?:aiena|heiro|ipó)|(?:angol|quin)a)|-(?:c(?:aiena|heiro|ipó)|(?:angol|quin)a)|ranas?-espinho)|eiros?-angola)|(?:gombós?-cheir|to-pernambuc)o|bondos?-água)|ati(?:s-(?:bando|vara)|-(?:bando|vara))|ássias?-caiena)|á(?:rvore(?:s-(?:(?:bálsam|incens|ranc?h|seb)o|(?:gra(?:lh|x)|orquíde)a|c(?:hocalho|oral|uia)|l(?:ótus|eite|ã)|(?:jud|vel)as|a(?:rr|n)oz|pagode|natal)|-(?:(?:bálsam|incens|ranc?h|seb)o|(?:gra(?:lh|x)|orquíde)a|c(?:hocalho|oral|uia)|l(?:ótus|eite|ã)|(?:jud|vel)as|a(?:rr|n)oz|pagode|natal))|(?:gu(?:as?-colóni|ias?-poup)|caros?-galinh)a)|i(?:n(?:ha(?:me(?:s-(?:c(?:oriolá|ão)|lagartixa|enxerto|benim)|-(?:c(?:oriolá|ão)|lagartixa|enxerto|benim))|íbas?-rego)|censos?-caiena|gás?-fogo)|mb(?:(?:ur(?:ana(?:s-(?:c(?:ambã|heir)|espinh)|-(?:espinh|cambã))|is?-cachorr)|aúbas?-(?:cheir|vinh))o|és?-(?:amarra|come)r)|p(?:ês?-impingem|ecas?-cuiabá)|xoras-cheiro|scas?-sola)|n(?:o(?:z(?:-(?:b(?:a(?:tauá|nda)|ugre)|co(?:(?:br|l)a|co)|(?:arec|galh)a)|es-(?:(?:co(?:br|l)|arec|galh)a|b(?:a(?:tauá|nda)|ugre)))|gueira(?:s-(?:cobra|pecã)|-(?:cobra|pecã)))|a(?:(?:rciso(?:-(?:invern|cheir)|s-cheir)|valhas?-macac)o|nás?-raposa)|iqui(?:ns-(?:areia|saco)|m-areia)|ené(?:ns|m)-galinha|ós-cachorro)|u(?:va(?:s-(?:(?:(?:espin|fac)h|g(?:enti|al)|c(?:heir|ã)|urs)o|r(?:ato|ei)|praia|obó)|-(?:(?:(?:espin|fac)h|g(?:enti|al)|urs|cã)o|praia|obó|rei))|(?:irapurus?-band|xis?-morceg|bás?-fach)o|queté(?:s-(?:água|obó)|-(?:água|obó))|m(?:baranas?-abelha|iris?-cheiro)|apuçás?-coleira|ntués?-obó)|h(?:ortelã(?:s-(?:c(?:a(?:mpina|valo)|heiro)|b(?:urro|oi)|leite)|-(?:c(?:a(?:mpina|valo)|heiro)|b(?:urro|oi)|leite))|idras?-água)|o(?:liveira(?:s-(?:marrocos|cheiro)|-(?:marrocos|cheiro))|iti(?:s-porcoóleo-copaíbaóleos-copaíba|-porco)|stras?-pobre)|ç(?:ana-áçúcar|or-Rosa)|ébanos?-zanzibar|xexéu-bananeira|Grão-Bico))\b"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider improving maintainability of the regex pattern.
The regex pattern is quite extensive and complex. Consider:
- Breaking it down into smaller, named sub-patterns
- Adding comments to explain different sections
- Using XML entities for common patterns
Example structure:
-<!ENTITY hyphenised_expressions "(?U)\b(?!feij(?:ão|ões)-frade)(...)">
+<!-- Common prefixes -->
+<!ENTITY prefix_ca "c(?:a(?:r(?:rap(?:icho(?:s-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi)|-(?:c(?:a(?:(?:rneir|val)o|lçada)|igana)|(?:agu|ove)lha|l(?:inho|ã)|boi))...">
+<!-- Common suffixes -->
+<!ENTITY suffix_common "(?:eiro|eira|eiros|eiras)">
+<!-- Main pattern -->
+<!ENTITY hyphenised_expressions "(?U)\b(?!feij(?:ão|ões)-frade)(&prefix_ca;|...)">
Committable suggestion was skipped due to low confidence.
<afterbreak></afterbreak> | ||
</rule> | ||
<rule break="no"> | ||
<beforebreak>(?U)\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak> | ||
<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typographical error in character class in regex pattern
There's an extra closing bracket ']' in the character class [Тт]]ел
, which may lead to incorrect regex matching or a syntax error. The correct character class should be [Тт]ел
.
Apply this diff to fix the regex:
-<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak>
+<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]ел|ч|част)\.[\h\v]*</beforebreak>
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak> | |
<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]ел|ч|част)\.[\h\v]*</beforebreak> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arysin There's a typo in the regex here (]]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
Thanks. It's good to be compatible with JDK19+, but we shouldn't make it a requirement, as Grails (used for community.languagetool.org) won't work with Java > 17 yet (grails/grails-core#13522). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- languagetool-commandline/src/test/java/org/languagetool/commandline/AbstractSecurityTestCase.java (0 hunks)
- languagetool-commandline/src/test/java/org/languagetool/commandline/MainTest.java (5 hunks)
💤 Files with no reviewable changes (1)
- languagetool-commandline/src/test/java/org/languagetool/commandline/AbstractSecurityTestCase.java
🔇 Additional comments (3)
languagetool-commandline/src/test/java/org/languagetool/commandline/MainTest.java (3)
108-111
:
Ensure setup and teardown processes remain effective after superclass removal.
With the removal of extends AbstractSecurityTestCase
, the setUp()
and tearDown()
methods no longer call super.setUp()
and super.tearDown()
. Verify that any essential initialization or cleanup previously handled by the superclass is now adequately managed within these methods to prevent any unintended side effects during test execution.
Also applies to: 118-120
131-131
:
Confirm that the exit code 1
for the help command is intentional.
The test expects an exit code of 1
when displaying the usage message with -h
. Typically, help commands exit with status code 0
to indicate successful execution. Verify whether returning 1
is intended behavior for this application or if it should be updated to 0
.
Run the following script to check the actual exit code when running the help command:
#!/bin/bash
# Description: Execute the help command and display the exit code.
# Expected result: Exit code should be 0 if the help command executed successfully.
java -cp "$(echo target/classes:~/.m2/repository/*)" org.languagetool.commandline.Main -h
echo "Exit code: $?"
37-37
:
Verify the impact of removing extends AbstractSecurityTestCase
.
The MainTest
class no longer extends AbstractSecurityTestCase
. This superclass may have provided important functionality, such as handling System.exit
calls during tests to prevent the JVM from exiting. Ensure that this change doesn't adversely affect the test behavior and that any necessary functionality from the superclass is replicated or no longer required.
Run the following script to check for any reliance on AbstractSecurityTestCase
methods:
Process process = new ProcessBuilder( | ||
"java", "-cp", System.getProperty("java.class.path"), "org.languagetool.commandline.Main", "-h" | ||
).start(); | ||
int exitCode = process.waitFor(); | ||
String output = readProcessOutput(process); | ||
assertTrue(output.contains("Usage: java -jar languagetool-commandline.jar")); | ||
assertEquals("Exit status", 1, exitCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Capture and assert the error stream when running subprocesses.
In the test method, only the standard output from the subprocess is captured. To ensure comprehensive testing, consider capturing the error stream (process.getErrorStream()
) as well. This allows you to assert that no unexpected errors are occurring during execution and helps in diagnosing issues that may not appear in standard output.
Modify the test to read and assert the error output:
String errorOutput = readProcessError(process);
assertTrue("Error output should be empty", errorOutput.isEmpty());
And add a method to read the error stream:
private String readProcessError(Process process) throws IOException {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getErrorStream()))) {
StringBuilder output = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
output.append(line).append(System.lineSeparator());
}
return output.toString();
}
}
private String readProcessOutput(Process process) throws IOException { | ||
try (BufferedReader reader = new BufferedReader(new InputStreamReader(process.getInputStream()))) { | ||
StringBuilder output = new StringBuilder(); | ||
String line; | ||
while ((line = reader.readLine()) != null) { | ||
output.append(line).append(System.lineSeparator()); | ||
} | ||
return output.toString(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prevent potential deadlocks by consuming both output streams of the subprocess.
The readProcessOutput
method reads only from the subprocess's standard output stream. If the subprocess writes enough data to the error stream without it being read, it can block due to the buffer being full, leading to a deadlock. To mitigate this risk, read from both the standard output and error streams.
Refactor the method to consume both streams:
private String readProcessOutput(Process process) throws IOException {
StringBuilder output = new StringBuilder();
try (
BufferedReader stdOutReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
BufferedReader stdErrReader = new BufferedReader(new InputStreamReader(process.getErrorStream()))
) {
String line;
while ((line = stdOutReader.readLine()) != null) {
output.append(line).append(System.lineSeparator());
}
while ((line = stdErrReader.readLine()) != null) {
// Optionally append to output or handle separately
output.append(line).append(System.lineSeparator());
}
}
return output.toString();
}
<afterbreak></afterbreak> | ||
</rule> | ||
<rule break="no"> | ||
<beforebreak>(?U)\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak> | ||
<beforebreak>\b(абз|арк|ауд|бл|буд|бульв|вул|держ|дод|зав|зб|зв|зовн|екон|к|кв|канд|кн|напр|нпр|нац|обл|оп|пл|пол|поч|пп|пор|просп|розд|стор|табл|[Тт]]ел|ч|част)\.[\h\v]*</beforebreak> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arysin There's a typo in the regex here (]]
)
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Chores
net.loomchild.segment
to improve project stability.Refactor
MainTest
by removing inheritance fromAbstractSecurityTestCase
and adopting process execution for command-line tests.