- ...
- requires Java 1.8 or later now
- extended the dictionary
- extended the dictionary
- load compound parts from a plain text file in the JAR, not a binary file (it's even faster, about the same size and makes development easier)
- extended the dictionary
- added several exceptions
- New method
AbstractWordSplitter.setMaximumWordLength()
that sets a maximum length for the input: longer words will throwInputTooLongException
to avoid excessively long processing times for artificial compounds. NOTE: The default is 70 and this is a change in behavior - there used to be no limit. - The recursive method that does most work now properly checks if the current thread
is interrupted using
Thread.interrupted()
and throws aRuntimeException
if so.
- Fix: Whenever we split a word, we should first check whether it is a defined split exception (#18)
- Net method
getSubWords()
, that also gets the shortest matches. For example,Sauerstoffflasche
will getSauer, stoff, Sauerstoff, flasche
. Thanks to github user Tobulus. - Added new constructor
GermanWordSplitter(boolean hideInterfixCharacters, Set<String> words)
(issue #11) - Fixed a
UnsupportedOperationException
that could occur in non-strict mode
- extended the dictionary
- added several exceptions
- constructor
GermanWordSplitter(boolean hideInterfixCharacters)
now makes sure the dictionary is read only once - fixed
getAllSplits()
to properly obey minimum word length
- requires Java 1.7 or later
- moved classes to package
de.danielnaber.jwordsplitter
- rewrote algorithm to make it simpler, slightly faster and more correct: it now always returns the longest match
- default minimum length of the compound parts is now 3, leading to much more decompositions
- dictionary update
- internal: binary dictionary not part of repository anymore. The
source files are in the repository now and everything can be built
with
build.sh
.
- Important note for users who extend AbstractWordSplitter:
getConnectingCharacters()
must return lowercase characters now - added new constructor:
GermanWordSplitter(boolean hideConnectingCharacters, InputStream plainTextDict)
- added several words to the list of exceptions
- about 40% speedup if you split a lot of words (for many use cases, time of initialization will probably still outweigh runtime)
- Fixed an UnsupportedOperationException that could occur when strict mode is false (which is the default)
- Fixed the JAR so starting it with
java -jar jWordSplitter-x.y.jar <filename>
works again - Removed
EnglishWordSplitter
as its dictionary was not included anyway. ExtendAbstractWordSplitter
if you want to add support for languages other than German.
- renamed
jWordSplitter
in package path tojwordsplitter
to be in accordance with Java conventions - New file
exceptionsGerman.txt
added to JAR that contains special cases to overwrite the algorithm. With this we now get a correct split for e.g. Klimasünderecke -> Klima + Sünder + Ecke (used to be: Klima + Sünde + Recke) - new method
addException()
to set the desired splitting for words, without touching the dictionary - small dictionary cleanup (e.g. removing some three-letter words)
- now built with Maven, no other changes
- fixed a bug: compound parts that ended in "s" caused the splitting not to work
- using generics (i.e. at least Java 1.5 is required now for jWordSplitter)
- AbstractWordSplitter.splitWords(String) is now called AbstractWordSplitter.splitWord(String)
- slightly better handling of German compounds with hyphens
- small extensions to the dictionary
- in strict mode the minimum length of words is not longer ignored
- major dictionary update: using a smaller dictionary with less words but hopefully better quality (it's the one that was used for LanguageTool already)
- now distributed as a ZIP
- simplified output format of TestjWordSplitterGerman
- comes with a larger German dictionary
- ImportTxtFile is now called SerializeDict
- The JAR can now be used directly to split a list of German words:
java -jar jWordSplitter.jar <filename>
- checked in original version from Sven Abels
- removed log4j dependency
- improved exception handling, i.e. exceptions aren't caught and logged but thrown
- code cleanup
- hardcoded three examples of German compound parts that don't exist as stand-alone words ("miet-", "grenz-", "ess-")