Skip to content

Latest commit

 

History

History
118 lines (100 loc) · 4.82 KB

CHANGES.md

File metadata and controls

118 lines (100 loc) · 4.82 KB

jWordSplitter Change Log

2023-xx-yy (4.8-SNAPSHOT)

  • ...

2023-09-21 (4.7)

  • requires Java 1.8 or later now
  • extended the dictionary

2023-09-16 (4.6)

  • extended the dictionary
  • load compound parts from a plain text file in the JAR, not a binary file (it's even faster, about the same size and makes development easier)

2021-06-15 (4.5)

  • extended the dictionary
  • added several exceptions

2018-05-05 (4.4)

  • New method AbstractWordSplitter.setMaximumWordLength() that sets a maximum length for the input: longer words will throw InputTooLongException to avoid excessively long processing times for artificial compounds. NOTE: The default is 70 and this is a change in behavior - there used to be no limit.
  • The recursive method that does most work now properly checks if the current thread is interrupted using Thread.interrupted() and throws a RuntimeException if so.

2018-01-27 (4.3)

  • Fix: Whenever we split a word, we should first check whether it is a defined split exception (#18)

2017-09-10 (4.2)

  • Net method getSubWords(), that also gets the shortest matches. For example, Sauerstoffflasche will get Sauer, stoff, Sauerstoff, flasche. Thanks to github user Tobulus.
  • Added new constructor GermanWordSplitter(boolean hideInterfixCharacters, Set<String> words) (issue #11)
  • Fixed a UnsupportedOperationException that could occur in non-strict mode

2015-05-09 (4.1)

  • extended the dictionary
  • added several exceptions
  • constructor GermanWordSplitter(boolean hideInterfixCharacters) now makes sure the dictionary is read only once
  • fixed getAllSplits() to properly obey minimum word length

2015-04-20 (4.0)

  • requires Java 1.7 or later
  • moved classes to package de.danielnaber.jwordsplitter
  • rewrote algorithm to make it simpler, slightly faster and more correct: it now always returns the longest match
  • default minimum length of the compound parts is now 3, leading to much more decompositions
  • dictionary update
  • internal: binary dictionary not part of repository anymore. The source files are in the repository now and everything can be built with build.sh.

2012-09-24 (3.4)

  • Important note for users who extend AbstractWordSplitter: getConnectingCharacters() must return lowercase characters now
  • added new constructor: GermanWordSplitter(boolean hideConnectingCharacters, InputStream plainTextDict)
  • added several words to the list of exceptions
  • about 40% speedup if you split a lot of words (for many use cases, time of initialization will probably still outweigh runtime)
  • Fixed an UnsupportedOperationException that could occur when strict mode is false (which is the default)
  • Fixed the JAR so starting it with java -jar jWordSplitter-x.y.jar <filename> works again
  • Removed EnglishWordSplitter as its dictionary was not included anyway. Extend AbstractWordSplitter if you want to add support for languages other than German.

2011-12-16 (3.3)

  • renamed jWordSplitter in package path to jwordsplitter to be in accordance with Java conventions
  • New file exceptionsGerman.txt added to JAR that contains special cases to overwrite the algorithm. With this we now get a correct split for e.g. Klimasünderecke -> Klima + Sünder + Ecke (used to be: Klima + Sünde + Recke)
  • new method addException() to set the desired splitting for words, without touching the dictionary
  • small dictionary cleanup (e.g. removing some three-letter words)

2011-02-06 (3.2)

  • now built with Maven, no other changes

2010-09-18 (3.1)

  • fixed a bug: compound parts that ended in "s" caused the splitting not to work

2009-10-25 (3.0)

  • using generics (i.e. at least Java 1.5 is required now for jWordSplitter)
  • AbstractWordSplitter.splitWords(String) is now called AbstractWordSplitter.splitWord(String)
  • slightly better handling of German compounds with hyphens
  • small extensions to the dictionary

2008-11-09 (2.2)

  • in strict mode the minimum length of words is not longer ignored
  • major dictionary update: using a smaller dictionary with less words but hopefully better quality (it's the one that was used for LanguageTool already)

2008-07-01 (2.1)

  • now distributed as a ZIP
  • simplified output format of TestjWordSplitterGerman
  • comes with a larger German dictionary
  • ImportTxtFile is now called SerializeDict

2007-05-12 (2.0)

  • The JAR can now be used directly to split a list of German words: java -jar jWordSplitter.jar <filename>

2007-05-08

  • checked in original version from Sven Abels
  • removed log4j dependency
  • improved exception handling, i.e. exceptions aren't caught and logged but thrown
  • code cleanup
  • hardcoded three examples of German compound parts that don't exist as stand-alone words ("miet-", "grenz-", "ess-")