Skip to content

String Similarity

rameshjesswani edited this page Oct 2, 2017 · 2 revisions

String Similarity

  • String similarity computes similarity between two strings based on structure rather than semantics
  • String similarity metrics are normally used to check spelling mistakes
  • Most widely known algorithm is Levenshtein distance also called edit distance

Minimum Edit Distance or Levenshtein Distance

  • It is the way of solving the string similarity.

  • Minimum edit distance is minimum number of editing operations to transform one string into another

    • Editing operations are:
      • Insertion(i)
      • Deletion(d)
      • Substitution(s)
  • To perform the operation between two strings, firstly alignment is done.

  • Figure shows alignment of two strings- Intention and Execution

  • In order to turn string "Intention" to "Execution", three operations can be performed.

    med

  • Image taken from "Speech and Language Processing" by Daniel Jurafsky & James H. Martin

  • If cost of each operation is 1, then distance between strings is 5

  • In Levenshtein distance, cost of substitution is 2, insertion and deletion is 1

  • Therefore, to turn string "Intention" to "Execution" total cost is 8

  • However, old version of Levenshtein distance, uses cost of substitution as 1

How to find the minimum edit distance

  • This can be considered as a search task, in which we want to search for shortest path to consider sequence of edits from one string to another:

    • Initial State: Word that is to be transformed
    • Operators: Insert, delete and substitute
    • Goal State: Word that trying to get
    • Path Cost: Minimize number of edits

    edit_distance_operations

    • However space of all possible edits can be huge, therefore, it can be difficult and time consuming to explore whole space.
    • But many of different edit paths may end in the same state(string), therefore rather than recomputing all those paths, we can store shortest path to a state at each time.
    • This can be done by using dynamic programming

Dynamic Programming

  • Introduced by Bellman(1957).
  • It uses table-driven method to solve complex problems by combining solutions to sub-problems.
  • Working of algorithm is explained in [7][2]
  • Since edit distance is not sufficient, sometimes we also need to align each character of two strings
  • This alignment is done by "backtrace" , that is extension of edit distance algorithm.
    • In backtrack, we start from final row and column, and follow the pointers back through the dynamic programming matrix

How Minimum Edit Distance is useful

  • It is useful in finding potential spelling error corrections
  • Alignment is useful throughout speech and language processing

Summary of Minimum Edit Distance

How Minimum Edit Distance can help for Semantic Similarity

  • Most commonly known metric for string similarity is Levenshtein distance
  • Levenshtein distance is used for spelling error corrections
  • For Semantic Similarity, we can use minimum edit distance for semantic similarity but it is not always optimal, that means it does not always give correct result
  • NLTK provides function for levenshtein distance, and to check the spellings, we can use the library for spell checking known as PyEnchant

N-Gram

  • N-gram is also used to compute similarity between the strings.
  • It is a sequence of variable characters in a word or sequence of words in a sentence.
  • N-grams are computed by considering all words in a window and we move one or more word forward, depending upon the gram of word we want.
  • If we want to compute n-gram words from a given sentence
  • For Example: Sentence is "The quick brown fox jumps over a lazy dog" If N=2 (known as bigrams), then the n-grams would be:
    • The quick
    • quick brown
    • brown fox
    • fox jumps
    • jumps over
    • over a
    • a lazy
    • lazy dog
  • Similarly, we can compute n-grams of words also.

How many n-grams in a sentence

  • Ngrams = X - (N - 1)
  • Where X = number of words in sentence or number of character in word, N is value of gram, N = 1(unigram), N = 2(bigram), N = 3(trigram) etcetra.

What are N-grams used for?

  • N-grams are used for the purpose of spell correction, word breaking and text summarization.
  • Google and Microsoft has developed n-gram models for spell correction, word breaking and text summarization.

Jaccard Correlation Coefficient

  • It is used to compute similarity between two sets.
  • It value varies between 0 and 1, when 0 represent terms are completely dissimilar and 1 represent terms are completely similar.
  • Formula to compute this coefficient is:
    • jaccard_coefficient = (A intersection B)/(A union B)

How N-Gram and Jaccard Correlation Coefficient can be used in Spell Correction

  • Find Misspelled words using PyEnchant Library.
  • Check Suggested Words related to Misspelled words using PyEnchant library.
  • Filter suggested words which are different within some distance using minimum edit distance
  • Compute Ngram of misspelled word and each suggested word.
  • Compute Jaccard coefficient of misspelled word and each suggested word.
  • Replace suggested word with maximum jaccard coefficient

Summary of N-Gram and Jaccard Correlation Coefficient

  • It is not optimal to find misspelled words because PyEnchant library contains lot of words that seems incorrect/out of dictionary or also contains nouns.
    • For example: If we write "Always" as "Alway" it considers as incorrect and suggested words are : ['Alway', 'Amway', 'Al way', 'Al-way', 'Away', 'Always', 'Allay', 'Alwin']
    • Since word "Alway" has maximum correlation coefficient therefore it replaces as "Alway"
  • However, in some cases, it does spell correction efficiently.
    • For example: I write: sentence = "Alway rememer oovember and decemer"
      • It returns: ['Always', 'remember', 'November', 'and', 'December']

Evaluation of both Spell Corrector techniques(using Minimum Edit Distance only, and Minimum Edit Distance, N-Gram and Jaccard Correlation Coefficient on dataset

  • Data set is downloaded from Wikipedia, that contains misspelled words as well correct words.
  • Around 670 words are selected randomly from data set to evaluate the performance of spell corrector.

Conclusion

  • After evaluating performance of spell corrector using Ngram, Jaccard Coefficient, Edit Distance and spell corrector using Levenshtein distance(Minimum edit distance) on dataset of misspelled words from Wikipedia, both the methods have achieved accuracy of 69.19% and 59.75% respectively.
  • It can be concluded that accuracy achieved is not sufficient to replace misspelled word by corrected word, but since in automatic short answer grading we need to detect misspelled words in student's answer, this can be done with the help of Pyenchant library.

References

  1. Natural Language Processing with Java by Richard M.Reese
  2. Speech and Language Processing by Daniel Jurafsky and James H. Martin
  3. https://www.tm-town.com/blog/is-segmentation-solved-problem
  4. http://esatjournals.net/ijret/2014v03/i01/IJRET20140301026.pdf
  5. https://www.youtube.com/watch?v=hwDhO1GLb_4&index=2&list=PL6397E4B26D00A269
  6. https://queryunderstanding.com/stemming-and-lemmatization-6c086742fe45
  7. https://www.youtube.com/watch?v=We3YDTzNXEk
  8. http://www.dictionary.com/browse/n-gram