String Similarity

String similarity computes similarity between two strings based on structure rather than semantics
String similarity metrics are normally used to check spelling mistakes
Most widely known algorithm is Levenshtein distance also called edit distance

Minimum Edit Distance or Levenshtein Distance

It is the way of solving the string similarity.
Minimum edit distance is minimum number of editing operations to transform one string into another
- Editing operations are:
- - Insertion(i)
- - Deletion(d)
- - Substitution(s)
To perform the operation between two strings, firstly alignment is done.
Figure shows alignment of two strings- Intention and Execution
In order to turn string "Intention" to "Execution", three operations can be performed.
Image taken from "Speech and Language Processing" by Daniel Jurafsky & James H. Martin
If cost of each operation is 1, then distance between strings is 5
In Levenshtein distance, cost of substitution is 2, insertion and deletion is 1
Therefore, to turn string "Intention" to "Execution" total cost is 8
However, old version of Levenshtein distance, uses cost of substitution as 1

How to find the minimum edit distance

This can be considered as a search task, in which we want to search for shortest path to consider sequence of edits from one string to another:
- Initial State: Word that is to be transformed
- Operators: Insert, delete and substitute
- Goal State: Word that trying to get
- Path Cost: Minimize number of edits
- However space of all possible edits can be huge, therefore, it can be difficult and time consuming to explore whole space.
- But many of different edit paths may end in the same state(string), therefore rather than recomputing all those paths, we can store shortest path to a state at each time.
- This can be done by using dynamic programming

Dynamic Programming

Introduced by Bellman(1957).
It uses table-driven method to solve complex problems by combining solutions to sub-problems.
Working of algorithm is explained in [7][2]
Since edit distance is not sufficient, sometimes we also need to align each character of two strings
This alignment is done by "backtrace" , that is extension of edit distance algorithm.
- In backtrack, we start from final row and column, and follow the pointers back through the dynamic programming matrix

How Minimum Edit Distance is useful

It is useful in finding potential spelling error corrections
Alignment is useful throughout speech and language processing

Summary of Minimum Edit Distance

How Minimum Edit Distance can help for Semantic Similarity

Most commonly known metric for string similarity is Levenshtein distance
Levenshtein distance is used for spelling error corrections
For Semantic Similarity, we can use minimum edit distance for semantic similarity but it is not always optimal, that means it does not always give correct result
NLTK provides function for levenshtein distance, and to check the spellings, we can use the library for spell checking known as PyEnchant

N-Gram

N-gram is also used to compute similarity between the strings.
It is a sequence of variable characters in a word or sequence of words in a sentence.
N-grams are computed by considering all words in a window and we move one or more word forward, depending upon the gram of word we want.
If we want to compute n-gram words from a given sentence
For Example: Sentence is "The quick brown fox jumps over a lazy dog" If N=2 (known as bigrams), then the n-grams would be:
- The quick
- quick brown
- brown fox
- fox jumps
- jumps over
- over a
- a lazy
- lazy dog
Similarly, we can compute n-grams of words also.

How many n-grams in a sentence

Ngrams = X - (N - 1)
Where X = number of words in sentence or number of character in word, N is value of gram, N = 1(unigram), N = 2(bigram), N = 3(trigram) etcetra.

What are N-grams used for?

N-grams are used for the purpose of spell correction, word breaking and text summarization.
Google and Microsoft has developed n-gram models for spell correction, word breaking and text summarization.

Jaccard Correlation Coefficient

It is used to compute similarity between two sets.
It value varies between 0 and 1, when 0 represent terms are completely dissimilar and 1 represent terms are completely similar.
Formula to compute this coefficient is:
- jaccard_coefficient = (A intersection B)/(A union B)

How N-Gram and Jaccard Correlation Coefficient can be used in Spell Correction

Find Misspelled words using PyEnchant Library.
Check Suggested Words related to Misspelled words using PyEnchant library.
Filter suggested words which are different within some distance using minimum edit distance
Compute Ngram of misspelled word and each suggested word.
Compute Jaccard coefficient of misspelled word and each suggested word.
Replace suggested word with maximum jaccard coefficient

Summary of N-Gram and Jaccard Correlation Coefficient

It is not optimal to find misspelled words because PyEnchant library contains lot of words that seems incorrect/out of dictionary or also contains nouns.
- For example: If we write "Always" as "Alway" it considers as incorrect and suggested words are : ['Alway', 'Amway', 'Al way', 'Al-way', 'Away', 'Always', 'Allay', 'Alwin']
- Since word "Alway" has maximum correlation coefficient therefore it replaces as "Alway"
However, in some cases, it does spell correction efficiently.
- For example: I write: sentence = "Alway rememer oovember and decemer"
- - It returns: ['Always', 'remember', 'November', 'and', 'December']

Evaluation of both Spell Corrector techniques(using Minimum Edit Distance only, and Minimum Edit Distance, N-Gram and Jaccard Correlation Coefficient on dataset

Data set is downloaded from Wikipedia, that contains misspelled words as well correct words.
Around 670 words are selected randomly from data set to evaluate the performance of spell corrector.

Conclusion

After evaluating performance of spell corrector using Ngram, Jaccard Coefficient, Edit Distance and spell corrector using Levenshtein distance(Minimum edit distance) on dataset of misspelled words from Wikipedia, both the methods have achieved accuracy of 69.19% and 59.75% respectively.
It can be concluded that accuracy achieved is not sufficient to replace misspelled word by corrected word, but since in automatic short answer grading we need to detect misspelled words in student's answer, this can be done with the help of Pyenchant library.

References

Natural Language Processing with Java by Richard M.Reese
Speech and Language Processing by Daniel Jurafsky and James H. Martin
https://www.tm-town.com/blog/is-segmentation-solved-problem
http://esatjournals.net/ijret/2014v03/i01/IJRET20140301026.pdf
https://www.youtube.com/watch?v=hwDhO1GLb_4&index=2&list=PL6397E4B26D00A269
https://queryunderstanding.com/stemming-and-lemmatization-6c086742fe45
https://www.youtube.com/watch?v=We3YDTzNXEk
http://www.dictionary.com/browse/n-gram

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String Similarity

String Similarity

Minimum Edit Distance or Levenshtein Distance

How to find the minimum edit distance

Dynamic Programming

How Minimum Edit Distance is useful

Summary of Minimum Edit Distance

How Minimum Edit Distance can help for Semantic Similarity

N-Gram

How many n-grams in a sentence

What are N-grams used for?

Jaccard Correlation Coefficient

How N-Gram and Jaccard Correlation Coefficient can be used in Spell Correction

Summary of N-Gram and Jaccard Correlation Coefficient

Evaluation of both Spell Corrector techniques(using Minimum Edit Distance only, and Minimum Edit Distance, N-Gram and Jaccard Correlation Coefficient on dataset

Conclusion

References

Clone this wiki locally