This code is a C++ implementation of Aho-Corasick patern matching algorithm.
In their paper published in 1975, Alfred V. Aho and Margaret J. Corasick describe this algorithm like this:
[Aho-Corasick is an] efficient algorithm to locate all occurences of any of finite number of keywords in a string of text.
Nowaday, this algorithm remains the best one in termes of memory usage and complexity to search lot of patterns simultaneously in one pass. The complexity is linear in the length of the patterns plus the length of the searched text plus the number of output matches.
The famous example is GNU Grep that uses Aho-Corasick algorithm to match multiple fixed patterns (cf. GNU Grep Performance page).
Reading this page means you had made some few searches about Aho & Corasick algorithm, and you would certainly agree with that there are a lot of Aho & Corasick algorithm implementations in many languages.
The main aim of this code is to propose a light, simple and easy to read Aho & Corasick algorithm implementation using a lexicographic tree.
Using a lexicographic tree is a such of naïve approch, but that allows to:
- remain close to what we figure out when we are thinking about this algorithm,
- take avantage of the tree structure to implement some algorithm's parts in recursive way (e.g. failure nodes computation or goto/output fonction)
In addition, with this code you will find:
- An easy to use, adapt and embed object-oriented implementation.
- An uncommon recursive implementation of calculating failure nodes (definitely needs improvement).
- A function to convert tree to a DataViz graph file. This file can be converted in picture, that is helpful to understand how the aho-coracick algorithm works.
This implementation is header only. To compile the code, your compiler must support C++17 standard features.
The following will create a lexicographic tree with a couple of keywords.
aho_corasick::LexicographicTree lt;
lt.addKeyword( "he" );
lt.addKeyword( "she" );
lt.addKeyword( "his" );
lt.addKeyword( "hers" );
lt.finalize();
Once the tree structure is finalized, it is ready to be used to find the keywords: just give the characters one by one to the processAndGetOutput
methode like this:
std::string text("ushers");
for(const auto& current_char : text) {
for( const auto* str : lt.processAndGetOutput(current_char)) {
std::cout << *str << std::endl;
}
Each call to processAndGetOutput
methode returns a "set" of keywords. This set may or may not be empty, depending on whether the last current_char
reached none, one or several keywords.
The methode
const std::string DisplayTools::getGraphVizDescription( const LexicoNode * lnode, bool graphSuffix = true, bool graphWord = true )
can be used to transform the tree in to DataViz file that would be used later to draw the tree.
Then use GraphViz dot
command like this:
> dot -Tpng graph1.dot > graph1.png
The result is:
To view GraphViz diagrams online, visit this project and try to use GraphvizOnline online here.