Skip to content

Commit

Permalink
Merge pull request #4 from pedroegsilva/changeDocs
Browse files Browse the repository at this point in the history
Change docs
  • Loading branch information
pedroegsilva authored Aug 30, 2021
2 parents afe2dae + 3a34c01 commit b141f0b
Show file tree
Hide file tree
Showing 8 changed files with 334 additions and 70 deletions.
191 changes: 190 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,190 @@
# gofindthem
# gofindthem
Gofindthem is a go library that combines a domain specific language (DSL), that is like a logical expression, with a substring matching engine (implementations of Aho-Corasick at the moment).
Enabling more complex searches with easy to read expressions.
It supports multiple expressions searches, making it an efficient way to "classify" documents according to the expressions that were matched.

This project was conceived in 2019 while I was searching a way to process millions of documents in a more efficient way.
In my researches I found this post [medium flashtext](https://medium.freecodecamp.org/regex-was-taking-5-days-flashtext-does-it-in-15-minutes-55f04411025f) of Vikash Singh.
And discover the Aho-Corasick, which is an awesome algorithm, but it didn't solve my problem completely.
I needed something that could process as fast as the Aho-Corasick implementations could, but also, something that was able to do more complex searches.
The other problem was that the expressions were managed by a team of analysts that were used to use regex to classify those documents.
So I needed a syntax that was easy enough to convince them to change.
The idea was to create a DSL that would have the operators "AND", "OR" and "NOT" that were the same of the logical operations and a new operator "INORD"
that would check if the terms were found on the same order that they were specified.
This would allow searches that used the regex `foo.*bar` to be replaced with `INORD("foo" and "bar")` and the combination of regexes
`foo.*bar` and `bar.*foo` to be replaced as `"foo" and "bar"`. This is not supposed to be a replacement for regex, but it was enough for most use cases that I had back then.
For those cases that only regex would solve, I added a way to represent a regex with the syntax `R"foo.*bar"`.
Making each kind of terms use its respective engine to find its matches and reducing the need to use regex for everything.

This repository is the golang implementation of this idea.

The scanner and parser form the DSL are heavily influenced by this post [medium parsers-lexers](https://blog.gopheracademy.com/advent-2014/parsers-lexers/) of Ben Johnson,
from which is heavily influenced by the [InfluxQL parser](https://github.com/influxdb/influxdb/tree/master/influxql).

PS: The INORD operator and Regex are not yet supported on this version.

## Usage/Examples

There are 2 libraries on this repository, the DSL and the Finder.

### Finder
First you need to create the Finder. The Finder needs a `SubstringEngine` (interface can be found at `/finder/substringEngine.go`)
and if the search will be case sensitive or not.
there are 3 implementations of `SubstringEngine` that uses the libraries from
https://github.com/cloudflare/ahocorasick,
https://github.com/anknown/ahocorasick and
https://github.com/petar-dambovaliev/aho-corasick.
But any other library can be used as long as it implements the `SubstringEngine` interface.
```go
subEng := &finder.PetarDambovalievEngine{}
caseSensitive := true
findthem := finder.NewFinder(subEng, caseSensitive)
```

Them you need to add the expressions that need to be solved.
```go
if err := findthem.AddExpression(`"Lorem" and "ipsum"`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`("Nullam" and not "volutpat")`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`"lorem ipsum" AND ("dolor" or "accumsan")`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`"purus.\nSuspendisse"`); err != nil {
log.Fatal(err)
}
```

And finaly you can check which expressions match on each text.
```go
for i, text := range texts {
resp, err := findthem.ProcessText(text)
if err != nil {
log.Fatal(err)
}
fmt.Printf("----------Text %d -----------\n", i)
for exp, val := range resp {
fmt.Printf("exp: %s | %v\n", exp, val)
}
}
}
```

The full example can be found at `/examples/finder/main.go`

### DSL
First you need to create the parser object.
The parser needs a reader with the expression that will be parsed and if it will be case sensitive.
```go
caseSensitive := false
p := dsl.NewParser(strings.NewReader(`"lorem ipsum" AND ("dolor" or "accumsan")`), caseSensitive)
```
Them you can parse the expression
```go
expression, err := p.Parse()
if err != nil {
log.Fatal(err)
}
```

Once parsed you can extract which terms there were on the expression.
```go
keywords := p.GetKeywords()
fmt.Printf("keywords:\n%v\n", keywords)
```

Format a pretty print to see the Abstract Syntax Tree (AST).
```go
fmt.Printf("pretty format:\n%s\n", expression.PrettyFormat())
```

There are two ways to solve the expression.

Recursively:
```go
matches := map[string]dsl.PatternResult{
"lorem ipsum": dsl.PatternResult{
Val: true,
SortedMatchPos: []int{1, 3, 5},
},
}

responseRecursive, err := expression.Solve(matches, false)
if err != nil {
log.Fatal(err)
}
fmt.Println("recursive eval ", responseRecursive)
```
Iteratively:
```go
solverArr := expression.CreateSolverOrder()
responseIter, err := solverArr.Solve(matches, false)
if err != nil {
log.Fatal(err)
}
fmt.Println("iterative eval ", responseIter)
```
The Iterative solution needs to create an array with the order in which the expressions need to be solved.
It is faster then the recursive if you need to solve the expression more then 8 times (the gain in performance is around 13% from the benchmark results)

The solvers also need to know if the map of matches is complete or not. If it is complete it will have the term as a key even if it was a no match.
The incomplete option will assume that if a key is not present the term was not found.
If an incomplete map is provided and the key is not found an error will be returned.

```go
// should return an error
_, err = expression.Solve(matches, true)
if err != nil {

log.Fatal(err)
}
}

```
The complete example can be found at `/examples/dsl/main.go`
## Run Locally
This projects uses bazel to build and test the code.
You can run this project using go as well.

### What is bazel?
"Bazel is an open-source build and test tool similar to Make, Maven, and Gradle. It uses a human-readable, high-level build language. Bazel supports projects in multiple languages and builds outputs for multiple platforms. Bazel supports large codebases across multiple repositories, and large numbers of users."
\- from https://docs.bazel.build/versions/4.2.0/bazel-overview.html.

To install bazel go to https://docs.bazel.build/versions/main/install.html.

Now with bazel installed.

Clone the project.
```bash
git clone https://github.com/pedroegsilva/gofindthem.git
```

Go to the project directory

```bash
cd gofindthem
```

You can run the examples with the following commands.

```
bazel run //examples/finder:finder
bazel run //examples/dsl:dsl
```
To run all the tests use the following.

```
bazel test //...
```

To run all the benchmark use the following.

```
bazel run //benchmarks:benchmarks_test -- -test.bench=. -test.benchmem
```

3 changes: 0 additions & 3 deletions benchmarks/benchmark_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,6 @@ import (
pdahocorasick "github.com/petar-dambovaliev/aho-corasick"
)

// bazel run //dsl:dsl_test -- -test.bench=. -test.benchmem
// bazel run //dsl:dsl_test -- -test.bench=Exps -test.benchmem

func init() {
rand.Seed(1629074756677820700)
wordsPath, err := filepath.Abs(EN_WORDS_FILE)
Expand Down
11 changes: 6 additions & 5 deletions dsl/expression.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ type PatternResult struct {
SortedMatchPos []int
}

// getTypeName returns the type of the expression with a readable name
// GetTypeName returns the type of the expression with a readable name
func (exp *Expression) GetTypeName() string {
return exp.Type.GetName()
}
Expand Down Expand Up @@ -123,10 +123,11 @@ func (exp *Expression) Solve(

// PrettyPrint returns the expression formated on a tabbed structure
// Eg: for the expression ("a" and "b") or "c"
// OR
// AND
// a
// b
// OR
// AND
// a
// b
// c
func (exp *Expression) PrettyFormat() string {
return exp.prettyFormat(0)
}
Expand Down
15 changes: 15 additions & 0 deletions examples/dsl/BUILD.bazel
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
load("@io_bazel_rules_go//go:def.bzl", "go_binary", "go_library")

go_library(
name = "dsl_lib",
srcs = ["main.go"],
importpath = "github.com/pedroegsilva/gofindthem/examples/dsl",
visibility = ["//visibility:private"],
deps = ["//dsl"],
)

go_binary(
name = "dsl",
embed = [":dsl_lib"],
visibility = ["//visibility:public"],
)
50 changes: 50 additions & 0 deletions examples/dsl/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
package main

import (
"fmt"
"log"
"strings"

"github.com/pedroegsilva/gofindthem/dsl"
)

func main() {
caseSensitive := false
p := dsl.NewParser(strings.NewReader(`"lorem ipsum" AND ("dolor" or "accumsan")`), caseSensitive)
expression, err := p.Parse()
if err != nil {
log.Fatal(err)
}

keywords := p.GetKeywords()
fmt.Printf("keywords:\n%v\n", keywords)

fmt.Printf("pretty format:\n%s\n", expression.PrettyFormat())

matches := map[string]dsl.PatternResult{
"lorem ipsum": dsl.PatternResult{
Val: true,
SortedMatchPos: []int{1, 3, 5},
},
}

responseRecursive, err := expression.Solve(matches, false)
if err != nil {
log.Fatal(err)
}
fmt.Println("recursive eval ", responseRecursive)

solverArr := expression.CreateSolverOrder()
responseIter, err := solverArr.Solve(matches, false)
if err != nil {
log.Fatal(err)
}
fmt.Println("iterative eval ", responseIter)

// should return an error
_, err = expression.Solve(matches, true)
if err != nil {

log.Fatal(err)
}
}
8 changes: 4 additions & 4 deletions examples/BUILD.bazel → examples/finder/BUILD.bazel
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
load("@io_bazel_rules_go//go:def.bzl", "go_binary", "go_library")

go_library(
name = "examples_lib",
name = "finder_lib",
srcs = ["main.go"],
importpath = "github.com/pedroegsilva/gofindthem/examples",
importpath = "github.com/pedroegsilva/gofindthem/examples/finder",
visibility = ["//visibility:private"],
deps = ["//finder"],
)

go_binary(
name = "examples",
embed = [":examples_lib"],
name = "finder",
embed = [":finder_lib"],
visibility = ["//visibility:public"],
)
69 changes: 69 additions & 0 deletions examples/finder/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
package main

import (
"fmt"
"log"

"github.com/pedroegsilva/gofindthem/finder"
)

func main() {
texts := []string{
`Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Curabitur porta lobortis nulla volutpat sagittis.
Nulla ac sapien sodales, pulvinar elit ut, lobortis purus.
Suspendisse id luctus quam.`,
`Lorem ipsum Nullam non purus eu leo accumsan cursus a quis erat.
Etiam dictum enim eu commodo semper.
Mauris feugiat vitae eros et facilisis.
Donec facilisis mattis dignissim.`,
}

subEng := &finder.PetarDambovalievEngine{}
caseSensitive := true
findthem := finder.NewFinder(subEng, caseSensitive)

if err := findthem.AddExpression(`"Lorem" and "ipsum"`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`("Nullam" and not "volutpat")`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`"lorem ipsum" AND ("dolor" or "accumsan")`); err != nil {
log.Fatal(err)
}

if err := findthem.AddExpression(`"purus.\nSuspendisse"`); err != nil {
log.Fatal(err)
}

for i, text := range texts {
resp, err := findthem.ProcessText(text)
if err != nil {
log.Fatal(err)
}
fmt.Printf("----------Text %d case sensitive-----------\n", i)
for exp, val := range resp {
fmt.Printf("exp: %s | %v\n", exp, val)
}
}

findthem2 := finder.NewFinder(subEng, !caseSensitive)

if err := findthem2.AddExpression(`"lorem ipsum" AND ("dolor" or "accumsan")`); err != nil {
log.Fatal(err)
}

for i, text := range texts {
resp, err := findthem2.ProcessText(text)
if err != nil {
log.Fatal(err)
}
fmt.Printf("----------Text %d case insensitive-----------\n", i)
for exp, val := range resp {
fmt.Printf("exp: %s | %v\n", exp, val)
}
}
}
Loading

0 comments on commit b141f0b

Please sign in to comment.