Skip to content

Commit

Permalink
fix bugs if logs has massive kinds of clusters
Browse files Browse the repository at this point in the history
  • Loading branch information
ethe committed Oct 5, 2023
1 parent 22f6e0f commit 5d40ac1
Show file tree
Hide file tree
Showing 12 changed files with 851 additions and 679 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ jobs:
- name: Test Installation
run: |
pip install -e '.'
python -m log2array --help
python -m log2row --help
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Log2Array
A command-line tool that automates the detection and extraction of log templates, leveraging regular expressions generated by GPT-4. Say goodbye to the stress of manually writing regex.
# Log2Row
A command-line tool that automates the detection and extraction of log templates, leveraging regular expressions generated by GPT-4. Escape from the stress of manually writing regex.

## What it does?
If you have several kinds of logs are mixed together:
Expand Down Expand Up @@ -33,10 +33,10 @@ I, [2023-09-13T00:00:05.009497 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006
I, [2023-09-13T00:00:05.010711 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1] Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16)
```

Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into an embedded in-memory DB [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens an IPython REPL:
Log2Row could detect and extract several log templates, it would take a while, all extracted log would be stored into an embedded in-memory DB [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens an IPython REPL:

```ipython
↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
↳ OPENAI_API_KEY="***" python3 log2row run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.16.0 -- An enhanced Interactive Python. Type '?' for help.
Expand Down Expand Up @@ -121,16 +121,16 @@ DuckDB also supports saving results to various output types such as CSV, JSON, a
+ + > pattern flow
```

Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.
Log2Row processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.

The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has.
The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Row runs, the higher the logs/sec rate it has.

## How to install it?
```
↳pip install git@https://github.com/ethe/log2array.git
↳pip install git@https://github.com/ethe/log2row.git
↳ python3 -m log2array
Usage: python -m log2array [OPTIONS] COMMAND [ARGS]...
↳ python3 -m log2row
Usage: python -m log2row [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Expand All @@ -143,7 +143,7 @@ Commands:
*Logs2Array requires Python3.9+*

## How much should I pay?
Log2Array uses GPT-4 to extract the regex of log community, each extraction would costs hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$.
Log2Row uses GPT-4 to extract the regex of log community, each extraction would costs hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$.

## What is next?
- [ ] auto-detect multiple parts of log templates
Expand Down
102 changes: 0 additions & 102 deletions log2array/__init__.py

This file was deleted.

142 changes: 0 additions & 142 deletions log2array/cluster.py

This file was deleted.

Loading

0 comments on commit 5d40ac1

Please sign in to comment.