Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ethe committed Oct 1, 2023
1 parent a606d51 commit a861307
Showing 1 changed file with 45 additions and 5 deletions.
50 changes: 45 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@ A command-line tool that automates the detection and extraction of log templates

## What it does?
If you have several kinds of logs are mixed together:

```text
↳ head -n 25 logs-2023-09-13.log
↳ head -n 25 logs-2023-09-13.log
2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily)
2023-09-13T00:00:00.129Z pid=6 tid=5i1f2 INFO: queueing ViewsCountWorker (update_views_count)
2023-09-13T00:00:00.130Z pid=6 tid=1xkm class=SendDailyWorker jid=61aac7d3fd0653ce9d0ff7f9 INFO: start
Expand Down Expand Up @@ -32,10 +33,10 @@ I, [2023-09-13T00:00:05.009497 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006
I, [2023-09-13T00:00:05.010711 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1] Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16)
```

Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:

```ipython
↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.16.0 -- An enhanced Interactive Python. Type '?' for help.
Expand Down Expand Up @@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
└─┬─────────┘
┌─▼───────────┐ ┌──────────────────────┐
│ Regex Sieve │ + + ► Text Embedding Model │
│ Regex Sieve │─────► Text Embedding Model │
└─┬─────────▲─┘ └──────────────────────┘
│ + +
│ + +
Expand All @@ -114,7 +115,46 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
┌─▼──────────────────────────┐
│ CSV / JSON / Parquet / ... │
└────────────────────────────┘
────> log flow
+ + > pattern flow
```

Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.

The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has.

## How to install it?
`pip install git@https://github.com/ethe/log2array.git`
```
↳pip install git@https://github.com/ethe/log2array.git
↳ python3 -m log2array
Usage: python -m log2array [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
clean
run
```

*Logs2Array requires Python3.9+*

## How much I pay?
Log2Array uses GPT-4 to extract the regex of log community, each extraction would cost hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$.

## What is the next?
- [ ] auto-detect multiple parts of log templates
```
I, [2023-09-13T00:00:04.832375 #6] INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000
▲ ▲ ▲ ▲
└──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘
│^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$ │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$
```
- [ ] more steps by GPT
- [ ] type columns of each pattern
- [ ] parse datetime
- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs
- [ ] GPT-3.5 compatible

0 comments on commit a861307

Please sign in to comment.