Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
ethe committed Oct 1, 2023
1 parent a606d51 commit aa14506
Showing 1 changed file with 26 additions and 2 deletions.
28 changes: 26 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ A command-line tool that automates the detection and extraction of log templates

## What it does?
If you have several kinds of logs are mixed together:

```text
↳ head -n 25 logs-2023-09-13.log
2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily)
Expand Down Expand Up @@ -32,7 +33,7 @@ I, [2023-09-13T00:00:05.009497 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006
I, [2023-09-13T00:00:05.010711 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1] Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16)
```

Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:

```ipython
↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
Expand Down Expand Up @@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
└─┬─────────┘
┌─▼───────────┐ ┌──────────────────────┐
│ Regex Sieve │ + + ► Text Embedding Model │
│ Regex Sieve │─────► Text Embedding Model │
└─┬─────────▲─┘ └──────────────────────┘
│ + +
│ + +
Expand All @@ -114,7 +115,30 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
┌─▼──────────────────────────┐
│ CSV / JSON / Parquet / ... │
└────────────────────────────┘
────> log flow
+ + > pattern flow
```

Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.

The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has.

## How to install it?
`pip install git@https://github.com/ethe/log2array.git`

*Logs2Array requires Python3.9+*

## What is the next?
- [ ] auto-detect multiple parts of log templates
```
I, [2023-09-13T00:00:04.832375 #6] INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000
▲ ▲ ▲ ▲
└──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘
│^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$ │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$
```
- [ ] more steps by GPT
- [ ] type columns of each pattern
- [ ] parse datetime
- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs

0 comments on commit aa14506

Please sign in to comment.