diff --git a/README.md b/README.md index cbc64a0..5ec4821 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,7 @@ A command-line tool that automates the detection and extraction of log templates ## What it does? If you have several kinds of logs are mixed together: + ```text ↳ head -n 25 logs-2023-09-13.log 2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily) @@ -32,7 +33,7 @@ I, [2023-09-13T00:00:05.009497 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006 I, [2023-09-13T00:00:05.010711 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1] Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16) ``` -Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL: +Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL: ```ipython ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8 @@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON / └─┬─────────┘ │ ┌─▼───────────┐ ┌──────────────────────┐ -│ Regex Sieve │ + + ► Text Embedding Model │ +│ Regex Sieve │─────► Text Embedding Model │ └─┬─────────▲─┘ └──────────────────────┘ │ + + │ + + @@ -114,7 +115,30 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON / ┌─▼──────────────────────────┐ │ CSV / JSON / Parquet / ... │ └────────────────────────────┘ + + +────> log flow ++ + > pattern flow ``` +Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns. + +The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has. + ## How to install it? `pip install git@https://github.com/ethe/log2array.git` + +*Logs2Array requires Python3.9+* + +## What is the next? +- [ ] auto-detect multiple parts of log templates + ``` + I, [2023-09-13T00:00:04.832375 #6] INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000 + ▲ ▲ ▲ ▲ + └──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘ + │^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$ │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$ + ``` +- [ ] more steps by GPT + - [ ] type columns of each pattern + - [ ] parse datetime +- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs