update readme

ethe · Oct 1, 2023 · aa14506 · aa14506
1 parent a606d51
commit aa14506
Showing 1 changed file with 26 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@ A command-line tool that automates the detection and extraction of log templates
 
 ## What it does?
 If you have several kinds of logs are mixed together:
+
 ```text
  ↳ head -n 25 logs-2023-09-13.log
 2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily)
@@ -32,7 +33,7 @@ I, [2023-09-13T00:00:05.009497 #6]  INFO -- : [28aa673a-e6e0-4016-83af-b39619006
 I, [2023-09-13T00:00:05.010711 #6]  INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1]   Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16)
 ```
 
-Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
+Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
 
 ```ipython
  ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
@@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
 └─┬─────────┘
   │
 ┌─▼───────────┐     ┌──────────────────────┐
-│ Regex Sieve │ + + ► Text Embedding Model │
+│ Regex Sieve │─────► Text Embedding Model │
 └─┬─────────▲─┘     └──────────────────────┘
   │         +         +
   │         +         +
@@ -114,7 +115,30 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
 ┌─▼──────────────────────────┐
 │ CSV / JSON / Parquet / ... │
 └────────────────────────────┘
+
+
+────>  log flow
++ + >  pattern flow
 ```
 
+Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.
+
+The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has.
+
 ## How to install it?
 `pip install git@https://github.com/ethe/log2array.git`
+
+*Logs2Array requires Python3.9+*
+
+## What is the next?
+- [ ] auto-detect multiple parts of log templates
+  ```
+  I, [2023-09-13T00:00:04.832375 #6]  INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000
+  ▲                                                                                  ▲ ▲                                                           ▲
+  └──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘
+  │^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$                      │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$
+  ```
+- [ ] more steps by GPT
+  - [ ] type columns of each pattern
+  - [ ] parse datetime
+- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs