update readme

ethe · Oct 1, 2023 · a861307 · a861307
1 parent a606d51
commit a861307
Showing 1 changed file with 45 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -3,8 +3,9 @@ A command-line tool that automates the detection and extraction of log templates
 
 ## What it does?
 If you have several kinds of logs are mixed together:
+
 ```text
- ↳ head -n 25 logs-2023-09-13.log
+  ↳ head -n 25 logs-2023-09-13.log
 2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily)
 2023-09-13T00:00:00.129Z pid=6 tid=5i1f2 INFO: queueing ViewsCountWorker (update_views_count)
 2023-09-13T00:00:00.130Z pid=6 tid=1xkm class=SendDailyWorker jid=61aac7d3fd0653ce9d0ff7f9 INFO: start
@@ -32,10 +33,10 @@ I, [2023-09-13T00:00:05.009497 #6]  INFO -- : [28aa673a-e6e0-4016-83af-b39619006
 I, [2023-09-13T00:00:05.010711 #6]  INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1]   Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16)
 ```
 
-Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
+Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL:
 
 ```ipython
- ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
+  ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8
 Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]
 Type 'copyright', 'credits' or 'license' for more information
 IPython 8.16.0 -- An enhanced Interactive Python. Type '?' for help.
@@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
 └─┬─────────┘
   │
 ┌─▼───────────┐     ┌──────────────────────┐
-│ Regex Sieve │ + + ► Text Embedding Model │
+│ Regex Sieve │─────► Text Embedding Model │
 └─┬─────────▲─┘     └──────────────────────┘
   │         +         +
   │         +         +
@@ -114,7 +115,46 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON /
 ┌─▼──────────────────────────┐
 │ CSV / JSON / Parquet / ... │
 └────────────────────────────┘
+
+
+────>  log flow
++ + >  pattern flow
 ```
 
+Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns.
+
+The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has.
+
 ## How to install it?
-`pip install git@https://github.com/ethe/log2array.git`
+```
+  ↳pip install git@https://github.com/ethe/log2array.git
+
+  ↳ python3 -m log2array
+Usage: python -m log2array [OPTIONS] COMMAND [ARGS]...
+
+Options:
+  --help  Show this message and exit.
+
+Commands:
+  clean
+  run
+```
+
+*Logs2Array requires Python3.9+*
+
+## How much I pay?
+Log2Array uses GPT-4 to extract the regex of log community, each extraction would cost hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$.
+
+## What is the next?
+- [ ] auto-detect multiple parts of log templates
+  ```
+  I, [2023-09-13T00:00:04.832375 #6]  INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000
+  ▲                                                                                  ▲ ▲                                                           ▲
+  └──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘
+  │^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$                      │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$
+  ```
+- [ ] more steps by GPT
+  - [ ] type columns of each pattern
+  - [ ] parse datetime
+- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs
+- [ ] GPT-3.5 compatible