From a8613070bcb873029878a9f9a51ab344a6b6cc93 Mon Sep 17 00:00:00 2001 From: Gwo Tzu-Hsing Date: Sun, 1 Oct 2023 17:39:09 +0800 Subject: [PATCH] update readme --- README.md | 50 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index cbc64a0..791d826 100644 --- a/README.md +++ b/README.md @@ -3,8 +3,9 @@ A command-line tool that automates the detection and extraction of log templates ## What it does? If you have several kinds of logs are mixed together: + ```text - ↳ head -n 25 logs-2023-09-13.log + ↳ head -n 25 logs-2023-09-13.log 2023-09-13T00:00:00.128Z pid=6 tid=1xlq INFO: queueing SendDailyWorker (send_wechat_daily) 2023-09-13T00:00:00.129Z pid=6 tid=5i1f2 INFO: queueing ViewsCountWorker (update_views_count) 2023-09-13T00:00:00.130Z pid=6 tid=1xkm class=SendDailyWorker jid=61aac7d3fd0653ce9d0ff7f9 INFO: start @@ -32,10 +33,10 @@ I, [2023-09-13T00:00:05.009497 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006 I, [2023-09-13T00:00:05.010711 #6] INFO -- : [28aa673a-e6e0-4016-83af-b39619006bc1] Rendered home/index.html.erb within layouts/application (Duration: 0.1ms | Allocations: 16) ``` -Then you could use log2array to detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL: +Log2Array could detect and extract several log templates, it would take a while, all extracted log would be stored into [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens a embedded IPython REPL: ```ipython - ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8 + ↳ OPENAI_API_KEY="***" python3 log2array run ~/Downloads/hn-*.log --gpt-base https://api.openai-hk.com/v1 --max-lines 1024 --buf-size 1MB --threshold 0.8 Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] Type 'copyright', 'credits' or 'license' for more information IPython 8.16.0 -- An enhanced Interactive Python. Type '?' for help. @@ -103,7 +104,7 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON / └─┬─────────┘ │ ┌─▼───────────┐ ┌──────────────────────┐ -│ Regex Sieve │ + + ► Text Embedding Model │ +│ Regex Sieve │─────► Text Embedding Model │ └─┬─────────▲─┘ └──────────────────────┘ │ + + │ + + @@ -114,7 +115,46 @@ DuckDB also supports you to save results to kinds of output: CSV / JSON / JSON / ┌─▼──────────────────────────┐ │ CSV / JSON / Parquet / ... │ └────────────────────────────┘ + + +────> log flow ++ + > pattern flow ``` +Log2Array processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns. + +The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Array runs, the higher the logs/sec rate it has. + ## How to install it? -`pip install git@https://github.com/ethe/log2array.git` +``` + ↳pip install git@https://github.com/ethe/log2array.git + + ↳ python3 -m log2array +Usage: python -m log2array [OPTIONS] COMMAND [ARGS]... + +Options: + --help Show this message and exit. + +Commands: + clean + run +``` + +*Logs2Array requires Python3.9+* + +## How much I pay? +Log2Array uses GPT-4 to extract the regex of log community, each extraction would cost hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$. + +## What is the next? +- [ ] auto-detect multiple parts of log templates + ``` + I, [2023-09-13T00:00:04.832375 #6] INFO -- : [a8e9e534-bf8b-42df-8ade-dd3f84af5bf0] Started GET "/" for 172.31.6.222 at 2023-09-13 00:00:04 +0000 + ▲ ▲ ▲ ▲ + └──────────────────────────────────────────────────────────────────────────────────┘ └───────────────────────────────────────────────────────────┘ + │^(I), \[([\d\-T:.]+) #(\d+)\]\s+(INFO) -- : \[([a-f0-9\-]+)\]$ │^(Started) (GET) ("\/") for (\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) at ([\d\- :+]+)$ + ``` +- [ ] more steps by GPT + - [ ] type columns of each pattern + - [ ] parse datetime +- [ ] not just a command-line tool but a filebeat-like sidecar component to streaming consume logs +- [ ] GPT-3.5 compatible