diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml index 046353b..b0e29a9 100644 --- a/.github/workflows/python-package.yml +++ b/.github/workflows/python-package.yml @@ -37,4 +37,4 @@ jobs: - name: Test Installation run: | pip install -e '.' - python -m log2row --help + python -m bakalog --help diff --git a/README.md b/README.md index 597b737..ed27829 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ -# Log2Row -A command-line tool that detects, extracts log templates, and structures logs to in-process database, leveraging template patterns generated by GPT-4. +# BakaLog +A command-line tool that structures structures logs to in-process OLAP database without any preparation, leveraging template patterns generated by GPT-4. *(previous name was log2row)* ## What it does? If you have several kinds of logs are mixed together (samples from [loghub](https://github.com/logpai/loghub/blob/master/Apache/Apache_2k.log)): @@ -33,10 +33,10 @@ If you have several kinds of logs are mixed together (samples from [loghub](http [Sun Dec 04 04:52:15 2005] [error] mod_jk child workerEnv in error state 6 ``` -Log2Row could detect and extract several log templates, it would take a while, all extracted log would be stored into an embedded in-memory DB [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens an IPython REPL: +BakaLog could detect and extract several log templates, it would take a while, all extracted log would be stored into an embedded in-memory DB [DuckDB](http://duckdb.org/docs/archive/0.9.0/), and opens an IPython REPL: ```ipython - ↳ OPENAI_API_KEY="***" python -m log2row run "loghub/Apache/*.log" --gpt-base https://api.openai.com/v1 --max-lines 0 --buf-size 1MB --threshold 0.9 + ↳ OPENAI_API_KEY="***" python -m bakalog run "loghub/Apache/*.log" --gpt-base https://api.openai.com/v1 --max-lines 0 --buf-size 1MB --threshold 0.9 Python 3.11.5 (main, Aug 24 2023, 15:09:45) [Clang 14.0.3 (clang-1403.0.22.14.1)] Type 'copyright', 'credits' or 'license' for more information IPython 8.16.0 -- An enhanced Interactive Python. Type '?' for help. @@ -172,29 +172,29 @@ DuckDB also supports saving results to various output types such as CSV, JSON, a + + > pattern flow ``` -Log2Row processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns. +BakaLog processes all logs through a list of regex patterns. If a log matches a pattern successfully, it's grouped and variables are inserted into DuckDB. If a log doesn't match any patterns, it's buffered. These buffered logs are used to detect log communities via a text embedding model. Samples from each community are then sent to GPT-4 to extract their regex patterns. -The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer Log2Row runs, the higher the logs/sec rate it has. +The pattern flow isn't part of the main processing, which means that after an initial bootstrap, the processing speed increases significantly. Thus, the longer BakaLog runs, the higher the logs/sec rate it has. ## How to install it? ``` - ↳pip install git@https://github.com/ethe/log2row.git + ↳pip install git@https://github.com/ethe/bakalog.git - ↳ python3 -m log2row -Usage: python -m log2row [OPTIONS] COMMAND [ARGS]... + ↳ python3 -m bakalog +Usage: python -m bakalog [OPTIONS] COMMAND [ARGS]... Options: --help Show this message and exit. Commands: - clean log2row cache all extracted patterns to each files as default,... + clean bakalog cache all extracted patterns to each files as default,... run ``` *Logs2Array requires Python3.9+* ## How much does it cost? -Log2Row uses GPT-4 to extract the regex of log community, each extraction would costs hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$. +BakaLog uses GPT-4 to extract the regex of log community, each extraction would costs hundreds to thousands tokens of GPT-4. This means each log community detection would costs 0.01$ to 0.1$. ## What is next? - [ ] auto-detect multiple parts of log templates @@ -211,4 +211,4 @@ Log2Row uses GPT-4 to extract the regex of log community, each extraction would - [ ] GPT-3.5 compatible ## More information -Currently, log2row is still in early stage. If you are interested in it, let's discuss it on [Hacker News](https://news.ycombinator.com/item?id=37789903) +Currently, bakalog is still in early stage. If you are interested in it, let's discuss it on [Hacker News](https://news.ycombinator.com/item?id=37789903) diff --git a/log2row/__init__.py b/bakalog/__init__.py similarity index 100% rename from log2row/__init__.py rename to bakalog/__init__.py diff --git a/log2row/__main__.py b/bakalog/__main__.py similarity index 89% rename from log2row/__main__.py rename to bakalog/__main__.py index 84f3d4b..ea8233f 100644 --- a/log2row/__main__.py +++ b/bakalog/__main__.py @@ -7,10 +7,10 @@ from IPython import embed from rich.logging import RichHandler -from log2row import Match, Sink, collect -from log2row.cluster import Cluster -from log2row.extract import extract -from log2row.util import Memory, parse_size +from bakalog import Match, Sink, collect +from bakalog.cluster import Cluster +from bakalog.extract import extract +from bakalog.util import Memory, parse_size @click.group() @@ -19,7 +19,7 @@ def main(): @main.command( - help="log2row cache all extracted patterns to each files as default, clean the cache as needed." + help="bakalog cache all extracted patterns to each files as default, clean the cache as needed." ) def clean(): files = glob(f"{Memory.PATH}/*") diff --git a/log2row/cluster.py b/bakalog/cluster.py similarity index 100% rename from log2row/cluster.py rename to bakalog/cluster.py diff --git a/log2row/extract.py b/bakalog/extract.py similarity index 99% rename from log2row/extract.py rename to bakalog/extract.py index 1f00baa..ea9aa1b 100644 --- a/log2row/extract.py +++ b/bakalog/extract.py @@ -115,7 +115,7 @@ def get_messages(samples): "role": "system", "content": "You are a senior regular expression developer." "Create a regular expression which could match, group and extract the log template of several logs below." - "Exactly the same part between logs must be the part of template." + "The exactly same part between logs must be the part of template." "Pattern should start with `^` and end with `$`", }, { diff --git a/log2row/util.py b/bakalog/util.py similarity index 99% rename from log2row/util.py rename to bakalog/util.py index 2db2428..462c438 100644 --- a/log2row/util.py +++ b/bakalog/util.py @@ -43,7 +43,7 @@ def register(self, type, encoder): class Memory(Singleton): home = os.environ["HOME"] - PATH = f"{home}/.log2row" + PATH = f"{home}/.bakalog" def __init__( self, diff --git a/pyproject.toml b/pyproject.toml index c94ffac..40987b5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,5 +1,5 @@ [tool.poetry] -name = "log2row" +name = "bakalog" version = "0.1.0" description = "" authors = ["Gwo Tzu-Hsing "]