From 6ab3ee644fbe98d7e526a77ee157a4a9d365b35e Mon Sep 17 00:00:00 2001 From: Gwo Tzu-Hsing Date: Tue, 10 Oct 2023 22:57:36 +0800 Subject: [PATCH] update docs --- docs/404.html | 24 ------------------ docs/_config.yml | 14 +++-------- .../2023-10-07-welcome-to-jekyll.markdown | 25 ------------------- docs/about.md | 18 ------------- docs/index.md | 17 +++++++++++++ 5 files changed, 20 insertions(+), 78 deletions(-) delete mode 100644 docs/404.html delete mode 100644 docs/_posts/2023-10-07-welcome-to-jekyll.markdown delete mode 100644 docs/about.md diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index c472b4e..0000000 --- a/docs/404.html +++ /dev/null @@ -1,24 +0,0 @@ ---- -layout: default ---- - - - -
-

404

- -

Page not found :(

-

The requested page could not be found.

-
diff --git a/docs/_config.yml b/docs/_config.yml index 0089889..49bf9d5 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -13,21 +13,13 @@ # you will see them accessed via {{ site.title }}, {{ site.email }}, and so on. # You can create any custom variable you would like, and they will be accessible # in the templates via {{ site.myvariable }}. -title: Your awesome title -email: your-email@example.com -description: >- # this means to ignore newlines until "baseurl:" - Write an awesome description for your new site here. You can edit this - line in _config.yml. It will appear in your document head meta (for - Google search results) and in your feed.xml site description. -baseurl: "" # the subpath of your site, e.g. /blog -url: "" # the base hostname & protocol for your site, e.g. http://example.com +title: STRUCTURED LOGS IS AMAZING, BUT YOU DON'T NEED TO PREPARE FOR IT +email: gotzehsing@gmail.com github_username: ethe -github: [metadata] - # Build settings markdown: kramdown -remote_theme: pages-themes/minimal@v0.2.0 +remote_theme: "just-the-docs/just-the-docs" plugins: - jekyll-remote-theme # add this line to the plugins list if you already have one diff --git a/docs/_posts/2023-10-07-welcome-to-jekyll.markdown b/docs/_posts/2023-10-07-welcome-to-jekyll.markdown deleted file mode 100644 index b19ce36..0000000 --- a/docs/_posts/2023-10-07-welcome-to-jekyll.markdown +++ /dev/null @@ -1,25 +0,0 @@ ---- -layout: post -title: "Welcome to Jekyll!" -date: 2023-10-07 18:46:14 +0800 -categories: jekyll update ---- -You’ll find this post in your `_posts` directory. Go ahead and edit it and re-build the site to see your changes. You can rebuild the site in many different ways, but the most common way is to run `jekyll serve`, which launches a web server and auto-regenerates your site when a file is updated. - -To add new posts, simply add a file in the `_posts` directory that follows the convention `YYYY-MM-DD-name-of-post.ext` and includes the necessary front matter. Take a look at the source for this post to get an idea about how it works. - -Jekyll also offers powerful support for code snippets: - -{% highlight ruby %} -def print_hi(name) - puts "Hi, #{name}" -end -print_hi('Tom') -#=> prints 'Hi, Tom' to STDOUT. -{% endhighlight %} - -Check out the [Jekyll docs][jekyll-docs] for more info on how to get the most out of Jekyll. File all bugs/feature requests at [Jekyll’s GitHub repo][jekyll-gh]. If you have questions, you can ask them on [Jekyll Talk][jekyll-talk]. - -[jekyll-docs]: https://jekyllrb.com/docs/home -[jekyll-gh]: https://github.com/jekyll/jekyll -[jekyll-talk]: https://talk.jekyllrb.com/ diff --git a/docs/about.md b/docs/about.md deleted file mode 100644 index 8b4e0b2..0000000 --- a/docs/about.md +++ /dev/null @@ -1,18 +0,0 @@ ---- -layout: page -title: About -permalink: /about/ ---- - -This is the base Jekyll theme. You can find out more info about customizing your Jekyll theme, as well as basic Jekyll usage documentation at [jekyllrb.com](https://jekyllrb.com/) - -You can find the source code for Minima at GitHub: -[jekyll][jekyll-organization] / -[minima](https://github.com/jekyll/minima) - -You can find the source code for Jekyll at GitHub: -[jekyll][jekyll-organization] / -[jekyll](https://github.com/jekyll/jekyll) - - -[jekyll-organization]: https://github.com/jekyll diff --git a/docs/index.md b/docs/index.md index 829c03a..f546877 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,3 +1,9 @@ +--- +layout: minimal +--- + +# STRUCTURED LOGS IS USEFUL, AND YOU DON'T NEED TO PREPARE FOR IT + Ever found yourself in the same predicament as I have? As a backend developer, I often find myself in a dilemma during service program diagnostics: I regret not having collected and structured more logs in advance for insertion into Elasticsearch, while also being panic by the complexity of grep, awk and sed commands. Recently, however, I seem to have seen a turning point: I've made a new attempt based on some recent innovations. I've created a command-line tool that leverages the LLM's automatic structuring capabilities to structure logs after the fact, and uses an in-process localized OLAP database, Python REPL, and Numpy / Pandas to provide a quick and powerful querying and processing workstation. You can [check out the results here](https://github.com/ethe/bakalog). Below are some of my thoughts on the issue of log processing. @@ -102,6 +108,17 @@ After clustering the logs, we need to identify the log patterns to facilitate th └────────────────────────────────────┘ └──────────────────────────────────────────────┘ ``` +And LLM is a perfect fit for this job: LLM performs exceptionally well in pattern recognition and summarization. I once attempted to submit each log to GPT-4 for pattern recognition, [but LLM was too slow](https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow). Ultimately, I chose to provide GPT-4 only with samples from each log group, prompting GPT-4 to generate regular expressions, and using these for direct log matching. Only when a log does not belong to any regular expression, does further log clustering and summarization occur. The advantage of this approach is that after a brief cold-start period, the vast majority of logs will be processed via regular expressions and will no longer rely on GPT-4. + +In the end, we have a complete tool: + +1. Use regular expressions to match and extract log variables; +2. Cluster logs based on text embedding model; +3. Use GPT-4 to extract regular expressions from samples of each log category; +4. Automatically create DuckDB tables and open IPython/Jupyter; + +You can see the final demonstration in [bakalog/README.md](https://github.com/ethe/bakalog#what-it-does). + ## WHAT'S NEXT? This is just a Proof of Concept (PoC), it has more potential capabilities to grow into a real localized ad-hoc log analysis platform: