LLVim uses Large Language Models (LLMs) to operate on text documents through a Vim client. This approach ensures model-extracted content exists in the source text, eliminating hallucinations common in traditional LLM extraction. LLVim achieves over 95% reduction in token usage compared to verbatim extraction methods, and is robust on weakly supported languages that even frontier models struggle with. It operates a headless Neovim instance to execute LLM-generated Vim commands, providing verifiable and efficient text extraction.
- Vim emulator with helpers for llm interaction.
- End-to-end single-turn proof-of-concept, with Hamming's You and Your Research.
- Token savings metric. Aim to answer "how many tokens do we save by doing this?"
- Plot token savings vs extracted length. (compared to verbatim extraction methods)
- Plot pipeline latency vs extracted length. (compared to verbatim extraction methods)
- Plot partial-ratio existence (verifiable extraction) vs extracted length. (compared to verbatim extraction methods)
- Ablate vim window size
- End-to-end multi-turn proof-of-concept (navigating a large document efficiently).
- Replicate results on open-source models.
- Concise & direct whitepaper to demonstrate findings.
- [Maybe] synthetically bootstrap some finetuning data (output is easily verifiable, synthetic data is applicable).
- [Maybe] fine-tune lightweight OS model on this.
Plot token savings vs extracted length. (compared to verbatim extraction methods)
@misc{llvim,
author = {Adham Elarabawy},
title = {LLVim: Verifiable and Token-Efficient Text Extraction Using LLMs and Vim.},
year = {2024},
version = {0.1.0},
url = {https://github.com/adham-elarabawy/llvim},
doi = {10.5281/zenodo.13835827},
}
This work is licensed under a Creative Commons Attribution 4.0 International License. This imposes that you must provide proper attribution (citation above) when referencing, using, or deriving from this work.