Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release wasm version #152

Open
do-me opened this issue Apr 23, 2024 · 3 comments
Open

Release wasm version #152

do-me opened this issue Apr 23, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@do-me
Copy link

do-me commented Apr 23, 2024

Hey Ben,

I would love to use a wasm version of text-splitter in the web application https://github.com/do-me/SemanticFinder. Currently it only supports chars, words, sentences, regex and tokens but all of these separators are too "stiff". I found that your unicode-based approach generally works quite well which would give users more flexibility and hopefully even better results.

Do you think you could release a wasm-compiled version for the web?

@benbrandt
Copy link
Owner

Hi @do-me cool project! I would definitely love to do support this.

Are you ok if it only supports character-based chunking? The reason is I think I need to do some workarounds or even see if it is possible to use tokenizer libs in wasm...

If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.

@do-me
Copy link
Author

do-me commented Apr 24, 2024

Yes absolutely! Token-based chunking for my use case is an absolute overkill.

However if you'd still want to offer a way to include it for some reason, trasformers.js offers a very convenient tokenizing API out of the box. See here for example: https://huggingface.co/docs/transformers.js/api/tokenizers

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/bert-base-uncased');
const { input_ids } = await tokenizer('I love transformers!');
// Tensor {
//   data: BigInt64Array(6) [101n, 1045n, 2293n, 19081n, 999n, 102n],
//   dims: [1, 6],
//   type: 'int64',
//   size: 6,
// }

So if you would shift the task of calculating tokens to the user and not include it directly in Rust/wasm, maybe that would make most sense. But again, for me it's not really necessary.

If character-based is fine, then I think it could be possible. I also would need to check if markdown can also be supported, but I guess anything is better than nothing perhaps for your use case.

For me certainly - whatever is feasible to you. If markdown was supported that would allow for a really great pipeline, as I just discovered https://r.jina.ai/ that converts any web input to LLM-ready markdown. So pairing this tool with your performant chunking and SemanticFinder would deliver a great user experience :)

@benbrandt benbrandt added the enhancement New feature or request label Apr 24, 2024
@benbrandt
Copy link
Owner

Awesome. Yeah I think I'd likely do something similar to what I have in the Python bindings and accept a callback/lambda function so the user can bring custom logic that isn't compiled in. It has the downside of having to do an FFI call quite often, which isn't always performant, but at least provides the functionality.

Well cool, assuming the markdown crate works, it should be quite easy to support a wasm target I think for this use case. It would also enable building a playground of sorts so people can play with the effect of different chunk settings and see it visually, which is something I've been wanting to do anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

2 participants