A pretrained Transformer-based encoder-decoder model for the Vietnamese language. With T5-style self-supervised pretraining, ViT5 is trained on a large corpus of high-quality and diverse Vietnamese texts. We benchmark ViT5 on two downstream text generation tasks, Abstractive Text Summarization and Named Entity Recognition. All the experiments are shown in our paper ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
Vocabulary: ViT5_vocab
Model | Gin File Location | Checkpoint Location | 🤗 HuggingFace Model |
---|---|---|---|
ViT5-Base | ViT5_base.gin | gs://vietai_public/viT5/ViT5_base/checkpoint_1000000 | ViT5-Base-1024 (1M) |
ViT5-Large | ViT5_large.gin | gs://vietai_public/viT5/ViT5_large/checkpoint_1500000 | ViT5-Large-1024 (1.5M) |
📄 Example with Flaxformer: finetune_vit5x_example.ipynb
📄 Example with Hugging Face: finetune_huggingface_example.ipynb
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-large-vietnews-summarization")
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-large-vietnews-summarization")
model.to("cuda")
sentence = "VietAI là tổ chức phi lợi nhuận với sứ mệnh ươm mầm tài năng về trí tuệ nhân tạo và xây dựng một cộng đồng các chuyên gia trong lĩnh vực trí tuệ nhân tạo đẳng cấp quốc tế tại Việt Nam."
text = "vietnews: " + sentence + " </s>"
encoding = tokenizer(text, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")
outputs = model.generate(
input_ids=input_ids, attention_mask=attention_masks,
max_length=256,
early_stopping=True
)
for output in outputs:
line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(line)
Load our pretrained models on HuggingFace
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# Base
tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-base")
# Large
tokenizer = AutoTokenizer.from_pretrained("VietAI/vit5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("VietAI/vit5-large")
For easily reproducing our results, we provide the ViT5 checkpoint finetuned on vietnews as well. You can directly use our model on HuggingFace 🤗.
@inproceedings{phan-etal-2022-vit5,
title = "{V}i{T}5: Pretrained Text-to-Text Transformer for {V}ietnamese Language Generation",
author = "Phan, Long and Tran, Hieu and Nguyen, Hieu and Trinh, Trieu H.",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop",
year = "2022",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-srw.18",
pages = "136--142",
}
We would like to thank Google for the support of Cloud credits and TPU quota!