A compilation of AWESOME things of Large Language Models (LLMs) is presented. Each LLM is detailed through a structured summary, highlighting shared characteristics for easy comparison. This structured approach aligns with the next-generation scholarly communication standards, allowing seamless integration into any Scholarly Knowledge Graph (SKG). A prime example is the machine-actionable review article on the Open Research Knowledge Graph (ORKG) or the comprehensive comparison of LLMs.
- awesome-LLMs
- Organizations
- OpenAI
- Google, and CMU
- Pengcheng Lab, and Baidu
- Google, and University of Washington
- Salesforce
- Deepmind
- Microsoft
- Huggingface
- Google, and Imperial College London
- Google, and Stanford
- NVidia
- EleutherAI
- Facebook, Google, and UC Berkeley
- UC Berkeley
- AI21
- Anthropic
- EleutherAI, Stability.ai, and LMU Munich
- Amazon
- Tsinghua University
- BigScience
- Huggingface, and Big Science
- Meta
- Meta AI, University of Washington, and University of Hong Kong
- Meta AI
- Stanford University
- Large Model Systems Organization
- MosaicML
- BigCode Project
- Technology Innovation Institute
- Berkeley AI Research
- Cerebras, Mohamed bin Zayed University of Artificial Intelligence, and Inception
- Organizations
-
Title: Improving Language Understanding by Generative Pre-Training model family: GPT date created: 2018-06-01 organization: OpenAI innovation: The paper introduces a framework for natural language understanding by first using generative pre-training on a diverse corpus and then fine-tuning for specific tasks. This approach improved state-of-the-art results on 9 out of 12 datasets, highlighting the potential of unsupervised learning combined with discriminative tasks. pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Supervized discriminative finetuning training corpus: BookCorpus, Supervised Finetuning on several task-specific datasets for Natural Language Inference, Question Answering, Sentence similarity, and Classification. optimizer: Adam optimizer tokenization: byte pair encoding number of parameters: 117M maximum number of parameters (in million): 117 application: Text generation, but adaptable to many other NLP tasks when fine tuned. has source code: https://github.com/openai/finetune-transformer-lm, https://huggingface.co/docs/transformers/model_doc/openai-gpt blog post: https://medium.com/@hyponymous/paper-summary-improving-language-understanding-by-generative-pre-training-7b77babd7086, https://www.reddit.com/r/MachineLearning/comments/n36htr/p_gpt1_annotated_paper_paper_summary/ license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Language models are unsupervised multitask learners model family: GPT date created: 2019-02-01 organization: OpenAI innovation: It can generate upto 768 words (equivalent to 1 1/2 page), demonstrate that language models begin to learn tasks such as question answering, machine translation, reading comprehension, and summarization without any explicit supervision when trained on a task-agnostic, diverse dataset of millions of web-scraped webpages. The work proposes a central research question: do WebText LMs transfer well across domains and datasets? pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: https://github.com/openai/gpt-2-output-dataset, 8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points. tokenization: byte pair encoding number of parameters: 124M, 355M, 774M, 1.5B maximum number of parameters (in million): 1500 extension: GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data., Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024) application: Text generation, but adaptable to many other NLP tasks when fine tuned. has source code: https://huggingface.co/docs/transformers/model_doc/gpt2 blog post: https://openai.com/research/better-language-models, https://www.philschmid.de/fine-tune-a-non-english-gpt-2-model-with-huggingface license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Language Models are Few-Shot Learners model family: GPT date created: 2020-05-01 organization: OpenAI innovation: GPT-3's primary innovation in the context of Large Language Models is its exceptional few-shot learning capabilities, allowing it to make accurate predictions using just a natural language prompt and a few task demonstrations. The model also introduced prompt-based and in-context learning methodologies. However, its vast size (175B parameters) poses challenges for real-world applications., It can generate upto 1,536 words (equivalent to 3 pages) pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: ~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B) number of parameters: 125M, 350M, 774M, 1.3B, 2.7B, 6.7B, 13B, 175B maximum number of parameters (in million): 175000 hardware used: Nvidia V100 GPU hardware information: All models were trained on V100 GPU’s on part of a high-bandwidth cluster provided by Microsoft. extension: Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer application: Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation has source code: https://platform.openai.com/docs/models/gpt-3-5, https://github.com/openai/gpt-3 blog post: https://medium.com/analytics-vidhya/openai-gpt-3-language-models-are-few-shot-learners-82531b3d3122, https://openai.com/blog/gpt-3-apps license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Zero-Shot Text-to-Image Generation model family: GPT date created: 2021-01-01 organization: OpenAI innovation: The paper introduces a model with remarkable generalization, capable of creatively interpreting and combining unusual textual concepts into images. It also demonstrates combinatorial generalization and zero-shot image-to-image translation controlled by natural language, showcasing advancements in LLMs for text-to-image synthesis. pretraining architecture: Decoder pretraining task: Caption prediction training corpus: 250 million text-images pairs from the internet optimizer: Adam optimizer tokenization: BPE-ecnode number of parameters: 12B maximum number of parameters (in million): 12000 hardware used: NVIDIA V100 (16GB) GPU extension: A differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3 application: Text to image has source code: https://github.com/openai/DALL-E, https://github.com/borisdayma/dalle-mini blog post: https://openai.com/blog/dall-e/, https://ml.berkeley.edu/blog/posts/dalle2/ license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: Learning Transferable Visual Models From Natural Language Supervision model family: Also using Resnet, ViT, and vanilla transformer for text, CLIP date created: 2021-02-01 organization: OpenAI innovation: CLIP, in the context of Large Language Models, introduces a novel approach by leveraging natural language supervision with a dataset of 400 million (image, text) pairs. It excels in zero-shot learning, allowing it to classify images using textual descriptions without prior training on specific categories. This integration of vision and language offers a flexible, scalable solution with potential for diverse applications. pretraining architecture: Encoder pretraining task: predict which of the N × N possible (image, text) pairings across a batch actually occurred training corpus: WIT (WebImageText) - 400 million text,image pairs optimizer: Adam optimizer hardware used: Nvidia V100 GPU hardware information: The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. extension: Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoder application: Image/Object classification has source code: https://github.com/openai/CLIP, https://huggingface.co/docs/transformers/model_doc/clip blog post: https://openai.com/research/clip, https://medium.com/axinc-ai/clip-learning-transferable-visual-models-from-natural-language-supervision-4508b3f0ea46 license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models model family: Diffusion models date created: 2021-12-01 organization: OpenAI innovation: The paper introduces two guidance techniques for text-guided image synthesis: CLIP guidance and classifier-free guidance. Of the two, classifier-free guidance produces higher-quality, photorealistic images that align closely with textual descriptions, outperforming previous models like DALL-E in evaluations. pretraining architecture: Encoder pretraining task: Caption prediction training corpus: Same as DALL-E number of parameters: 3.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsampling maximum number of parameters (in million): 3500 extension: GLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind. application: Text to image has source code: https://github.com/openai/glide-text2im license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Training language models to follow instructions with human feedback model family: GPT date created: 2022-01-01 organization: OpenAI innovation: Better alignment of LLMs with human expectations using reinforcement learning through human feedback pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Reinforcement Learning from Human Feedback training corpus: Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts number of parameters: Same as GPT3 extension: GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning application: Knowledge-intensive dialog or language tasks has source code: https://github.com/openai/following-instructions-human-feedback blog post: https://sh-tsang.medium.com/review-instructgpt-training-language-models-to-follow-instructions-with-human-feedback-7fce4bf9059a, https://openai.com/research/instruction-following license: Closed source, accessible through API research problem: Large Language Models (LLMs), transformer model
-
Title: Hierarchical Text-Conditional Image Generation with CLIP Latents model family: GLIDE, CLIP date created: 2022-04-01 organization: OpenAI pretraining architecture: Encoder/Decoder pretraining task: Caption prediction training corpus: Combination of the DALL-E and CLIP datasets number of parameters: 3.5B maximum number of parameters (in million): 3500 extension: Combines CLIP encoder and Diffusion decoder similar to GLIDE application: Text to image blog post: https://openai.com/product/dall-e-2, https://labs.openai.com/ license: Closed source, accessible through API research problem: Large Language Models (LLMs), transformer model
-
Title: Introducing ChatGPT model family: GPT date created: 2022-11-30 organization: OpenAI innovation: trained using Reinforcement Learning from Human Feedback (RLHF) to obtain better model alignment, It can generate upto 3000 words (equivalent to 6 pages), Supports input context length of 2048 tokens pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Step 3. RLHF using Proximal Policy Optimization, Step 2. Collect comparison data and train a reward model, Step 1. Supervized fine-tuning training corpus: Human written prompt and interaction dataset collected through the OpenAI API number of parameters: 175B maximum number of parameters (in million): 175000 hardware information: trained on an Azure AI supercomputing infrastructure application: provide human-like conversational interactions and assist users in answering questions, generating text, providing recommendations, and engaging in natural language conversations. blog post: https://openai.com/blog/chatgpt license: Closed source, accessible through API research problem: transformer model, Large Language Models (LLMs)
-
Title: GPT-4 Technical Report model family: GPT date created: 2023-03-14 organization: OpenAI innovation: It can generate upto 24000 words (equivalent to 48 pages), Supports input context length between 8192 and 32,768 tokens depending on the model version pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Reinforcement Learning from Human Feedback, Rule-Based Reward Model number of parameters: 170T maximum number of parameters (in million): 170000000 extension: a large-scale, multimodal model which can accept image and text inputs and produce text outputs application: Creating highly realistic and contextually accurate human-like text generation blog post: https://openai.com/research/gpt-4 license: Closed source, accessible through API research problem: transformer model, Large Language Models (LLMs)
-
Title: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding model family: BERT date created: 2018-10-01 organization: Google innovation: BERT's primary innovation in Language Model Learning is the "masked language model" (MLM) approach, inspired by the Cloze task. This method masks random tokens in a sentence and trains the model to predict them, enabling bidirectional context understanding. pretraining architecture: Encoder pretraining task: Masked Language Modeling fine-tuning task: Next Sentence Prediction training corpus: Toronto Book Corpus and Wikipedia (3.3B Tokens) optimizer: Adam optimizer tokenization: WordPiece number of parameters: Base = 110M, Large = 340M maximum number of parameters (in million): 340 application: General Language Understanding and Question Answering. Many other language applications followed has source code: https://huggingface.co/docs/transformers/model_doc/bert blog post: https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb, https://www.philschmid.de/bert-text-classification-in-a-different-language license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations model family: BERT date created: 2019-09-01 organization: Google innovation: The main innovation of the work is ALBERT, a language model that improves on existing large models like BERT by employing parameter reduction techniques, such as factorized embeddings and cross-layer parameter sharing. This allows ALBERT to achieve better performance and efficiency on natural language understanding tasks by training larger models with fewer parameters. pretraining architecture: Encoder pretraining task: Next Sentence Prediction, Masked Language Modeling training corpus: Same as BERT optimizer: LAMB optimizer tokenization: sentencepiece number of parameters: Base = 12M, Large = 18M, XLarge = 60M maximum number of parameters (in million): 60 hardware used: Cloud TPUv3 extension: Compressed version of BERT using parameter sharing, which is much more efficient given the same number of parameters application: Same as BERT has source code: https://github.com/google-research/albert, https://huggingface.co/docs/transformers/model_doc/albert blog post: https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Exploring the limits of transfer learning with a unified text-to-text transformer model family: T5 date created: 2019-10-01 organization: Google innovation: The main innovation of Google's T5 language model is its "text-to-text" framework, where various tasks are formulated as converting input text to output text. This unified approach allows T5 to achieve state-of-the-art performance on diverse tasks without task-specific modifications, simplifying training and deployment. This innovation enhances efficiency and effectiveness in real-world applications of large language models. pretraining architecture: Encoder/Decoder pretraining task: Span Corruption fine-tuning task: finetuning on downstream tasks one at a time training corpus: Colossal Clean Crawled Corpus optimizer: AdaFactor tokenization: sentencepiece number of parameters: 60M, 220M, 770M, 3B, and 11B maximum number of parameters (in million): 11000 hardware used: TPUv3 hardware information: we use a combination of model and data parallelism and train models on “slices” of Cloud TPU Pods. TPU pods are are multi-rack ML supercomputers that contain 1,024 TPU v3 chips connected via a high-speed 2D mesh interconnect with supporting CPU host machines. extension: Same as original Transformer with some additions such as relative positional embeddings like Transformer XL application: Diverse set of downstream tasks including machine translation, question answering, abstractive summarization, and text classification has source code: https://github.com/google-research/text-to-text-transfer-transformer, https://huggingface.co/docs/transformers/model_doc/t5 blog post: https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Big Bird: Transformers for Longer Sequences model family: BERT date created: 2020-07-01 organization: Google innovation: BigBird introduces a sparse attention mechanism, allowing it to efficiently handle sequences up to 8 times longer than traditional models like BERT. It combines global, sliding window, and random attention patterns to capture both local and long-range dependencies. This innovation enables superior performance on various NLP tasks without sacrificing efficiency. pretraining architecture: Encoder pretraining task: Masked Language Modeling training corpus: Books, CC-News, Stories and Wikipedia tokenization: byte pair encoding number of parameters: Depends on the overall architecture extension: Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences application: Particularly well suited for longer sequences, not only in text but also e.g. in genomics has source code: https://github.com/google-research/bigbird, https://huggingface.co/docs/transformers/model_doc/big_bird blog post: https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html, https://huggingface.co/blog/big-bird license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale model family: BERT date created: 2020-10-01 organization: Google innovation: The Vision Transformer (ViT) applies Transformers, typically used in NLP, directly to image patches without image-specific biases. It excels when pre-trained on larger datasets, outperforming traditional convolutional models like ResNets. This approach challenges the dominance of convolutional architectures in computer vision, mirroring the Transformer's rise in NLP. pretraining architecture: Encoder pretraining task: image classification training corpus: From standard Imagenet to JFT-300M (large inhouse dataset) optimizer: Adam optimizer number of parameters: 86M(Base) to 632M (Huge) maximum number of parameters (in million): 632 hardware used: Cloud TPUv3 hardware information: the ViT-L/16 model pre-trained on the public ImageNet-21k dataset could be trained using a standard cloud TPUv3 with 8 cores in approximately 30 days. extension: Extension of BERT architecture to train on patches of images application: image classification has source code: https://github.com/google-research/vision_transformer, https://huggingface.co/docs/transformers/model_doc/vit blog post: https://www.v7labs.com/blog/vision-transformer-guide license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity model family: T5 date created: 2021-01-01 organization: Google innovation: The Switch Transformer introduces a sparsely-activated model approach, enhancing the Mixture of Experts (MoE) models by simplifying their routing algorithm and reducing computational costs. It enables training large models with lower precision formats like bfloat16 and achieves up to 7x faster pre-training speeds. This innovation pushes LLM boundaries, scaling up to trillion parameter models with significant efficiency gains. pretraining architecture: Encoder/Decoder pretraining task: denoising autoencoder training corpus: Colossal Clean Crawled Corpus number of parameters: 1T maximum number of parameters (in million): 1000000 hardware used: TPUv3 hardware information: All models are trained with the same amount of computation (32 cores) and on the same hardware (TPUv3). extension: Goal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts) application: General language tasks (e.g. question answering) has source code: https://github.com/google-research/t5x, https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py blog post: https://www.alexanderthamm.com/en/blog/switch-transformer-upscaling-to-over-a-billion-parameters/ license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: GLaM: Efficient Scaling of Language Models with Mixture-of-Experts model family: Transformer date created: 2021-12-01 organization: Google innovation: GLaM introduces a sparsely activated mixture-of-experts architecture, allowing it to scale to 1.2 trillion parameters while consuming only 1/3 of GPT-3's training energy. Despite its size, it achieves superior performance on 29 NLP tasks and is more energy-efficient than dense models like GPT-3. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 1.6T tokens including web pages filtered by Wikipedia and books for quality optimizer: AdaFactor tokenization: sentencepiece number of parameters: 1.2T across 64 experts, but only 96B get activated for inference maximum number of parameters (in million): 1200000 hardware used: cloud TPU-v4 hardware information: the GLaM (64B/64E) training after 600B tokens consumes 456 MWh, about 1/3 of the energy cost of 1287 MWh used by GPT-3. Moreover, to reach similar (and slightly exceeded) scores as GPT-3, we train using 1,024 TPU-v4 chips for 574 hours (with 280B tokens). This consumes 213 MWh or 1/6 of the GPT-3 energy cost. extension: GLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference. application: General language modeling - tested across 29 NLP tasks blog post: https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: LaMDA: Language Models for Dialog Applications model family: LaMDA-PT date created: 2022-01-01 organization: Google innovation: LaMDA is a specialized dialog model that emphasizes safety and factual grounding. The model's innovation lies in its fine-tuning with annotated data and its ability to consult external knowledge sources. This approach aims to produce more accurate and safer dialog responses compared to traditional LLMs. pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: based on multi-turn crowdsourced dialog datasets, LaMDA-PT is finetuned in a mix of generative tasks that generate response given contexts, and discriminative tasks that evaluate quality and safety of a response in context training corpus: 1.56T words from public dialog data and other public web documents. Overall, it consists of 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words tokenization: sentencepiece number of parameters: 137B maximum number of parameters (in million): 137000 hardware used: TPUv3 hardware information: LaMDA was pretrained on 1024 TPU-v3 chips for a total of about 57.7 days, and 256K tokens per batch extension: LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies application: General language modeling, such as translation, summarization, question and answers has source code: https://github.com/conceptofmind/LaMDA-rlhf-pytorch blog post: https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html, https://blog.google/technology/ai/lamda/ license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Finetuned language models are zero-shot learners model family: LaMDA-PT date created: 2022-02-08 organization: Google innovation: The primary innovation of FLAN in the context of Large Language Models is instruction tuning, where models are finetuned on datasets described via natural language instructions. This method significantly enhances zero-shot learning abilities, with FLAN outperforming the 175B GPT-3 on numerous tasks. The approach emphasizes human-like prompts over traditional model-specific prompts used in models like GPT-3 and T5. pretraining architecture: Decoder fine-tuning task: Instruction Tuning training corpus: FLAN is instruction tuned on 25 tasks spanning 62 datasets., LaMDA-PT is is pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary optimizer: AdaFactor tokenization: sentencepiece number of parameters: 137B maximum number of parameters (in million): 137000 hardware used: TPUv3 hardware information: instruction tuning takes around 60 hours on a TPUv3 with 128 cores extension: Zero-shot task learning. The output space for a given task is either one of several classes (classification) or free text (generation). application: language understanding and generation tasks such as inference, sentiment analysis, paraphrase, closed-book QA, reading comprehension, coreference, summarization, translation, commonsense reasoning, and struct-to-text has source code: https://github.com/google-research/FLAN blog post: http://rylanschaeffer.github.io/blog_posts/2022-01-20-google-brain-flan.html, https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: PaLM: Scaling Language Modeling with Pathways model family: PaLM date created: 2022-04-01 organization: Google innovation: To demonstrate the first large-scale use of Pathways -- a new ML system which enables training a single model across thousands or tens of thousands of accelerator chips in a highly efficient manner. With Pathways, they trained a 540B parameter language model on 6144 TPU v4 chips at efficiency levels that could not be reached before for models of this scale. E.g., GPT-3 (175B), Gopher (280B), Megatron-Turing-NLG (530B). pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 780B tokens from multilingual social media conversations (50%), multilingual filtered webpages (27%), books in English (13%), code from Github (5%), multilingual Wikipedia (4%), and news in English (1%). Code includes 24 programming languages. optimizer: AdaFactor tokenization: sentencepiece number of parameters: 8B, 62B, and 540B maximum number of parameters (in million): 540000 hardware used: TPUv4 hardware information: PaLM 540B is trained over two TPU v4 Pods connected over data center network (DCN) using a combination of model and data parallelism. Each Pod has 3072 TPU v4 chips attached to 768 hosts. extension: PaLM uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data application: PaLM is designed as a general purpose language model with applicability to hundreds of different language tasks has source code: https://github.com/lucidrains/PaLM-pytorch blog post: https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/, https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Ul2: Unifying language learning paradigms model family: Transformer date created: 2022-05-01 organization: Google innovation: The paper introduces the UL2 model, a unified framework for pre-training in NLP, featuring a novel Mixture-of-Denoisers (MoD) objective. This objective smoothly integrates various pre-training paradigms, such as span corruption and prefix language modeling. Additionally, UL2 introduces dynamic "mode switching" between different denoisers and showcases superior performance across diverse NLP tasks. pretraining architecture: Encoder/Decoder pretraining task: Mixture-of-Denoisers, which combines diverse pretraining paradigms together training corpus: 1 trillion tokens on C4 optimizer: AdaFactor tokenization: sentencepiece number of parameters: 20B maximum number of parameters (in million): 20000 hardware used: TPUv4 hardware information: We use a batch size of 1024 and 512 TPUv4 chips for pretraining this model. UL20B is trained with Jax and T5X infrastructure. We release and open source T5X-based model checkpoints of this 20B model extension: UL2-20B (Unifying Language Learning) can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs. application: A unified framework for pre-training models that are universally effective across datasets and setups. has source code: https://github.com/google-research/google-research/tree/master/ul2 blog post: https://blog.research.google/2022/10/ul2-20b-open-source-unified-language.html license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding model family: Diffusion models, CLIP, T5 date created: 2022-06-01 organization: Google innovation: The "Imagen" model innovatively merges transformer language models with high-fidelity diffusion techniques to produce photorealistic images from text descriptions. This demonstrates that embeddings from text-only pretrained large language models are highly effective for text-to-image synthesis. pretraining architecture: T5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to image pretraining task: image/text pair prediction training corpus: a combination of internal datasets, with ? 460M image-text pairs, and the publicly available Laion dataset, with ? 400M image-text pairs optimizer: AdaFactor number of parameters: 2B maximum number of parameters (in million): 2000 hardware used: TPUv4 hardware information: use 256 TPU-v4 chips for our base 64 x 64 model, and 128 TPU-v4 chips for both super-resolution models extension: Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations) application: Text to image blog post: https://imagen.research.google/ license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Solving Quantitative Reasoning Problems with Language Models model family: PaLM date created: 2022-06-01 organization: Google pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Same as PaLM + 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats number of parameters: 540B maximum number of parameters (in million): 540000 extension: Extends PaLM by fine-tuning on the mathematical dataset application: Mathematical reasoning blog post: https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Scaling instruction-finetuned language models model family: T5 date created: 2022-11-01 organization: Google innovation: this paper explores instruction finetuning with a particular focus on (1) scaling the number of tasks (1.8K fine-tuning tasks), (2) scaling the model size, and (3) finetuning on chain-of-thought data. This approach is compatible with various model sizes and architectures, with Flan-T5 models notably outperforming baseline T5 models. pretraining architecture: Encoder/Decoder pretraining task: Span Corruption fine-tuning task: Instruction Tuning training corpus: Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT optimizer: AdaFactor number of parameters: 80M (Flan-T5-Small), 250M (Flan-T5-Base), 780M (FLan-T5-Large), 3B (Flan-T5-XL), and 11B (Flan-T5-XXL). maximum number of parameters (in million): 11000 hardware used: TPUv3 extension: instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data application: The primary use is to underestand how to improve large language models with the right kind of instruction fine-tuning. The focus is research on zero-shot and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models has source code: https://github.com/google-research/t5x, https://huggingface.co/docs/transformers/model_doc/flan-t5 blog post: https://ai.googleblog.com/2023/02/the-flan-collection-advancing-open.html license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Scaling instruction-finetuned language models model family: PaLM date created: 2022-11-01 organization: Google innovation: The paper introduced an extended instruction fine-tuning for the Flan-PaLM model, scaling it to a 540B-parameter size and 1.8K fine-tuning tasks. They incorporated chain-of-thought (CoT) data, which enhanced performance across evaluations. This approach is compatible with various model sizes and architectures pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Instruction Tuning training corpus: Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT optimizer: AdaFactor number of parameters: 8B, 62B, 540B maximum number of parameters (in million): 540000 hardware used: TPUv4 hardware information: use 0.2% of the pre-training compute to instruction-finetune Flan-PaLM 540B (approximately 512 v4 TPU chips for 37 hours) extension: Flan-PaLM is generated by "Flan Finetuning" the PaLM models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data. application: Same as Flan-T5. The goal is to show Flan finetuning can even improve on the largest Google LMs (+9.4% improvement average across tasks), with improvements to chain of thought, self consistency, multilingual tasks, arithmetic reasoning license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context date created: 2019-01-01 organization: Google, CMU innovation: Transformer-XL introduces a segment-level recurrence mechanism and a novel positional encoding scheme to overcome the fixed-length context limitations of traditional Transformers. This allows it to capture dependencies 80% longer than RNNs and 450% longer than vanilla Transformers, addressing context fragmentation and improving efficiency in language modeling. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Different training datasets depending on experiments, but baseline is Wikitext-103 tokenization: byte pair encoding number of parameters: 151M maximum number of parameters (in million): 151 hardware information: state-of-the-art results reported in the paper were obtained by training the model on a large-scale TPU cluster extension: Relative positioned embeddings enable longer-context attention when compared to vanilla Transformer model application: General language tasks has source code: https://github.com/chiayewken/transformer_xl, https://huggingface.co/docs/transformers/model_doc/transfo-xl blog post: https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: XLNet: Generalized Autoregressive Pretraining for Language Understanding model family: Transformer XL date created: 2019-05-01 organization: Google, CMU innovation: XLNet introduces a generalized autoregressive pretraining method that captures bidirectional context by considering all possible permutations of the factorization order. This approach overcomes BERT's limitations related to data corruption and token independence. Additionally, XLNet integrates techniques from Transformer-XL and offers architectural improvements for permutation-based modeling. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Same as BERT + Giga5 (16GB text), and and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB) optimizer: Adam weight decay optimizer number of parameters: Base=117M, Large=360M maximum number of parameters (in million): 360 hardware used: TPUv3 hardware information: train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days extension: This model basically adapts Transformer XL architecture to permutation-based LM application: General language tasks has source code: https://huggingface.co/docs/transformers/model_doc/xlnet blog post: https://towardsdatascience.com/xlnet-explained-in-simple-terms-255b9fb2c97c license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: ERNIE: Enhanced Language Representation with Informative Entities model family: BERT date created: 2019-05-01 organization: Pengcheng Lab, Baidu innovation: ERNIE innovatively incorporates knowledge from knowledge graphs (KGs) into language representation models. It fuses lexical, syntactic, and knowledge information, enabling enhanced performance on knowledge-driven tasks. This approach sets ERNIE apart from traditional models like BERT, which primarily rely on textual context. pretraining architecture: Encoder pretraining task: Masked Language Modeling training corpus: English Wikipedia + Wikidata for entitites (note that they initialize model to original BERT parameter values optimizer: Adam optimizer number of parameters: Ernie-ViLG 2.0 = 10B, Ernie 3.0 Titan = 260B maximum number of parameters (in million): 260000 extension: Uses BERT for Encoder architecture, but stacks and aggregates two of them for text and entities. This architecture could be understood as BERT for text + knowledge graphs application: Knowledge intensive related tasks that might benefit from knowledge graphs or entities such as entity recognition has source code: https://github.com/thunlp/ERNIE blog post: http://research.baidu.com/Blog/index-view?id=160 license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: RoBERTa: A Robustly Optimized BERT Pretraining Approach model family: BERT date created: 2019-07-01 organization: Google, University of Washington innovation: The work introduced RoBERTa, an improved version of BERT, by optimizing design choices: extending training duration, removing the next sentence prediction objective, and dynamically altering the masking pattern. These modifications led RoBERTa to achieve state-of-the-art results on benchmarks like GLUE, RACE, and SQuAD, emphasizing the importance of refining pretraining strategies in Large Language Models. pretraining architecture: Encoder pretraining task: Masked Language Modeling training corpus: Same as BERT + CC News + OpenWebText + Stories (~33B Tokens) optimizer: Adam optimizer number of parameters: 125M Base, and 356M Large maximum number of parameters (in million): 356 extension: Extension of BERT with optimized training procedure and more data application: Same as BERT has source code: https://github.com/facebookresearch/fairseq/tree/main/examples/roberta, https://huggingface.co/docs/transformers/model_doc/roberta blog post: https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: CTRL: A Conditional Transformer Language Model for Controllable Generation date created: 2019-09-01 organization: Salesforce innovation: The main innovation of the work in the context of LLMs appears to involve advancements in model architecture, training techniques, and multitask learning to enhance the performance, efficiency, and ethical considerations of language models. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions optimizer: Adagrad optimizer tokenization: fastBPE number of parameters: 1.63B maximum number of parameters (in million): 1630 hardware used: Cloud TPUv3 extension: model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior application: Controllable text generation has source code: https://github.com/salesforce/ctrl, https://huggingface.co/docs/transformers/model_doc/ctrl blog post: https://blog.salesforceairesearch.com/introducing-a-conditional-transformer-language-model-for-controllable-generation/ license: Open, BSD-3-Clause license research problem: Large Language Models (LLMs), transformer model
-
Title: Highly accurate protein structure prediction with AlphaFold model family: SE(3)-Transformer date created: 2019-09-01 organization: Deepmind innovation: The main innovation of "Highly accurate protein structure prediction with AlphaFold" is the creation of AlphaFold, a deep learning model that accurately predicts protein structures. This extends the capabilities of Large Language Models by showcasing their potential to solve complex scientific challenges beyond language-related tasks, marking a significant advancement in the intersection of machine learning and biology. pretraining architecture: Encoder pretraining task: Protein folding prediction of BERT using parameter sharing, which is much more efficient given the same number of parameters training corpus: 170,000 proteins from a public repository of protein sequences and structures number of parameters: b12M, Large = 18M, XLarge = 60M maximum number of parameters (in million): 60 extension: The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post). application: Protein folding has source code: https://github.com/deepmind/alphafold blog post: https://www.deepmind.com/publications/highly-accurate-protein-structure-prediction-with-alphafold, https://fabianfuchsml.github.io/alphafold2/ license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Scaling Language Models: Methods, Analysis & Insights from Training Gopher model family: GPT date created: 2021-12-01 organization: Deepmind innovation: The paper ""Scaling Language Models: Methods, Analysis & Insights from Training Gopher"" emphasizes the benefits and limitations of scaling LLMs. It highlights significant performance advances with data quality and scale but notes uneven gains across tasks, especially in mathematical reasoning. The study also delves into the impact of scale on toxicity and bias in model outputs. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. optimizer: Adam optimizer tokenization: sentencepiece number of parameters: 44M, 117M, 417M, 1.4B, 7.1B, and 280B maximum number of parameters (in million): 280000 hardware used: TPUv3 hardware information: We built our training and evaluation codebase with JAX and Haiku. In particular, we use JAX’s pmap transformation to efficiently express both data and model parallelism. We trained and evaluated all models on TPUv3 chips. The half-precision parameters and single-precision Adam state for Gopher occupy 2.5 TiB, which far exceeds the 16 GiB of memory available on each TPUv3 core. To address these memory concerns, we use optimiser state partitioning, model parallelism, and rematerialisation to partition the model state and reduce the activations so that they fit in TPU memory. extension: Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute application: Mostly Language Modeling and NLU, but also extensible like GPT blog post: https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Training Compute-Optimal Large Language Models model family: GPT date created: 2022-03-01 organization: Deepmind innovation: The paper "Training Compute-Optimal Large Language Models" introduces a methodology to optimally balance model size and training tokens under a fixed compute budget. The findings emphasize equal scaling of parameters and training tokens, highlighting the importance of high-quality dataset scaling. The research also underscores ethical concerns with training on vast datasets sourced from the web. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 1.4 trillion training tokens. Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. optimizer: AdamW tokenization: a slightly modified SentencePiece (Kudo and Richardson, 2018) tokenizer that does not apply NFKC normalisation number of parameters: 70B maximum number of parameters (in million): 70000 hardware used: TPUv4, TPUv3 hardware information: All models in this analysis have been trained on TPUv3/TPUv4 with JAX and Haiku. extension: Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance application: Same as Gopher/GPT3 blog post: https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510, https://medium.com/mlearning-ai/language-models-need-proper-training-c71484727f00 license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Teaching language models to support answers with verified quotes model family: Gopher date created: 2022-03-01 organization: Deepmind innovation: GopherCite, in the context of large language models, is designed to support its answers with verified quotes from sources. It employs reinforcement learning with unique training techniques and can decline to answer questions if uncertain about the quality of its response. This approach differentiates it from other models by emphasizing evidence-backed answers and selective response generation. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Same as Gopher plus specific dataset generated in the RLHP process optimizer: AdaFactor tokenization: sentencepiece number of parameters: 280B maximum number of parameters (in million): 280000 hardware used: TPUv3 hardware information: shard the networks across 128 TPU v3 machines. extension: GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported application: Dialog systems, Q&A, general language generation tasks blog post: https://www.deepmind.com/blog/gophercite-teaching-language-models-to-support-answers-with-verified-quotes license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Flamingo: a Visual Language Model for Few-Shot Learning model family: Chinchilla date created: 2022-04-01 organization: Deepmind innovation: Flamingo is a Visual Language Model designed to bridge powerful pretrained vision-only and language-only models, allowing it to handle sequences of interleaved visual and textual data. It's trained on large-scale multimodal web corpora with interleaved text and images, enabling rapid adaptation to new tasks using few-shot learning. This approach allows Flamingo to outperform models fine-tuned on significantly more task-specific data. pretraining architecture: Decoder pretraining task: Log likelihood of text given some visual input training corpus: MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average) optimizer: AdamW number of parameters: 80B (largest) maximum number of parameters (in million): 80000 hardware used: TPUv4 hardware information: Our model and associated infrastructure were implemented using JAX and Haiku. All training and evaluation was performed on TPUv4 instances. The largest model containing 80 billion parameters is trained on 1536 chips for 15 days and sharded across 16 devices. Megatron type sharding is used to enable 16-way model parallelism for all Embedding / Self-Attention / Cross-Attention / FFW layers, while the NFNet vision layers were unsharded. ZeRO stage 1 is used to shard the optimizer state. extension: It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet application: Text to image has source code: https://github.com/lucidrains/flamingo-pytorch blog post: https://medium.com/geekculture/3-overlooked-things-deepminds-flamingo-a-large-model-for-computer-vision-84cd9d2f738c, https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: A Generalist Agent model family: “Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) date created: 2022-05-01 organization: Deepmind innovation: Gato is a multi-modal, multi-task agent inspired by Large Language Models, capable of handling diverse tasks like playing games, captioning images, and robotics using a single neural network. It employs a unique tokenization approach to process varied data types, from text to images. This innovation allows Gato to generalize across a vast range of tasks, setting a new standard in the realm of LLMs. pretraining architecture: Decoder pretraining task: CLM (where tokens are either text or agent actions) training corpus: 1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot) optimizer: Adam optimizer tokenization: sentencepiece number of parameters: 79M, 364M, and 1.18B maximum number of parameters (in million): 1180 extension: The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable. application: Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm. has source code: https://github.com/OrigamiDream/gato blog post: https://www.deepmind.com/blog/a-generalist-agent, https://www.deepmind.com/publications/a-generalist-agent license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: Improving alignment of dialogue agents via targeted human judgements model family: GPT date created: 2022-09-01 organization: Deepmind innovation: Sparrow, developed by DeepMind, introduces innovations in the context of LLMs by utilizing targeted human judgments for alignment. It employs a unique approach of using a single external knowledge fragment for evidence and focuses on breaking down goals into detailed rules, enhancing the model's helpfulness and accuracy in responses. The model also integrates techniques like self-play, search, and fine-grained rules to shape its behavior. pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Supervised fine-tuning (SFT) training corpus: Same as Chinchilla + interactive data gathering with human annotators during the RLHF process optimizer: AdaFactor number of parameters: 70B maximum number of parameters (in million): 70000 hardware used: TPUv3 hardware information: shard the models across 64 TPU v3 machines extension: Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite application: Dialog agents and general language generation applications like Q&A blog post: https://www.deepmind.com/blog/building-safer-dialogue-agents, https://medium.com/to-cut-a-long-paper-short/sparrow-improving-alignment-of-dialogue-agents-via-targeted-human-judgments-e0876402d800 license: closed source research problem: Large Language Models (LLMs), transformer model
-
Title: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension model family: BERT for encoder, GPT for Decoder date created: 2019-10-01 organization: Facebook innovation: The main innovation of the work is the introduction of BART, a pre-training approach that combines bidirectional and auto-regressive modeling. BART excels in generating coherent text, making it particularly effective for tasks like summarization, leveraging denoising tasks for multi-task learning, and achieving state-of-the-art results in text generation, especially abstractive summarization. pretraining architecture: Encoder/Decoder pretraining task: denoising autoencoder training corpus: Same as RoBERTa (160Gb of news, books, stories, and web text) number of parameters: Base = 140M, Large = 400M. In general, roughly 10% larger than BART for equivalent architectures maximum number of parameters (in million): 400 extension: It can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoder application: Mostly text generation but also some text understanding tasks has source code: https://huggingface.co/docs/transformers/model_doc/bart license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Unsupervised Cross-lingual Representation Learning at Scale model family: RoBERTa date created: 2019-10-01 organization: Facebook innovation: The main innovation of the work is the development of the XLM-R model, trained on data from 100 languages, achieving superior performance on cross-lingual tasks. The research highlights the challenges of scaling multilingual models and suggests that increasing model capacity can address some limitations. This approach is especially beneficial for low-resource languages. pretraining architecture: Encoder pretraining task: Masked Language Modeling training corpus: Cleaned Common Crawl in 100 languages tokenization: sentencepiece number of parameters: Base = 270M Large = 550M maximum number of parameters (in million): 550 hardware used: NVIDIA V100 (32GB) GPUs hardware information: train the XLM-R model for 1.5 Million updates on five-hundred 32GB Nvidia V100 GPUs with a batch size of 8192. extension: An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applications application: Translation and other cross-lingual language tasks has source code: https://huggingface.co/docs/transformers/model_doc/xlm-roberta blog post: https://ai.meta.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/ research problem: Large Language Models (LLMs), transformer model
-
Title: Multilingual Denoising Pre-training for Neural Machine Translation model family: BART date created: 2020-01-01 organization: Facebook innovation: mBART introduces a multilingual denoising pre-training method for neural machine translation using a sequence-to-sequence auto-encoder model. Unlike other models like XLM and MASS, mBART pre-trains both the encoder and decoder, making it more adaptable for translation tasks. The model's versatility is further showcased by its ability to handle various levels of multilinguality, from monolingual to 25 languages. pretraining architecture: Encoder/Decoder pretraining task: denoising autoencoder training corpus: CC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB) optimizer: Adam optimizer tokenization: sentencepiece number of parameters: Same as BART hardware used: NVIDIA V100 (32GB) GPUs hardware information: The full model (including 25 languages) is trained on 256 Nvidia V100 GPUs (32GB) for 500K steps. The total batch size is around 128K tokens per GPU, matching the BART configuration. extension: mBART introduces a multilingual denoising pre-training method for neural machine translation using a sequence-to-sequence auto-encoder model. Unlike other models like XLM and MASS, mBART pre-trains both the encoder and decoder, making it more adaptable for translation tasks. The model's versatility is further showcased by its ability to handle various levels of multilinguality, from monolingual to 25 languages. application: Translation has source code: https://github.com/facebookresearch/fairseq/tree/main/examples/mbart, https://huggingface.co/docs/transformers/model_doc/mbart blog post: https://medium.com/syncedreview/facebook-ai-mbart-the-tower-of-babels-silicon-solution-610dfb494f98 license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: HTLM: Hyper-Text Pre-Training and Prompting of Language Models model family: BART date created: 2021-07-01 organization: Facebook innovation: The HTLM model's primary innovation is its ability to directly model hyper-text (HTML) from web crawls, enabling structured prompting and auto-prompting. This approach leverages the inherent structure of HTML for tasks like zero-shot summarization and offers improved performance and data efficiency compared to traditional Large Language Models. pretraining architecture: Encoder/Decoder pretraining task: denoising autoencoder training corpus: 23TB of simplified HTML extracted from CommonCrawl optimizer: Adam optimizer number of parameters: 400M maximum number of parameters (in million): 400 hardware information: We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. extension: As opposed to BART, they don’t do sentence shuffling application: General purpose language model that allows structured HTML prompting license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: CM3: A Causal Masked Multimodal Model of the Internet model family: HTLM date created: 2022-01-01 organization: Facebook innovation: The CM3 model introduces a causally masked approach for generative modeling, blending the strengths of causal and masked language models. Trained on large-scale multi-modal documents, it can generate rich structured outputs and offers impressive zero-shot capabilities across text and image tasks. This innovation positions CM3 as a powerful advancement in the realm of Large Language Models. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: CC-News, English Wikipedia optimizer: Adam optimizer number of parameters: 125M (small), 800M (small), 2.7B (medium) and 13B (large) maximum number of parameters (in million): 13000 hardware used: NVIDIA A100 GPU, Nvidia V100 GPU hardware information: HTLM-Medium was trained on 240 V100 GPU for 28 days, while HTLM-Large was trained on 384 A100 GPU for 24 days. extension: This is somewhat similar to HTML in its use of structured training data. However, it is a different architecture and uses causal masking, which makes the model predict, at the end of the sequence, an entire missing span of text. It also includes image input via Vector Quantized Variational Autoencoding (VQ-VAE) tokens. application: Multimodal language model with the ability to do structured prompting, zero-shot captioning, image generation, and entity linking (via target text prediction of hyperlinks) blog post: https://lilianweng.github.io/posts/2022-06-09-vlm/ license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion model family: GPT (but can extend any family) date created: 2022-03-01 organization: Facebook innovation: The main innovation of the model, SeeKeR, is its modular approach that combines internet search, knowledge generation, and response generation for more factual and up-to-date outputs. This method addresses the limitations of traditional large language models by ensuring accurate and current information in responses. SeeKeR outperforms existing models in open-domain knowledge-grounded conversations. pretraining architecture: Encoder/decoder or decoder only, depending on the base model it’s extending pretraining task: LM training, Dialogue training fine-tuning task: dialogue-based fine-tuning training corpus: Wizard of the Internet/Wikipedia, PersonaChat, Blended Skill Talk, Empatheic Dialogues, Multi-Session Chat, MS MARCO, Natural questions, SQuAD, TriviaQA optimizer: Adam optimizer number of parameters: SeeKeR Dialogue: 400M, 3B; SeeKeR LM: 365M, 762M, 1.5B, R2C2 BlenderBot: 400M, 3B hardware used: Nvidia V100 GPU hardware information: The SeeKeR language models were fine-tuned on all of the search, knowledge, and response tasks simultaneously, with training occurring on 32 V100 GPUs for around 17, 21, and 31 hours for the XL, Large, and Medium models, respectively., The SeeKeR 2.7B R2C2 model was fine-tuned on all of the search, knowledge, and dialogue response tasks simultaneously, with training occurring on 64 V100 GPUs for around 20 hours, SeeKeR 2.7B R2C2 was pre-trained on 128 V100 GPUs for approximately 25 days extension: SeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretraining application: Same as base models has source code: https://parl.ai/projects/seeker/ license: the code is open sourced research problem: Large Language Models (LLMs), transformer model
-
Title: OPT: Open Pre-trained Transformer Language Models model family: GPT date created: 2022-05-01 organization: Facebook innovation: The authors introduced OPT, a collection of auto-regressive language models ranging from 125M to 175B parameters, replicating GPT-3's performance. They applied the latest best practices in data curation and training efficiency and emphasized the importance of community collaboration for responsible LLM guidelines and ethical considerations. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 180B tokens = RoBERTa + the Pile + PushShift.io Reddit optimizer: AdamW tokenization: GPT-2 byte level BPE tokenizer number of parameters: 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, and 175B maximum number of parameters (in million): 175000 hardware used: A100-80GB GPU hardware information: training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hardware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. extension: Basically same architecture as GPT-3 but with some training improvements introduced in Megatron-LM application: Same as GPT-3 has source code: https://github.com/facebookresearch/metaseq, https://huggingface.co/facebook/opt-350m blog post: https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ license: MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage model family: GPT date created: 2022-08-01 organization: Facebook innovation: BlenderBot 3 enhances its predecessor by grounding conversations with internet-based knowledge retrieval. It emphasizes fine-tuning with diverse datasets and incorporates advanced safety mechanisms, while also focusing on continual learning from public deployments. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 180B tokens = RoBERTa + the Pile + PushShift.io Reddit optimizer: Adam optimizer number of parameters: 3B, 30B and 175B maximum number of parameters (in million): 175000 hardware used: NVIDIA V100 (32GB) GPUs, A100-40GB GPU hardware information: The 30B and 175B parameter BlenderBot 3 models were each trained for one epoch of the training data on 64 (30B) or 128 (175B) x 40gb A100 GPU, The 3B parameter BlenderBot 3 model was trained on 64 x 32gb V100 GPUs for 27k updates with a batch size of 64 extension: BlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them. application: same as GPT-3 has source code: https://parl.ai/projects/bb3/, https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/bb3/model_card.md, https://github.com/facebookresearch/ParlAI/blob/main/projects/bb3/agents/README.md blog post: https://ai.facebook.com/blog/blenderbot-3-a-175b-parameter-publicly-available-chatbot-that-improves-its-skills-and-safety-over-time/ license: Limited, non-commercial, research only research problem: Large Language Models (LLMs), transformer model
-
Title: DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation model family: GPT date created: 2019-10-01 organization: Microsoft innovation: The main innovation of this work is DialoGPT, a large language model designed for open-domain conversations. DialoGPT is trained on a Reddit dataset, allowing it to generate coherent responses in multi-turn dialogues and adapt to different conversational domains through fine-tuning. It addresses context handling, offensive content mitigation, and employs human evaluation to assess its performance and variants. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 140M Reddit conversations number of parameters: 117M, 345M and 762M maximum number of parameters (in million): 762 hardware used: Nvidia V100 GPU hardware information: trained on 16 Nvidia V100 machines with NVLink. extension: GPT-2 architecture trained on dialog data application: Text generation in dialog settings has source code: https://github.com/microsoft/DialoGPT, https://huggingface.co/docs/transformers/model_doc/dialogpt blog post: https://huggingface.co/microsoft/DialoGPT-medium?text=Hey+my+name+is+Mariama%21+How+are+you%3F license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Deberta: Decoding-enhanced bert with disentangled attention model family: BERT date created: 2020-06-01 organization: Microsoft innovation: The DeBERTa model introduces a disentangled attention mechanism, representing each word with separate vectors for content and position. It also incorporates an Enhanced Mask Decoder for better prediction of masked tokens during pre-training. These innovations lead to improved efficiency and performance over models like BERT and RoBERTa. pretraining architecture: Encoder pretraining task: Masked Language Modeling training corpus: For DeBERTa pre-training, we use Wikipedia (English Wikipedia dump; 12GB), BookCorpus (Zhu et al., 2015) 9 (6GB), OPENWEBTEXT (public Reddit content (Gokaslan & Cohen, 2019); 38GB) and STORIES (a subset of CommonCrawl (Trinh & Le, 2018); 31GB). The total data size after data deduplication (Shoeybi et al., 2019) is about 78GB. optimizer: Adam optimizer tokenization: byte pair encoding number of parameters: 100M (Base), 350M (Large), 700M (X-large), and 1.5B (XX-large) maximum number of parameters (in million): 1500 hardware used: Nvidia V100 GPU hardware information: The base model of DeBERTa was trained on 4 DGX-2 machines equipped with 64 V100 GPUs. The training took 10 days to complete 1M training steps with a batch size of 2048. For the larger version of DeBERTa with 1.5 billion parameters (DeBERTa1.5B), the model was trained on a pre-training dataset of 160G. The training was conducted on a DGX-2 machine with 16 V100 GPUs. extension: Separate positional embedding vector independent from the content embedding using disentangled attention matrices for contents and relative positions application: Same as BERT has source code: https://github.com/microsoft/DeBERTa, https://huggingface.co/microsoft/deberta-v2-xxlarge, https://huggingface.co/microsoft/deberta-v2-xlarge, https://huggingface.co/microsoft/deberta-xlarge, https://huggingface.co/microsoft/deberta-large blog post: https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows model family: ViT date created: 2021-03-01 organization: Microsoft innovation: The Swin Transformer introduces a hierarchical representation for visual elements of varying scales and achieves linear computational complexity by computing self-attention within non-overlapping, shifted windows. This design efficiently adapts the Transformer architecture for high-resolution images, making it a robust backbone for computer vision tasks. The shifted window approach enhances modeling power while ensuring lower latency. pretraining architecture: Encoder pretraining task: Same as ViT training corpus: Imagenet and Imagenet-22k optimizer: AdamW number of parameters: Swin-Tiny (29M), Swin-Small (50M), Swin-Base (88M), and Swin-Large (197M) maximum number of parameters (in million): 197 hardware used: Nvidia V100 GPU extension: Extends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution images application: Image (object detection, image classification..) has source code: https://github.com/microsoft/Swin-Transformer blog post: https://www.section.io/engineering-education/an-overview-of-swin-transformer/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Godel: Large-scale pre-training for goal-directed dialog model family: T5, GPT date created: 2022-06-01 organization: Microsoft innovation: The main innovation of the "Godel" model lies in its novel approach to integrating external knowledge into dialogue generation using a knowledge selection mechanism. This enables the model to produce contextually relevant and accurate responses by incorporating relevant information from external sources, enhancing the capabilities of LLMs in generating grounded and informed dialogues. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 147M dialog sessions for a total of 6B tokens from Reddit comment chains for DialoGPT. And grounded dialog corpora like DSTC7 Task 2 corpus, MS MARCO, UnifiedQA, and Schema-Guided Dialog. tokenization: Byte level BPE tokenizer number of parameters: 220M (base), 770M (large), and 175B (XL) maximum number of parameters (in million): 175000 hardware used: Nvidia V100 GPU hardware information: GODEL_B and GODEL_L were trained on 16 Nvidia V100 machines, and GODEL_XL was trained with 128 Nvidia V100 GPUs. extension: In contrast with earlier models such as DialoGPT, GODEL leverages a new phase of grounded pre-training designed to better support adapting GODEL to a wide range of downstream dialog tasks that require information external to the current conversation (e.g., a database or document) to produce good responses. application: open-domain goal-directed dialog tasks such as knowledge-grounded response generation, task-oriented dialog, and conversational QA has source code: https://huggingface.co/microsoft/GODEL-v1_1-large-seq2seq?text=Hey+my+name+is+Mariama%21+How+are+you%3F, https://huggingface.co/microsoft/GODEL-v1_1-base-seq2seq?text=Hey+my+name+is+Julien%21+How+are+you%3F, https://github.com/microsoft/GODEL blog post: https://www.microsoft.com/en-us/research/blog/godel-combining-goal-oriented-dialog-with-real-world-conversations/ license: MIT License research problem: Large Language Models (LLMs), transformer model
-
Title: Text Embeddings by Weakly-Supervised Contrastive Pre-training model family: BERT date created: 2022-12-01 organization: Microsoft innovation: The paper introduces E5, a text embedding model trained contrastively using a curated dataset named CCPairs. A unique consistency-based filtering refines this dataset, ensuring high-quality training data. E5's efficiency allows it to match or outperform much larger models in various tasks. pretraining architecture: Encoder pretraining task: Contrastive pretraining fine-tuning task: Supervised fine-tuning (SFT) training corpus: Finetune on a combination of 3 datasets: NLI 6 (Natural Language Inference), MS-MARCO passage ranking dataset, and NQ (Natural Questions) dataset, CCPairs dataset by combining various semistructured data sources such as CommunityQA, Common Crawl and Scientific papers, and perform aggressive filtering with a consistency-based filter optimizer: AdamW number of parameters: 300M maximum number of parameters (in million): 300 hardware used: Nvidia V100 GPU hardware information: takes {16; 32; 64} V100 GPUs and {1; 1; 2} days for the {small, base, large} models. To improve training efficiency and reduce GPU memory usage, we adopt mixed precision training and gradient checkpointing. extension: Fine-tunes BERT-based models to create text string embeddings optimized for semantic relatedness application: Text embeddings for semantic relatedness tasks such as text clustering or search retrieval has source code: https://huggingface.co/intfloat/e5-large-v2, https://huggingface.co/intfloat/e5-base-v2, https://huggingface.co/intfloat/e5-small-v2, https://github.com/microsoft/unilm/tree/master/e5 blog post: https://blog.vespa.ai/simplify-search-with-multilingual-embeddings/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: WizardLM: Empowering Large Language Models to Follow Complex Instructions model family: LLaMa date created: 2023-04-24 organization: Microsoft innovation: The paper introduces the Evol-Instruct methodology, allowing Large Language Models (LLMs) to autonomously generate diverse instructional data. This innovation shifts away from traditional human-generated instructions, leading to more robust models. The resulting model, WizardLM, demonstrated superior performance, highlighting the effectiveness of this approach. pretraining architecture: Decoder fine-tuning task: supervized open-domain complex instruction finetuning training corpus: 250K instructions generated from OpenAI ChatGPT API auto-generated based on the Evol-Instruct method optimizer: Adam optimizer number of parameters: 7B maximum number of parameters (in million): 7000 hardware used: Nvidia V100 GPU hardware information: train our model on 8 V100 GPUs with Deepspeed Zero-3 for 70 hours on 3 epochs. extension: Evol-Instruct application: assistant in foundational NLP tasks such as reasoning or multi-domain and multi-genre QA, code generation, etc. has source code: https://github.com/nlpxucan/WizardLM blog post: https://medium.com/@preangelleo/wizardlm-enhancing-large-language-models-with-ai-evolved-instructions-7fd4425afe80 license: Non-commercial license research problem: Large Language Models (LLMs), transformer model
-
Title: WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct model family: LLaMa date created: 2023-04-24 organization: Microsoft innovation: The main innovation of WizardMath in the context of Large Language Models (LLMs) is its enhanced mathematical reasoning capabilities. Through a case study, the model demonstrated its ability to solve complex problems using step-by-step reasoning, showcasing its advancement over other LLMs in mathematical tasks. pretraining architecture: Decoder fine-tuning task: Reinforcement Learning from Evol-Instruct Feedback training corpus: To enhance the model’s ability to adhere to the neural and diverse instructions, 1.5k open-domain conversations from WizardLM’s training data are sampled and merged with above math corpus as the final supervized finetuning training data, few-shot re-generate 15k answers for GSM8k and MATH with an Alpha version of WizardLM 70B model to produce solutions in a step-by-step format, then find out those with a correct answer, and use this data to finetune base Llama model, evolve the original math (GSM8k + MATH) instructions by 8 turns, increasing the data size from 15k to 96k number of parameters: 7B, 13B, and 70B maximum number of parameters (in million): 70000 extension: it enhances the mathematical reasoning abilities for open-source pretrained large language model Llama-2 application: surpasses all other open-source LLMs by a substantial margin in terms of mathematical reasoning, including Llama-2 70B, Llama-1 65B, Falcon-40B, MPT-30B8, Baichuan-13B Chat and ChatGLM2 12B on both GSM8k and MATH. has source code: https://github.com/nlpxucan/WizardLM blog post: https://ollama.ai/blog/wizardmath-examples license: Non-commercial license research problem: Large Language Models (LLMs), transformer model
-
Title: Orca: Progressive learning from complex explanation traces of gpt-4 model family: LLaMa date created: 2023-06-05 organization: Microsoft innovation: The main innovation of the Orca model is Explanation Tuning, which leverages detailed explanations rather than just input-response pairs for training. This approach enhances the model's reasoning capabilities, using system instructions from GPT-4 and ChatGPT as a teaching assistant for generating explanations. pretraining architecture: Decoder fine-tuning task: Explanation tuning training corpus: the Flan 2022 Collection with its extensive public assortment of tasks and instructions, ⟨query, response⟩ pairs augmented with detailed responses from GPT-4 that explain the reasoning process of the teacher as it generates the response tokenization: LLaMA Byte Pair Encoding (BPE) number of parameters: 13B maximum number of parameters (in million): 13000 hardware used: A100-80GB GPU hardware information: trained Orca on 20 NVIDIA A100 GPUs with 80GB memory. It took 160 hours to train Orca on FLAN-5M (ChatGPT augmentations) for 4 epochs, and 40 hours to continue training on FLAN-1M (GPT-4 augmentations) for the same number of epochs. extension: finetuning on complex explanation traces obtained from GPT-4 application: various NLP tasks including Bio Olympiad, Forming Inequalities, Counterfactual Question Answering, Compound Interest Problems, Spatial Reasoning, Commonsense Question Answering, Hallucination, Quadratic Equation Solving, Meeting Transcript Processing blog post: https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/ license: Non-commercial license research problem: Large Language Models (LLMs), transformer model
-
Title: WizardCoder: Empowering Code Large Language Models with Evol-Instruct model family: StarCoder date created: 2023-06-14 organization: Microsoft innovation: The paper introduces WizardCoder, a model that empowers Code Large Language Models with complex instruction fine-tuning using the Evol-Instruct method tailored for code. This innovation allows WizardCoder to surpass other open-source Code LLMs and even outperform major closed LLMs on key benchmarks. The model fills a gap in the field by emphasizing instruction fine-tuning specifically for the code domain. pretraining architecture: Decoder fine-tuning task: supervized complex code-based instruction finetuning training corpus: initialized with the 20K instruction-following dataset called Code Alpaca, the Evol-Instruct technique is iteratively applied on this dataset consisting of 20,000 samples to produce evolved data. After each round of data evolution, the evolved data is merged from all previous rounds with the original dataset to finetune StarCoder number of parameters: 1B, 3B, 7B, 13B, 15B, and 34B maximum number of parameters (in million): 34000 extension: Code Evol-Instruct application: code understanding, code generation, instruction fine-tuning for code tasks, tested on benchmarks (HumanEval, HumanEval+, MBPP, DS-1000), generating accurate code responses with clear explanations. has source code: https://github.com/nlpxucan/WizardLM license: Non-commercial license research problem: Large Language Models (LLMs), transformer model
-
Title: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter model family: BERT date created: 2019-10-01 organization: Huggingface innovation: The main innovation of this work is the creation of DistilBERT, a smaller language model derived from BERT using knowledge distillation during pre-training. It retains 97% of BERT's performance while being 40% smaller and 60% faster, making it efficient for on-device tasks and resource-constrained environments, thus addressing challenges related to large-scale language models' deployment. pretraining architecture: Encoder pretraining task: Masked Language Modeling, Next Sentence Prediction training corpus: Same as BERT number of parameters: 66M maximum number of parameters (in million): 66 hardware used: Nvidia V100 GPU hardware information: DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours. extension: Compressed version of BERT using distillation, which is much more efficient given the same number of parameters application: Same as BERT has source code: https://huggingface.co/docs/transformers/model_doc/distilbert blog post: https://medium.com/huggingface/distilbert-8cf3380435b5 license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization model family: Transformer date created: 2019-12-01 organization: Google, Imperial College London innovation: PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks. pretraining architecture: Encoder/Decoder pretraining task: DAE (more concretely GSG) and MLM training corpus: C4 (750GB) + HugeNews (3.8 TB) optimizer: Adam optimizer tokenization: sentencepiece number of parameters: Base = 223M Large = 568M maximum number of parameters (in million): 568 extension: PEGASUS introduces a novel pre-training objective called Gap Sentence Generation (GSG), extending vanilla Transformers, where "important" sentences are removed from a document and the model is trained to regenerate them. The method of selecting these principal sentences is based on their relevance to the entire document, measured using the ROUGE1-F1 score. This unique approach tailors the model for abstractive summarization, achieving state-of-the-art performance on various tasks. application: abstractive text summarization has source code: https://github.com/google-research/pegasus, https://huggingface.co/docs/transformers/model_doc/pegasus blog post: https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators model family: BERT date created: 2020-03-01 organization: Google, Stanford innovation: ELECTRA introduces a novel pre-training task called "replaced token detection" where, instead of masking tokens like in BERT, input tokens are replaced with plausible alternatives from a generator network. The model then acts as a discriminator, predicting if each token was replaced or original. This approach is more computationally efficient than traditional Masked Language Modeling (MLM) and offers superior performance, especially for smaller models. pretraining architecture: Encoder pretraining task: replaced token detection training corpus: Same as BERT except for Large with is same as XLNet number of parameters: Small = 14M, Base = 110M, Large = 335M maximum number of parameters (in million): 335 hardware used: TPUv3, Nvidia V100 GPU hardware information: ELECTRA-SMALL trained for 4days on 1 V100 GPU. ELECTRA-Base trained for 4d on 16 TPUv3s. extension: Applied new training techniques including Replaced Token Detection application: Same as BERT has source code: https://github.com/google-research/electra, https://huggingface.co/docs/transformers/model_doc/electra blog post: https://sh-tsang.medium.com/brief-review-electra-pre-training-text-encoders-as-discriminators-rather-than-generators-9568050d3a86 license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism model family: T5, BERT, GPT date created: 2020-03-01 organization: NVidia innovation: Megatron-LM introduces an efficient intra-layer model parallelism approach for training large transformer models. Implemented seamlessly in PyTorch without custom modifications, it allows for the training of models with billions of parameters, achieving state-of-the-art results on datasets like WikiText103 and LAMBADA. This innovation pushes the boundaries of Large Language Models, offering a scalable solution for the research community. pretraining architecture: Encoder or Decorder, depending on the base model pretraining task: Same as base model training corpus: Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtext number of parameters: 8.3B (GPT-like), 3.9B (BERT-like) maximum number of parameters (in million): 8300 hardware used: Tesla V100 (32GB) GPU hardware information: Their experiments use up to 32 DGX-2H servers (a total of 512 Tesla V100 SXM3 32GB GPUs). Our infrastructure is optimized for multi-node deep learning applications, with 300 GB/sec bandwidth between GPUs inside a server via NVSwitch and 100 GB/sec of interconnect bandwidth between servers using 8 InfiniBand adapters per server. extension: Megatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking. application: Same as base model has source code: https://github.com/NVIDIA/Megatron-LM blog post: https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/, https://huggingface.co/blog/megatron-training license: Limited, Non-commercial usage research problem: Large Language Models (LLMs), transformer model
-
Title: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model model family: GPT date created: 2021-10-01 organization: NVidia innovation: The Megatron-Turing NLG 530B (MT-NLG) is the largest monolithic language model trained to date with 530 billion parameters. It was developed using advanced 3D parallelism techniques, a collaboration between NVIDIA Megatron-LM and Microsoft DeepSpeed. The model showcases superior in-context learning capabilities and sets new benchmarks in zero-shot and few-shot learning. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: The Pile (800GB dataset) + 2 Common Crawl snapshots number of parameters: 530B maximum number of parameters (in million): 530000 hardware used: A100-80GB GPU hardware information: Model training is done with mixed precision using 16-bit bfloat on NVIDIA’s Selene supercomputer with 560 DGX A100 nodes. Each cluster node has 8 NVIDIA 80-GB A100 GPUs, connected to each other by NVLink and NVSwitch. Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband HCAs for application communication, with an additional two HCAs per node for dedicated storage. The nodes are connected in a three-level (leaf, spine, core) fat-tree topology with 850 switches. The cluster uses an all-NVME shared parallel filesystem for high-performance data access and storage. extension: Uses parallelization similar to Megatron to train a LM double the size of GPT-3 application: Language generation and others (similar to GPT-3) blog post: https://developer.nvidia.com/megatron-turing-natural-language-generation, https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ license: Limited, Non-commercial usage research problem: Large Language Models (LLMs), transformer model
-
Title: Global Context Vision Transformers model family: ViT date created: 2022-06-01 organization: NVidia innovation: The Global Context Vision Transformer (GC ViT) introduces a hierarchical structure optimized for compute and parameter efficiency. It employs global query tokens for capturing contextual information and a novel downsampling module with Fused MB-Conv blocks to enhance inter-channel dependencies. These innovations lead to state-of-the-art performance across various computer vision tasks. pretraining architecture: Encoder pretraining task: Image Classification training corpus: Imagenet-1K and other task dependent dataasets optimizer: AdamW number of parameters: 90M maximum number of parameters (in million): 90 hardware used: NVIDIA A40 GPU, NVIDIA A100 GPU hardware information: Object detection and instance segmentation models as well as semantic segmentation models were trained using one computational node with 8 NVIDIA A40 GPUs using a total batch size of 16, hence a batch size of 2 per GPU., For image classification, GC ViT models were trained using four computational nodes with 32 NVIDIA A100 GPUs extension: hierarchical ViT architecture consisting of local and global self-attention modules application: image generation has source code: https://github.com/NVlabs/GCVit blog post: https://towardsdatascience.com/global-context-vision-transformers-nvidias-new-sota-image-model-2923bdaf438e license: Limited, non-commercial license CC-BY-NC-SA-4.0 research problem: Large Language Models (LLMs), transformer model
-
Title: GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow model family: GPT date created: 2021-03-01 organization: EleutherAI innovation: GPT-Neo by EleutherAI is an open-source alternative to proprietary models like GPT-3. Trained on the diverse "The Pile" dataset, its significance lies in democratizing access to large language models, fostering community-driven development, and promoting ethical considerations in AI research. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Pile - 840 GB open source text dataset that combines 22 pre existing datasets number of parameters: 125M, 350M, 1.3B, and 2.7B maximum number of parameters (in million): 2700 extension: Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens application: Text generation, but adaptable to many other NLP tasks when fine tuned. has source code: https://github.com/EleutherAI/gpt-neo, https://huggingface.co/docs/transformers/model_doc/gpt_neo blog post: https://huggingface.co/blog/few-shot-learning-gpt-neo-and-inference-api, https://www.section.io/engineering-education/leveraging-gptneo-to-generate-ai-based-blog-content/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: GPT-J-6B: A 6 billion parameter autoregressive language model model family: GPT date created: 2021-05-01 organization: EleutherAI innovation: GPT-J-6B by EleutherAI stands out for its open-source accessibility, training on the diverse "Pile" dataset, and its development by an independent research organization. The model emphasizes democratization in AI, showcasing that state-of-the-art advancements aren't limited to large corporations. EleutherAI's transparent approach and emphasis on open research further distinguish GPT-J in the LLM landscape. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Pile corpus, a large-scale curated dataset created by EleutherAI tokenization: same set of Byte Pair Encoding (BPE) Tokenizer as GPT-2/GPT-3 number of parameters: 6B maximum number of parameters (in million): 6000 hardware used: TPU v3-256 pod extension: GPT-J 6B is a Transformer model trained using Mesh Transformer JAX and same tokenizer as GPT2/3 application: generating text from a prompt, but is advised to be finetuned for more effective performance has source code: https://huggingface.co/EleutherAI/gpt-j-6b, https://github.com/kingoflolz/mesh-transformer-jax blog post: https://en.wikipedia.org/wiki/GPT-J license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: GPT-NeoX-20B: An Open-Source Autoregressive Language Model model family: GPT date created: 2022-04-01 organization: EleutherAI innovation: GPT-NeoX-20B is a 20 billion parameter autoregressive language model with notable architectural differences from GPT-3, including the use of rotary positional embeddings. It excels in few-shot reasoning and is distinctively open-sourced, making it a significant contribution to the public research community. The model was trained on the Pile dataset. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Pile — 840 GB open source text dataset that combines 22 preexisting datasets optimizer: ZeRO, AdamW tokenization: train a new BPE tokenizer based on the Pile number of parameters: 20B maximum number of parameters (in million): 20000 hardware used: AMD EPYC 7532 CPU, A100-SXM4-40GB GPU hardware information: trained GPT-NeoX-20B on twelve Supermicro AS-4124GO-NART servers, each with eight NVIDIA A100-SXM4-40GB GPUs and configured with two AMD EPYC 7532 CPUs extension: Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse application: range of language-understanding, mathematics and knowledge-based tasks has source code: https://github.com/EleutherAI/gpt-neox, other gpt-neo models, https://huggingface.co/EleutherAI/gpt-neox-20b blog post: https://blog.eleuther.ai/announcing-20b/ license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Pythia: A suite for analyzing large language models across training and scaling model family: Pythia date created: 2023-03-13 organization: EleutherAI innovation: The paper introduces Pythia, a suite of 16 Large Language Models (LLMs) trained on consistent public data, spanning from 70M to 12B parameters. Pythia provides public access to 154 checkpoints for each model, facilitating research into LLM training dynamics, memorization, and bias reduction. This unique setup offers novel insights into LLM behavior and evolution. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Two version of each of the eight models are released: one trained on the original Pile corpus and another trained on the deduplicated Pile corpus., Pile optimizer: ZeRO, Adam tokenization: BPE tokenizer that is trained specifically on the Pile same as used for GPT-NeoX-20B number of parameters: 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B maximum number of parameters (in million): 12000 hardware used: A100-40GB GPU hardware information: training the 70M, 160M, and 410M models required 32 A100 GPUs with 40GB RAM, 1.0B, 1.4B, and 2.8B models used 64 A100-40GB GPUs. 6.9B model used 128 GPUs. And 12B model used 256 GPUs. extension: Trained with the library GPT-NeoX application: Research on language model’s behavior, functionality, and limitations has source code: pythia-deduped, pythia, https://github.com/EleutherAI/pythia blog post: https://www.eleuther.ai/papers-blog/pythia-a-suite-for-analyzing-large-language-modelsacross-training-and-scaling license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Decision Transformer: Reinforcement Learning via Sequence Modeling model family: GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) date created: 2021-06-01 organization: Facebook, Google, UC Berkeley innovation: The Decision Transformer reframes Reinforcement Learning (RL) as a sequence modeling task, leveraging the Transformer architecture used in Large Language Models like GPT. It autoregressively models trajectories by conditioning on desired returns, past states, and actions, enabling it to generate optimal future actions. This approach offers a simplified and scalable solution to RL, especially in sparse reward settings. pretraining architecture: Decoder pretraining task: Next action prediction training corpus: Different corpus for different experiments optimizer: AdamW number of parameters: Same as GPT extension: Decision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive task application: General RL (reinforcement learning tasks) has source code: https://github.com/kzl/decision-transformer, https://huggingface.co/docs/transformers/main/en/model_doc/decision_transformer blog post: https://sites.google.com/berkeley.edu/decision-transformer license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Offline Reinforcement Learning as One Big Sequence Modeling Problem model family: GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) date created: 2021-06-01 organization: UC Berkeley innovation: The paper presents a novel approach to reinforcement learning (RL) by treating it as a sequence modeling problem. Using the Transformer architecture, traditionally employed in natural language processing, they model trajectories in RL tasks. This perspective simplifies design decisions and proves versatile across various RL challenges, including long-horizon tasks. pretraining architecture: Decoder pretraining task: predict most likely sequence training corpus: D4RL dataset and other RL datasets depending on the task at hand optimizer: Adam optimizer number of parameters: Smaller architecture than GPT extension: Similarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards) application: General RL (reinforcement learning tasks) has source code: https://trajectory-transformer.github.io/, https://github.com/JannerM/trajectory-transformer blog post: https://bair.berkeley.edu/blog/2021/11/19/trajectory-transformer/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Jurassic-1: Technical details and evaluation model family: GPT date created: 2021-09-01 organization: AI21 innovation: The Jurassic-1 model's primary innovation in the context of Large Language Models is its enhanced tokenization efficiency, achieved through a larger 256K vocabulary SentencePiece tokenizer. This tokenizer captures a mix of word pieces, whole words, and multi-word expressions, allowing Jurassic-1 to represent text with 28% fewer tokens than GPT-3, leading to faster processing and broader domain prediction capabilities. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 300B tokens (same as GPT-3) tokenization: self-trained sentencepiece number of parameters: 178B (Jumbo), 17B (Grande), 7.5B (Large) maximum number of parameters (in million): 178000 hardware used: graphics processing unit extension: Very similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadth application: Similar to GPT-3 has source code: https://github.com/ai21labs/lm-evaluation blog post: https://www.ai21.com/blog/ai21-studio-use-cases, https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1 license: Closed source, accessible through API research problem: Large Language Models (LLMs), transformer model
-
Title: A General Language Assistant as a Laboratory for Alignment model family: Transformer date created: 2021-12-01 organization: Anthropic innovation: The paper introduces techniques to improve the alignment of Large Language Models (LLMs) with human values, specifically HHH--helpful, honest, and harmless. It delves into methods like context distillation and preference modeling over imitation learning. The goal is to ensure LLMs provide responses that are both relevant and in line with human expectations. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: 400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training. number of parameters: 13M, 42M, 197M, 810M, 2.7B, 13B, and 52B maximum number of parameters (in million): 52000 extension: These models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF. application: Different models with different applications from general dialog to code assistant. license: N/A research problem: Large Language Models (LLMs), transformer model
-
Title: High-Resolution Image Synthesis with Latent Diffusion Models model family: Diffusion date created: 2021-12-01 organization: EleutherAI, Stability.ai, LMU Munich pretraining architecture: Encoder/Decoder pretraining task: Caption prediction training corpus: LAION-5B, a publicly available dataset derived from Common Crawl number of parameters: 890M (although there are different, smaller, variants) maximum number of parameters (in million): 890 extension: Stable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and Imagen application: Text to image has source code: https://github.com/CompVis/latent-diffusion, https://huggingface.co/CompVis/stable-diffusion, https://huggingface.co/spaces/stabilityai/stable-diffusion, https://github.com/Stability-AI/stablediffusion blog post: https://stability.ai/blog/stable-diffusion-public-release license: open, CreativeML Open RAIL++-M License research problem: Large Language Models (LLMs), transformer model
-
Title: DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization model family: BART date created: 2022-03-01 organization: Amazon innovation: The paper introduces DQ-BART, which innovatively combines model distillation and quantization to compress sequence-to-sequence models. It uniquely initializes student models by copying specific layers from the teacher and distills both the encoder and decoder. This approach achieves significant model compression with minimal performance loss. pretraining architecture: Encoder/Decoder pretraining task: denoising autoencoder training corpus: CNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens) number of parameters: Up to 30x reduction in parameters compared to standard BART hardware used: A100 GPU extension: Adds quantization and distillation to a BART model to improve performance and model size application: Text generation and understanding has source code: https://github.com/amazon-science/dq-bart blog post: https://www.amazon.science/publications/dq-bart-efficient-sequence-to-sequence-model-via-joint-distillation-and-quantization license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Alexatm 20b: Few-shot learning using a large-scale multilingual seq2seq model model family: transformer date created: 2022-08-01 organization: Amazon innovation: The paper introduces AlexaTM 20B, the largest multilingual seq2seq model, trained on denoising and Causal Language Modeling tasks. This model excels at in-context learning with long contexts and showcases efficiency by outperforming much larger models like GPT3 175B in certain benchmarks. Additionally, it emphasizes a new paradigm for one-shot machine translation, especially for low-resource languages. pretraining architecture: Encoder/Decoder pretraining task: Optimizes denoising (80%) and Prefix LM (20%) training corpus: Wikipedia and mC4 datasets in 12 languages. optimizer: Adam optimizer number of parameters: 20B maximum number of parameters (in million): 20000 hardware used: NVIDIA A100 GPU hardware information: trained AlexaTM 20B for 120 days on 128 A100 GPUs for the total of 500k updates with the accumulated batch size of 2 million tokens (total of 1 trillion token updates) extension: Derived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained encoder. application: Summarization, multi-lingual machine translation and NLU tasks has source code: https://github.com/amazon-science/alexa-teacher-models blog post: https://www.amazon.science/blog/20b-parameter-alexa-model-sets-new-marks-in-few-shot-learning license: Limited, non-commercial research problem: Large Language Models (LLMs), transformer model
-
Title: GLM: General language model pretraining with autoregressive blank infilling model family: GLM (General Language Model) date created: 2022-03-01 organization: Tsinghua University innovation: GLM's main innovation lies in its use of gap sentences for pretraining, along with a blank infilling objective during fine-tuning. This approach enhances the model's ability to generate coherent and contextually relevant text. By combining autoregressive decoding and text infilling, GLM achieves competitive performance on a variety of NLU tasks. pretraining architecture: Encoder/Decoder pretraining task: Auto regressive blank infilling training corpus: Pile, GLM-130B Chinese corpora, P3, DeepStruct finetuning dataset optimizer: Adam optimizer tokenization: uncased wordpiece number of parameters: Base = 110M, Large = 335M, and also 2B, 10B, 130B maximum number of parameters (in million): 130000 hardware used: Nvidia V100 GPU hardware information: The models are trained on 64 V100 GPUs for 200K steps with batch size of 1024 and maximum sequence length of 512, which takes about 2.5 days for GLMLarge. extension: GLM has a bidirectional encoder and a unidirectional decoder in a unified model application: a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. has source code: https://github.com/THUDM/GLM-130B blog post: http://keg.cs.tsinghua.edu.cn/glm-130b/posts/glm-130b/ license: Open, MIT license research problem: Large Language Models (LLMs), transformer model
-
Title: Multitask prompted training enables zero-shot task generalization model family: T5 date created: 2022-03-01 organization: BigScience innovation: The paper introduces a model named T0, based on the T5 architecture, which is trained using a unique method of mapping natural language tasks into prompted forms. This approach, combined with training on multiple prompts and datasets, enables T0 to achieve robust generalization to unseen tasks, often surpassing larger models like GPT-3 in performance. The innovation lies in the effective use of prompts and multitask training to enhance zero-shot learning capabilities. pretraining architecture: Encoder/Decoder pretraining task: Masked Language Modeling fine-tuning task: Natural language prompts training corpus: T0 (Multiple-choice QA, Extractive QA, Closed-Book QA, Structure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase Identification. T0p (same as T0, with additional datasets from GPT-3’s evaluation suite). T0pp (same as T0p, with additional datasets from SuperGLUE, excluding NLI sets) number of parameters: T0-3B: 3 billion, T0, T0p, T0pp: 11 billion maximum number of parameters (in million): 11000 hardware used: TPUv3 hardware information: training runs corresponded to about 270 total hours of training on a v3-512 Cloud TPU device. Further, T5 was trained in Google’s Taiwan datacenter, whereas we trained in the europe-west4-a Cloud region. The gCO2eq/kWh published by Google for these datacenters are 540 and 410 respectively extension: T0 stands for "T5 for Zero Shot", obtained by fine-tuning the T5 model on multitask mixture covering many different NLP tasks. Compared with T0, T0p and T0pp were fine-tuned with more datasets. T0pp is recommended as it leads (on average) to the best performances on a variety of NLP tasks. application: Perform zero-shot inference tasks by specifying the query in natural language, and the models will generate a prediction. has source code: https://huggingface.co/bigscience/T0 license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model model family: GPT date created: 2022-07-01 organization: Huggingface, Big Science innovation: BLOOM is an open-access Large Language Model (LLM) collaboratively designed by hundreds of researchers. Trained on the ROOTS corpus with 59 languages, it achieves competitive performance, especially after multitask prompted finetuning. The model and code are publicly released under the Responsible AI License. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: https://openreview.net/forum?id=UoEw6KigkUn, 366B tokens (1.5 TB of text data) multilingual dataset (46 natural languages and 13 programming languages) optimizer: Adam optimizer number of parameters: 560m, 1.1B, 1.7B, 3B, 7.1B, and 176B maximum number of parameters (in million): 176000 hardware used: A100-80GB GPU hardware information: Training was conducted on 48 nodes, each having 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs); due to possible hardware failures during training, we also maintained a reserve of 4 spare nodes. The nodes were equipped with 2x AMD EPYC 7543 32-Core CPUs and 512 GB of RAM, while the storage was handled by mix of full flash and hard disk drives using a SpectrumScale (GPFS) parallel file system shared between all nodes and users of the supercomputer. 4 NVLink GPU-to-GPU interconnects per node enabled intra-node communications while 4 Omni-Path 100 Gbps links per node, arranged in an enhanced hypercube 8D global topology, were used for inter-node communications. extension: Main difference to GPT-3 is that it uses full attention instead of sparse attention application: Same as GPT-3 has source code: https://huggingface.co/docs/transformers/model_doc/bloom blog post: https://huggingface.co/blog/bloom-megatron-deepspeed, https://huggingface.co/blog/bloom-inference-pytorch-scripts, https://huggingface.co/blog/bloom-inference-optimization license: Open, but need to follow restrictions in Attachment A, BigScience RAIL License v1.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Galactica: A large language model for science model family: transformer date created: 2022-11-01 organization: Meta innovation: Galactica is designed to revolutionize the way we access scientific knowledge, moving beyond the traditional store-and-retrieve approach. It efficiently absorbs technical knowledge, including complex LaTeX equations and chemical reactions, and demonstrates superior context-associative capabilities over traditional search engines. This model's architecture and approach highlight its potential to serve as a new, more efficient interface for scientific inquiry. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Trained on 106 billion tokens of open-access scientific text and data. This includes papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and more optimizer: AdamW tokenization: byte pair encoding number of parameters: mini: 125M, base: 1.3B, standard: 6.7B, large: 30B, huge: 120B maximum number of parameters (in million): 120000 hardware used: A100-80GB GPU hardware information: We use the metaseq library for training the models, built by the NextSys team at Meta AI. For training the largest 120B model, we use 128 NVIDIA A100 80GB nodes. For inference Galactica 120B requires a single A100 node. extension: Transformer based architecture in a decoder-only setup with a few modifications. Data extensions include special tokens for working memory, citations, genetic data, and a few other biology related tasks. application: The models are designed to perform scientific tasks, including but not limited to citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction. blog post: https://galactica.org/ license: Limited, non-commerical CC BY-NC 4.0 license research problem: Large Language Models (LLMs), transformer model
-
Title: LLaMA: Open and Efficient Foundation Language Models model family: LLaMa date created: 2023-02-27 organization: Meta AI innovation: The LLaMA models prioritize efficiency, with LLaMA-13B outperforming larger models like GPT-3 using fewer parameters. They exclusively train on publicly available data, emphasizing both inference efficiency and democratizing access to powerful language models. pretraining architecture: Decoder pretraining task: Language Modeling fine-tuning task: instructions training corpus: Stack Exchange, ArXiv, Gutenberg and Books3, Wikipedia, Github, C4, CommonCrawl, approximately 1.4T tokens from various sources: 3.3 TB CommonCrawl (67%), 783GB C4 (15%), 328BG Github (4.5%), 83GB Wikipedia (4.5%), 85GB Books (4.5%), 92GB ArXiv (2.5%), and 78GB StackExchange (2.0%) optimizer: AdamW tokenization: the bytepair encoding (BPE) algorithm, using the implementation from Sentence-Piece. Notably, we split all numbers into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. number of parameters: 6.7B, 13.0B, 32.5B, and 65.2B maximum number of parameters (in million): 65200 hardware used: A100-80GB GPU hardware information: When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. extension: LLaMA uses a Transformer architecture, and with extensions: Pre-normalization, SwiGLU activations, RoPE embeddings, reduced memory usage and runtime through efficient implementation of the causal multi-head attention, checkpointing to reduce the amount of activations that are recomputed during the backward pass, model and sequence parallelism to reduce memory usage of the model, and uses 1.4T BPE tokens after tokenization. application: Zero and few shot Commonsense reasoning, Question answering, mathematical reasoning, Code generation, Reading comprehension, and multitask understanding. has source code: https://huggingface.co/docs/transformers/main/model_doc/llama, https://github.com/facebookresearch/llama blog post: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ license: Limited, Non-commercial bespoke license research problem: Large Language Models (LLMs), transformer model
-
Title: Llama 2: Open Foundation and Fine-Tuned Chat Models model family: LLaMa date created: 2023-07-18 organization: Meta AI innovation: The paper introduces Llama 2-Chat, a fine-tuned Large Language Model optimized for dialogue use cases, ranging from 7 billion to 70 billion parameters. It outperforms most open-source chat models and emphasizes safety through specific data annotation and tuning. , The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA). pretraining architecture: Decoder pretraining task: Self-supervized Learning fine-tuning task: both instruction tuning and reinforcement learning from human feedback for Llama-2-Chat training corpus: No specific information available other than "A new mix of publicly available online data optimizer: AdamW tokenization: sentencepiece number of parameters: 7B, 13B, 34B, 70B maximum number of parameters (in million): 70000 hardware used: A100-80GB GPU hardware information: models were pretrained on Meta’s Research Super Cluster (RSC) as well as internal production clusters, both of which use NVIDIA A100s. extension: Llama-2-Chat application: Research on LLMs demonstrating multitask abilities such as reading comprehension, commonsense reasoning, math and coding abilities has source code: https://github.com/facebookresearch/llama blog post: https://ai.meta.com/blog/llama-2/, https://ai.meta.com/resources/models-and-libraries/llama/ license: Llama 2 Community License Agreement research problem: Large Language Models (LLMs), transformer model
-
Title: One Embedder, Any Task: Instruction-Finetuned Text Embeddings model family: T5 date created: 2022-12-01 organization: Meta AI, University of Washington, University of Hong Kong innovation: InstructOR is built on the GTR model architecture, initialized from T5 models, and fine-tuned on information search datasets. It uses a single encoder to encode input text and task instructions. The training objective distinguishes between good and bad candidate outputs based on their cosine similarity in embeddings. pretraining architecture: Encoder/Decoder fine-tuning task: Wide variety of instruction based text-to-text tasks training corpus: Finetuned on MEDI optimizer: AdamW number of parameters: 330M maximum number of parameters (in million): 330 hardware used: A100-40GB GPU extension: Fine-tunes T5 explicitly to optimize encoder to produce a general purpose text string embedding useful for many NLU tasks. application: Any NLU task requiring a single text string embedding. As of April 2023 InstructOR is the top-ranked system on the Massive Text Embedding Benchmark (MTEB). has source code: https://huggingface.co/hkunlp/instructor-xl blog post: https://pub.towardsai.net/paper-review-instructor-one-embedder-any-task-6a846b0d3ba license: Open, Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Alpaca: A strong, replicable instruction-following model model family: LLaMa date created: 2023-03-01 organization: Stanford University innovation: Alpaca 7B is fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. On preliminary evaluation of single-turn instruction following, Alpaca behaves qualitatively similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$). pretraining architecture: Decoder fine-tuning task: supervized open-domain instruction finetuning training corpus: 52K instruction-following data generated from OpenAI’s text-davinci-003 using self-instruct mechanism, from 175 human-written instruction-output pairs was leveraged to finetune the model number of parameters: 7B, 13B, 33B, 65B maximum number of parameters (in million): 65000 hardware used: A100-80GB GPU hardware information: fine-tuned the LLaMA models using Hugging Face’s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. extension: Alpaca is fine-tuned from a 7B LLaMA model application: Evaluated on a variety of text generation and classification tasks. has source code: https://github.com/tatsu-lab/stanford_alpaca blog post: https://medium.com/version-1/stanford-alpaca-a-small-yet-mighty-language-model-for-instruction-following-tasks-af9e92e87d9a license: Limited, non-commercial license CC-BY-NC-SA-4.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality model family: LLaMa date created: 2023-03-30 organization: Large Model Systems Organization innovation: Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90%* of cases. pretraining architecture: Decoder fine-tuning task: supervized open-domain instruction finetuning training corpus: Vicuna is finetuned with 70K user-shared ChatGPT conversations user-shared conversations collected from ShareGPT.com number of parameters: 7B, 13B, 33B maximum number of parameters (in million): 33000 hardware used: A100-80GB GPU hardware information: The training is done with 8x A100 GPUs. The longest single training run takes around 2 days. They utilized SkyPilot managed spot instances for saving training costs and FlashAttention for memory optimizations. extension: stable-vicuna-13b-delta, Vicuna v1.3 is fine-tuned from LLaMA with supervised instruction fine-tuning. application: chatbot has source code: https://chat.lmsys.org/, https://huggingface.co/lmsys/vicuna-33b-v1.3, https://huggingface.co/lmsys/vicuna-13b-v1.3, https://huggingface.co/lmsys/vicuna-7b-v1.3, https://github.com/lm-sys/FastChat blog post: https://lmsys.org/blog/2023-03-30-vicuna/ license: Non-commercial license research problem: transformer model, Large Language Models (LLMs)
-
Title: Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs model family: MosaicPretrainedTransformer (MPT) models date created: 2023-05-05 organization: MosaicML innovation: MosaicML tries to provide a commercially-usable, open-source model that matches (and - in many ways - surpasses) LLaMA-7B. It is trained on a large amount of data (1T tokens like LLaMA). pretraining architecture: Decoder pretraining task: Language Modeling fine-tuning task: Conversations for the Chat model, Short instructions for the Instruct model, Excerpts of fiction books for StoryWriter training corpus: Semantic Scholar ORC, The Stack Dedup, RedPajama, C4, mC4 optimizer: LION tokenization: EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus number of parameters: 7B maximum number of parameters (in million): 7000 hardware used: A100-40GB GPU hardware information: A100-80GB GPU was used in model finetuning, took ~9.5 days to train on 440xA100-40GB GPUs, and cost ~$200k extension: MPT-7B-StoryWriter-65k+, MPT-7B-Chat, MPT-7B-Instruct application: Multiple tasks similar to LLaMA such as reasoning or code generation has source code: https://huggingface.co/mosaicml/mpt-7b blog post: https://www.youtube.com/watch?v=KSlWkrByc0o&t=9s, https://www.mosaicml.com/blog/mpt-7b license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
-
Title: Introducing MPT-30B: Raising the bar for open-source foundation models model family: MosaicPretrainedTransformer (MPT) models date created: 2023-06-22 organization: MosaicML pretraining architecture: Decoder pretraining task: Language Modeling fine-tuning task: Finetuning on longer context data using sample subsets of datasets only samples with at least 4096 tokens of instances in base model dataset for the 8K context window model, A large collection of chat datasets for the Chat model, Instructions following for the Instruct model training corpus: RedPajama - StackExchange, RedPajama - arXiv, RedPajama - Books, Semantic Scholar ORC, The Stack - Markdown, RedPajama - Wikipedia, The Stack - Selected Languages, RedPajama - CommonCrawl, c4 - English - SemDedup, mC4 3.1.0 - English optimizer: LION tokenization: EleutherAI's gpt-neox-20b BPE tokenizer trained on the Pile corpus number of parameters: 30B maximum number of parameters (in million): 30000 hardware used: A100-40GB GPU, H100-80GB GPU hardware information: The model was trained in three stages using the MosaicML Platform: (i) First it was trained on 440 A100-40GBs with a batch size of 1760. (ii) Then, on 216 A100-40GBs with a batch size of 1728. (iii) Training was completed on 256 H100-80GBs with a batch size of 512 with 8k context length and 50B tokens. The model was trained with sharded data parallelism using FSDP and used the LION optimizer. extension: MPT-30B 8k Context Window, MPT-30B-Chat, MPT-30B-Instruct application: Multiple text generation tasks. This model demonstrates good coding ability better than its predecessor as well as specifically finetuned code writers. has source code: https://huggingface.co/mosaicml/mpt-30b blog post: https://www.mosaicml.com/blog/mpt-30b license: Apache 2.0 research problem: transformer model, Large Language Models (LLMs)
-
Title: StarCoder: may the source be with you! model family: SantaCoder date created: 2023-05-09 organization: BigCode Project innovation: StarCoder introduces novel features like an 8K context length, Fill-in-the-Middle (FIM) infilling, and Multi-Query-Attention (MQA) for fast inference. It outperforms other open LLMs for code and matches the OpenAI code-cushman-001 model. Additionally, it emphasizes safe open model release with improved PII redaction and an integrated attribution tool. pretraining architecture: Decoder fine-tuning task: Python-specific variant training corpus: StarCoderBase was trained using the dataset "The Stack v1.2" (Kocetkov et al., 2022), which exclusively contains data from permissively licensed GitHub repositories. The training set was further refined by selecting 86 languages out of the 358 programming languages present in The Stack, based on criteria like data volume, popularity, and active support. The assignment of data to programming languages was done based on file extensions. optimizer: Adam optimizer tokenization: use the Hugging Face Tokenizers library (MOI et al., 2022) to train a byte-level Byte-Pair-Encoding with a vocabulary size of 49,152 tokens—including the sentinel tokens number of parameters: 15.5B maximum number of parameters (in million): 15500 hardware used: A100-80GB GPU hardware information: trained our model on a GPU cluster with 512 A100 80 GB GPUs distributed across 64 nodes. We partitioned the model with a 3D-parallel layout that shards the model with both tensor and pipeline parallelism rank 4, requiring 16 GPUs (two nodes) for one replica. application: potential in generating code across multiple programming languages, and acting as a virtual technical assistant without the need for instruction-tuning or RLHF has source code: https://github.com/bigcode-project/starcoder blog post: https://huggingface.co/blog/starcoder license: OpenRAIL-M license research problem: Large Language Models (LLMs), transformer model
-
Title: Falcon-40B: an open large language model with state-of-the-art performance model family: transformer date created: 2023-05-25 organization: Technology Innovation Institute innovation: Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively, in line with modern models optimising for inference. The key ingredient for the high quality of the Falcon models is their training data, predominantly based (>80%) on RefinedWeb — a novel massive web dataset based on CommonCrawl. Another interesting feature of the Falcon models is their use of multiquery attention. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: Falcon RefinedWeb optimizer: AdamW number of parameters: 7B, 40B maximum number of parameters (in million): 40000 hardware used: A100-40GB GPU hardware information: Falcon-40B-Instruct was trained on AWS SageMaker, on 64 A100 40GB GPUs in P4d instances. , Falcon-7B-Instruct was trained on AWS SageMaker, on 32 A100 40GB GPUs in P4d instances. , Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO., Falcon-7B was trained on 384 A100 40GB GPUs, using a 2D parallelism strategy (PP=2, DP=192) combined with ZeRO. extension: Falcon-Instruct application: Research on large language models; as a foundation for further specialization and finetuning for specific usecases (e.g., summarization, text generation, chatbot, etc.) has source code: https://huggingface.co/tiiuae/falcon-7b, https://huggingface.co/tiiuae/falcon-40b blog post: https://huggingface.co/blog/falcon license: Apache 2.0 research problem: transformer model, Large Language Models (LLMs)
-
Title: OpenLLaMA: An Open Reproduction of LLaMA model family: LLaMa date created: 2023-06-16 organization: Berkeley AI Research innovation: The only difference between OpenLLaMA setting and the original LLaMA is the dataset used: OpenLLaMA employs open datasets rather than the one utilized by the original LLaMA. pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: We train our models on the RedPajama dataset released by Together, which is a reproduction of the LLaMA training dataset containing over 1.2 trillion tokens., RedPajama optimizer: AdamW number of parameters: 3B, 7B, 13B maximum number of parameters (in million): 13000 hardware used: cloud TPU-v4 hardware information: The models are trained on cloud TPU-v4s using EasyLM, a JAX based training pipeline we developed for training and fine-tuning large language models. We employ a combination of normal data parallelism and fully sharded data parallelism (also know as ZeRO stage 3) to balance the training throughput and memory usage. extension: public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. application: same as LLaMA has source code: https://github.com/openlm-research/open_llama, https://huggingface.co/openlm-research/open_llama_13b, https://huggingface.co/openlm-research/open_llama_7b, https://huggingface.co/openlm-research/open_llama_3b blog post: https://github.com/openlm-research/open_llama license: Apache 2.0 research problem: transformer model, Large Language Models (LLMs)
-
Title: OpenLLaMA: An Open Reproduction of LLaMA model family: LLaMa date created: 2023-07-16 organization: Berkeley AI Research pretraining architecture: Decoder pretraining task: Causal language modeling training corpus: The v2 models are trained on a mixture of the Falcon refined-web dataset, the StarCoder dataset and the wikipedia, arxiv, book and stackexchange part of the RedPajama dataset., Starcoder, Falcon refined-web optimizer: AdamW number of parameters: 3B, 7B maximum number of parameters (in million): 7000 hardware used: cloud TPU-v4 extension: public preview of OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA. application: same as LLaMA has source code: https://huggingface.co/openlm-research/open_llama_7b_v2, https://huggingface.co/openlm-research/open_llama_3b_v2 blog post: https://github.com/openlm-research/open_llama license: Apache 2.0 research problem: transformer model, Large Language Models (LLMs)
-
Title: Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models model family: GPT date created: 2023-08-30 organization: Cerebras, Mohamed bin Zayed University of Artificial Intelligence, Inception innovation: The paper introduces Jais and Jais-chat, state-of-the-art Arabic-centric LLMs. These models are trained bilingually with Arabic and English, addressing the underrepresentation of Arabic in the LLM space. They combine specialized Arabic text processing with advanced model architectures, offering superior performance in Arabic while remaining competitive in English. pretraining architecture: Decoder pretraining task: Causal language modeling fine-tuning task: Instruction Tuning training corpus: An English dataset comprising 232B tokens from Pile-CC, Books3, ArXiv, PubMed Central, OpenWebText2, Wikipedia (en), FreeLaw, PubMed Abstracts, DeepMind Math, Project Gutenberg, BookCorpus2, EuroParl, PhilPapers, Youtube Subtitles, NIH Grant Abstracts, Enron Emails. And 46B tokens from its Github subset., 3B tokens from English Wikipedia and 15B tokens from the Books3 corpus translated to Arabic., An Arabic dataset comprising 55B tokens from Abu El-Khair, Aranews, ArabicText 2022, ARabic subset of C4, Arabic Wikipedia, ArabicNews 202, Maktabah, UN Meeting transcripts, and other sources. optimizer: AdamW tokenization: Jais tokenizer number of parameters: 1.3B, 6.7B, and 13B maximum number of parameters (in million): 13000 hardware used: Condor Galaxy 1 AI supercomputer hardware information: All training, hyper-parameter tuning, and instruction-tuning experiments were executed on the Condor Galaxy 1 (CG-1) AI supercomputer from Cerebras, built in partnership with G42. The final training and fine-tuning runs for Jais were performed on 16 CS-2 systems within CG-1. CG-1 is a Cerebras Wafer-Scale Cluster composed of Cerebras CS-2 systems, MemoryX, SwarmX, management, and input worker nodes. extension: Jais-chat application: various NLP tasks encompassing world knowledge and commonsense reasoning has source code: https://huggingface.co/inception-mbzuai/jais-13b blog post: https://inceptioniai.org/jais/ license: Apache 2.0 research problem: Large Language Models (LLMs), transformer model
- If you would like to add a new language model, you can just click on the small edit button in the top-right corner of the README file (see below).
-
Add the
language model
under anorganization
. If anorganization
does not exist, introduce it. -
Add the
language model
name as a new bullet point. -
Beneath it, add values for the following properties for as many as known.
Title:
model family:
date created:
organization:
innovation:
pretraining architecture:
pretraining task:
fine-tuning task:
training corpus:
optimizer:
tokenization:
number of parameters:
maximum number of parameters (in million):
hardware used:
hardware information:
extension:
application:
has source code:
blog post:
license:
research problem:
BONUS: You can also add your structured description to the Open Research Knowledge Graph by selecting Add new Paper
from the homepage and using the following template to describe the model.
Aim: to convert the contents of this repo into a csv file that can be imported into the ORKG via https://orkg.org/csv-import.
Coming soon...
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.