Meta Releases Llama 3.1 – the First Set of Open-Source Models to Outperform the Top Proprietary Models

AI Ecosystem Updates | Issue# 3 [August 06, 2024]

What?

Meta introduced Llama 3.1, its most advanced open-source Large Language Model (LLM) to date. Llama 3.1 includes the flagship 405B (with 405 billion parameters) model, boasting superior capabilities in various domains such as general knowledge, steerability, math, tool use, and multilingual translation. Other variants of the Llama 3.1 models include the updated Llama 3 8B (with 8 billion parameters) and 70B (with 70 billion parameters) models. Each of the models come with the base and instruction-tuned versions.

Key Features and Improvements?

Expanded Context and Multilingual Support
  • The Llama 3.1 models, which is a part of Meta’s ‘herd’ of language models that is known as Llama 3 and includes the updated Llama 3 8B and 70B models, now support a context length of up to 128K, allowing for more extensive inputs and applications, such as long-form text summarization, multilingual conversational agents, and coding assistants. The model weights have been made available here.
  • The models provide support across eight languages, enhancing their usability in diverse linguistic contexts.
Unmatched Flexibility and Control
  • The Llama 3.1 models offer state-of-the-art capabilities that rival top closed-source models.
  • New workflows enabled include synthetic data generation, which facilitates the training of specialized models, and model distillation, which enhances the quality of smaller models, broadening the scope of AI applications.
The Llama System and Commitment towards Responsible AI
  • The Llama models are part of a broader system that includes components, such as Llama Guard 3 and Prompt Guard.
  • Meta continues to advocate for openly accessible AI, and intends to go beyond the foundation models by providing the developer community with the access and tools required to create custom agents and new behaviors through the Llama models. The throught process germinated from the time Meta introduced the incorporation of components outside of the core LLMs.
  • A reference system with various sample applications, and new security and safety tools, such as Llama Guard 3 – a multilingual safety model, and Prompt Guard – a prompt injection filter, are introduced to promote the responsible use of AI with the Llama models.

Ecosystem and Partnerships

Collaborative Ecosystem
  • Over 25 partners, including AWS, NVIDIA, Databricks, Dell, Azure, Google Cloud, and Snowflake, support the Llama 3.1 models.
  • The ecosystem is designed to support developers from day one, offering advanced capabilities and immediate development opportunities with the Llama 3.1 405B model.
Llama Stack
  • A standardized interface, Llama Stack API, is proposed to facilitate third-party projects leveraging the Llama models. Request for comment here.
  • The Llama stack aims to simplify integration and enhance interoperability within the AI community.

Model Architecture and Training

Training and Optimizations
  • Llama 3.1 405B was trained on over 15 trillion tokens, using 16,000 H100 GPUs.
  • The training process involved significant optimizations and improved data pre-processing and post-training curation.
  • The models were quantized from 16-bit to 8-bit numerics to reduce compute requirements, enabling the model to run on a single server node.
Architecture
  • To make the model development simple and scalable, the team used a basic decoder-only transformer model with slight adjustments over a mixture-of-experts model for stability.
  • Compared to the earlier Llama versions, the team used more and better-quality data, improved data preparation, and applied stricter quality checks.
Instruction and Chat Fine-Tuning
  • Post training, the 405B model went through multiple rounds of alignment and fine-tuned with Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization.
  • Synthetic data generation was used extensively to produce high-quality training data, ensuring the model’s effectiveness in responding to user instructions. The iterations not only improved the quality of the generated synthetic data, but the performance of the model as well.

Performance

The Llama 3.1 models have demonstrated significant performance improvements across multiple benchmarks, often surpassing leading models like GPT-4o and Claude 3.5 Sonnet. The 405B model, in particular, excels in a variety of tasks, showcasing its superior capabilities. Performance summary of the Llama 3.1 405B, 70B, and 8B models below.

Llama 3.1 405B

The 405B model was compared to other leading models on 16 public benchmarks, tying or surpassing them on seven, despite some differences in prompting methods for benchmarks, such as the GSM8K and MMLU zero-shot chain-of-thought. It set new records in benchmarks like IFEval (general knowledge), ARC Challenge (reasoning), and Nexus (tool use). The performance is comparable with GPT-4o and the model outperfromed GPT-4o on the ARC Challenge (Reasoning), GSM8K, Nexus(tool use), ZeroSCROLLS / QuALITY(Long Context) and Multilingual MGSM benchmarks.

  • General Knowledge: Achieved an 88.6 score on MMLU-Chat, indicating strong general knowledge comparable to GPT-4o.
  • Math Proficiency: Scored 96.8 on GSM8K, highlighting its exceptional mathematical capabilities, outperforming GPT-4o.
  • Coding Tasks: Earned an 89 on HumanEval, making it a top performer in coding tasks.
  • Reasoning and Tool Use: Outperformed GPT-4o on the ARC Challenge and Nexus benchmarks, setting new standards in reasoning and tool use.
  • Long Context and Multilingual Benchmarks: Excelled in ZeroSCROLLS/QuALITY and Multilingual MGSM benchmarks, respectively.
Llama 3.1 70B

The 70B model established new benchmarks in general knowledge, coding, math, and reasoning, outperforming other models in its size category.

  • General Knowledge: Scored 86 on MMLU-Chat.
  • Math Proficiency: Achieved 95.1 on GSM8K.
  • Coding Tasks: Recorded an 80 on HumanEval, proving robust capabilities across these domains.
Llama 3.1 8B

The 8B model demonstrated dominance in general, coding, and math benchmarks, surpassing other similarly sized models.

  • General Knowledge: Noted a score of 73 on MMLU-Chat.
  • Math Proficiency: Scored 84.5 on GSM8K.
  • Coding Tasks: Achieved a 72.6 on HumanEval, marking significant improvements over its predecessor, Llama 3 8B.

Openness and Innovation

Driving Innovation through Openness
  • The weights of the Llama models have been made available by Meta and is also available on Hugging Face, allowing developers to customize, train on new datasets, and fine-tune for specific applications.
  • Open-source access is seen as a way to democratize AI and make the technology accessible, preventing concentration of power and enabling broader societal benefits, and Meta has been a strong advocate of the same.
  • Llama 3.1 is in all probability, the first open weights model to outperform some of the top proprietary models across multiple benchmarks.
Community Contributions
  • The Llama community has developed various applications, including educational tools and healthcare solutions, demonstrating the potential of open-source AI. More here.
  • Developers are encouraged to explore workflows, such as synthetic data generation and model distillation using the provided tools and partnerships.
License

The Llama 3.1 models are available under a custom license permitting commercial use by companies with up to 700 Mn monthly active users before Llama 3.1’s release. This license also allows training other models using data generated by Llama 3.1. The setup gives a large number of organizations the freedom to utilize the models as needed, while possibly requiring the bigger tech giants to secure a commercial license.

Future Prospects

Meta plans to continue expanding the Llama models, exploring more device-friendly sizes, additional modalities, and enhancing the agent platform layer. But more than that, Meta is excited about the prospect of the innovative products that the community will build with these models in the future.

Thoughts

The AI landscape is changing each day as the AI community helps bring in innovations that further alters the very landscape and the way we interact with the world. We are glad that Meta is pushing the barrier by making cutting-edge innovation open and accessible and encouraging the developer community to come up with applications that solve for specific use cases using the Llama 3.1 models.

Additionally, the information encompassing the model architecture, training, inference, fine-tuning, and evaluation, alongside guidelines about the best practices, is only going to help foster innovation through future research work, as pointed out by this wonderful coverage by DeepLearning.AI. The article goes on to emphasize that Data-centric AI, which refers to the process of engineering data systematically to create successful AI systems, was a key to the training process and significantly enhanced the performance of the Llama 3.1 models.

The capability to handle 128K tokens of context makes the Llama 3.1 models a strong candidate for RAG (Retrieval-Augmented Generation) applications. However, the Llama 3.1 models are not multimodal yet and Meta is working on changing that. We would love to see the Llama models handle multimodal use cases in the future versions.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at [email protected] and share your feedback and thoughts to help us make it better for you next time.


Acronyms used in the blog that have not been defined earlier: (a) Artificial Intelligence (AI), (b) Billion (B), (c) kilo (K) – Denotes thousand, (d) Amazon Web Services (AWS), (e) Application Programming Interface (API), (f) Graphics Processing Unit (GPU), and (g) Million (Mn).