Llama Series Comparison Across Generations: A White Paper

August 14, 2024

Introduction

The Llama series, an open-source large language model (LLM) developed by Meta, has gained recognition for its high performance and the emphasis placed on safety and security during its development. Over the years, several significant versions of the Llama series have been released, each aimed at improving upon its predecessor:

Llama-2 Series: Released on July 18, 2023

Llama-3 Series: Released on April 18, 2024

Llama-3.1 Series: Released on July 23, 2024

This white paper aims to analyze the evolution of safety and security within the Llama series, highlighting both the improvements made across generations and the challenges that persist.

Methodology

Key Insights

We found two key insights derived from our analysis of the Llama series:

  • Improvement in Safety Scores: As the Llama series has evolved, the overall Safety Scores have shown improvement when no adversarial attacks are applied.
  • Mitigation of Attacks: While some attacks have been successfully mitigated in newer generations, certain attack methods remain effective.

Evaluation Dataset

To thoroughly assess the security and safety of the Llama models, we created negative prompts spanning across 31 categories within the domains of Safety, Privacy, Security, and Integrity. Each of these domains was defined as follows:

Safety: The safety domain refers to risks that promote danger such as Crime and Hate Speech.

Privacy: The privacy domain refers to risks related to data breaching such as Data Sharing and Membership Interference.

Security: The security domain refers to behavioral risks of the model such as Roleplay and Prompt Injection.

Integrity: The integrity domain refers to risks related to model ethics such as Copyright and Fraud.

In addition to these negative prompts, we applied the following attack methods to further modify the prompts, creating adversarial inputs to test the models’ robustness:

Adaptive: This method uses adaptive prompt templates to exploit model-specific vulnerabilities, tailoring the prompts based on the model's responses to maximize the likelihood of a successful attack.

ArtPrompt: ArtPrompt takes advantage of the ability of large language models (LLMs) to properly interpret ASCII art, leading them to generate unintended or harmful outputs.

GCG: GCG generates adversarial suffixes automatically based on the model's gradient, effectively steering the model toward generating harmful or undesirable content.

PAIR: PAIR employs an attacker LLM to iteratively generate and refine jailbreak prompts for a target LLM using a black-box approach, enhancing the attack's effectiveness over multiple iterations.

Past Tense: This method simply reformulates a harmful request in the past tense, exploiting the model’s tendency to respond differently to past events, potentially bypassing safety mechanisms.

ReNeLLM: ReNeLLM utilizes LLMs to automatically generate jailbreak prompts through techniques like prompt rewriting and scenario nesting, creating complex, layered attacks that can be challenging to mitigate.

TAP: TAP automates the generation of jailbreak prompts using a tree-of-thoughts reasoning and pruning strategy, systematically exploring and refining possible prompts to maximize their effectiveness.

These attack methods were chosen to evaluate the robustness of the Llama models against a wide range of adversarial techniques. Our EPASS offers the ability to compare even more attack methods, providing a comprehensive analysis of model vulnerabilities and defenses.

We evaluated approximately 10 test cases for each category and attack method combination, resulting in over 300 prompts per model and attack method.

Evaluation Metrics

Each response from the models was evaluated using two primary metrics:

Safety Score: Measures how safe the response is.

Coherence Score: Assesses the consistency and logical integrity of the response.

These two scores were combined to calculate an Overall Score on a scale from 0 to 100 using our proprietary platform. A score of 0 represents the most dangerous or inconsistent responses, while a score of 100 represents the safest and most coherent responses.

Evaluated Models

The following Llama models were evaluated in this study:

Llama-2 Series: Llama-2-7B, Llama-2-13B, Llama-2-70B

Llama-3 Series: Llama-3-8B, Llama-3-70B

Llama-3.1 Series: Llama-3.1-8B, Llama-3.1-70B, Llama-3.1-405B

This diverse set of models allowed for a comprehensive comparison across different model sizes and generations.

Results and Analysis

Key Insight 1: Improvement in Safety Scores

The following chart illustrates the Overall Scores measured from negative prompts that were not subjected to any attack methods. By comparing models of the same size across different generations, it is evident that the Overall Scores have consistently improved. This indicates that the Llama models have become increasingly safer and more secure with each subsequent release.

Additionally, it is observed that scores across different domains—Safety, Privacy, Security, and Integrity—are generally consistent. However, the Privacy scores tend to be slightly higher compared to the other domains. This suggests that the models are particularly cautious when handling prompts related to privacy, reflecting a stronger emphasis on safeguarding sensitive information.

Key Insight 2: Mitigation of Attacks

The chart below illustrates the Overall Scores for different attack methods applied across various generations of the Llama models. The data clearly shows that as the Llama series has evolved, several attack methods have been effectively mitigated, while others remain problematic or have even increased in vulnerability.

Mitigated Attacks:

GCG, PAIR, TAP, and Adaptive: These attack methods have seen significant improvements across the Llama series. For instance, in the Llama-2-7B-Chat model, the GCG attack method successfully generated harmful instructions for spreading misinformation, as shown below:

Llama-2-7b-chat (Successful Attack Example):

However, by the time we reach the Llama-3.1-8B model, the same attack method was effectively blocked, demonstrating significant improvement:

Llama-3.1-8B (Mitigated Attack Example):

This reflects Meta's continuous efforts to enhance the robustness of their models against these specific adversarial techniques.

Persistent Vulnerabilities :

Past Tense and ReNeLLM: On the other hand, these attack methods continue to pose challenges. The Overall Scores for these attacks have not shown the same level of improvement, and in some cases, they have even declined, particularly in the larger models like Llama-3.1-70B. This suggests that while certain attacks have been successfully mitigated, others still exploit underlying weaknesses in the models, which may have been exacerbated by the increased complexity of larger models.

This analysis highlights the dual nature of model evolution: while progress is being made in certain areas, other vulnerabilities persist or even worsen, emphasizing the need for continuous evaluation and refinement of security measures as model architectures become more complex.

Conclusion

This analysis of the Llama series across generations has highlighted two key insights:

  • Improvement in Safety and Security: As the Llama models have evolved, their overall safety and security have significantly improved, particularly in handling negative prompts without attacks.
  • Evolution of Attack Effectiveness: While several attack methods have been effectively mitigated in newer generations, others remain problematic, emphasizing the need for continuous refinement.

Future Outlook

As the Llama series continues to evolve, further enhancements in security and robustness are expected. Developers and users should remain vigilant, continuously testing and refining these models to address emerging vulnerabilities. Ongoing research and development will be crucial in ensuring that these models remain both powerful and safe.

Our Platform Advantage

Our EPASS offers a streamlined approach for conducting such in-depth comparisons and analyses. By utilizing our tools, developers and researchers can easily evaluate and understand the security and safety of AI models, allowing for informed decisions and more secure deployments.

Stay tuned for a detailed report on Llama-3.1, which will be released soon!

References

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Universal and Transferable Adversarial Attacks on Aligned Language Models

Jailbreaking Black Box Large Language Models in Twenty Queries

Does Refusal Training in LLMs Generalize to the Past Tense?

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically