Smarter Models Aren't Always Safer: A Deep Dive into Llama-3.1

September 23, 2024

Introduction

In our previous Llama-generation report, we analyzed the safety and advancements of Llama-2, Llama-3, and Llama-3.1. One key finding that emerged was the surprising result that the larger Llama-3.1-70B model exhibited lower safety compared to its smaller counterpart, Llama-3.1-8B. In this article, we will take a deep dive into the safety of Llama-3.1, exploring the relationship between model size and safety. Our goal is to shed light on why larger, smarter models are not always the safest, and what this means for the future of AI safety.

Llama-3.1: Larger Models Are Smarter

According to Meta’s official reports, the Llama-3.1 models show significant improvements in benchmark scores as the parameter size increases.

For instance, in the MMLU benchmark, the scores rise steadily with the larger models, from 8B to 70B and up to 405B parameters. This trend is not limited to MMLU; similar patterns of improved performance are observed across other tasks such as code generation, mathematical problem-solving, and reasoning. These results suggest that as models grow in size, their ability to handle complex and diverse tasks becomes more refined, reinforcing the belief that larger models are generally "smarter."

Key Insight: Vulnerability in Larger Models

Our analysis revealed an unexpected trend: in some cases, the larger Llama-3.1-70B model exhibited greater vulnerability compared to its smaller counterpart, Llama-3.1-8B. We used the same dataset, attack methods, and evaluation metrics as in our previous Llama-generation report. If you haven't read that report yet, we highly recommend starting there for further context. In this analysis, Baseline refers to the default negative prompts without any applied attack methods.

An overall score of 0 represents the most dangerous or inconsistent responses, while an overall of 100 represents the safest and most coherent responses.

Attack Based Analysis

The following table also shows the Overall Score when various Attack Methods are applied in addition to the Baseline.

Focusing on the Safety-based Overall Score, it is surprising to see that the Llama-3.1-70B model scored lower than the Llama-3.1-8B model. Even the larger Llama-3.1-405B model performed similarly to the 8B model, but in certain attack scenarios, the Llama-3.1-8B model was judged to be safer. These results suggest that increasing model size does not always equate to enhanced safety and, in fact, may introduce new vulnerabilities under certain conditions.

This trend of larger models being more vulnerable is particularly noticeable in specific attack methods, such as ArtPrompt, where the advanced capabilities of the Llama-3.1-70B model were exploited in Jailbreak scenarios. For instance, in the ArtPrompt attack, the Llama-3.1-8B model successfully blocked the harmful instructions:

Llama-3.1-8B (Failed Attack Example):

However, the same attack method was successful against the Llama-3.1-70B model, which generated a detailed and harmful response due to its more sophisticated understanding:

Llama-3.1-70B (Successful Attack Example):

Interestingly, the Llama-3.1-405B model performed more comparably to the 8B model than the 70B in certain scenarios. This suggests that while larger models generally possess more advanced capabilities, the added complexity and size don’t always result in greater vulnerabilities. It could be that for models as large as 405B, Meta has implemented additional safety measures or refined their training methods to mitigate the risks observed in the 70B model. Alternatively, the extreme size and parameter count of the 405B model might contribute to some form of self-correction, where the model's capacity helps counterbalance its susceptibility to certain attacks.

Category Based Analysis

Using the radar chart from our EPASS, where Llama-3.1-8B is represented in blue and Llama-3.1-70B in white, we conducted an analysis of various categories. The scores shown represent data aggregated across all attack methods.

In terms of Sexual Content, both models performed well, with the 8B model scoring 93.94 and the 70B model scoring 80.60. However, in categories like Ethics, Crime, Political Sensitivity, and Misinformation, both models showed lower scores. This suggests that while both models are relatively strong in handling certain types of sensitive content, they share vulnerabilities in other key areas.

From a safety perspective, it appears that different model architectures have strengths and weaknesses across different categories. This indicates that model size alone does not determine overall safety performance.

Conclusion

Our analysis of Llama-3.1 models has revealed important insights regarding the relationship between model size and safety. While larger models like Llama-3.1-70B generally show greater capabilities in terms of intelligence and task performance, they also present heightened vulnerabilities. On the other hand, smaller models like Llama-3.1-8B performed more consistently across various safety metrics, proving that larger isn’t always safer.

These findings reinforce the notion that model architecture and training play crucial roles in determining a model’s safety and robustness, rather than size alone.

Future Outlook

As AI models continue to grow in complexity, the need for ongoing research into safety and security becomes even more critical. Future iterations of Llama, such as Llama-3.1-405B, will require more rigorous testing and refined mitigation strategies to address emerging vulnerabilities. By combining enhanced model architectures with continuous safety assessments, developers can better ensure the robustness of these models in real-world applications.

Our Platform Advantage

Our EPASS offers a streamlined approach for conducting such in-depth comparisons and analyses. By utilizing our tools, developers and researchers can easily evaluate and understand the security and safety of AI models, allowing for informed decisions and more secure deployments.

References

Llama 3.1

Introducing Llama 3.1: Our most capable models to date

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

Universal and Transferable Adversarial Attacks on Aligned Language Models

Jailbreaking Black Box Large Language Models in Twenty Queries

Does Refusal Training in LLMs Generalize to the Past Tense?

A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically