The best part is that you can define your own “safety taxonomy” with it — custom policies for what is safe vs unsafe interactions between humans (prompts) and AI (responses).
I wanted to see how “safe” conversations with OpenAI’s ChatGPT were, so I ran a bunch of prompts (a mixture of innocuous and inappropriate) and asked LLaMA Guard to classify the interactions as safe/unsafe.
My key takeaways from the exercise: 1. OpenAI has done a good job of adding guardrails for its models. LLaMA Guard helped confirm this.
2. What makes this really cool is I may have a very specific set of policies I want to enforce ON TOP of the standard guardrails that a model ships with. LLaMA Guard makes this possible.
3. This kind of model chaining — passing responses from OpenAI models to LLaMA is becoming increasingly common, and I think we’ll have even more complex pipelines in the near future. It helped to have a consistent interface to store this multi-model pipeline as a config, especially because that same config also contains my safety taxonomy.
Try it out yourself:
GitHub: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/LLaMA-Guard
Colab: https://colab.research.google.com/drive/1CfF0Bzzkd5VETmhsniksSpekpS-LKYtX
YouTube: <https://www.youtube.com/watch?v=XxggqoqIVdg>
Would love the community's feedback on the overall approach.