1 year ago

Mon Dec 11, 2023 3:31pm PST

Show HN: Use Purple LLaMA to test ChatGPT safeguards

I spent some time this weekend playing with LLaMA Guard, a fine-tuned LLaMA-7B model by Meta that lets you add guardrails around generative AI. I recorded a quick demo showing what it does and how to use it.

The best part is that you can define your own “safety taxonomy” with it — custom policies for what is safe vs unsafe interactions between humans (prompts) and AI (responses).

I wanted to see how “safe” conversations with OpenAI’s ChatGPT were, so I ran a bunch of prompts (a mixture of innocuous and inappropriate) and asked LLaMA Guard to classify the interactions as safe/unsafe.

My key takeaways from the exercise: 1. OpenAI has done a good job of adding guardrails for its models. LLaMA Guard helped confirm this.

2. What makes this really cool is I may have a very specific set of policies I want to enforce ON TOP of the standard guardrails that a model ships with. LLaMA Guard makes this possible.

3. This kind of model chaining — passing responses from OpenAI models to LLaMA is becoming increasingly common, and I think we’ll have even more complex pipelines in the near future. It helped to have a consistent interface to store this multi-model pipeline as a config, especially because that same config also contains my safety taxonomy.

Try it out yourself:

GitHub: https://github.com/lastmile-ai/aiconfig/tree/main/cookbooks/LLaMA-Guard

Colab: https://colab.research.google.com/drive/1CfF0Bzzkd5VETmhsniksSpekpS-LKYtX