2 years ago

Fri Jun 2, 2023 6:29pm PST

Show HN: Benchmarking AI Chatbot with Game Prompts

I’ve been using these prompts To compare how different LLMs perform, and the results have been surprisingly staggering.

The toughest one is Wheel of Fortune, which only works consistently on GPT4.

3.5 turbo rarely works, or it does with surface level misunderstanding gameplay.

Bard never works.

BingChat kinda works, but sometimes gets sassy and ends the chat.

comments:

loading comments...