2 years ago
Fri Jun 2, 2023 6:29pm PST
Show HN: Benchmarking AI Chatbot with Game Prompts
I’ve been using these prompts To compare how different LLMs perform, and the results have been surprisingly staggering.

The toughest one is Wheel of Fortune, which only works consistently on GPT4.

3.5 turbo rarely works, or it does with surface level misunderstanding gameplay.

Bard never works.

BingChat kinda works, but sometimes gets sassy and ends the chat.

read article
comments:
add comment
loading comments...