8 months ago

Tues Jun 10, 2025 1:52pm PST

Data manipulation using natural language prompts

I've got a dataset of ~100K input-output pairs that I want to use for fine-tuning Llama. Unfortunately it's not the cleanest dataset so I'm having to spend some time tidying it up. For example, I only want records in English, and I also only want to include records where the input has foul language (as that's what I need for my use-case). There's loads more checks like these that I want to run, and in general I can't run these checks in a deterministic way because they require understanding natural language.

It's relatively straightforward to get GPT-4o to tell me (for a single record) whether or not it's in English, and whether or not it contains foul language. But if I want to run these checks over my entire dataset, I need to set up some async pipelines and it all becomes very tedious.

Collectively this cleaning process is actually taking me ages. I'm wondering, what do y'all use for this? Are there solutions out there that could help me be faster? I expected there to be some nice product out there where I can upload my dataset and interact with it via prompts, e.g. ('remove all records without foul language in them'), but I can't really find anything. Am I missing something super obvious?

comments:

add comment

loading comments...