3 weeks ago
Sun Apr 27, 2025 3:29pm PST
Show HN: SemHash – Semantic Text Deduplication, Outlier Filtering and Sampling
We’ve just released SemHash v0.3.0, a major rework of our open-source text pre-processing library. We’ve added two new functionalities: outlier filtering & representative sampling. The core API has been reworked to make sure all of these features can be used together in an intuitive way. Our new features use the existing approximate nearest neighbors index that we already used for semantic deduplication, so they can be ran very quickly after building the index on your dataset. The core package can now be used for:

- Semantic Deduplication: Remove semantic duplicates from your dataset. This can prevent train/test set overlap in classification tasks, or prevent duplicate samples in RAG/semantic search.

- Outlier Filtering: Surface and filter the most anomalous samples from your dataset. This can help with automated removal of low quality data, or data that should not be in your dataset.

- Representative Sampling: Select the most central and diverse examples using Maximal Marginal Relevance. This can help you quickly explore and understand a dataset, or even build a small, diverse, high quality dataset, for example for LLM finetuning.

We’ve designed these features in the same way as our semantic deduplication: CPU friendly, lightweight, and explainable.

We hope these features help you create cleaner datasets, or simply understand your data better. We’re curious to hear your feedback, and whether there are any other features you think would improve SemHash further!

read article
comments:
add comment
loading comments...