2 years ago
Sun Oct 2, 2022 2:48pm PST
Show HN: Zero-shot object detection demo
Hello HN!

We've built an interactive demo of Google AI's OWL-ViT [1] zero-shot object detection model, using Hugging Face transformers.

Regular object detection models are trained on a fixed set of categories, for example cats, dogs and birds. If you want to detect a new type of object, like a horse, you have to collect and label images with horses and retrain your model.

A zero-shot object detection model is a so-called open-vocabulary model: it can detect a huge number of object categories without needing to retrain it. These categories are not predefined: you can provide any free-form text query like “yellow boat” and the model will attempt to detect objects that match that description.

Zero-shot object detection models like OWL-ViT are trained on massive datasets of image-text pairs, often scraped from the internet. The heavy lifting is done by a CLIP-based image classification network trained on 400 million image-text pairs, and adapted to work as an object detector. The largest model took 18 days to train on 592 V100 GPUs.

We used the Hugging Face implementation [2] of the OWL-ViT model and deployed it to a cloud GPU. Inference takes about 300ms, making interactive exploration possible: make sure to tweak the text queries and thresholds to find the ones that work best for your images!

[1] https://arxiv.org/abs/2205.06230 [2] https://huggingface.co/docs/transformers/model_doc/owlvit

read article
comments:
add comment
loading comments...