1 month ago

Sun Jan 19, 2025 12:11pm PST

Ask HN: How Do AI Apps Like ChatGPT Achieve Such High Speeds?

I’ve been experimenting with building my own Retrieval-Augmented Generation (RAG) application using an NVIDIA H100 GPU with 80GB memory, leveraging web search APIs like Serper and deploying various GPT models on platforms such as OpenAI, Azure, and quantized models on Ollama. While smaller models perform reasonably fast, anything beyond the 7-billion parameter range tends to be significantly slow.

In contrast, AI applications like ChatGPT and Perplexity demonstrate impressive speed in real-time searching, web scraping, content generation, and reasoning. Their ability to deliver results so quickly, even with large-scale models, is quite remarkable.

I’m curious to understand the engineering strategies and optimizations these companies use to achieve such high performance. Are there any insightful engineering blogs or technical resources that explain how they optimize their infrastructure, parallelize workloads, or manage latency effectively? Any insights into their backend architecture, caching mechanisms, or inference optimization techniques would be greatly appreciated.

comments:

add comment

loading comments...