Today we want to share Durable Swarm, a drop-in replacement for OpenAI’s Swarm that augments it with durable execution to help you build reliable and scalable multi-agent systems. Durable Swarm makes your agentic workflows resilient to failures, so that if they are interrupted or restarted, they automatically resume from their last completed steps.
https://github.com/dbos-inc/durable-swarm
We believe that as multi-agent workflows become more common, longer-running, and more interactive, it's important to make them reliable. If an agent spends hours waiting for user inputs or processing complex workflows, it needs to be resilient to transient failures, such as a server restart. However, reliable multi-agent orchestration isn't easy—it requires complex rearchitecting like routing agent communication through SQS or Kafka.
Durable execution helps you write reliable agents while preserving the ease of use of a framework like Swarm. The idea is to automatically persist the execution state of your Swarm workflow in a Postgres database. That way, if your program is interrupted, it can automatically resume your agentic workflows from the last completed step.
Here’s an example application–a durable refund agent that automatically recovers from interruptions when processing refunds:
https://github.com/dbos-inc/durable-swarm/tree/main/examples...
We also converted all of OpenAI’s example applications to Durable Swarm:
https://github.com/dbos-inc/durable-swarm/tree/main/examples
Under the hood, we implemented Durable Swarm using DBOS (https://github.com/dbos-inc/dbos-transact-py), an open-source lightweight durable execution library that (full disclosure) we developed. The entire implementation of Durable Swarm is 24 lines of code, declaring the main loop of swarm to be a durable workflow and each chat completion or tool call to be a step in that workflow. Check it out here:
https://github.com/dbos-inc/durable-swarm/blob/main/durable_...
Me and qianli_cs are here to answer any questions!