Today me and qianli_cs want to share a new open-source project we've been working on called Durable Swarm. It's a drop-in replacement for OpenAI’s Swarm that augments it with durable execution to make your agentic workflows resilient to failures, so that if they are interrupted or restarted, they automatically resume from their last completed steps.
https://github.com/dbos-inc/durable-swarm
We believe that as multi-agent workflows become more common, longer-running, and more interactive, it's important to make them reliable. If an agent spends hours waiting for user inputs or processing complex workflows, it needs to be resilient to transient failures, such as a server restart. However, reliable multi-agent orchestration isn't easy—it requires complex rearchitecting like routing agent communication through SQS or Kafka.
Durable execution helps you write reliable agents while preserving the ease of use of a framework like Swarm. The idea is to automatically persist the execution state of your Swarm workflow in a Postgres database. That way, if your program is interrupted, it can automatically resume your agentic workflows from the last completed step.
Here’s an example application–a durable refund agent that automatically recovers from interruptions when processing refunds:
https://github.com/dbos-inc/durable-swarm/tree/main/examples...
We also converted all of OpenAI’s example applications to Durable Swarm:
https://github.com/dbos-inc/durable-swarm/tree/main/examples
Under the hood, we implemented Durable Swarm using DBOS (https://github.com/dbos-inc/dbos-transact-py), an open-source lightweight durable execution library that we developed. The entire implementation of Durable Swarm is 24 lines of code, declaring the main loop of Swarm to be a durable workflow and each chat completion or tool call to be a step in that workflow. Check it out here:
https://github.com/dbos-inc/durable-swarm/blob/main/durable_...