Perilous open source infrastructure, or, The Quay Incident

Quay.io is down right now. The status page won’t admit it, but I have pinged multiple people and found it failing to let me pull images locally, in GitHub Actions, and on my VPS in Europe.

This is very inconvenient, because Quay – being a community service of Red Hat – is the only home for official Docker images of the Keycloak project. And while I have no concrete evidence for this supposition, I really feel like this may be at least partially caused by Red Hat’s recent buy-in on “everything AI”. We have seen this time and again, with Amazon, with Microsoft, with Cloudflare, and with numerous others. Letting AI agents run free on production systems results in outages because they fundamentally cannot reason about why things are configured as they are.

I think AI agents can be quite helpful for automating tasks, but their changes always need to be checked by humans before sent to production. This should go without saying! We keep saying AI will “replace” junior engineers, but we don’t let junior engineers push directly to production. Why would we let AI?

Anyway, Quay’s outage is very annoying because I can’t actually work on our production stack because there’s no other place to retrieve legitimate Keycloak images from. They really should be mirroring them somewhere. There is an entry on Docker Hub, but they aren’t even a verified organisation or “Sponsored OSS”, so I don’t know if that’s even really Keycloak. And the images don’t appear to have an SBOM, so it’s hard to reason about or audit. With all the supply chain attacks going around lately, it’s hard to trust “random” Docker images, even if they do appear to be the actual upstream.

I should note that I still firmly believe that open source projects should self-host, but they should also always have a backup mechanism somewhere. And those backup mechanisms need to be trustworthy and easy to verify as well.

Oh well. The downtime gives me some time to write this blog and go through my work emails, I guess.

Update: it came back up around 01:00 BST, 1.5 hours after this article was written. The incident was also finally logged in their status page, though with a start time a full hour after it actually stopped responding to requests.