Summary
On March 4th, our email services experienced a critical incident caused by a cascade of unforeseen hardware failures. Multiple SSD (Solid-State Drive) disks failed in quick succession resulting in an application server becoming unresponsive, first impacting email services for some users and then spreading to other parts of the system.
Users who were already logged in to webmail experienced fewer disruptions, but new logins and IMAP connections were significantly impacted. Our support system also lost operability during the outage.
We have since replaced the failed hardware, rebuilt the affected systems, and fully restored services with additional redundancy. No user data was lost, and we have taking immediate and longer-term steps to prevent a recurrence in the future.
Details
On March 4th, 2026, at 12:05 PM CET, our team began investigating connection issues with our email services. Once we had gathered accurate information and confirmed the scope of the issue, we made a public update on our status page. What started as a suspected network problem quickly revealed itself to be a series of compounding hardware failures.
Timeline of events
March 4th – Initial Failure
An application server experienced a disk failure in its RAID (Redundant Array of Independent Disks). The failed SSD placed an additional load on the remaining disks in the array, triggering a chain reaction as other disks began to fail. While RAID is designed for redundancy, the failure of one drive can place extreme stress on the remaining drives, leading to cascading failures across the array.
Action: Our system administrators immediately deployed to the data center to replace the failed disks and rebuild the data. Email access remained functional through the late afternoon and evening (CET), but the situation continued to deteriorate as compounding issues arose.
Note: None of the disks had reported any prior issues in our monitoring systems, nor were they near their maximum write limits.
Overnight – Domino Effect
The loss of the disks impacted the virtual machines running various email related services including interfaces such as POP, IMAP, SMTP. Our support system was also indirectly impacted because it uses Runbox’ own SMTP service to send email to our users, which prevented them from receiving replies – except from via the support web interface.
Action: Our system administrators worked through the night performing disk replacements and data rebuilding, and two new physical servers (hypervisors) were installed in order to share the load by migrating services from the impacted existing physical servers.
March 5th – Response and Repair
Our team continued working non-stop to replace failed disks, rebuild data arrays, and configure new servers to offload the struggling hardware. During this process capacity was added to improve performance and resilience longer term. Services gradually normalized and by evening all email functions (web, IMAP, POP, SMTP) were restored, and queued incoming email was delivered.
March 6th – 8th – Lingering Effects
Some users experienced persistent login issues with email clients and IMAP. After increasing resources on our IMAP servers, we identified and resolved underlying NFS (Network File System) configuration issues and by March 8th, 2026, 10:00 AM CET all services were operating normally for all users.
No data was lost. All user data and information remained intact through this incident.
Lessons Learned & Next Steps
Redundancy
The experiences and customer feedback from this incident emphasized the crucial role of redundancy plays in preventing outages. Our system architecture already has layers of redundancy in place, but not sufficiently for the services to continue operating when several disks in multiple physical servers fail simultaneously. This incident has underscored the need for even greater hardware resilience, which we have already acted on by adding more physical and virtual servers running user-facing services.
New Hardware
As part of our actions already implemented we have installed new hypervisors (physical servers), a major step toward improving reliability and performance. This allows us to spread clusters of virtual application servers—running web, auth, mail, IMAP, and other services—more effectively, significantly reducing the risk of future service outages.
Monitoring
We’re reviewing our hardware monitoring and disk failure detection, virtual and physical server redundancy, and response protocols. The deployment of advanced disk health monitoring tools to detect early signs of failure is critical.
Support System
We recognize that the outage also severely impacted our support system, adding to your frustration—and ours, as we were unable to communicate. We are actively exploring ways to separate and strengthen our support infrastructure, ensuring it remains operational and responsive even during service disruptions. We are also assessing ways to communicate better with our customers when incidents occur.
Thank You
This was a difficult 24+ hours for all of us — for you, our customers, and for our entire team. We continue to closely monitor services to ensure that operations remain normal and resolve any remaining minor issues that might be uncovered. If you continue to experience issues, please don’t hesitate to reach out via support.runbox.com or support@nullrunbox.com.
Many of you have been with us for a long time, and some of you are new to Runbox. Please know that we take this incident seriously and are deeply committed to providing the sustainable, reliable, secure, and high-quality email service you deserve. Your trust is a precondition for our business, and we will continue to work tirelessly to earn it every day.
—The Runbox Team
