Rentman outage on May 9th 2017
Yesterday at 13:53 Rentman experienced an outage of its servers which lasted until 14:21. The error occurred in the server responsible for monitoring our complete infrastructure. As this server was consuming all available resources anyone who would try to access Rentman or perform actions would receive an error message.
To prevent that a single error can harm other accounts each Rentman client runs on an isolated process. The overload was caused by an event at 12:30 that triggered a chain reaction which extensively logged data and as a result filled our general database server. Because of the extensive logging on the disk, any action performed in our system could not be stored and triggered no response in the software.
Since the error occurred on a lower level server and gradually spread, the cause was not immediately clear. When we noticed that the database was reaching its maximum storage capacity we responded by switching to our backup systems. This didn’t solve the root cause of the issue but allowed users to work in Rentman again.
Immediately after the incident happened we started investigating what caused the root error message to pile and stop our servers from working. The repeating error was detected by one of our automatic emergency response systems and automatically stopped at 14:05. We started manually investigating the cause by going over single actions that occurred when the event was triggered. With more than 10.000 events happening per minute, this took a while. Since we didn’t detect the root cause a similar event occurred today at 10:57. It lead to a slowdown of the system for some users over a period of eight minutes. The impact was less severe because of the precautionary safety measures we already installed.
The root cause turned out to be an extensive amount of externally malformed API-calls that triggered the error in our processing servers. The global monitoring system was not able to keep up with logging the issues and consumed all available resources. We isolated the responsible process from being able to interfere with our general database server. We are also investigating extra isolation measures to make sure that similar events cannot cause a chain reaction that affects all users.
We have several security measures in place that should prevent events like this from happening. Unfortunately, the extra measures weren’t sufficient to prevent an outage of our system this time. That’s why we are currently working with our data infrastructure provider to investigate why the service did not function according to its standard.
We realize that an outage of our services has a severe impact on our users ability to perform their work. That’s why any amount of downtime is unacceptable to us and we are sorry for the inconvenience it has caused. We learned from this incident and are putting everything in place to prevent problems like this in the future.