On Monday 20th February vzaar video playback was unavailable for almost 2 hours. We sincerely apologize for this. We know how much your business relies on vzaar. The dependability of our service is our top priority, but in this instance we failed to maintain the level of uptime that you – and we – expect.
I would like to share with you how this happened and explain the steps we are taking to prevent it from happening again.
We unintentionally deployed a change to our database which included altering a column name in one of the tables that video playback is dependent on. This change was not yet ready to go out, and as a result caused the system which loads the videos to fail.
How Did It Happen?
Our routine deployment process involves creating a branch from our Master branch. During development, the team make their code changes and then merge their working branch back into the Master branch ready for release.
At this point we create a Release branch. We have intensive QA procedures in place around our Release branch, designed to catch any bugs in our code before they make it live to site.
So why didn’t they catch this one?
The problem in this case is that we didn’t deploy the Release branch. We unintentionally deployed our Master branch, not the fully tested Release branch. The Master branch contained the code that hadn’t been tested yet and wasn’t ready for release.
How Did We Fix It?
Ultimately we restored the database column back to its original state. But, this took longer than we expected.
Within a matter of minutes our developers had initiated this fix. However, we were deploying it to our Release branch and it was the Master branch which was the problem.
When our initial fix didn’t solve the issue it was important to take a step back and reassess the situation. It was important to tackle the problem calmly. If we rushed to deploy untested fixes we risked compounding the error. We absolutely understood how important it was to get playback restored quickly. However, we didn’t want to risk database corruption, by not making sure we understood the problem correctly.
We needed to unpick the chain of events and untangle the confusion over which branches that had been deployed. Once we had done that, the fix was clear and we were able to execute it safely.
How Will We Prevent This From Happening Again?
First, we have put a safeguard in place to protect against human error. It is no longer possible to deploy the Master branch live to production (i.e. code that has not been through our QA procedure). Our deploy script looks for a current_release tag to know which branch it should put live. In future, the current_release tag can only be applied to the correct Release branch.
Secondly, we’ve also put in place a policy change to mitigate the risk when table columns need to be altered. The new policy is to create a new column, ensure code that references this column is deployed and working, and then clear the old column. This is slower, but the risk mitigation is worth it.
We realize that outages are unacceptable and we are the first to accept that we did make mistakes in our deployment process (which led to this issue). We thought we had enough QA procedures in place to catch problematic code, but this has highlighted where our current policies need to be updated. Rest assured those updates have been made and our platform is now more stable as a result.
Thank you for your patience and understanding. On behalf of the entire team once again, we are truly sorry. We hope you understand we’re working hard to ensure against any further outages.