On Monday, March 22, Android users globally suddenly saw notifications pop up on their devices saying that apps had stopped running. Critical apps such as Gmail, Google Pay, and other banking apps showed sudden errors that they couldn’t be opened, creating widespread consumer concerns. Later, Google revealed that the issue resided in the Android System WebView, and many users were able to remediate this issue by uninstalling the latest update. While the issue can be resolved by relying on consumers to manually update, major crashes and necessary manual updates have a lasting impact on the end user and the overall brand reputation.
I would be remiss if I didn’t add that bugs are inevitable, and engineering teams needn’t aim for 100% error-free software. They should, however, have pre-production QA measures in place that act as a safety net for situations like this, enabled by tools for comprehensive error diagnostics and actionable insights. This allows engineering organizations to prioritize the bugs creating the most damaging user experience. Even giants like Google and Facebook still experience lapses in this process, but it is a critical step in delivering consistent, quality software.
Unfortunately, operating system components like the Android System WebView should never crash an application. In fact, one of the tenets of good component design is that they should never crash an app.
Well before and amid the crash, Bugsnag did not drop a single error due to its elastic queuing architecture. As outlined in the blog related to Facebook’s SDK outage in May and July of 2020, our auto scaling capabilities helped our customers this time around too.
Since the start of the Android app outage, Bugsnag was monitoring for errors in the Android ecosystem. Our platform monitored the situation in real time Monday night to reveal the following key findings:
In this scenario, given that it was an operating system component at fault, there really isn’t much development teams could have done to prevent their applications from crashing, but immediate visibility into issues impacting the customer’s application experience is critical. Engineering teams using Bugsnag were able to provide clear guidance to their support and customer success teams to respond confidently and quickly to their customers.
Although the steps below would not have applied to this situation since an OS component was at fault, here are some proactive steps engineering teams can take to protect their applications from similar problems that impact application stability:
Additionally, this Android WebView error was caused by a Native Development Kit error (NDK), which can only be detected if your crash reporting supports NDK crash detection, and if it is enabled. Bugsnag’s monitoring capabilities are critical in situations like this one because you don’t have to opt-in for NDK monitoring like you do with other systems. It is available by default.
Since consumers rely heavily on mobile apps to navigate day-to-day life, application stability is absolutely critical, especially in today’s relentlessly competitive environment. The silver lining of outages such as this one is that it draws attention to good software design and process. It showcases where software engineering teams need to introduce new best practices or where to to fine-tune existing ones.