Technology

Atlassian implements ‘soft-delete’ policy and improves backups to avoid another outage


Outgoing Atlassian CTO Sri Viswanath has announced that the company will establish a “soft delete” policy across all systems as one of several steps the company will be taking to avoid another disastrous outage that accidentally disabled several cloud services and took nearly two weeks to fix.

Atlassian first acknowledged the outage on its Status Page on April 5. The company did not restore services for all impacted customers until April 18.

The company blamed the outage on a recent maintenance script which, according to Viswanath, resulted in the immediate deletion of 883 sites, representing 775 customers. The deleted sites also contained some customer contact information, which meant that customers could not file support tickets as they normally would, and Atlassian could not immediately contact impacted customers, Viswanath said.

However, on reviewing the incident, Atlassian said it has taken a number of immediate actions to avoid similar situations in the future. This includes preventing the deletion of customer data and metadata that has not gone through a “soft-delete” process. Instead, all new operations that require deletion will first be tested within Atlassian’s own sites to validate its approach, and once that validation is completed, it will progressively move customers through the same process.

“Deletion of an entire site should be prohibited; and, soft-delete should require multi-level protections to prevent error,” Viswanath outlined in a blog post.

“We will implement a ‘soft delete’ policy, preventing external scripts or systems from deleting customer data in a production environment. Our ‘soft delete’ policy will allow for sufficient data retention so that data recovery can be executed quickly and safely. The data will only be deleted from the production environment after a retention period has expired.”

Any activity to soft-delete data must also have a tested rollback plan, Atlassian added.

Additionally, Atlassian said it will accelerate its disaster recovery program so that restoration can be automated for multi-site, multi-product deletion events for a larger set of customers, and ensure this process is regularly tested and updated to reduce recovery time.

According to Viswanath, Atlassian will also revise its incident management process for large-scale incidents and conduct simulated exercise, as well as improve the backup of key contacts and retrofit support tooling so customers without a valid site URL or Atlassian ID can still directly contact technical support.

Investing in a unified, account-based, escalation system and workflows that allow for multiple objects such as tickets and tasks to be stored underneath a single customer account object, plus revisiting the company’s incident communication playbook and executing an escalation management function so that it is globally consistent across all geographies for customers make up the additional steps that Atlassian said it will be taking.

In a letter penned by the company’s co-founders and co-CEOs Scott Farquhar and Mike Cannon-Brookes, the aim of these actions centres around regaining the trust of its customers.

“We want to acknowledge the outage that disrupted service for customers earlier this month. We understand that our products are mission critical to your business, and we don’t take that responsibility lightly. The buck stops with us. Full stop. For those customers affected, we are working to regain your trust,” they wrote.

For the third-quarter ending March 31, the company reported a net loss of $31 million, compared with net income of nearly $160 million during Q3 in fiscal year 2021. Meanwhile, revenue grew 30% year-on-year to $740.5 million.

Atlassian ended its third quarter with a total customer count, on an active subscription or maintenance agreement basis, of 234,575 customers, adding 8,054 net new customers during the quarter.

The company added that, during the quarter, its customer count was reduced by approximately 1,800 due primarily to Russia-based customers that were unable to pay, as a consequence of sanctions levied on their payment networks.

During the results, the company also named Rajeev Rajan as the company’s new CTO when Viswanath finishes up with the company at the end of 2022 financial year.

Related Coverage



Source link

3 thoughts on “Atlassian implements ‘soft-delete’ policy and improves backups to avoid another outage

Comments are closed.