How to avoid downtime in your business

Published in

Fyipe

8 min readAug 8, 2018

Here are some practical tips to help you avoid downtime and be available when your customers look for you.

Reliability often used synonymous to uptime isn’t just a number. It is the key to running a successful business today. Your revenue, growth, productivity and customer trust depends on it.

In simple terms uptime is expressed as a percentage of time for which your product or service is available to the time for which your service is under observation and expressed in terms of number of nines.

For example, A SaaS company that has an uptime of 4 nines implies that their uptime is 99.99% which means that they suffer a downtime of nearly 52.56 minutes in a year. That’s a really good goal to achieve for a highly complex system and although it can be improved further, downtime is inevitable.

Downtime happens — Its an ugly truth

Downtime is inevitable — but it doesn’t have to be a customer service disaster. Here are some statistics on how downtime leads to huge loses:

Most Fortune 500 companies experience a minimum of 1.6 hours of downtime per week. According to IDC, this results in costs and losses exceeding $1.2 billion per year.
Lost productivity due to downtime costs businesses 127+ million person-hours per year, collectively.
Manual configuration errors cost companies up to $72,000 per hour.
An average company with 175+ downtime hours a year loses more than $7 million annually.
Businesses in US lose an average of $96,000 for every hour of IT system downtime.
Total data center outages cost companies more than $5,000 per minute.

Getting back up and running in the shortest possible time is the goal for every company but how do you ensure it in long term requires a different strategy.

“Nearly 70 percent of consumers leave a digital app or service especially related to banking and e-commerce within 15 minutes or less if it stops working or the service slows down.” — Forrester survey.

Your customers should determine what’s critical

Reliability means differently to different companies. Not every IT incident is a critical failure and you don’t consider normal operations issues that you face everyday as a major issue either.

Hence, it is important to understand that it is your customer and not your system that should define what’s critical and what’s not. Your critical metrics should align with what matters most to your customers.

For example, If you are an e-commerce company, it doesn’t matter to your customers if one of your servers go down when there are other servers that could easily manage the traffic and this doesn’t count as a critical issue that would affect your reliability. But, if the login, checkout and browsing functionality is down, it is an extremely critical issue that needs your attention and thus counts as a critical issue.

Customers don’t need to be aware of every issue your business may face, but your customer-facing reliability should be the top priority.

Best practices to avoid downtime

Downtime is the worst nightmare for any company out there. Its like seeing your storefront crumble and crash in front of your eyes while your customers are in line to buy something they need from you.

Here are some of the best practices that have become a norm that every successful company follows today in order to avoid serious downtime.

1. Deliberately inject failures into the system

Keeping backups and backups for those backups is really great but is it enough ?

Of course no! A backup that fires up when the system goes down might hide non-performing code that might work perfectly in test environment but fail in the production when the load kicks in. Hence, keeping a backup isn’t enough.

This is why some big companies with big budgets test out these situations continuously by creating automated tools that introduce artificial latency, test applications for failure resilience or even shut down an entire availability zone.

For most companies, creating something like this is a hefty task. They can perform these tests periodically based on a schedule and do it manually. Most successful companies implement this strategy religiously.

For instance, at Netflix, Site Reliability Engineer Corey Bertram calls it ‘promoting chaos.’ On this day the team manually injects failure into the system to test its reliability.

TIP: Sometimes these failures might turn out to be bigger and worse than expected and hence it is important for the entire IT support team supporting the functionality being tested to be ready as if it isn’t a drill. They should be vigilant of the effects of introducing the failure on different metrics on the dashboard. If you don’t have a dashboard , you can request one here on Fyipe

Scheduled attacks allow you to be proactive about finding vulnerabilities in your system and getting adept at incident response, going beyond just fixing problems to preventing them from occurring in the first place.

Tips for success with this strategy:

You should plan multiple attacks that affects different functionality and all of these attacks should last for a short period of time say, 5 minutes.
Between attacks, bring your services back to a fully functional state.
Confirm that everything is operating as expected before moving on to the next attack. You should perform a full system check (even that functionality that wasn’t planned to be affected) to ensure that no other system is affected.
Check your dashboards to understand which metrics point to the issue you are facing and how that issue affects other systems.
If you use a monitoring and alert service, keep notes on when you received the alert, who received it and how it helped you with the process. If you don’t use one, you can request one here at Fyipe and we’ll get it up and running under 5 mins.

2. Avoid the same incidents from happening again.

It might look tough but mining historical performance data, analyzing the root cause of issues and setting up an alert and response system will help ensure that the same issues that caused downtime in the past don’t crop up again especially the ones caused by human error. Here is the five step process to ensure that the same incidents don’t happen again.

Step 1: Make sure you review historical data of performance issues and understand the cause by drilling down into the minutest specifics. Having a clear data will help you make a better analysis and avoid it in future.

Step 2: Once you have the data in place, perform a root cause analysis using the data. This data analysis would be cumbersome if the data sets are huge with lot of dependencies. In that case you can use tools like Fyipe to look into the historical data and performance issues on a single window to understand the state of your environment at all times, allowing you to drill down into the performance and dependencies between individual servers, websites and applications.

Step 3: Once you get a solid grasp on the reason behind the incident, you can now set goals. These goals should be based on the needs of your business, past performance and how that performance translated into the overall accessibility of your business operations.
For example, if you are an e-commerce company, adding items to cart is an important part of your business. Improving the response time and success rate at this stage could be a goal for you.

Step 4: Now that you are aware of the goals, you need to convert these goals into threshold for your alerts. Any good monitoring and alerting service provides this feature to let you set your threshold for alerts so that you are alerted as soon as the issue begins and not when it has already happened.

Step 5: Use a dashboard to view and manage all your event data in one place. This will help you escalate the incidents priority and keep the team in loop in case the incident gets more severe over time. Keep the incident communication plan and team ready if the situation needs it.

3. Use continuous integration practices

Continuous integration (CI) is a software development practice where team members merge their work to decrease problems and conflicts. In essence, it verifies code quality to ensure no bugs are introduced. Any bugs that are found in testing the source are discovered and fixed.

You can also use automated tests to simplify testing so that once a bug is detected and a test is written for it, the test ensures that the bug doesn’t go untested in future. Once all the tests are done, it is always a good idea to pass it along to another team member to verify it before the final release.

By using continuous integration, you’ll create a baseline quality of software that will lower the risk of every release.

Here are the five tests that you should consider performing to ensure a healthy release.

Test 1: Semantic test- This is used to study and verify the relationship between data.

Test 2: Unit test- You isolate a piece of code and verify its correctness. It studies the design and flexibility of the code. Good design makes it easy for other team members to quickly pick up and understand the code.

Test 3: Functional test- The greatest concern here is not how the code is written, but whether it works

Test 4: Integration test- To ensure everything is working when combined with all the other services within a production and alongside other third party services.

Test 5: Load test- It helps to determine volume capabilities and where performance bottlenecks might occur so that your system is able to handle the expected load.

4. Test for third party integrations end to end.

Unreliable third-party integrations is one of the biggest reasons for partial or even complete outage suffered by companies. We rely on a lot of third party services to deliver reliable products to our customers. Like the e-commerce companies that rely on delivery services, every web-service depends on cloud services. When AWS goes down, not only does Amazon.com suffer an outage but all the websites hosted on the Amazon cloud services go down as well. Hence it is important to continuously test all third party integrations.

This can be done my using monitoring and alert services that keep you aware when one or more of your integrations go down so that you can take action right away before it spreads and converts into a full blown outage.

Some additional tips:

Although it doesn’t seem important, testing out the service providers via whom you receive your alerts is equally important. If you are unable to receive the SMS alert on time, you might end up being in a much bigger trouble. Hence its important to test out the SMS alert feature with different service providers by injecting failure into the system and then measuring the time that it took to receive the SMS.

On conducting this test a few number of time helps you choose the best provider based on reliability and time taken to receive the alert.

You can use monitoring services like Fyipe as well that alert you not just via SMS and email but also via call and other third party integrations such as Slack. It also has an on-call schedule feature that lets you set up a schedule for who receives the alert at different times of the day based on their availability so that you don’t waste a minute and start working on the issue right away.