How to provide the Maximum uptime of the Project?

Project uptime is 100% a myth to many. The modern market offers many solutions that promise to provide maximum accessibility opportunities or offer to increase it.  

But they do not always work effectively in practice, and in some cases, they can reduce the percentage of availability. What errors and problems lead to a similar situation?

6 uptime problems

To ensure 100% availability of the project, it is important to be able to correlate costs with the cost of downtime. Why? The answer is simple: even a few minutes of inaccessibility can lead to financial losses. In order not to step on other people’s rakes, you should know what mistakes are made in practice by others. The best companies already solved all of these problems ( Perhaps the considered exits from the problems that have arisen will help you avoid erroneous steps.

Maximum Uptime of the Project

1. The whole project is localized in one place (cloud hosting, data center)

There is an erroneous opinion that cloud hosting is not tied to hardware, that is, the infrastructure in the cloud simply cannot fall. But the reality is cruel, and cloud crashes happen (remember cloud4y or cloudmouse!). As a result, downtime for several hours.

The situation is similar with “earthly” data centers: the user reserves all servers in one place. As soon as an accident occurs at one of them, inaccessibility can infect not only a few racks but the entire center.

Solution: try setting up the AWS schema. According to it, several availability zones are used, which is how 100% uptime is achieved.

2. No adequate redundancy at the reserve site

Based on the previous situation, you decide to create a reserve site, as it will provide an opportunity to get maximum uptime. But the transition to it requires its own conditions: the adequacy of the data of the production site. Synchronization of possible changes:

  • in the “cluster-data in a cluster” configuration (for a complex site);
  •       in the file structure (+ backlog tracking);
  •       in the server configuration;
  •       when adding basic services/projects (+ debugging control processes);
  •       when connecting secondary services.
  •       For maximum efficiency, do not forget about the constant monitoring of everything!

3. The regularity of switching to the reserve site has not been established, testing of such transitions has not been performed

With the most careful organization of monitoring, there is no guarantee that the site prepared as a reserve will be ready for switching at the necessary moment. Therefore, everything needs to be checked in real conditions. The Stack Overflow example shows that it may take more than one test transition to fallback points.

Solution: it is necessary to add test checks/transitions to the reserve site in advance of the plan to increase the availability of the project. It should be borne in mind that each of them does not exclude an emergency situation.

4. The reserve site is localized in one place (be it a channel or a cloud region)

One hosting organization can localize its product and its redundancy site in one place. But the consequences of such an error are sad: in an emergency, everything will turn off!

Solution: create a configuration with a reserve site based on your hosting (regular tools will be placed here to help you switch to a reserve site) + from a secondary one based on another hosting.

5. The same data is placed both on the reserve site and on the main

The solution to the previous problem also does not guarantee that the prepared reserves will be ready at the crucial moment to work at the maximum level and take on a load of production. The situation is explained by the very essence of the reservation process. On the reserve site, exactly the same fatal load will be formed as on the production site. As a result, a complete downtime of the project is provided.

Way out: think over a mechanism for rolling back the product to the previous version on another site. Here, the topic will mention backup – delayed replication for a certain period of time (for example, for 1 hour). Such a solution will help at the time of the accident to perform the transition to the database, where the situation has not yet undergone any changes.

6. The project is dependent on external services

Often, projects involve external services – SMS for authentication on the site, delivery services from online stores, third-party wallets for paying for services, etc. If such a service is not available from the outside, you can forget about high-quality customer service.

Solution: duplication of critical external services with subsequent monitoring of their availability. And, of course, the inclusion in the plan of their switching in case of an emergency.

Be the first to comment

Leave a Reply

counter for wordpress