Cookies managing
Emat EOOD, referred to in this policy as ("Emat", "we", "our", "us"), is committed to protect the privacy and security of your personally identifiable information. We advise you to carefully read this cookie policy ("Policy"), together with Emat Privacy Policy so that you are aware of how, where and why we are using your personal information.

This Policy applies to all individuals visiting our website and to all the information that is collected through cookies. Read more...
Cookies managing
Cookie Settings
Cookies allow our websites to remember information that changes the way the site behaves or looks, such as your preferred language or the region you are in. Remembering your preferences enables us to personalize and display advertisements and other contents for you.
Essential cookies
Always On. These cookies are essential so that you can use the website and use its functions. They cannot be turned off. They're set in response to requests made by you, such as setting your privacy preferences, logging in or filling in forms.
Analytics cookies
Disabled
We may use cookies to better understand how people use our products/services so that we can improve them.
Advertising cookies
Disabled
We use cookies to make advertising more engaging to our users. Some common applications of cookies are made to select advertising based on what's relevant to you, to improve reporting on campaign performance and to avoid showing ads you would have already seen. Cookies capture information about how you interact with our website, which includes the pages that you visit most.
Security/Optimization cookies
Disabled
Cookies allow us to maintain security by authenticating users, preventing fraudulent use of login credentials and protect user data from unauthorized parties. We may use certain type of cookies allow us to block many types of attacks, such as attempts to steal content from the forms present on our website.

99.99% reliable, but availability needs rethinking

Penetration audit by Emat EOOD it company
Applications and service software should always be available to users and work without fail. That is the ideal scenario. Of course, anyone involved in the IT industry will tell you that 100% trouble-free operation is impossible. But what if that is what the customer wants? Do they understand how much such stability actually costs?

During a discussion of one of our new projects at Emat EOOD it company, we once again recalled the article The Calculus of Service Availability and Google's ‘Four Nines Rule’ — the mathematically calculated and expected average time for detecting a failure, downtime, and service recovery.

At first glance, Google's level is certainly unattainable for a small IT team or start-up, and all this is not about us. But the point is not in the numbers, but in the approach: how to think about reliability and what can be improved even with limited resources. So today we will talk about the ideal and how to strive for it.
‘Almost always’ from Google: 52.56 minutes of downtime per year
The book Site Reliability Engineering: Reliability and Fault Tolerance as at Google describes the principle of availability and operability of a specific service, which must be ‘almost always’ in good working order.

SLO (service-level objectives) stipulates that no user should be able to notice the difference between 100% and 99.999% availability. There are many vulnerable links between the user and the service that service developers cannot influence (laptop, home Wi-Fi, provider, power grid, etc.). And it makes no sense to fight for fractions of a percent.

Google:
  • Most services must provide 99.99% availability to users.
  • Allows for the worst-case scenario and a total error limit per year of 0.01% of 525,600 minutes per year, or 52.56 minutes (365 days).
  • The limit allocated for critical component failures is five independent critical components with a limit of 0.001% each = 0.005%; 0.005% of 525,600 minutes per year, or 26 minutes.
  • The remaining service error limit can be 53-26=27 minutes.
  • Each critical component of the system must be more reliable than the service as a whole.
  • The service cannot be more stable than the sum of its weakest links.
Ideal versus reality
But should small IT companies and developers strive to reach the level of Google, Amazon or Microsoft? How can you use their approach if you don't have an army of DevOps engineers and round-the-clock support? Not all customers have a banking API, and not every crash threatens to cost millions. And yet, even small projects benefit from understanding these principles.

Numbers force us to think more precisely. The ‘four nines’ model helps us take a fresh look at project architecture: which components are critical, which can be replicated, and which should be isolated.

99% availability and uptime means 3.65 days of downtime per year, 99.9% means 8 hours and 45 minutes, and 99.99% means those very 53 minutes of Google. Simply put, each ‘nine’ in a customer contract costs resources and effort. On the other hand, when everything seems to be working ‘normally’ and ‘almost never crashes,’ you need to understand exactly what ‘almost’ means. Is it 3 minutes? 3 hours?
MTTR is more important than the number of nines
Not everyone needs Google's level, according to the developers at Emat Ltd. Let's be honest, it's an ideal. But it's important to understand that the principle is more important than the percentage. It's a tool for thinking and maturity, not a religion. Here's how you can look at it in real terms. MTTR (Mean Time To Repair) is something you can really improve even in a small team:
  • Reduce fault detection time — build a system in which incidents are detected before the customer becomes aware of them, using customised monitoring, logging, alerting and metrics.
  • Reduce specialist response time — ensure that the person responsible sees the relevant notification and immediately understands what needs to be done. This is not just about duty rosters. Clear instructions, automatic incident routing, and team training are also important.
  • Reduce recovery time with monitoring, one-button rescue actions (e.g., rollback to a previous state or adding backup capacity), operational readiness practices, etc.
  • Reduce average downtime not only through technical solutions, but also through architectural approaches: infrastructure segmentation, geographic distribution, and partial degradation capabilities. Even if one segment remains operational during a failure, the overall impact is reduced. This means that users can continue working, albeit with limited functionality, and the team can restore the service without panic and rush.

From mathematics to a culture of responsibility
Discussions about ‘nines’ often turn into arguments about minutes, incidents, and recovery times. This is especially true when we are talking about small and medium-sized companies rather than IT market giants. A failure is always a test of teamwork. Who finds out first? Who makes the decision? Do we have a recovery plan? Does the customer have an alternative? How prepared is your service for reality?

The sooner the team starts asking itself these questions, the easier it will be to answer them in the future. Because reliability is not about numbers. It's a way of thinking. It's a habit of checking. It is an internal discipline that, over time, gives rise to trust.

At Emat Development, we believe that service stability is characterised primarily not by numbers, but by technological accessibility. Every minute of downtime is not just a minus for the SLA. It is an assessment of the work and approach to engineering, the attitude towards users.

What can be done in a small project
  • Monitoring: UptimeRobot, BetterStack, Telegram bot
  • Automatic database backup in the cloud
  • Notifying responsible parties in case of problems: Slack/messenger
  • Response scenarios: Google Doc with actions (rollback, restart, adding backup power, switching DNS)
  • A failure of one client should not affect others
  • When updating (segmentation, geographical isolation, gradual degradation)
See our other News
    Info
    Emat EOOD
    Bulgaria, Sofia 1404, Stolichna Municipality,
    district. Triaditsa, st. Yasna Polyana 110