Data center maintenance and operation: the Yondr way

By: Andy Hoogeland

Our clients count on us. They rely on us to run our operations smoothly so they can focus on running theirs. The less they hear from us the better. 

The availability of a data center is expressed as a percentage of time. The industry target is known as “the 5 nines”. That’s 99.999% annual uptime, allowing for only five minutes 26 seconds of downtime per year. 

Read that again. 

Only five minutes of downtime is allowed every year.

Seems a little extreme, doesn’t it. After all, what’s five minutes out of an entire year? However, in that time companies could lose millions in revenue. 

In fact, research shows that 93% of businesses that lose data center availability for 10 days file for bankruptcy within a year. 

There’s a lot at stake. It’s no surprise large organisations invest a lot of money in the company that guarantees that availability.

Human or hardware?
What’s to blame for data center failures, tech or technicians? 

Well, a bit of both. However, a study from the Uptime Institute found that between 70-75% of major failures are human errors. 

At Yondr, we do a few things differently to make sure we’re best in class for maintenance and operations, allowing us to boast an impressive availability and win our clients’ trust (and dare I say hearts?). 

Six elements that make up the Yondr way

Monitor and control
Our data centers are monitored 24/7, 365 days a year. If a customer has to call us to report server problems they’ll lose confidence in us. Our approach to monitoring and control means we can alert customers to an incident immediately, giving them time to relocate data and inform their own clients if needed. 

At all times, a lead engineer, electrical engineer and a mechanical engineer occupy the monitoring room. This ensures there are always engineers on site from every technical discipline to monitor the facility and act on any failures that occur.

We notice disturbances before they grow into incidents. For example, all of our generators contain sensors which are observed from the monitoring room. If one gets hotter than the others, we investigate. And we have backup systems that take over when the main systems fail. By monitoring these systems, we can be sure they’ll kick in and do their job when needed.

Our round-the-clock monitoring keeps everything healthy and operational. We take a proactive approach, rather than waiting until something breaks before taking action. If something happens which is going to affect uptime, we want to take action within a matter of seconds.  

Manage the risks
All works on critical components follow predefined, reviewed and approved step-by-step procedures. They are written in a simple, easy-to-follow format and must be carried out the same way every time. 

There’s no room for guesswork here. Even experienced engineers with good instincts must follow protocol. If anything unexpected occurs during maintenance, work stops immediately until the procedures are updated. 

A roll back plan is written as part of this procedure, detailing the steps necessary to return the database or application to the point before the procedure began. 

Prepare for emergencies
Immediate action is required during an emergency. There’s no time to ask questions. 

If, for example, the electricity grid goes down, our generators automatically kick in and so too do our emergency operating procedures (the first of which is to check the generators and cooling equipment is working). Our engineers are trained to immediately respond and regular drills mean their actions run like second nature. 

Plan maintenance and register assets
All maintenance works are carefully planned and approved in our facility management system. 

To prevent one maintenance work impacting another, automated processes create a  standardised, controlled way of working.

The history of all assets is also registered in the facility management system. This means we can review an asset after an incident to see what happened and decide a course of action so the problem doesn’t repeat. 

Implement processes and procedures
Clearly defined processes and procedures are integral to the operation of a data center. They ensure all critical works are fully under control and allow us to produce predictable outcomes. 

Because our industry moves at such a pace, we regularly review and refine our processes and procedures. 

They are a vital part of Yondr’s growth because continuous improvement is implemented in the day-to-day method of working.

Follow the model
Although we’re covering this at the end, the maintain and operate model was actually created before the processes were. Because all processes are interlinked, it acts as a way to make sure nothing is missed. 

There are a few key aspects of the model that are worth briefly covering. 

They are: 

  • Incident management –  this covers everything that impacts the service we deliver. Simply put, it is fixing the problems as they arise. 
  • Problem management – involves investigating to find the root cause of an incident and coming up with a solution so it doesn’t happen again. This is how lasting improvements are made. 
  • Change management – makes sure every new activity is approved, reviewed and approved once more. Reviews are conducted by people with the relevant expertise. 
  • Asset registration –  allows us to track the entire history of our assets, so we can take any action required. For example, if we notice a piece of equipment has recurring problems we know there’s a faulty component and we can contact the supplier to do something about it. 

Our data centers are like the airline industry
Every time a plane crashes or experiences a technical malfunction an investigation takes place. 

The error is isolated and corrections are made which prevent the same error happening to other aircrafts. The entire industry grows stronger. 

That’s why you’re now more likely to get struck by lightning than be involved in a plane crash. Each error fixed makes every other flight safer. 

Our data centers work the same way. 

Our approach to maintenance and operations means we’re able to offer staggeringly high availability. And, every challenge we overcome improves the safety and security of all our data centers moving forward.  

Get the latest news from Yondr