Business applications play a critical role in the organisation and represent the “face” of the business.
Meanwhile, the ecosystems in which applications reside are now larger than ever and are becoming ever more distributed and complex.
Lurking within this labyrinthine environment is a common, though not well-publicised, storage performance issue that affects applications and impacts the user experience.
This culprit is known as “tail latency”.
As technologies become more advanced and faster, humans become more impatient, and that can cost the business.
For example, at Amazon, every 100 milliseconds of latency causes a 1% decrease in sale. And at Bing, a two-second slowdown was found to reduce revenue per user by 4.3%.
Conversely, when latency is improved, it generates business. At Shopzilla, reducing latency from seven seconds to two seconds increased page views by 25%, and revenue by 7%.
What is tail latency?
Tail latency is the small percentage of response times from a system, out of all of responses to the input/output (I/O) requests it serves, that take the longest in comparison to the bulk of its response times.
They are, quite literally, the tail end of a system’s response time spectrum, and are often expressed as the 98th or 99th percentile response times.
To use an analogy, think of a shepherd who needs to herd his or her sheep back to the farm before calling the job done. About 99% of the sheep return pretty quickly, but 1% of slow sheep take much longer to return, and the shepherd cannot call the job done until they are all back. That 1% is the tail latency problem for the shepherd.
It is the engineers responsible for systems that serve business-critical applications who are aware of this challenge.
Application and business managers can only see that apps are slow, and they want the issues resolved. In most cases, it does not matter to them whether apps are slow because of storage issues, network issues, load issues, or in this case, tail latency issues. Their only concern is that it is fixed.
Causes of tail latency
There can be many different causes of tail latency, and some might not have anything to do with storage I/O at all.
At its simplest, the issue may be caused by a dated storage solution that cannot keep up with new applications. But often, it is a combination of growing demand, changing workloads, increasingly distributed applications, complex services, and unpredictable “noise” or contention in the system.
Throw more storage at it?
Instead of investigating the causes of a tail latency problem a typical reaction is to “throw more storage at the problem”.
But really, the first thing to do is to ask some questions. There could be a number of causes, such as oversubscribed storage resources, changing workloads, background processes, or perhaps a combination of these.
Well before a deployment, organisations have the option to assess current production applications’ workload profiles, and design a network or storage infrastructure to eliminate or minimise tail latency for the current load, as well as forecasted growth.
Over-provisioning is a costly and wasteful knee-jerk reaction, which brings no guarantee whatsoever of curing the problem.
Business impact: The twist in the tail
Despite the obvious impact on business performance, tail latency is not often explicitly named in publicised real world examples. This is mainly because, a) only technical teams are familiar with it, and b) it is often one piece of a larger or systemic problem that is more newsworthy.
There are some real world examples where tail latency has been directly highlighted as an issue for organisations, however.
In one case, LinkedIn described the criticality of tail latency to their services, and discussed tracing down the issue in one of their distributed systems.
According to a LinkedIn blog post, tail latency matters.
It said: “A 99th percentile latency of 30ms means that every 1 in 100 requests experience 30ms of delay. For a high traffic website like LinkedIn, this could mean that for a page with 1 million page views per day, 10,000 of those page views experience (noticeable) delay.”
Most organisations experience tail latency to some extent, because no infrastructures are designed with infinite resources. And from time to time there are unexpected peak loads from unexpected failures or events.
Solving tail latency: Diagnosis and monitoring
While entirely mitigating tail latency in all cases may not yet be possible, what can be achieved is to minimise it to the point where it makes sense from a business cost benefit perspective.
There are many solutions, depending on the root cause. So, the first step is to identify and define the root cause.
If tail latency is completely due to the current storage being too slow, then faster storage might be the quick answer. Cutting edge storage solutions like NVMe can solve this partially or completely, depending on the nature of the storage performance issues. With today’s trend towards more cores per server, NVMe also gives you better efficiency with highly parallel storage access.
If demand has grown significantly and storage tail latency is exacerbated, then more storage instead of faster storage might be the better answer.
Deploying localised storage closer to the apps, or the edge in cases there is a highly distributed application, can be a good idea.
If it is due to noise or changing workload patterns, then this indicates a need to re-evaluate the nature of application workloads, and to design and/or source storage infrastructures that are optimised for those workloads.
If due to background jobs (like replication, or garbage collection in flash storage) clashing with business hours workloads, then the recommendation would be to reschedule those background jobs, if feasible.
If it is an issue that is not storage-related, but perhaps due to web application design, then the tail latency problem may be resolved in a number of ways that range from simple to complex depending on the nature of the issue.
On the simple end of the spectrum, tail latency might be resolved by redesigning the way content is delivered to the end user so that it appears to be faster, using caching for example. At the complex end of the spectrum, an architectural redesign at one or more layers may be necessary.
But, often, the problem is not resolvable to simply one issue, and therefore the solution will be a combination of several of the above.
Once a solution is in place, it is critical to have a full stack monitoring strategy that allows the mapping of applications to the underlying infrastructure resources they consume, down the stack and across the infrastructure.
It is vital to understand the business value of applications that run on a shared infrastructure so problems can be traced and prioritised accordingly.
Most importantly, an artificial intelligence for IT operations (AIOps) hybrid infrastructure management platform can be leveraged to monitor performance degradations and proactively run investigations to identify issues, recommend solutions to those issues, and make sure those issues do not become future problems.
Henry He is director of product management at Virtual Instruments.