Mitigating IT Downtime with Scoutbees

Gadi Feldman

October 18, 2020

“Service currently unavailable”

Those who have encountered this dreadful message themselves know what thoughts follow (not to mention running around in desperation on what would otherwise have been a delightful Tuesday morning):

S#%t.
What happened?
How long has this been going on?
Will I get fired?

While the first and last thoughts won’t solve the problem, the middle two are critical in resolving the issue.

The reason we dread any kind of downtime related to IT resources is the negative impact on business operations. Compounded across lines of business, downtime can add up to a massive negative financial impact.

In 2014, Gartner estimated that IT downtime can result, on average, in losses in excess of $5,400 per minute, but a Ponemon Institute report from 2016 went even further. Their report raised Gartner’s average to nearly $9,000 per minute. Almost five years later, we can safely assume this figure is even larger due to increased reliance on technology.

Unfortunately, it’s impossible to avoid outages altogether. Some reports say that in the past three years, 97% companies in the U.K. have faced outages of some sort. Of those, 41% have encountered between one and four outages, with an unlucky 8% having to deal with 50+ outages in the same time period.

The Connectivity Supply Chain

Even though most service providers promise uptime of greater than 99%, we are not immune to failures.

Let’s take a look at a typical connection between an end user and a hosted application they need to access via their virtual workspace:

End-User Device – Connection to the corporate Gateway (typically either Citrix Gateway or VMware UAG) → Authentication with the on-prem Active Directory (or any other directory, such as Azure AD) and optionally 2FA → Get list of published resources (here is where the broker comes in play) → Connection to the published resource (brokers need to find an available server to run the session on) → published resource session started (a session successfully started on a Hosting Server or VDI-based machine)

And all the way back to the end user.

As expected—between an end user and an application—we’re using multiple technologies, each with its own vulnerabilities. If one element in this connection supply chain fails, the whole process fails. So, even with >99% availability outages like these add up to several hours of downtime per year, and that costs a lot of money. That said, most downtime suffered by businesses is caused by human error—some estimates say as much as 49%. Some errors are especially bad (the CenturyLink outage caused by a misconfiguration taking down 3.5% of global internet traffic is a classic example).

So, when we ask “what happened,” we know that there are numerous points of vulnerability that can cause a service to be unavailable. The key is to accurately identify the connection segment that failed so you can troubleshoot efficiently.

Identifying issues before they cause damage

But service unavailability is not necessarily the problem. Damage is caused when outages last for prolonged periods of time, stopping operations and preventing workers and customers from using services. In fact, a 2020 report states that 79% of respondents agree that when IT issues go unnoticed, they always lead to bigger problems. So it makes sense that the less service unavailability you have, the fewer disruptions and financial losses.

To answer the question “How long has this been going on,” we need some sort of mechanism to regularly probe the availability of resources. This way, we can control the level of granularity we have when monitoring resource uptime. For example, if we run proactive tests every ten minutes, we would have greater than 99% confidence that we know the status of a resource in every ten-minute block. We need this level of confidence when we’re using virtual desktop infrastructure (VDI), such as VMware Horizon and Citrix Virtual Apps & Desktops, which are used to deliver applications and workspaces to end users.

ControlUp Scoutbees has the advantage of monitoring resources, from an end-user perspective, at regular time intervals, as often as every five minutes. Scoutbees offers end-to-end visibility into the connectivity status of VDI resources rather than monitoring isolated parts of the network. WIth this approach, Scoutbees was able to immediately pick up a recent Azure Active Directory outage and alerted users with comprehensive details about the nature of the problem.

With Scoutbees, you can initiate Scouts from Cloud Hives (this is what we call our fleet of cloud runners deployed across the globe) or any other internal or custom locations, such as branches or stores. Scouts are initiated to test your selected internal and external gateways (including 2FA-based gateways) and then attempt to open the selected published virtual workspace or application. Upon opening the workspace or application, the Scout will not only check the availability of the published resource, but it will also validate the health of different components of the connection flow (e.g. authentication might be working, but what if takes more than 10 seconds to authenticate the user).

All of the above is consolidated in a report that contains the outcome of the tests, the times taken for each step in the connection, and also the reason for failure (where necessary). This reporting gives you all the intelligence you need to perform root cause analysis and discover exactly where in the connection chain the issue lies.

The detailed reports combined with Scoutbees’ built-in notifications and alerts can help you remedy any outages as soon as they happen, before you suffer any material losses from any unavailable services.

So, if you ever get another Service Unavailable message on a Tuesday morning, and your manager asks what happened, you will know that someone, somewhere, tripped over a cable at 06:47am, and that your job is safe.