The Looming Competency Crisis

Those of you who’ve taken a class with me know that I’m an IT nerd, rather than someone who jumps out of aircraft or eats things that would make a billy goat choke – so today, July 19th, was a big day in the news for me.

In the space of about 6 hours, we’ve seen 2 outages – a large outage affecting some Microsoft customers, and a huge global outage affecting customers of a security company called CrowdStrike.  Let’s break them down and look at near term and long-term consequences.

1.       Around 18:00 on 7/18, Microsoft reported an outage of their Office 365 and Azure platforms.  Both of these platforms are cloud-based and service a large volume of customers – Office 365 is the popular product that provides email, SharePoint, Teams, and OneDrive functionality to home and corporate users, and Azure provides a cloud-based server hosting and services environment.  What that means is that if you tried to check your email after 18:00, well, you probably couldn’t.  It also means all those files you had in OneDrive were unavailable -- yikes.

Microsoft had pretty much everything back up in a couple of hours, so the impact wasn’t terrible – to me, it falls within the range of “this is what happens when you pay someone else to do your work” kind of thing.  Outages happen, whether they’re in the cloud, or on premises. It’s part of IT.

2.      The second outage, which began around 01:00 EST, is the earth-shattering kaboom.  A company called CrowdStrike, which among other things, provides a product called Falcon, deployed a faulty update to Falcon which resulted in the infamous BSOD on Windows products.

Falcon is an endpoint security tool – it’s an agent that is deployed to an endpoint, which looks for certain activity, and then works to either stop or mitigate that activity.  It can be good or bad – for years, I’ve called Falcon “Blue Falcon”, because while it can stop illegitimate activity, it also sometimes stops legitimate activity – making an IT guy’s life much harder.  It runs at the kernel level, which is basically a security level higher than root, and something that we, as users or administrators, will never have – it interacts directly with the kernel of the operating system. 

Yeah, okay – that’s not great.

The CrowdStrike team deployed an update, which apparently included a bad file, which caused Falcon and Windows to suddenly disagree on things, and blue-screened Windows.  We’ve all had blue screens – they suck, but a reboot many times fixes them (IT Tip number 1 – if you’re having problems with Windows, reboot it till it works).  The problem is Falcon loads WITH Windows, at the kernel level, so you can’t reboot yourself out of this mess.  The solution at this point in time is reboot into safe mode, rename a specific file, reboot into “regular” mode, and hope you have good point-in-time backups if it doesn’t come back up.  I’m pretty good at what I do, so I figure I could get that taken care of in10-15 minutes, soup to nuts.  No big deal, right?

Well, here’s the wrinkle – those of us who are involved on the infrastructure side of the IT house don’t manage 10-20 servers – we manage hundreds or thousands of servers.  Virtualization has allowed us to cram a huge amount of servers (guests) onto a relatively small amount of physical hardware (hosts).  It’s not uncommon to find enterprises with 100 hosts, and thousands of guests – and since IT guys no longer have to deal with staggering amounts of physical hardware, the bean counters decided we can manage a larger number of guests – it’s not uncommon to find enterprises with hundreds of physical hosts and thousands of guests being managed by a handful of irritable IT guys.  1000 servers X 10-15 minutes a server?  Yeah, you see where I’m going with this…

But wait – there’s more.  It’s not just servers that are endpoints – desktops and laptops can be endpoints, too.  So, let’s think about that in a small business context – you have a company with 200 employees, 6 hosts, maybe 50-60 guests, and one or two IT guys.  Well, first off, not all IT guys do the same thing – so you may have just one guy who does all the infrastructure work, while the other guy manages a software platform that’s mission critical.  Look at the timing for that – 200 employees probably means close to 200 desktops/laptops, and split the difference and say 55 virtual guests – that’s 50 hours alone to fix the desktops/laptops.  The thing is, you don’t fix desktops and laptops first, because if you do, they don’t have anything to connect to – you bring up the infrastructure stuff first.  Using the worst case of 15 minutes each, that’s 13.5 hours you spend on servers before you even touch the desktops\laptops.

I was caught up in the 2016 Shamoon attacks – essentially caught up in the proxy war between Iran and Saudi Arabia on the cyber side, and largely agreed to be the largest cyberattack in history.  I can tell you, you don’t just start fixing things right away, you start figuring out what happened.  This issue, unlike a cyberattack, is a known one – but there is still planning involved before mitigation, particularly if you roll to a disaster recovery site.  The correct way to handle this would be figure out the extent of what happened, declare a disaster (if necessary), roll to the DR site, ensure everything’s working, and THEN start looking to solve the production problem.  That’s a perfect world – not every company is set up to do that, for one reason or another.  I think we’re about to see who’s got a good DR plan, and who doesn’t.  I won’t even start getting into business continuity plans.

The final wrinkle – this is all on premises work – if you can fix this remotely, you’re lucky.  The vast majority of enterprises use single-sign-on connected to your Windows active directory account – including their VPN access.  Do you sign into your VPN using your Windows credentials?  That’s SSO.  Did CrowdStrike just smoke everything that manages your Windows credentials?  They did.  Time to put on some pants and head into the office, and expect there to be a 2- or 3-hour delay in response times because of this.

So, now you’re starting to see the scope of this – no matter the size of the enterprise, this is going to take time to fix.  Some may be back in a few hours as they declare a disaster and roll to their DR sites (which are hopefully unaffected), but some are going to be down for days, as they slowly repair or restore their servers. 

Expect this disruption from this to go on for quite a while.

The past few months have brought some notable service disruptions – AT&T has some big issues, Microsoft had a big issue, and CrowdStrike – well, let’s just say CrowdStrike stock is going to be “on sale” for a while.  I’ve been asked, on all these occasions, “was this a cyberattack?”  - which is a valid question, given today’s world.  My response is usually:

Never attribute to malice what can be attributed to stupidity.

The AT&T outage is widely accepted to be some bad code in an update that got pushed, which affected the signal handover between cell towers – what was it?  They’ll never tell us, but don’t be surprised if somebody missed a semi-colon somewhere.

The Office 365 and Azure outage – again, most likely a bad update, and again, we’ll never know for sure.

CrowdStrike?  Well, they’ve pretty much admitted what happened, or we can deduce that based on their mitigation procedure – and yep, bad update.  Basically, if you had auto-update turned on, you’re in trouble.

So, what does this all mean in the longer-term, larger scope of the world?  Well, we’re all becoming more interconnected, relying on a smaller and smaller base of people to provide services to an ever-larger set of complex systems.  How many vital systems are going to be either unavailable or delayed today, because a small team, or even a single person, slipped up and wrote some bad code?  Well, check United Airlines out and see what their delays look like – or better yet, ask Boeing how the 737 Max worked out for them – or the two folks stuck on the space station (again, looking at you, Boeing) because their ride was built by multiple vendors who hated each other, overseen by folks who only cared about the bottom line.  This is what happens, when to quote Boeing, you “fired all the amazingly talented assholes.”

We are facing a crisis of competence, and we’re only in the beginning stages.  New technology, old technology, everything is affected.  How many guys do you know that can run your water treatment plant?  Have you tried fixing a fridge lately?  What do you do when a vendor you trust deploys a bad update and brings down your enterprise?

This country is simply not producing the professionals it needs in the amount it needs, and this coupled with the offshoring mania that the MBA’s have embraced, means while technology will continue to improve, we are building an ever more complex environment that we’re just not going to be able to maintain.  Things will get brighter and shinier, and do more things better and faster, but they will become less and less reliable, until boom – nothing works anymore.  Welcome to Idiocracy, or end-state competency crisis.

— Kirk

LTAC INFOComment