Skip navigation

Tech Talk

3 Posts tagged with the it_automation tag
3

One of the most interesting hats I wear at xMatters involves visiting our clients and learning about what our products and services have done for their companies.  We have a very diverse group of companies in our 1000+ client community so I hear lots of anecdotes like:

"We got our customer facing web services back online faster than ever before"

"It used to take us 30min to assemble our major incident management SWAT team, we've been doing it in <5 with our IT relevance engine"

"We support more workloads with less people because of self service options built right into our job abend notifications"

"It keeps our airport operations running smoothly"

"We've shaved off 100s of hours [annually] of our outage durations across all the incidents that we track. That's real money."

etc.

 

You can always catch these great stories live or in recorded form in our client webinars

 

What's always great to hear about is how xMatters impacts the lives of the actual people in IT.  As IT organizations continue to become more agile, they demand ever more of their people.  Be on call all the time, communicate relentlessly to keep everyone aware of what's going on, resolve business impacting issues more quickly, jump onto the myriad of calls that we need you on, get projects done more efficiently, here's some new tech – make it work, and so on. And all this while making sure that you follow process, keep things documented, and juggle flaming torches while standing on one leg.

 

 

Sounds about as easy as doing this:

 

 

 

So what role can can xMatters play in makings things a little easier for IT personnel?  A fair bit as it turns out.  Without the right tools to foster effective communications between people, tools, and process you will engage the wrong resources at the wrong time (or all the time).  Each alert sent to the wrong person at best disrupts them from completing the task they were involved in, and at worst causes problems to last much longer.  Here are some of the best quotes that I've heard in the last few weeks from IT people who had relevance engines in their lives for a while:

"I got to sleep through the night without pages that turned out to be false alarms! Felt as good as when our kids stopped waking us in the middle of the night."

"Since our notifications got more targeted, I've finally started to get over my 'phantom vibration' syndrome"

"Making our service desk available on mobile devices finally gave me the business case to get the boss to buy the field techs iPads. And yes, we do use them to get actual work done all the time!"

 

 

Great IT people = Great IT services.  Are you doing everything that you can to attract and retain the best people

 

 

Share your quotes and thoughts in the comments!  I'll update the post with the best ones.

 

 

Abbas Haider Ali.

0

Earlier this year, I ran across an interesting post on the CIO website talking about IT services firm Atos Origin and their battle against productivity losses due to email overload. 

 

A couple of weeks ago it came up again with stories from ABC News, Techcrunch, Business Insider, and even Engadget.  The hook?  CEO Thierry Breton wants to eliminate email entirely and move people over to social collaboration tools, exit messages, phone calls, and face to face meetings (!).

 

I do agree that email overload is a real issue and has resulted in some really off the wall behavior: 

  • volume bragging - "I've been on a tear, getting 1000 emails a day"
  • wrestling competitions - "Hit Inbox zero today"
  • admissions of failure - "Finally gave up and just declared email bankruptcy. Deleted everything and starting from scratch."

 

However, killing email is probably a little extreme. 

 

One area where email is out of control is notifications for IT teams.  The xMatters Advisors team spends a lot of time talking to large enterprises about how they manage their IT environments, and in particular the communication approaches they use to engage people to take actions (fix stuff, approve things) and to give people a heads up (service outages, change windows).

 

A common element that we've observed is that there is a LOT of email traffic internal to the IT teams and a fair amount that is sent outside the team about what's happening with IT services.  Here's some common complaints:

  • I'm on an email distro for IT alerts and get so many that I just ignore them (or have a rule that redirects them to a folder I never look at)
  • We send emails out but for important stuff have to follow up with a text or phone call to actually make something happen
  • When we get emails, we can't do anything with the messages themselves.  We have to call someone to approve a change or confirm that we're working on the issue. 
  • We have to power up a laptop to get connected to the system that sent the email to figure out what's really going on
  • We get emails for stuff on the weekend even when we're not on call
  • IT doesn't let us know what's going on.  We're kept in the dark about what's going on.
  • IT keeps sending us so many emails about what's going on that we just ignore all of them now.

 

In my view, the core of the problem isn't email itself.  It's how we use email, rely on it exclusively, and treat it like a one way communication channel.

 

When we deliver an IT relevance engine to a client, there are a lot of benefits that directly address this email overload problem:

  • Individuals get fewer emails because they pass through many filters to ensure that communications only get sent to the people that should get it, based on schedule, on call rules, escalation structure, location, etc.
  • When time is of the essence, either bypass email entirely, or use it just for a backup.  Primary communication switches over to SMS / text messages, or automated phone calls.
  • Enable people to respond to the messages they receive in context of the issue and their role in the organization. Examples: I'll take the incident ticket, this change request needs more detail about impact, I'll be on the war room conference call in 15min, etc.
  • Provide mobile access to IT management tools so whether you need to just get more information before making a decision, or you want to start working for wherever you are, you can do so quickly and using any smartphone or tablet at your disposal.

 

If you read any of the articles about the challenges at Atos Origin and thought you have the same type of problems in your IT organization, you might find this information on IT relevance engines useful.

 

If you're an existing xMatters client and have seen these benefits come to life at your company, feel free to share stories.

 

And finally, here's a Monday morning challenge:  If you record a video of yourself saying the following phrase (or a creative alternative), I'll make sure you get some awesome xMatters schwag for the holidays. Limited to first 20 responders who post as comments or to info@xmatters.com.

 

"If you're having notification problems I feel bad for you son, I got a relevance engine so I got none"

 

And yes, I realize that the grammatical structure of that phrase is far from perfect, but I have rules to comply with: Jay-Z 99 Problems

 

Abbas Haider Ali.

0

Call it the holy grail of IT management, but picture an Enterprise IT environment that is monitored, detects faults, determined service impact, reroutes workload around failure, automatically fixed things, all while communicating what was happening to an Ops team. Maybe it's just too much time at the movies, but the visual for that in my head looks something like this:

 

termin2d.jpg terminator_2_judgment_day_7.jpg

Most IT teams that I work with have a clearly stated goal of automating whatever parts of IT management they can they can minimize the cost of operations and support.  Spending time fixing things is certainly not going to considered as valuable on investing in new projects or services that align with their company's business goals.

 

There's lots of reasons why this is more an objective rather than reality in most places.  There's a lot of moving parts to a typical Enterprise IT environment, all from different vendors, different management stacks, interconnected applications maintained independently, 3rd party apps, different levels of operations support built into systems, and the list can keep going on.  The most mature IT service organizations have a service assurance stack that includes application, infrastructure, and platform capabilities, all combined together into a business service management view.  The processes and data for changing, dealing with incidents, and managing problems all live in their service support stack.  And some even have decent service automation deployments where they can at least address some types of issues at the push of a couple of buttons.

 

However, even with all that in place, I can't think of a single client who could honestly say that they even have a single self-healing IT service. 

 

That's why the post by Pat Power at Facebook, Making Facebook Self-Healing, really caught my attention.  Pat starting building the scripts that eventually became Facebook Auto-Remediation (FBAR) which work all the way through monitoring, through operations logging, scheduling hardware replacements, validating fixes, and then bringing systems back into production.  It's all pretty damn impressive.

 

309269_10150303526577200_9445547199_7814774_473418711_n.jpg

 

Interested in what the return on something like this is?  It takes 2 full time engineers to maintain FBAR, and it does the work of ~200 full time sysadmins.  It manages more than 50% of Facebook's environment.

 

Some of my clients would kill (or at least kidnap Pat) to have something like that in their environment.  They key of course is that everyone at Facebook is absolutely clear on the fact that downtime is unacceptable.  It encourages DevOps type collaboration to the extreme.  As part of developing new features that initally terrify us and that we later learn to love, engineers can build customized remediation plugins into their projects so once operational, they can heal themselves.

 

Most companies can't be as single minded in building out self-healing environments all the way up to the application & service layers, but over the next few weeks, I'll share some stories of best practices that I've seen clients put in place that get them as close as I've seen.  Maybe some worst practices too just to keep things entertaining.

 

Abbas Haider Ali.

 

img credit: jamescameron.org