Nutanix
7 min readAug 14, 2017

4 Ways You Can Use AIOps For Better CloudOps Efficiency

The world of IT has evolved exponentially over the last decade with cloud being the new normal for enterprise businesses. From on-premise data centers to the rise of cloud and converged architecture of the IT operations has undergone a wave of process and procedural changes with DevOps philosophy. The not-so-IT companies like Amazon, Microsoft and Google have disrupted traditional infrastructure and IT operations by removing the heavy lifting of installing data centers, managing servers, networks and storage etc. so that engineers can put their focus back on applications and business operations instead of IT plumbing.

Above all, the DevOps philosophy is to save time and improve performance by bridging the gap between engineers and IT operations, however DevOps hasn’t truly delivered what was expected out of it as engineers still had to handle all the issues and events in their infrastructure whether in Cloud or on-prem data center.
There is a new philosophy emerging around “what if humans could solve new complex problems while we let machines resolve known, repetitive, and identifiable problems in cloud infrastructure management?” — this is known as “the AIOps philosophy” that is slowly taking root in cloud-native and cloud-first companies to reduce the dependency on engineers to resolve problems.

Many enterprises have already adopted Cloud as key component of their IT and have limited their DevOps to configuration management and automated application deployments. Nurturing the AIOps philosophy will further eliminate the repetitive need for engineers to manage everyday operations and save precious engineering time to focus on business problems. While Cloud has made automation easy for engineers, it’s the lack of intelligence powering their day to day operations that is still causing operational fatigue for engineers even in the cloud world.

The adoption of Cloud and emergence of AI, ML technologies are allowing companies to use intelligent software automation rather than vanilla scripting to make decisions on known problems, predict issues and provide diagnostic information for the new problems to reduce the operational overhead for engineers. The era of pagers to wake up engineers in the middle of the night for down times and known issues will be a by-gone over the next 18 to 24 months.

In the traditional IT world, the main focus of operation engineers was to keep lights ON but in the world cloud, there are new dimensions like Variable Costs, API Driven Infrastructure Provisioning, No Centralised Security and Dynamic Changes that further increase the work burden. The only way to help companies reduce their cloud costs, improve security compliance for on-demand provisioning, reduce alerts fatigue for engineers and bring intelligent machine operations to handle problems due to dynamic changes is through AIOps — Put AI to make Cloud work for your business.

Managing Enterprise Cloud Costs

According to RightScale “State of Cloud 2017 Report”, managing cloud costs is the #1 priority for companies that are using Cloud computing. The cloud cost challenges are causing massive headache for finance, product and engineering teams within the organisations due to dynamic provisioning, auto-scaling support and lack of unused cloud resources garbage collection. When hundreds of engineers within an enterprise use Cloud platforms like AWS, AZURE and Google for their applications, it will be impossible for one person to keep track of spend or deploy any centralised approval processes. Many companies like Botmetric are using machine intelligence and AI technologies to detect the cost spikes, provide deep visibility into who used what and help companies deploy intelligent automation to reduce unused resources and auto resize over provisioned servers, storage etc. in the cloud.

As IT infrastructure is an important factor to your business success, so is its need to understand the optimal usage limitations for your organizational IT infrastructure needs. In comparison on-prem cloud looks pretty easy because of “it’s pay-as-you-go” model, however when you grow exponentially you scale your cloud the same way and this gives you a bill shock. A lot of teams put in place tagging policies, rigorous monitoring but still as controlling cost is not the engineer’s way you still lack the edge. The process of continuous automation will help in reducing those misses in saving cloud cost. AIOps will put across the process of continuous saving in your cloud paradigm. For example: You can automate the process of purchase of Reserved Instances in AWS cloud through simple code with help of AWS Lambda. Another most favourable and most common use case is that of turning off dev instances over weekends and auto-turn back on while the start of weekdays, this saves upto 36% savings for most of cloud users.

Ensuring Cloud Security Compliance

When any engineer within the organisation can provision a cloud resource through API call, how can businesses ensure that every cloud resource is provisioned with the right security compliance configuration needed for their business and satisfy regulatory requirements like PCI-DSS, ISO 27001, HIPPAA etc. This again requires a real time security compliance detection, informing the right user who provisioned the resource and take actions like shutting down machines if not complied to ensure your business stays protected. The most important part of security these days is continuous monitoring, and this can be achieved if you have a mechanism in place that detects and reports the next millisecond when the alert is received. A lot of organization are developing tools that not only detects security vulnerabilities but auto-resolves them. By leveraging AIOps and using the real time event configuration management data from cloud providers, companies can stay compliant and reduce their business risk.

Reduce Alert Fatigue

The problem of too many alerts is a known issue in the data center world, and is popularly called as Ops fatigue. The traditional NOC team (look at alert emails), IT support team (review tickets & respond) and then engineers looking into the critical problems was broken in the cloud world with DevOps Engineers managing all these tasks.

Also, anybody who managed production infrastructure, business services, applications and architected systems, knows that most of the problems are caused by the known events or identifiable patterns. Noisy Alerts are the common denominator in any IT operations management. With swarm of alerts flooding inboxes, it becomes highly difficult to manage which ones really matter or are ones to be looked upon by engineers. A great solution powered by anomaly detection would be to filter out unnecessary alerts or suppress duplicate alerts for a more concise alert management to detect real issues and predict problems. The engineers already have an idea on what to do when certain events or symptom occur in their application or production infrastructure. When events or alerts are triggered, most of the current tools just provide a text of what happened instead of providing a context of what is happening or why it’s happening? So as DevOps engineers, it’s important for you to create diagnostic scripts or programs so you can get a context of why CPU spiked? Why an application went down? Or why API latency increased? Essentially, to get to the root cause faster powered by intelligence. You should encourage them to deploy anomaly detection powered by machine intelligent and smart automated actions (response mechanisms) for known events with business logic embedded so team can sleep peacefully and never sweat again.

Intelligent Automation For Operations

The engineers responsible for managing the production operations (from ITOps to DevOps era) have been frustrated with the static tooling that’s mostly not intelligent. With the rise of machine intelligence and adoption of deep learning, we will see more of dynamic tooling that can help them in day to day operations. In the Cloud world, the only magic wand for solving operational problems is to use code and automation as a weapon. Without using intelligent automation to operate your cloud infrastructure would only increase complexity for your DevOps teams. You can create everything from automated remediation actions to alert diagnostics. As a team and DevOps engineer, you need to focus on using CODE as a mechanism for resolving problems. If you are building the CI/CD today then you should certainly deploy a trigger as part of your CI/CD pipeline that can monitor deployment for health metrics and invoke a rollback if it detects performance or SLA issues. Simple remedies like this can save hours of time after every deployment and handle failures gracefully.

We will also see various ITSM vendors bringing AI & ML into their offerings like Intelligent Monitoring (without static thresholds for alerts instead of dynamic alerts), Intelligent Deployment (with cluster management and auto-healing tooling), Intelligent APM (not just what’s happening but why it’s happening due to what), Intelligent Log Management (real time streaming of log events and auto detection of relevant anomaly events based on application stack) and Intelligent Incident Management (suppression of noise from different alerting systems and providing diagnostics for engineers to get to the root cause faster).

The state of Cloud platforms and ITSM offerings is evolving at rapid pace, we are still to see newer concepts powered by AI and ML that revolve around disrupting cloud operations and infrastructure management to ease the pain for engineers to let them sleep peacefully in the night and not worry every time a pager goes off!

You can also read the original post here.

Nutanix

We make infrastructure invisible, elevating IT to focus on the applications and services that power their business.