Three 'Next Practices' That Leverage AI And Machine Learning

Getty

If you are a CIO, VP of IT operations or some other type of IT leader, you are under constant pressure to ensure that IT systems operate at maximum efficiency. Systems must meet increasing service-level expectations in terms of performance, availability and security. In fact, you're probably already anticipating that this challenge is only going to get bigger. After all, you must deal with skills shortages and are tasked with supporting a growing number of IT initiatives such as cloud migrations, digital transformation, M&A integrations and other strategic projects. To address these challenges, you need to think about leveraging "next practices," not best practices. Let me explain.

The pace of change in today’s increasingly digitalized business environment means that what has worked in the past (as codified by "best practices") increasingly will not work moving forward. This has given rise to the concept of next practices. Next practices do not focus on improving existing processes since existing processes are becoming increasingly obsolete due to transformative technologies. Instead, they deal with the best ways to rethink your processes for the future, leveraging transformative technologies like artificial intelligence (AI) and machine learning (ML) to make your processes smarter.

Let me give you three interconnected examples of how AI-powered next practices can be applied to the system incident detection and resolution process. If properly applied, they will address limitations in managing your current setup while transforming your process in a way that allows you to both meet service-level targets today and create scalability for tomorrow.

Managing Alert Fatigue

The first step in keeping your IT systems running is system monitoring. Today, your team gets an alert each time any one of your monitoring tools detects something that exceeds a threshold. Because your team relies on a hodgepodge of unintelligent, legacy monitoring tools to watch over your growing landscape of systems and solution stack layers, your team is inundated with alerts, the vast majority of which they probably ignore reflexively. This is despite the risk of dismissing consequential warnings or prematurely acting on the wrong alerts, either one of which can hurt team productivity and service levels.

Instead, today you have the opportunity to use AI/ML to intelligently detect anomalies so your team is only alerted when they need to attend to something important. AI and ML can be used to learn from and profile system behavior and automatically define and adjust dynamic thresholds that trigger notifications based on statistical deviations from expected behavior. AI can also be used to prioritize and score these anomalies, considering factors like anomaly magnitude, frequency and clustering. This allows your team to manage by exception, giving time and attention only to true anomalies that deserve investigation.

Making Sense Of Your Data

Your team is also challenged in isolating the root cause of system performance problems, which is essential to applying fixes quickly and getting things running again. Root cause isolation is extraordinarily difficult today because there is no efficient and repeatable way to make sense of the huge volume and variety of system performance and behavior data coming in from your many monitoring tools. Also, your team’s expertise is in different organizational silos (including with your outsourcing partners), so they find themselves relying on tribal knowledge, making it hard to drive quickly to a holistic view and have high confidence in the analysis.

In contrast, AI/ML can automatically make sense of your data to contextualize incident analysis and help isolate root causes quickly and minimize ongoing reliance on specialist tribal knowledge. AI/ML is very good at narrowing down probable root causes by applying algorithms to determine metric correlation, incident co-occurrence and seasonality effects based on time series and log analysis. Further, it can be used to generate a limited, curated set of recommended remediation actions.

Eliminating Trial And Error Remediation

Finally, your team is forced to take a painstaking trial-and-error approach to resolve incidents. This is extremely time-consuming and doesn’t lend itself to continuous improvement and more effective and predictable outcomes via systematic learning. Instead, it’s feasible to combine AI/ML with collaboration technologies to remediate incidents proactively. The collaboration content (think chat messages that the team exchanges regarding incidents) can, in turn, be mined to enable closed-loop learning, ensuring that information about what worked and what didn’t is fed back into the knowledge base so that automated recommendations improve over time. It’s also not hard to believe that this can form the foundation for a future “self-healing” system.

Conclusion

Next practices -- like the application of AI and ML technology to your mission-critical IT systems management processes -- do not merely tune your current processes incrementally. They transform the way you run your IT operations and reset expectations about what is possible.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

More From Forbes

Three 'Next Practices' That Leverage AI And Machine Learning