AIOPS Prediction for Server Stability Based on ARIMA Model

Download Full-Text PDF Cite this Publication

Text Only Version

AIOPS Prediction for Server Stability Based on ARIMA Model

Pralhad P. Teggi

ITOM Product Professional Services Micro Focus Pvt Ltd.

Bangalore, India

Dr. Harivinod N.

Department of CSE

St Joseph Engineering College Mangalore, India

Dr. Bharathi Malakreddy Department of AI and ML BMSIT & M

Bangalore, India

Abstract In todays competitive landscape, enterprises must constantly improve and speed up their IT operations to stay ahead. The application of AI and ML in IT operations will enhance the efficiency of IT operations while resolving IT issues quickly and reducing the alerts. The AIOPs is short for Artificial intelligence for IT operations, that refers to multi-layered technology platforms that automate and enhance IT operations through analytics and machine learning (ML). One important use of AIOPs is, it can provide predictive alerts that let IT teams address potential problems before they lead to serious issues or outages. One such use case in IT infrastructure is to predict the stability of the server. The stability of the server depends on various performance metrics like CPU, Memory, Disk (storage health), NIC (network health) and Network services. IT Teams can proactively work on the server when the server stability is predicted as low. To predict the stability of the server, the AIOPs involves data collection, modelling and prediction. In this paper, we propose an AIOPs based framework for alert prediction for server stability. The framework includes data collection and preprocessing, time series modelling using ARIMA and alert prediction. We implemented data collection, preprocessing, time series modelling and evaluating and comparing the forecast of Naïve, average, and best fitted ARIMA models. The event prediction is out of scope from this paper. For server stability, we have considered CPU Utilization as a metric and CPU Utilization data is collected for the few data center servers.

  1. INTRODUCTION

    Today's IT environment is highly dynamic, increasingly distributed, and heterogeneous and that make an environment more complex. This complex IT environment pose challenges on IT operators for performing tedious IT operations which encompasses all the activities involved in the setup, design, configuration, deployment, and maintenance of the IT infrastructure that supports business services. There are various products to monitor the IT infrastructure at the hardware, service, and application levels. These products for monitoring, collect logs, metrics, events, and traces from distributed systems. The final goal of monitoring is to reach a level of technological development where we have tools that conduct root cause analysis with high accuracy and enable us to autonomously recover systems. To achieve this goal, we still need to shift from a data collection stage to insight- and action- driven paradigm.

    There are many Enterprise Monitoring Software in the market which includes Micro Focus Operation Bridge (MF OpsBridge), Dynatrace, Datadog, AppDynamics and Splunk Enterprise. The Micro Focus Company offers a broad set of monitoring tools as part of its IT operations management (ITOM) portfolio. OpsBridge is an enterprise platform for

    ITOM and monitors entire infrastructure, gathering raw operations data and consolidating it with data from existing tools (e.g., network and virtual machine management platforms). OpsBridge centralizes data from various data sources, providing a single source of ITOM data.

    To show the health and performance of infrastructure and applications across IT environment, the hundreds of hourly and daily notifications and alarms are delivered by OpsBridge to IT operations teams. The following figure 1 shows the MF OpsBridge operation console with Event Browser, Health Status, Service Health and performance dashboard.

    The Event Browser window displays an overview of the active events that exist in the IT environment that we are monitoring. The details include like Date and time when the event occurred, the Host system (node) where the event occurred, the Application that caused the event, the severity of the event, the user responsible for solving the problem that caused the event, if assigned etc. The OpsBridge Health Status window displays the health information of the OpsBridge deployment. To ensure efficient operations, OpsBridge keeps track of the health of its components and reports problems so that corrective or preventive action can be taken. Service Health window helps to recognize the impact of a problem on your organization's business services and applications. Different levels of measurement build on and combine based on predefined rules to present a broad picture of system health, rooted in fine-grained data. The performance dashboard window visualizes performance metrics in the form of a performance dashboard.

    Figure 1 – Operations bridge operator console with Event Browser, Health Status, Service Health and performance dashboard windows in it. Source: [1]

    The important features of OpsBridge with respect to reduction of events are topology-based and stream-based event correlation along with AIOps-based anomaly detection. By using these features effectively, certain extent event noise can be reduced. But still further the event noise can be reduced if

    prediction framework which can efficiently and accurately predict future server utilization loads, peak loads, and their timing. Such prediction framework can be built using AIOPs methods.

    Predictive alerting is one of the abilities of AIOps which use machine learning and analytics to identify abnormalities in operational data and predictively alert ITOps on these abnormalities that could potentially impact an application or service. The prediction framework activates the Predictive alerting based on forecasted server utilization. The prediction framework can have a prediction window and that can be short- term (i.e., hours) or long term (i.e., a week). If the server utilization is high in next prediction window, the predictive events can be raised so that operators can proactively react to the situation to ensure the server utilization is stable.

    To design and build the prediction framework, the first step includes data collection and pre-processing, and it involves fetching the unstructured data collected by OpsBridge for server utilization and then it is processed as time series and stored as big data. In the second step, AIOPs based time series forecasting methods are applied to predict the future server utilization. In this paper our focus is on collecting and preprocessing of the data, transforming the data to time series, exploring the statistic characteristics of the time series and then performing the prediction of the time series using ARIMA methods. Final step is to perform the comparative evaluation of best fitted ARIMA model with Naïve and Average models and deciding on which method best suits in prediction framework.

  2. METHODOLOGY

    The fallowing figure 2 shows the high-level architecture of the predictive alerting AIOPS pipeline process that includes components as (1) data collection and preprocessing, (2) Time Series modeling, (3) Predictive alerting. The following subsections describe these components and their working in detail.

    Figure. 2: High level architecture of prediction framework and predictive alerting.

    1. Data Collection

      The MF OpsBridge collects huge data from various data sources and stores it in a data lake. For system stability check, we are extracting only the CPU utilization metric data using the MF OpsBridge APIs. The exracted data is ingested into the NoSQL local DB for quick processing of the large volume of data. The MF OpsBridge collectors, collect the data for CPU utilization every 5 minutes interval. In our predictive alerting framework, we have dynamic configurable window to decide on the data collection interval. In our study, we have set the interval as 15 mins. The predictive alerting framework extracts the 3 sets of CPU utilization data for every 15 minutes.

      After data collection, in pre-processing step we perform the below steps and actions are taken accordingly on the data.

      • detecting N/A values

      • detecting gaps and

      • outlier analysis

        Affected instances can be either removed, replaced with interpolated values, or preserved depending on application needs.

    2. ARIMA

      In autoregressive models, forecasting of a variable of interest is done using the linear combination of the past values of the variable. If there is some correlation between the present value and the past values of the variable, then autoregressive models are good choice for the forecasting. The autoregressive model of order p can be written as

      Where is the white noise and are the lagged values of time series y.

      In autoregressive models, to forecast the value of a variable, the past values are used, but instead, if past forecast errors are used, then that approach is called moving average models. The moving average model of order q can be written as

      Where is the white noise and are the lagged forecast errors of time series y.

      When we combine differencing with autoregressive and moving average model, we get the ARIMA model. ARIMA stands for Autoregressive Integrated Moving Average. The ARIMA model of order p, d and q can be written as

      Where is the differenced time series and it includes both the lagged values of and lagged errors. The p, d and q refers to as below –

      • The p refers to the number of autoregressive terms

      • The d refers to is the number of differences

      • The q refers to the number of moving averages terms

    3. Building the ARIMA Model

    To build an ARIMA model, the time series must be a stationary series. If the time series is not stationary, then differential process is carried out on series to make stationary. In testing stationary of time series, Dicky-Fuller test method is used to detect whether the time series is stationary or not. As a first step, we collect the server CPU Utilization data for few selected servers in data center. And mark them as original series after preprocessing them.

    () = {1, 2 , 3, . . . , 4}

    A – CPU Utilization Original TS B – CPU Utilization TS with

    marked Anomaly

    C – ACF and PACF plots along with quantile and histogram

    D – TS Decomposition

    Figure 4 – Server 1 CPU Utilization TS Analysis results

    A – CPU Utilization Original TS B – CPU Utilization TS with

    marked Anomaly

    C- ACF and PACF plots along with quantile and histogram

    D – TS Decomposition

    Figure 5 – Server 2 CPU Utilization TS Analysis results

    A – CPU Utilization Original TS B – CPU Utilization TS with

    marked Anomaly

    C – ACF and PACF plots along with quantile and histogram

    D – TS Decomposition

    Figure 6 – Server 3 CPU Utilization TS Analysis results

    A – CPU Utilization Original TS B – CPU Utilization TS with

    marked Anomaly

    C – ACF and PACF plots along with quantile and histogram

    D – TS Decomposition

    Figure 7 – Server 4 CPU Utilization TS Analysis results

    TABLE I. SUMMARY OF THE TIME SERIES ANALYSIS

    Trend

    Seasonality

    Stationary

    Total Records

    ARIMA Order

    SARIMA Order

    Server 1

    Yes

    No

    No

    773

    (2,1,2)

    (2,1,2)(0,0,0)[24]

    Server 2

    Yes

    No

    No

    1129

    (2,1,3)

    (2,1,3)(0,0,0)[24]

    Server 3

    Yes

    Yes

    Yes

    1129

    (2,0,2)

    (2,0,2)(1,0,1)[24]

    Server 4

    Yes

    No

    Yes

    1129

    (2,0,2)

    (2,0,2)(1,0,0)[24]

    Perform the preprocessing of original time series to ensure original series is smoothened. Calculate the ACF auto- correlation function coefficients and the PACF partial auto- correlation function coefficients. Estimate all tentative models orders – p,d and q using ACF and PACF values. For each tentative model, estimate model parameters using OLS method. Then compute the sigma square, adj. R Square, AIC, SBIC and number of significant coefficients. Choose the best model having lowest AIC, SBIC and sigma square and highest adj. R square.

  3. RESULTS AND DESCUSSIONS

    1. Data Analysis

      The MF OpsBridge collects huge data from various data sources and stores it in a data lake. For our study, we extract only CPU utilization metric data for few data center servers. On these servers, the CPU utilization data gets generated for every five-minute interval. The collected data spans for two months and it need preprocessing as it contains missing and inconsistent data. The processing step involve cleaning and

      exploratory data analysis (EDA). EDA is geared towards understanding the data exploratory visualizations (histograms, box plots, scatter plots etc.) and correlation analysis which bring out the relationships and other patterns. The cleaning step in the processing can include (1) detecting N/A values (2) detecting gaps and (3) outlier analysis. Affected instances may either be removed, replaced with interpolated values or preserved depending on application needs.

      Before time series modelling, the below preprocessing steps carried out on the collected time series.

      1. Detection and Treating Missing Values in Time Series – The time series models require that there be no gaps in data along the time index, and so simply omitting observations with missing values (and re-indexing as if there were no gaps) is not an option, as it might be for non-time indexed data. Instead, the missing values need to be replaced with judiciously chosen values before fitting a model. Here, we used the interpolation technique to estimate the missed values. One of the simplest method used is linear interpolation and that requires knowledge of two points and the constant rate of change between them.

      2. Identifying Outliers in Time Series – Outliers are extreme observations relative to the rest of the data. Outliers can corrupt model estimates and consequently result in less accurate predictions. Isolation forests approach is used to detect anomaly in the time series and it identifies anomaly by isolating outliers in the data. The contamination parameter refers to the expected proportion of outliers in the data set. This value is set to 0.01.

      3. Resampling of time series to hourly averaged – A common solution to handle the imbalanced time series is to resample the data. The resampling strategies change the distribution of learning data in order to balance the number of rare and normal cases, attempting to reduce the skewness of the data. So all the time series data were resampled to represent hourly averaged after fixing missing data and outliers.

      4. Identify trend, seasonality and correlation in time series – The most important EDA on time series data is to identify trend, seasonality and correlation. To inspect each time series for the trend and seasonality, the time series is decomposed into the 3 components as trend, seasonality and noise. To inspect correlation in time series, the ACF (Auto-Correlation Function) and PACF (Partial Auto-Correlaion Function) are used. We have set lags=50 to compute the ACF and PACF for all the time series.

      5. Evaluate stationarity of time series – To test the stationarity of the time series, we have used Augmented Dickey-Fuller (ADF) test. The null hypothesis will be rejected if ADF statistic value is less than the value at 1%. So the time series is stationary else its a non-stationary.

      The figures from 4 to 9 provides the EDA results for CPU utilization time series corresponding to six servers in the data center. The table I provides the summary of time series analysis. Based on the output results from figures 4 to 9, the summary has been derived. We have be divided the time series into two sets of data – Train Data and Test Data. We followed 95% Train: 5% Test ratio to split the time series. Train data is used to create a model and forecast and test data is used to validate the forecast.

    2. Modelling

      Once the time series trend and seasonality components are analyzed in the data analysis step, checking for the stationarity

      of the time series is very important as ARIMA modelling is applied for stationary time series. When the series is not stationary, time series differencing or detrending is applied. For seasonal data, one may require seasonal differencing in addition to non-seasonal differencing.

      After time series differencing and ensuring the series stationarity, the optimal AR and MA terms are decided. Using the ACF and PACF plots, we get a range within which the parameters p, P, q, Q should lie.

      1. Modelling TS – Server 1 CPU Utilization

        ARIMA – Performed the ADF test for checking the series stationarity and it implied that the series is non-stationary as P

        = 0.144. After one difference, the time series became stationary (ADF test result P = 0.01). Using the auto.arima function with d = 1, the optimal model was selected as ARIMA(2,1,2) with the lowest AICc of -1511.544. The residuals analysis is performed using fitted model residual data. The LjungBox test of the residuals demonstrated no autocorrelation in the residual. Performed the Shapiro-Wilk and Jarque-Bera test to check the normality distribution of residual. The tests fail to reject the null hypothesis (H0: Data is normally distributed) indicating still certain portion of residual is not captured by the ARIMA model.

        SARIMA – There is not much seasonality component in the time series. The best fitted SARIMA model remains same as ARIMA(2,1,2)

      2. Modelling TS – Server 2 CPU Utilization

        ARIMA – Performed the ADF test for checking the series stationarity and it implied that the series is non-stationary as P

        = 0.445. After one difference, the time series became stationary (ADF test result P = 0.01). Using the auto.arima function with d = 1, the optimal model was selected as ARIMA(2,1,3) with the lowest AICc of -456.957. The residuals analysis is performed using fitted model residual data. The LjungBox test of the residuals demonstrated no autocorrelation in the residual. The statistic of the test is 22.992274 and the p-value of the test is 0.2891744, which is greater than 0.05. Thus, we fail to reject the null hypothesis of the test and conclude that the residuals are independent. Performed the Shapiro-Wilk and Jarque-Bera test to check the normality distribution of residual. The tests fail to reject the null hypothesis (H0: Data is normally distributed) indicating still certain portion of residual is not captured by the ARIMA model.

        SARIMA – There is no much seasonality component in the time series. The best fitted SARIMA model remains same as ARIMA(2,1,3)

      3. Modelling TS – Server 3 CPU Utilization

      ARIMA – Performed the ADF test for checking the series stationarity and it implied that the series is stationary as P =

      0.043. Using the auto.arima function with d = 0, the optimal model was selected as ARIMA(1,0,2) with the lowest AICc of

      -457.512. The residuals analysis is performed using fitted model residual data. The LjungBox test of the residuals demonstrated no autocorrelation in the residual. Performed the Shapiro-Wilk and Jarque-Bera test to check the normality distribution of residual. The tests fail to reject the null hypothesis (H0: Data is normally distributed) indicating still certain portion of residual is not captured by the ARIMA model. Inspected the ACF/PACF of the residual, and found the spikes at the seasonal periods like lag=24.

      SARIMA – The auto.arima function with d = 0 and D = 0 is performed and the optimal model was selected as SARIMA (2, 0, 1)(1, 0, 1)[24] with the lowest AICc of -507.926. The Ljung

      =

      100

      =1

      (

      )

      Box test of the residuals, whose P value is 0.094022, demonstrated no autocorrelation in the residual. Performed the Shapiro-Wilk and Jarque-Bera test to check the normality

      v) Root Mean Square Error (RMSE) – is a standard derivation

      of the prediction error.

      distribution of residual. The tests fail to reject the null

      1

      = (

      )2

      hypothesis but the residual distribution is around the mean (0.002).

      iv) Modelling TS – Server 4 CPU Utilization

      ARIMA – Performed the ADF test for checking the series stationarity and it implied that the series is stationary as P =

      0.01. Using the auto.arima function with d = 0, the optimal model was selected as ARIMA(2,0,2) with the lowest AICc of

      -457.512. The LjungBox test of the residuals, whose P value is 0.601513, demonstrated no autocorrelation in the residual. Performed the Shapiro-Wilk and Jarque-Bera test to check the normality distribution of residual. The tests fail to reject the null hypothesis (H0: Data is normally distributed) indicating still certain portion of residual is not captured by the ARIMA model. Inspected the ACF/PACF of the residual and square of the residual and found the spikes at lag=2 indicating there is an heteroskedasticity effect in the time series.

      SARIMA – There is no much seasonality component in the time series. The best fitted SARIMA model remains same as ARIMA(2,0,2).

    3. Forecasting

    All the evaluated and finalized models were used to forecast or predict the future values of CPU time series. The forecasted results used to evaluate the accuracy of models using below accuracy metrics. From figure 8 to figure 11 provides the forecasting plots of Naive, Average, ARIMA and SARIMA models applied on CPU utilization time series corresponding to 4 servers in the data center.

    Evaluation Metrics – There are several performance metrics to measure the accuracy of the forecasting methods. Here are few important metrics we have considered to evaluate the performance of the four models.

    1. Mean Absolute Percentage Error (MAPE) It is a measure of forecasting method prediction accuracy in percentage. Any forecast with the MAPE value 10% is observed as highly accurate. The value between 10% and 20% is considered good, while 21 – 50% is supposed to be reasonable. The value greater than 50% is considered inaccurate forecasting [2].

      =1

      Forecasting results – The tables 2, 3, 4 and 5 provides the summary of forecasting results. In most cases, the seasonality behavior does not exist as the workload on these servers is very dynamic. So theyre needed a of transformations of data to build a decent forecasting model. Only in case of server 3, the seasonality component exists and SARIMA model provide better predictions than ARIMA. The SARIMA Models MAPE value is 0.104 and which is less then ARIMA model.

  4. CONCLUSION AND FUTURE WORK

We developed an ARIMA and SARIMA model to model the CPU Utilization forecasting for the data center servers in IT environment by using BoxJenkins time series approach. The historical CPU Utilization data were used to develop several models and the adequate one was selected according to performance crterias like AIC, SBC, standard error, and maximum likelihood. The CPU usage is dynamic and these fluctuations in CPU usage of servers in data center affect the time series forecasting models accuracy. In this study, the ARIMA having first 2 AR terms and first 2 MA terms are performing better without overfitting the data.

The results obtained proves that these models can be used for modeling and forecasting the CPU Utilization of servers in IT environment; these results will provide to IT Operators in making proactive decisions. As future work, we will develop other models by using a combination of qualitative and quantitative techniques to generate reliable forecasts and increase the forecast accuracy. We will also try neural network approach to compare it with ARIMAs results to confirm the ANNs strength.

TABLE II – PERFORMANCE EVALUATION RESULTS FOR CPU UTILIZATION OF

MAPE

ME

MAE

MPE

RMSE

Naive

0.193

0.145

0.153

0.185

0.210

Average

0.171

-0.116

0.169

-0.096

0.191

ARIMA(2,1,2)

0.178

0.124

0.142

0.163

0.196

SARIMA (2,1,2)(0,0,0)[24]

0.178

0.124

0.141

0.162

0.196

MAPE

ME

MAE

MPE

RMSE

Naive

0.193

0.145

0.153

0.185

0.210

Average

0.171

-0.116

0.169

-0.096

0.191

ARIMA(2,1,2)

0.178

0.124

0.142

0.163

0.196

SARIMA (2,1,2)(0,0,0)[24]

0.178

0.124

0.141

0.162

0.196

SERVER 1

=

100

|

=1

(

)

|

  1. Mean Error (ME) – It is the average of the difference between actual and forecasted values.

    1 ( )

    =

    TABLE III – PERFORMANCE EVALUATION RESULTS FOR CPU UTILIZATION OF

    MAPE

    ME

    MAE

    MPE

    RMSE

    Naive

    0.035

    -0.051

    0.051

    -0.035

    0.061

    Average

    0.020

    0.015

    0.0294

    0.011

    0.036

    ARIMA(2,1,3)

    0.017

    -0.014

    0.025

    -0.009

    0.035

    SARIMA (2,1,3)(0,0,0)[24]

    0.017

    -0.014

    0.025

    -0.009

    0.035

    MAPE

    ME

    MAE

    MPE

    RMSE

    Naive

    0.035

    -0.051

    0.051

    -0.035

    0.061

    Average

    0.020

    0.015

    0.0294

    0.011

    0.036

    ARIMA(2,1,3)

    0.017

    -0.014

    0.025

    -0.009

    0.035

    SARIMA (2,1,3)(0,0,0)[24]

    0.017

    -0.014

    0.025

    -0.009

    0.035

    SERVER 6

    =1

  2. Mean Absolute Error – It is the average of the absolute values of the difference between actual and forecasted values.

    1

    = |(

    )|

    =1

  3. Mean Percentage Error (MPE) It is the average of the percent difference between actual and forecasted values.

TABLE IV – PERFORMANCE EVALUATION RESULTS FOR CPU UTILIZATION OF

SERVER 3

TABLE V – PERFORMANCE EVALUATION RESULTS FOR CPU UTILIZATION OF

MAPE

ME

MAE

MPE

RMSE

Naive

0.293

-0.918

3.305

0.043

4.146

Average

0.280

-1.408

3.310

-0.003

4.281

ARIMA(2,0,2)

0.291

-1.173

3.379

0.023

4.3

SARIMA (2,0,2)(0,0,0)[24]

0.291

-1.173

3.379

0.023

4.3

MAPE

ME

MAE

MPE

RMSE

Naive

0.293

-0.918

3.305

0.043

4.146

Average

0.280

-1.408

3.310

-0.003

4.281

ARIMA(2,0,2)

0.291

-1.173

3.379

0.023

4.3

SARIMA (2,0,2)(0,0,0)[24]

0.291

-1.173

3.379

0.023

4.3

SERVER 4

MAPE

ME

MAE

MPE

RMSE

Naive

0.074

0.268

0.379

0.055

0.539

Average

0.129

-0.696

0.752

-0.117

0.838

ARIMA(2,0,2)

0.110

-0.573

0.640

-0.096

0.722

SARIMA (2,0,1)(1, 0, 1)[24]

0.104

-0.495

0.596

-0.082

0.665

A – Naive Model Forecasting Results

B – Average Model Forecasting Results

C-ARIMA (2,1,2) Model Forecasting results

Figure 8 – Server 1 CPU Utilization Forecasting plots

A – Naive Model Forecasting Results

B – Average Model Forecasting Results

C- ARIMA(2,1,3) Model Forecasting results

Figure 9 – Server 2 CPU Utilization Forecasting plots

A – Naive Model Forecasting Results

B – Average Model Forecasting Results

C- ARIMA(2,0,2) Model Forecasting results

Figure 10 – Server 3 CPU Utilization Forecasting plots

A – Naive Model Forecasting Results

B – Average Model Forecasting Results

C- ARIMA(2,0,2) Model Forecasting results

Figure 11 – Server 4 CPU Utilization Forecasting plots

REFERENCES

  1. Gupta, S., and Dinesh, D. A. (2017). Resource usage prediction of cloud workloads using deep bidirectional long short-term memory networks, in 2017 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS). (Bhubaneswar: IEEE), 16

  2. J. J. M. Moreno, A. P. Pol, A. S. Abad, and B. C. Blasco, Using the r- mape index as a resistant measure of forecast accuracy, Psicothema, vol. 25, no. 4, pp. 500506, 2013.

  3. Nuno Moniz, Paula Branco, Lu´s Torgo, "Resampling Strategies for Imbalanced Time Series", DOI: 10.1109/DSAA.2016.37, October 2016

  4. Digital transformation and enterprise software modernization : Micro focus. [Online]. Available: https://www.microfocus.com/enus/home

Leave a Reply

Your email address will not be published. Required fields are marked *