The Saturated-Server Challenge!

Data Science

Challenge designed by: Siddartha Kshirsagar

Maturity level: Amateur | Coding required: No | Challenge starts on: Fri, Oct. 14th | Challenge ends on: Sun, Nov. 6th

A Tech company recently has been struggling with its tech-infra management. Some of its servers are behaving erratically with respect to their CPU and memory usage. It is important for the organization to understand and predict its server’s behaviour, which otherwise will lead to crashing of applications on those server’s, resulting in huge losses for the organization. The organization is seeking your help to use your Data Science acumen and gather some insights. To craft a solution, you’ll have to complete the following 3 tasks:

Task 1: Create groups of servers with similar resource utilization

You first need to understand how each server is consuming its CPU and Memory. To do that, you need to perform the following activities:

  • You have been a given time-series of CPU and Memory utilization of each server. So you need to compute a representative (Hint: mean values)  CPU and Memory utilization values for each server.
  • Next group these servers based on their similarity of CPU and Memory utilization. We are looking for 3 clusters of servers (Hint: create a CPU-Memory scatter-plot of the 50 pairs of data-points and then apply hierarchical clustering).

What to submit:

  • X-Y values of the centroids of the 3 clusters
  • Mapping of servers to clusters
  • Dunn index of the clusters

Task 2: Find the servers nearing saturation

As we know, the mean values are often misleading. Many servers with otherwise harmless looking mean values contain risky trends and patterns. The next task is to look for such servers. To do that, you need to perform the following activities:

  • Select the cluster of servers with moderate CPU utilization (between 40%-60%).
  • Build linear regression equations between CPU utilization and DOM (Day of Month)
  • Use the regression equations to find the servers with increasing trends in CPU utilization and predict their expected CPU utilization on Sep. 30th
  • Identify the servers that are expected to cross the 90% mark by Sep 30th.

What to submit:

  • The list of servers that are likely to cross the 90% mark by Sep 30th
  • The regression equation for only those servers that are likely to cross the 90% mark by Sep 30th
  • The r-squared error for each of these equations (Hint: Use the regression equation to predict the CPU values for first 10 days and calculate the r-squared error by comparing these predicted values against actual given values).

Task 3: Characterizing the servers nearing saturation

Now that we have identified the servers nearing saturation, the natural next question is to understand what makes these servers different from the other servers with similar average CPU and Memory utilization. This can then help us better understand the server properties that lead to saturation. To do that, you need to perform the following activities:

  • Consider only those servers that belong to the cluster with moderate CPU utilization, and prepare an additional Boolean attribute – “likely to saturate” based on the output of Task 2.
  • Feel free to derive additional attributes by breaking existing columns or combining multiple columns. (Hint: this step will help you derive more accurate classification).
  • Create the classification tree with the “likely to saturate” attribute as the target attribute.
  • Identify the properties of servers that separate the servers that are likely to saturate from the other servers.

What to submit:

  1. The image of the classification tree with clearly defined decision boxes and leaf boxes and count of records on each edge
  2. The confusion matrix (for this, do not break the data into training and test. Consider the entire data for training, and use the same for developing the confusion matrix)

We are providing you following two datasets:

Server metrics data:

  • This data contains time-series of historical data of CPU and memory utilization of all servers
  • Each server has 10 days of data from Sep 1st to Sep 10th
  • Data is at one-day granularity

Server attributes data:

  • This data contains different attributes of each server
  • It has 8 columns- Server Name, Model, OS, Country, Application, Owner, Environment and BU (Business Unit)

Submission and Evaluation:

  • Refer to the above sample output file for output format.
  • Submit a single PDF file with your answers for all tasks.
  • Each subtask has some points associated with. The closer you get to the optimal results, the more points you get.
  • You can make multiple submissions. We will consider your most recent submission for assessment.
  • The last date to submit your answers is Sunday, Nov. 6th, midnight IST.
What is the deadline to submit the solution?

You can submit your solution by Sunday, Nov. 6th, midnight IST

Can I submit partial results?

Yes. The challenge involves multiple tasks and steps. Your submission will be assessed based on the number of correctly attempted tasks.

Can I work with fellow associates to submit as a team?

No. The challenge is designed for individual submission.

Can I make multiple submissions?

Yes. You can make multiple submissions. But we will consider your most recent submission for assessment.

Yet to come!

Submit your solution below

    Leave A Comment