Sunday, February 13, 2011

Order out of Chaos


In stark certainty, we were wandering amidst a death which had reigned at least 500,000 years, and in all probability even longer.


- "At the Mountains of Madness", H.P. Lovecraft

Coding requires that we make decisions all the time, ranging from the small (what should I call this variable) to the large (what framework should I use). While sometimes the information we base our answers on is certain data (the data we are given will never be null), more often we need to make come up with an answer based on uncertain data (what SLA should we advertise for our system based on what we know of the SLA for all the various components of the system).  Coming up with answers for the latter type of questions may seem daunting, however it may not be so impossible given that there is there is a whole branch of mathematics that deals with situations like this, namely probability.

To move this away from the abstract, let's suppose you've just started work at the Corporate Cthulhu Containment Center and you are put in charge of making improvements to the alerting system. When you ask about the system, you are told it's required due to the incredible job that the Center has done with logging and reporting when errors occur in their various sub-systems.  It turns out that 99% of these errors are minor and can wait for action (e.g., the hot water has quit working in Cultist cell 42), yet the remaining 1% need immediate response (e.g., the Cthonian force field is failing). So system was put in place in order to only alert people when a critical event happens. The center is proud of this system, stating that
  • 99% of the time a critical event occurs, an alert is created (i.e. filtering is highly sensitive).
  • 99% of the time a non critical event occurs, no alert is created (i.e. the filtering also has high specificity).
However, the on call people who respond to the alerts feel like too many of the things they investigate are not critical and would like to see the system improved. Your first job is to answer the following question:
  • When an alert occurs, what is the likelihood that it was caused by a critical error?
You're also told that once you have an answer, you should then choose one of the following options for improving the alerting system:
  1. Increase the sensitivity to 99.9%
  2. Increase the specificity to 99.9%
  3. Decrease the percentage of critcal events to 0.1%.
Answering the first question is simple if you've had a class in Probability and remember Baye's theorem. I'll let you read that article on your own; what I want to do here is derive the answer from looking at actual numbers. This will help make the answer to the first question easier to understand, as well as allowing us to determine what the answer is to the second question before crunching the numbers.

Let's start by looking at a sample pool of 100,000 errors. From what the Center has told you, you can conclude that 1000 of these should be critical and the remaining 99,000 are non critical. Of the critical errors, you will get alerts for 99% of them, i.e. 990 alerts will be generated from critical errors.  Similarly, the filtering system will not create an alert for 99% of the non critical errors, however this means there will be alerts for 1% of them, i.e. 990 alerts will be created for non critical alerts. To summarize:

AlertNo AlertRow Subtotal
Critical990101,000
Non Critical99098,01099,000
Column Total1,98098,020100,000
Distribution of 100,000 errors
(Initial Alerting System)

So, overall there are 1980 alerts, of which 990 are from critical errors. So, there's a 50% chance that when an alert occurs it is due to a critical bug. No wonder the on call people are complaining.

It should be fairly obvious that the best way to improve the accuracy of the alerting system would be to reduce the number of non-critical errors that generate alerts (relative to the number of critical errors that cause alerts).  In other words, improve the specificity the filtering system.

To confirm this: here are tables with the breakdown for each of the three proposed solutions.

AlertNo AlertRow Subtotal
Critical99911,000
Non Critical99098,01099,000
Column Total1,98998,011100,000
Distribution of 100,000 errors
(Increased Sensitivity from 99% to 99.9%)
When an alert occurs, the chance that it was from a critical error is 50.22%. This is an improvement over 50%, but not much!

AlertNo AlertRow Subtotal
Critical990101,000
Non Critical9998,90199,000
Column Total1,08998,911100,000
Distribution of 100,000 errors
(Increased Specificity from 99% to 99.9%)
When an alert occurs, the chance that it was from a critical error is 90.91%.  This is a huge improvement.

AlertNo AlertRow Subtotal
Critical991100
Non Critical99998,90199,900
Column Total1,09898,902100,000
Distribution of 100,000 errors
(Critical errors reduced from 1% to 0.1%)
When an alert occurs, the chance that it was from a critical error is 9.02%.  Improving the overall system by reducing the errors actually made the alerting system less accurate.

So, to summarize, if you focus is entirely on just improving the relevance of the alerting system, then you should focus on eliminating false positives. Reducing the relative percentage of the critical alerts will only make things worse. Note too that this is not isolated to just this example. For example, if you assign values as follows:
  • Let x represent the percentage of errors that are critical (0 < x < 1).
  • Let N be the specificity of the filtering system (0 < N < 1).
  • Let M be the sensitivity of the filtering system (0 < M < 1).
Than applying Baye's theorem you obtain the following equation describing the likelihood that when an alert occurs it is due to a critical error: Nx / ( (N -M) x + M). If you regard M and N as fixed then the remaining function in just x will always tend to 0 as x goes to 0. In short, just reducing the percentage of critical errors alone will lead to more false alerts.

No comments:

Post a Comment