In stark certainty, we were wandering amidst a death which had reigned at least 500,000 years, and in all probability even longer.

- "At the Mountains of Madness", H.P. Lovecraft

Coding requires that we make decisions all the time, ranging from the small (what should I call this variable) to the large (what framework should I use). While sometimes the information we base our answers on is certain data (the data we are given will never be null), more often we need to make come up with an answer based on uncertain data (what SLA should we advertise for our system based on what we know of the SLA for all the various components of the system). Coming up with answers for the latter type of questions may seem daunting, however it may not be so impossible given that there is there is a whole branch of mathematics that deals with situations like this, namely probability.

To move this away from the abstract, let's suppose you've just started work at the Corporate Cthulhu Containment Center and you are put in charge of making improvements to the alerting system. When you ask about the system, you are told it's required due to the incredible job that the Center has done with logging and reporting when errors occur in their various sub-systems. It turns out that 99% of these errors are minor and can wait for action (e.g., the hot water has quit working in Cultist cell 42), yet the remaining 1% need immediate response (e.g., the Cthonian force field is failing). So system was put in place in order to only alert people when a critical event happens. The center is proud of this system, stating that

- 99% of the time a critical event occurs, an alert is created (i.e. filtering is highly
*sensitive*). - 99% of the time a non critical event occurs, no alert is created (i.e. the filtering also has high
*specificity*).

- When an alert occurs, what is the likelihood that it was caused by a critical error?

*one*of the following options for improving the alerting system:

- Increase the sensitivity to 99.9%
- Increase the specificity to 99.9%
- Decrease the percentage of critcal events to 0.1%.

*before*crunching the numbers.

Let's start by looking at a sample pool of 100,000 errors. From what the Center has told you, you can conclude that 1000 of these should be critical and the remaining 99,000 are non critical. Of the critical errors, you will get alerts for 99% of them, i.e. 990 alerts will be generated from critical errors. Similarly, the filtering system will

**not**create an alert for 99% of the non critical errors, however this means there will be alerts for 1% of them, i.e. 990 alerts will be created for non critical alerts. To summarize:

Alert | No Alert | Row Subtotal | |
---|---|---|---|

Critical | 990 | 10 | 1,000 |

Non Critical | 990 | 98,010 | 99,000 |

Column Total | 1,980 | 98,020 | 100,000 |

Distribution of 100,000 errors (Initial Alerting System) |

So, overall there are 1980 alerts, of which 990 are from critical errors. So, there's a 50% chance that when an alert occurs it is due to a critical bug. No wonder the on call people are complaining.

It should be fairly obvious that the best way to improve the accuracy of the alerting system would be to reduce the number of non-critical errors that generate alerts (relative to the number of critical errors that cause alerts). In other words, improve the specificity the filtering system.

To confirm this: here are tables with the breakdown for each of the three proposed solutions.

Alert | No Alert | Row Subtotal | |
---|---|---|---|

Critical | 999 | 1 | 1,000 |

Non Critical | 990 | 98,010 | 99,000 |

Column Total | 1,989 | 98,011 | 100,000 |

Distribution of 100,000 errors (Increased Sensitivity from 99% to 99.9%) |

Alert | No Alert | Row Subtotal | |
---|---|---|---|

Critical | 990 | 10 | 1,000 |

Non Critical | 99 | 98,901 | 99,000 |

Column Total | 1,089 | 98,911 | 100,000 |

Distribution of 100,000 errors (Increased Specificity from 99% to 99.9%) |

*huge*improvement.

Alert | No Alert | Row Subtotal | |
---|---|---|---|

Critical | 99 | 1 | 100 |

Non Critical | 999 | 98,901 | 99,900 |

Column Total | 1,098 | 98,902 | 100,000 |

Distribution of 100,000 errors (Critical errors reduced from 1% to 0.1%) |

So, to summarize, if you focus is entirely on just improving the relevance of the alerting system, then you should focus on eliminating false positives. Reducing the relative percentage of the critical alerts will only make things worse. Note too that this is

*not*isolated to just this example. For example, if you assign values as follows:

- Let x represent the percentage of errors that are critical (0 < x < 1).
- Let N be the specificity of the filtering system (0 < N < 1).
- Let M be the sensitivity of the filtering system (0 < M < 1).

*always*tend to 0 as x goes to 0. In short, just reducing the percentage of critical errors alone will lead to

*more*false alerts.

## No comments:

## Post a Comment