Importance of visualizing outliers - L3 cache miss rates

Zooming in on the lowest L3 miss rates – shows there are some packets that experience an order of magnitude lower cache miss rates. Why?

L3 cache miss rates
L3 cache miss rates as a function of the length of the network function chain. The red arrow on the right side of the figure points to some of the outliers that represent packets that had much lower cache miss rates than those covered by the 2nd quartile (and median) of the box plot. However, these packets underwent exactly the same processing as all of the other packets.

Data provided by Georgios Katsikas, doctoral student at Network Systems Lab (NSL), CoS


Transcript

And what does that tell us?  As soon as I saw this data, I said, "hold on just a minute" these outliers are getting lucky somehow as they were being processed; they were all in memory, and so we didn't have any misses of the cache. And so, of course, an obvious question becomes: How did they get lucky?  So not only the layer one cache but if we zoom in and look at the layer three cache, we also see the case down here that there are a number of cases for which (yes) these packets got lucky so even though the median and the first and third quartiles are way up here and of course the ones that got unlucky that had very high miss rate - these got lucky.  And then looking into the “why of this”, lead Georgios and others, including myself and his advisor Dejan Kostic - to dig deep into exactly what is going on.  Because, of course, what we would like to do is move the median performance down here to this level so that in fact all the packets or at least most of the packets are getting lucky and actually experience a low cache miss rate - this is one of the major reasons why it's very important in your box plots to pay attention to your outliers.  Georgios initially thought, "Ah! I should just ignore those - as those are due to some sort of experimental error - probably and aren't really important" - where in fact, they turned out to be the most important data points in this particular plot.