Let’s take a stroll through a few performance graphs that didn’t quite make the cut, and then I’ll show you a new graph that makes me really happy.
A command-line performance-listener, works over 0mq, UDP, and Redis. Written in PHP.
It’s easy to see how much time any particular system or its parent takes to execute during a pageload. But one thing I often wonder looking at it is “am I missing anything?”
That’s right, Extend ASCII is back in fashion! (well.. maybe not)
So that looks interesting, but it still doesn’t tell us where we failed to instrument code. There may be undetected lumps of unoptimized code hiding in the shadows!
I wanted to find gaps in performance coverage.
This is an old game-engine I wrote for Xbox 360 with common video-game performance UIs: a flat-list and a tree.
Flat-List performance UI is good in a pinch. Hierarchical UIs are better because they give you context and make it obvious what sub-system a performance item belongs to. It’s great for organizing data.
But is there something we aren’t measuring, and if so, where is it?
New Scheduler style web-renderer
None of the previous ways of rendering performance data exposed scheduling patterns, or revealed gaps between important events.
Sophisticated game-engines use scheduler-style graphs. Check this screenshot of Uncharted 2 by Naughty Dog:
Scary but insightful!
Scheduler style graphs reveal gaps in measured data, and make scale more apparent. So let’s give that a try on the web:
The graph above looks mysterious, lots of different colors, no hierarchical context.
It looks like a stacked-bar-chart. What if you make all child node heights a fraction of the parent’s height?
Suddenly, a picture begins emerging:
Red = bootstrapping, routing, and framework overhead
Light Green = Controller
Dark Green = ViewModels and DataModels
Orange = database queries
Blue = template rendering
See that first tall orange bar, yeah that’s the SQL connection opening. It’s fast right? Compared to the thickness of the other orange blobs, the open-connection phase is a small fraction of overall DB work. That’s a good sign.
Even without knowing the absolute numbers, we can quickly see that the overhead of opening a DB connection is low compared to the amount of DB work done. This is the power of schedule-graphs.
Another scheduler-graph superpower is making gaps apparent. There is a 1.5ms gap at the 6ms mark. And the DB queries are spread out, what if we combined them closer together, can we disconnect from the DB quicker? All sorts of refactoring ideas springing up!
But wait there’s more: When you mouse-over any of the above boxes you get a drill-down:
I love the above drill-down. Vertical lines guide your eye, colors make things easy to scan.
That big mysterious blue-box on the right of the chart is Mustache Template rendering. It looks like it takes up a lot of space, so let’s sprinkle some timing code and see what we can learn:
function render($templateName, $data)
$p = $this->perf->start("Mustache Engine instantiation");
$m = new Mustache_Engine([
$tpl = $m->loadTemplate($templateName);
$p->next("Render to string");
$output = $tpl->render($data);
After the above instrumentation ($p) our performance renderer has more details:
Template loading is really fast thanks to Mustache’s cache. The above visualization really puts its speed in context.
The next box represents Mustache’s template rendering (concatenating a giant string). This graph makes it easy to see that the overhead of spawning a Mustache_Engine object, and loading the template are irrelevant compared to the cost of actual template interpolation. You could continue drilling down like this as far as you want.
A short rant on granularity of instrumentation: I think 1ms chunks of activity are about where you can safely stop with web languages. Beyond that the overhead of instrumentation + VM + JIT + GC + network/socket issues even on local machines will vary wildly. Simply allowing an opcode-cache to warm and persistent sockets to prep can change your measurements by an order of magnitude (or two).
The scheduler-renderer needs fairly simple data:
Each timing entry only has 3 mandatory parts
- time since page execution began (local epoch)
- a meaningful label of activity
- time-span of activity
You can include extra metadata to further discovery:
- SQL query
- Cache-Hit status
- Stacktrace of the code-section being timed, in case you can’t just find the label in your codebase and want to know what file and function is being measured.
I went a step further, instrumented Partials (section) rendering within Mustache:
Now I get to see how much it costs to render each partial. Each row in my bug-report list above took almost half a millisecond to interpolate!
Here’s the scheduler-graph widget at work: