Why we focus on a single, reliable metric

By Bill Harding at 9:47pm on February 5, 2019

It's no secret that we've invested a tremendous amount of energy over the past three years into crafting Line Impact into a single, reliable metric that managers can rely upon to inform some of their most valuable decisions. In effect, we've bet thousands of working hours on the belief that this single metric can accurately represent which developers are having the greatest impact on your repos, and your company.

To a software developer approaching Static Object for the first time, it might seem like a reckless bet to make. Books could be filled with the internet screeds dedicated to chronicling the futility of quantifying developer output. Given the current state of Github's stats (which focus primarily on lines of code added/deleted, and commits made), these screeds have no shortage of evidence to support their premise. Github stats are exactly as ineffectual as advertised.

Even our friends at Gitprime hedge their bets when it comes to embracing a single metric. Their approach is to provide roughly fifteeen metrics they believe combine to paint a picture of how much work a team of developers is getting done. They have a number of pages like this one that spread developer evaluation across many dimensions:

Source: Getapp's Gitprime profile.

It's reasonable to assume they've adopted this approach based on conversations with their customers, who are surely skeptical that any single metric could be so reliable as to fully capture how much productive output is being generated.

Nowhere on Static Object do we explicitly provide "commits made" or "code churn" metrics, though it would be trivial for us to do so (the latter is one factor in our calculation of Line Impact, the former we believe to possess no intrinsic signal). What explanation could possibly justify a conscious decision to withhold information from our customers on behalf of this "single, reliable metric" philosophy that drives Line Impact?

For starters, the more metrics you have to sift through and consider, the less clarity there is in what conclusion to take from them. Secondly, spreading data across many metrics makes it impossible to facet data across secondary dimensions. For example, if you want to know who's the most prolific test writer, you can rank all developers' Line Impact on the secondary dimension of "test code." This isn't possible if you don't have a reliable metric to serve as your primary axis.

But there's an even more fundamental reason we've chosen to focus on creating a single metric to rule them all™️.


The only way we know to build a code metric that earns developer trust

The impetus for our decision came from a realization we made around 2016, when we had just set out on our quest to quantify code output. Our epiphany was that empirical analysis was the one and only way to create sufficient confidence in a metric that managers would feel comfortable making critical decisions based upon it.

What do I mean by "empirical analysis" in this context? That any metric purporting to accurately measure "productive code output per developer" must correspond to a human's evaluation of the same. More specifically, the results provided by Line Impact must correspond to the judgement of a company's most reliable code evaluators: its CTO, VP of Engineering, and other Senior Tech Managers. These are the ultimate arbiters with sufficient expertise and credibility to judge which developers are currently making the biggest positive impact on the code base.

Put another way, what we set out to build was a system that was flexible enough to be configured such that it corresponds to the judgement of your best software experts.

Further, we recognized that expert opinions would vary. The ideal software engineering metric needed to pick smart defaults, informed by past efforts to correlate empirical findings with the algorithm's value. These empirically-derived defaults serve as the foundation for a first-pass approximation of Line Impact, when a repo is initially imported. It has been essential for us to make every reasonable effort toward getting the default values right. But when the stakes are high, default settings aren't enough. We need to allow experts to tune the system to ensure that their intuition is reflected in the Line Impact values they see. Once your resident experts have calibrated the system, their confidence will naturally flow down to the developers themselves.

A well-tuned measurement tool is what permits managers to evaluate the biggest, most important questions -- like work from home productivity, or the performance of a new hire.


Putting the idea into practice

Every time a new customer imports their repo(s) with Static Object, we ask them the same question: do our results correspond with your prior beliefs?

The question might seem a bit counterintuitive at first. If our value proposition is to reveal how much productive work is being done in your repo, why would our first question essentially be "how much productive work is being done in your repo?" But this calibration step is essential to match up the vision of your company's tech leadership with the result of our algorithm. A good analogy is purchasing a new clock: the item you've bought can be very precise at measuring the passage of time -- but until you've set the initial value (i.e., the current time), the accuracy of your new clock will rely upon the manufacturer being in the same time zone and etc.

Once a technical manager has confirmed that their judgement corresponds to the graphs we present, they can then move forward with the peace of mind that their beliefs are reflected in the Line Impact values we present them. This is a key aspect of why we describe ourselves as a "Gitprime alternative for technical managers." When non-technical managers contact us for a demo, we strive to emphasize that the accuracy of their results can be guaranteed to a greater extent if they're willing to work with their technical leadership team -- or our technical support team -- to match up the specifics of their project with the various settings we make available to them.


Calibration in action: file type multipliers

Let's briefly delve into a specific example to make these concepts more tangible. Programmers know that not all file types are created equal when it comes to the ease of adding or removing lines. In web apps, one of the most basic manifestations of this idea is CSS files. A single line of CSS (or its close cousin, SCSS) can typically be written in 30-50% as much time as a line of Ruby, PHP, Java, or Python. Failure to account for this intrinsic property of the CSS file type leads to front-end developers (who spend the most time in such files) being heralded as the most productive developers at the company. To be clear, sometimes they might be the most productive -- but the fact that they tend to add, update, and remove more lines of code than anyone else isn't itself sufficient proof of their dominance.

In the interests of furnishing smart defaults, we automatically reduce the file type multiplier for files like CSS that have short lines and high redundancy. But we also provide a settings page that facilitates more granular control -- along with a preview of how a potential change would affect the credit given to a sample set of your committers:


This real-time preview gives engineering managers the opportunity to consider whether a proposed multiplier change brings Line Impact into better alignment with their own intuitions. File type multipliers are one of many ways we allow Line Impact to be calibrated to your tastes -- here are a few others.


When reliability is paramount, get empirical

At this point, I should probably re-emphasize that you don't need to spend hours combing over custom settings to start gaining insights from Static Object's code processing. We've gone to great lengths to infer the best default settings for your repo based on our past experience processing similar code. Compared to relying on the classic developer evaluation techniques like "manager intuition," you can get a long way toward discovering new truths about your team's output without tweaking a single setting. 

When you're ready to start making serious business decisions based on your quantified code stats, you can do it with confidence by empirically calibrating Line Impact. If you're a CTO, VP of Engineering, or manager with experience writing code, you're ideally equipped to match Line Impact to your own judgement. Better still, if you're at a large company with multiple tech managers spread throughout the enterprise, each manager can individually tune their team's Line Impact calculation to their tastes (we allow per-repo and per-organization configuration that supersede company-wide settings). After you've fine-tuned Line Impact to align with your vision, from that point forward, you no longer need to sink your own time into reading code to manually assess who's stuck, or which meetings/policies are sapping your team's productivity.

If you're a non-technical manager, our technical support team can still help you work out how to craft the best possible configuration to rely upon for your business-critical decisions. Get in touch via our demo page, or just seize the day and start a free trial when you're ready to move beyond guessing at who your top performers are.