Stop using GitHub as a measure of open source contributions

With Microsoft and the rest of the tech industry trying to flaunt their open source bona fides, people keep trotting out deeply flawed analyses of “who contributes most to open source” based on … what they can measure on GitHub. For many reasons, using GitHub as the gold standard for open source contributions is extremely biased and doesn’t begin to give an accurate picture of what companies are actually making real contributions to open source.

GitHub is not the home for all open source

It should go without saying, but apparently doesn’t, that GitHub hosts only a fraction of open source projects and activity.

GitHub launched about 10 years ago. Open source and free software development predates GitHub’s existence by twenty years or so. A lot of projects have picked up and moved from their previous homes to GitHub, but many haven’t. GNU projects, for example, aren’t hosted there. Canonical’s Launchpad repository hosts a lot of projects that aren’t on GitHub. Fedora has Pagure, the Eclipse project has its own source control for its projects, as well as the Apache Software Foundation, etc.

Some of those may mirror projects on GitHub, but it’s unclear to me how people who don’t have GitHub accounts are counted when people survey GitHub. I’m skeptical that using GitHub APIs to pull user data to see “what company does so-and-so work for?” is effective when that person hasn’t created a GitHub account.

GitHub metrics are biased towards newer projects, corporate-founded projects, and projects that have a bent towards non-reciprocal licenses.

One metric, even if it’s wrong

We’ve established that GitHub is only a slice of open source development activity. What size slice, I’m not sure, but if I was feeling generous I might say GitHub hosts between 40%-60% of active and important projects today.

But 1) it’s a biased set of projects because it misses a lot of important and established projects (as well as those who have a bent towards software freedom and not just expediency in development), and 2) it captures only the activity you can measure via GitHub’s tools. (It also misses people who contribute via personal email address or project addresses.)

Let’s look at some of the things measuring via GitHub automatically excludes.

Non-development activity: Assuming you’ve identified users properly, etc., you can then measure their activity in contributing code and answering issues, how popular the projects are in the context of GitHub, and some other snazzy metrics.

What you don’t see are non-code contributions in the form of documentation, UX work, legal work, and other activities that healthy projects depend on. Yes, some projects do also host docs on GitHub, but many don’t. And you don’t see the work that a company contributes in the form of lawyers doing work to approve or vet code that moves from proprietary to open source, or verifying that a project’s CLA is acceptable or not.

Microsoft, for example, has probably had teams of lawyers working very hard as they move things to GitHub. That work? Completely silent and unobserved on GitHub – but essential. I know the legal teams at companies like SUSE and Red Hat spend a great deal of time supporting open source projects — and that work isn’t going to be surfaced when asking GitHub’s API for commits and PR data.

Quality assurance and testing: In many cases, when you see a PR or commit, that is the result of a developer fixing a bug somebody found in testing work outside GitHub. While GitHub repos have ways to report issues, if you look at projects that are shipped by companies like Microsoft or Red Hat, there’s a ton of bug reporting that happens outside GitHub. After extensive testing and examination there might be an issue filed on GitHub or it might be reflected in a PR, but you don’t see the bulk of work that’s gone into it.

Distribution: GitHub also doesn’t show you the work that goes into, for example, putting together Fedora or OpenStack. You don’t see the work that goes into packaging software to work with other software, or the work that goes into maintaining build systems and so forth that don’t live at GitHub.

Every six months or so, Fedora puts out a release with multiple ISO images for installation, huge sets of packages in repositories for multiple arches, as well as container images, rpm-ostree images, and more. All of that with documentation, artwork, promotional activity, and infrastructure that are all open source contributions from thousands of people (many are Red Hat employees, many aren’t) that happens almost entirely away from GitHub.

The same is true for Ubuntu, and there’s also Linux Mint, Debian, and openSUSE on various release cadences. So measuring only GitHub ignores all of that and wrongly credits newcomers like Microsoft with more influence than they actually have.

Effing the ineffable

Finally, measuring GitHub projects en masse without any real metric for the importance of the projects, their adoption, or their value to the ecosystem and the companies contributing fails to provide an accurate picture.  You can count GitHub stars and downloads and PRs and issues, but that doesn’t really give a full view of the importance of a project.

One company (or project, like Debian or Apache) may make its entire portfolio available as open source, while other companies focus their activities on open source they hope will add value to their core platforms. To be blunt, Microsoft’s strategy is to embrace open source so that its key platforms (Azure, Windows) have workloads to run.

This is not to say Microsoft isn’t a valuable contributor in those communities or that its contributions aren’t worthwhile. But if the company is just helping to entrench proprietary platforms by making open source work better with them, it’s premature to celebrate that company as a champion of open source.

It’s also faulty to claim that Microsoft or anybody else contributes “the most” to open source based on a survey of a single platform. (Especially when that platform is owned by one of the companies. Of course more of their developers are going to be doing work there!)

In short, people need to stop trotting out reports that only focus on GitHub if they want to claim an accurate picture of “open source.” It might be fair to use GitHub when examining a single project (like Kubernetes), but it’s not a fair representation of open source overall. Failing to understand that is perhaps understandable when talking to laypersons who don’t participate in open source or claim to be experts. But failing to understand that demonstrates questionable understanding of open source. Simply ignoring it or excusing it and using GitHub as a metric anyway is deceptive.