Stop using GitHub as a measure of open source contributions

With Microsoft and the rest of the tech industry trying to flaunt their open source bona fides, people keep trotting out deeply flawed analyses of “who contributes most to open source” based on … what they can measure on GitHub. For many reasons, using GitHub as the gold standard for open source contributions is extremely biased and doesn’t begin to give an accurate picture of what companies are actually making real contributions to open source.

GitHub is not the home for all open source

It should go without saying, but apparently doesn’t, that GitHub hosts only a fraction of open source projects and activity.

GitHub launched about 10 years ago. Open source and free software development predates GitHub’s existence by twenty years or so. A lot of projects have picked up and moved from their previous homes to GitHub, but many haven’t. GNU projects, for example, aren’t hosted there. Canonical’s Launchpad repository hosts a lot of projects that aren’t on GitHub. Fedora has Pagure, the Eclipse project has its own source control for its projects, as well as the Apache Software Foundation, etc.

Some of those may mirror projects on GitHub, but it’s unclear to me how people who don’t have GitHub accounts are counted when people survey GitHub. I’m skeptical that using GitHub APIs to pull user data to see “what company does so-and-so work for?” is effective when that person hasn’t created a GitHub account.

GitHub metrics are biased towards newer projects, corporate-founded projects, and projects that have a bent towards non-reciprocal licenses.

One metric, even if it’s wrong

We’ve established that GitHub is only a slice of open source development activity. What size slice, I’m not sure, but if I was feeling generous I might say GitHub hosts between 40%-60% of active and important projects today.

But 1) it’s a biased set of projects because it misses a lot of important and established projects (as well as those who have a bent towards software freedom and not just expediency in development), and 2) it captures only the activity you can measure via GitHub’s tools. (It also misses people who contribute via personal email address or project addresses.)

Let’s look at some of the things measuring via GitHub automatically excludes.

Non-development activity: Assuming you’ve identified users properly, etc., you can then measure their activity in contributing code and answering issues, how popular the projects are in the context of GitHub, and some other snazzy metrics.

What you don’t see are non-code contributions in the form of documentation, UX work, legal work, and other activities that healthy projects depend on. Yes, some projects do also host docs on GitHub, but many don’t. And you don’t see the work that a company contributes in the form of lawyers doing work to approve or vet code that moves from proprietary to open source, or verifying that a project’s CLA is acceptable or not.

Microsoft, for example, has probably had teams of lawyers working very hard as they move things to GitHub. That work? Completely silent and unobserved on GitHub – but essential. I know the legal teams at companies like SUSE and Red Hat spend a great deal of time supporting open source projects — and that work isn’t going to be surfaced when asking GitHub’s API for commits and PR data.

Quality assurance and testing: In many cases, when you see a PR or commit, that is the result of a developer fixing a bug somebody found in testing work outside GitHub. While GitHub repos have ways to report issues, if you look at projects that are shipped by companies like Microsoft or Red Hat, there’s a ton of bug reporting that happens outside GitHub. After extensive testing and examination there might be an issue filed on GitHub or it might be reflected in a PR, but you don’t see the bulk of work that’s gone into it.

Distribution: GitHub also doesn’t show you the work that goes into, for example, putting together Fedora or OpenStack. You don’t see the work that goes into packaging software to work with other software, or the work that goes into maintaining build systems and so forth that don’t live at GitHub.

Every six months or so, Fedora puts out a release with multiple ISO images for installation, huge sets of packages in repositories for multiple arches, as well as container images, rpm-ostree images, and more. All of that with documentation, artwork, promotional activity, and infrastructure that are all open source contributions from thousands of people (many are Red Hat employees, many aren’t) that happens almost entirely away from GitHub.

The same is true for Ubuntu, and there’s also Linux Mint, Debian, and openSUSE on various release cadences. So measuring only GitHub ignores all of that and wrongly credits newcomers like Microsoft with more influence than they actually have.

Effing the ineffable

Finally, measuring GitHub projects en masse without any real metric for the importance of the projects, their adoption, or their value to the ecosystem and the companies contributing fails to provide an accurate picture.  You can count GitHub stars and downloads and PRs and issues, but that doesn’t really give a full view of the importance of a project.

One company (or project, like Debian or Apache) may make its entire portfolio available as open source, while other companies focus their activities on open source they hope will add value to their core platforms. To be blunt, Microsoft’s strategy is to embrace open source so that its key platforms (Azure, Windows) have workloads to run.

This is not to say Microsoft isn’t a valuable contributor in those communities or that its contributions aren’t worthwhile. But if the company is just helping to entrench proprietary platforms by making open source work better with them, it’s premature to celebrate that company as a champion of open source.

It’s also faulty to claim that Microsoft or anybody else contributes “the most” to open source based on a survey of a single platform. (Especially when that platform is owned by one of the companies. Of course more of their developers are going to be doing work there!)

In short, people need to stop trotting out reports that only focus on GitHub if they want to claim an accurate picture of “open source.” It might be fair to use GitHub when examining a single project (like Kubernetes), but it’s not a fair representation of open source overall. Failing to understand that is perhaps understandable when talking to laypersons who don’t participate in open source or claim to be experts. But failing to understand that demonstrates questionable understanding of open source. Simply ignoring it or excusing it and using GitHub as a metric anyway is deceptive.

FOSDEM Distributions Developer Room: Call for Participation

FOSDEM LogoOnce again, FOSDEM will have a cross-distribution miniconference on 1 & 2 February 2014. We’d like to invite submissions of talks, Birds of a Feather (BoF) sessions, or round-table discussions from any interested representatives of Linux distributions or individuals who have a topic of interest related to Linux distributions.

Proposals should be submitted through the FOSDEM proposal system (Pentabarf) here:

https://penta.fosdem.org/submission/FOSDEM14

You’ll add your session title, speaker bio, and abstract for the talk. If you’ve presented or submitted at FOSDEM previously, you should have an account in Pentabarf. If you haven’t created an account, but have presented at FOSDEM previously please contact me before creating an account – the odds are you have an account that was created previously by the FOSDEM organizers.

Deadline for submissions is 22 December 2013. Since we’re on a tight timeline, this is unlikely to be extended.

In addition to speakers, we also need one moderator for each day, and a video volunteer for each day. The moderator will introduce the speaker, keep time, and pass the microphone around for questions. The video volunteer will handle recording of sessions with provided equipment. (Don’t worry, we’ll also provide training as well.)

The call for participation is going out a bit late, so please do speak up quickly if you’re interested in participating! Also, please do help spread the word so we can ensure the best possible program for this year’s FOSDEM.

A Response I’d Never Like to Hear or See Again: “Just Don’t Use X”

Well, Actually TrollcatLet me say this up-front. I’m guilty of this myself. I’ll own it, I’ve said variations of this about plenty of technologies or services.

Someone complains about a mobile OS, “oh, don’t use that. Use [insert speaker’s favorite mobile OS here].” Someone complains about Windows/Mac/Linux, “simple, just use [Windows|Mac|Linux] and your problem goes away.”

Someone complains about a problem with Facebook, Gmail, Google+, Twitter, etc. “Oh, just don’t use it. Simple.”

You get the idea.

The speaker may be the best kind of correct, technically correct, but they risk invoking the “fail mode of clever” which is (as John Scalzi so eloquently put it) “asshole.”

You may think your absolutist, well-thought-out, well-reasoned manifesto against $thing is convincing. It may even be convincing to anyone willing to 1) take the several hours it takes to hear the diatribe, and 2) trade off the benefits or perceived benefits of their choice to embrace the alternative. (This is assuming you offer an alternative. Many folks like to bash things and then not even up an alternative, which isn’t a winning strategy. Yes, I’m looking at the whole “Defective by Design” campaign when I say this.)

It’s totally OK for you to refuse to use a service, operating system, program, or whatever. More power to you. Just don’t assume that your choices are applicable to others.

People use Facebook for complicated reasons, and often actually are aware how annoying the service is and how shitty it is that Facebook continually tweaks privacy options/settings and the flow of posts, etc. People use Windows for complicated reasons that depend a lot on their level of comfort with computers, applications they need, etc.

“Just don’t use X,” is not a constructive comment. That’s not to say offering an alternative is bad or wrong, if done reasonably. But “just don’t use X” is pretty much a non-starter.

And don’t even get me started on the folks who recommend telling others when they encounter problems with X “simple, just tell them not to use X and to use a better service/technology.” Yes, because what will win users/customers is to reply to their issues with an invitation to make changes on their end that will be perceived as disruptive. Way to go champ, pick up your prize for customer service at the front desk.

You can advocate for better options, but leading with “just don’t use X” as an absolutist statement pretty much guarantees you’re going to be ignored and annoy the other person or people. Take a stab at being empathetic with others and realize that your set of choices and values may not apply well to their situation.