Stop using GitHub as a measure of open source contributions

With Microsoft and the rest of the tech industry trying to flaunt their open source bona fides, people keep trotting out deeply flawed analyses of “who contributes most to open source” based on … what they can measure on GitHub. For many reasons, using GitHub as the gold standard for open source contributions is extremely biased and doesn’t begin to give an accurate picture of what companies are actually making real contributions to open source.

GitHub is not the home for all open source

It should go without saying, but apparently doesn’t, that GitHub hosts only a fraction of open source projects and activity.

GitHub launched about 10 years ago. Open source and free software development predates GitHub’s existence by twenty years or so. A lot of projects have picked up and moved from their previous homes to GitHub, but many haven’t. GNU projects, for example, aren’t hosted there. Canonical’s Launchpad repository hosts a lot of projects that aren’t on GitHub. Fedora has Pagure, the Eclipse project has its own source control for its projects, as well as the Apache Software Foundation, etc.

Some of those may mirror projects on GitHub, but it’s unclear to me how people who don’t have GitHub accounts are counted when people survey GitHub. I’m skeptical that using GitHub APIs to pull user data to see “what company does so-and-so work for?” is effective when that person hasn’t created a GitHub account.

GitHub metrics are biased towards newer projects, corporate-founded projects, and projects that have a bent towards non-reciprocal licenses.

One metric, even if it’s wrong

We’ve established that GitHub is only a slice of open source development activity. What size slice, I’m not sure, but if I was feeling generous I might say GitHub hosts between 40%-60% of active and important projects today.

But 1) it’s a biased set of projects because it misses a lot of important and established projects (as well as those who have a bent towards software freedom and not just expediency in development), and 2) it captures only the activity you can measure via GitHub’s tools. (It also misses people who contribute via personal email address or project addresses.)

Let’s look at some of the things measuring via GitHub automatically excludes.

Non-development activity: Assuming you’ve identified users properly, etc., you can then measure their activity in contributing code and answering issues, how popular the projects are in the context of GitHub, and some other snazzy metrics.

What you don’t see are non-code contributions in the form of documentation, UX work, legal work, and other activities that healthy projects depend on. Yes, some projects do also host docs on GitHub, but many don’t. And you don’t see the work that a company contributes in the form of lawyers doing work to approve or vet code that moves from proprietary to open source, or verifying that a project’s CLA is acceptable or not.

Microsoft, for example, has probably had teams of lawyers working very hard as they move things to GitHub. That work? Completely silent and unobserved on GitHub – but essential. I know the legal teams at companies like SUSE and Red Hat spend a great deal of time supporting open source projects — and that work isn’t going to be surfaced when asking GitHub’s API for commits and PR data.

Quality assurance and testing: In many cases, when you see a PR or commit, that is the result of a developer fixing a bug somebody found in testing work outside GitHub. While GitHub repos have ways to report issues, if you look at projects that are shipped by companies like Microsoft or Red Hat, there’s a ton of bug reporting that happens outside GitHub. After extensive testing and examination there might be an issue filed on GitHub or it might be reflected in a PR, but you don’t see the bulk of work that’s gone into it.

Distribution: GitHub also doesn’t show you the work that goes into, for example, putting together Fedora or OpenStack. You don’t see the work that goes into packaging software to work with other software, or the work that goes into maintaining build systems and so forth that don’t live at GitHub.

Every six months or so, Fedora puts out a release with multiple ISO images for installation, huge sets of packages in repositories for multiple arches, as well as container images, rpm-ostree images, and more. All of that with documentation, artwork, promotional activity, and infrastructure that are all open source contributions from thousands of people (many are Red Hat employees, many aren’t) that happens almost entirely away from GitHub.

The same is true for Ubuntu, and there’s also Linux Mint, Debian, and openSUSE on various release cadences. So measuring only GitHub ignores all of that and wrongly credits newcomers like Microsoft with more influence than they actually have.

Effing the ineffable

Finally, measuring GitHub projects en masse without any real metric for the importance of the projects, their adoption, or their value to the ecosystem and the companies contributing fails to provide an accurate picture.  You can count GitHub stars and downloads and PRs and issues, but that doesn’t really give a full view of the importance of a project.

One company (or project, like Debian or Apache) may make its entire portfolio available as open source, while other companies focus their activities on open source they hope will add value to their core platforms. To be blunt, Microsoft’s strategy is to embrace open source so that its key platforms (Azure, Windows) have workloads to run.

This is not to say Microsoft isn’t a valuable contributor in those communities or that its contributions aren’t worthwhile. But if the company is just helping to entrench proprietary platforms by making open source work better with them, it’s premature to celebrate that company as a champion of open source.

It’s also faulty to claim that Microsoft or anybody else contributes “the most” to open source based on a survey of a single platform. (Especially when that platform is owned by one of the companies. Of course more of their developers are going to be doing work there!)

In short, people need to stop trotting out reports that only focus on GitHub if they want to claim an accurate picture of “open source.” It might be fair to use GitHub when examining a single project (like Kubernetes), but it’s not a fair representation of open source overall. Failing to understand that is perhaps understandable when talking to laypersons who don’t participate in open source or claim to be experts. But failing to understand that demonstrates questionable understanding of open source. Simply ignoring it or excusing it and using GitHub as a metric anyway is deceptive.

If you hitch a ride with a scorpion…

I haven’t seen a blog post or notice about this, but according to the Twitters, Coverity has stopped supporting online scanning for open source projects. Is anybody shocked by this? Anybody?

Chris Aniszczyk (@cra) tweets: "sigh coverity stopped supporting their online scanning for open source projects... C/C++ code scan tool that integrates beautifully with github?"

This comes the same week that Slack announces that they’re ending support for IRC/XMPP gateways — that is, the same tools that persuaded a number of people that it’s OK to adopt a proprietary chat service, because they’d always be able to use open clients to connect.

Not sure what the story is with Coverity, but it probably has something to do with 1) they haven’t been able to monetize the service the way they hoped, or 2) they’ve been able to monetize the service and don’t fancy spending the money anymore or 3) they’ve pivoted entirely and just aren’t doing the scanning thing. Not sure which, don’t really care — the end result is the same. Open source projects that have come to depend on this now have to scramble to replace the service.

We’ve seen this before with a litany of variations. BitKeeper pulling the plug on its freebies for kernel developers. SourceForge.net taking turns for the worse and driving a number of projects away. Google Chat / Hangouts stopped federating with XMPP clients outside its network. Transifex closing its source code… I could go on, those are just the ones that jump to top of mind.

I’m not going to go all RMS, but the only way to prevent this is to have open tools and services. And pay for them.

Amazon’s open source aspirations and actions

I stayed up last night to watch Amazon’s Tuesday night keynote for AWS re:Invent. Lemme tell you, I am not at all sad to be missing the crowds at re:Invent, and kudos to Amazon for its high-quality production values for the keynotes.

One of the things that really interested me, but wasn’t deeply explored, was the mention of Amazon’s home-grown KVM hypervisor and its Nitro setup, where it offloads networking, management, and storage to separate hardware and gives instances all the resources on the machine. (This is going by Peter DeSantis’ description and my following along with the keynote past midnight, so…)

Later in the keynote session when they brought Netflix on, they made some noises about open source and talked about their TLS implementation s2n. Haven’t dove deeply into s2n, but it sounds like they’re doing the right thing with this project, and a strong encryption alternative that has deep-pocket backing is not a bad thing at all.

But what struck me is the dichotomy of talking about open source and its importance for s2n, but glossing over completely their modifications or plans for KVM as a project. There’s a huge KVM community and I’m sure that they’d love to have Amazon participating actively. As far as I know, though, this isn’t happening.

Amazon has made moves to start an open source office and is doing more work in open source, but there’s a huge deficit between what Amazon builds off of open source and what it contributes back. If the company is serious about open source, it has an opportunity to make an enormous impact. I just hope the plan isn’t to limit its contributions to fringe or non-crucial projects and keep vital projects like Nitro/KVM behind closed doors away from the rest of the industry.

Flock Day Two: Everything is a Container! (Kinda)

Day two at Flock was, once again, a pretty container-riffic experience, at least if that’s what you were interested in. The day kicked off with Dan Walsh giving an overview of new container technologies and a roadmap for things like the cri-o project. (Look here for a longer post on cri-o and such shortly.)

Dan’s talk was excellent all-around, but he had one piece of perspective I plan to use going forward: Everything running on Linux is in a “container,” even if it’s in a “host” container. What this means is that, really, all processes use the same technologies that help make up “containers” — e.g., cgroups, SELinux, namespaces, etc. What container runtimes do is to set up more restrictive containers that have a different view of the system than unconstrained processes. (For certain values of ‘unconstrained.”)

Continue reading

Flock Day One: All Containers, All the Time

This year, Fedora’s Flock conference is being held in Cape Cod, Massachusetts, following the tick/tock cadence of North America/Europe. Last year, I was helping to organize the conference (in Prague), and this year I get to turn up and enjoy the event while other folks (like Brian Exelbierd, Jen Madriaga, and many others) wrangle the event. Spoiler alert: it’s a lot more fun attending than running a conference.

Day one kicked off with Matthew Miller (Fedora Project Leader, for those folks not heavily involved in the Fedora Project) giving a “State of Fedora” overview. I’ll probably write more about this later, but the tl;dr – things are good, as far as uptake of Fedora. But they could be better. Fedora 25 and 26 have seen great uptake, people seem to be liking the latest releases, and they’re getting good reviews. Continue reading

Communication Anti-Patterns

Let’s get this out of the way: Yes, I’m old and grumpy. I have more than a few “get off my lawn!” moments. But sometimes… sometimes, they’re justified. Especially when confronted with some of the common communication anti-patterns I run into day after day when working with distributed communities/workers. Here’s a few things you shouldn’t do, or stop doing if you do them.

Continue reading

Project Fi and replacement phones: Android could learn from Fedora…

nexus2cee_project_fi_hero_thumbI’ve had really good luck with smartphones (/me knocks on wood) over the years. I’ve dropped phones a number of times, but other than a few scuffs and scratches, no permanent damage. (My first-generation iPhone did have an unfortunate encounter with a softball years ago, but since then – smooth sailing.) This weekend, though, I biffed the Nexus 6 just wrong on the tile floor and the screen got the worst of it.
Continue reading

Happy New Year! (Foiled by DDoS…)

So – one of the resolutions I was kicking around for 2016 was to blog more often, perhaps daily. I got up bright and early on January 1st… ok, that’s a lie. I got up around 8 a.m. after the cat batted my nose repeatedly. But I got up, and after the morning ritual of feeding the cats, thought I would log into the blog and write a little something.

Unfortunately, my hosting provider (Linode) was suffering a DDoS and connecting to my server between 1 January and yesterday proved difficult if not impossible. Here’s hoping the rest of 2016 goes a little smoother!