Stop using GitHub as a measure of open source contributions

With Microsoft and the rest of the tech industry trying to flaunt their open source bona fides, people keep trotting out deeply flawed analyses of “who contributes most to open source” based on … what they can measure on GitHub. For many reasons, using GitHub as the gold standard for open source contributions is extremely biased and doesn’t begin to give an accurate picture of what companies are actually making real contributions to open source.

GitHub is not the home for all open source

It should go without saying, but apparently doesn’t, that GitHub hosts only a fraction of open source projects and activity.

GitHub launched about 10 years ago. Open source and free software development predates GitHub’s existence by twenty years or so. A lot of projects have picked up and moved from their previous homes to GitHub, but many haven’t. GNU projects, for example, aren’t hosted there. Canonical’s Launchpad repository hosts a lot of projects that aren’t on GitHub. Fedora has Pagure, the Eclipse project has its own source control for its projects, as well as the Apache Software Foundation, etc.

Some of those may mirror projects on GitHub, but it’s unclear to me how people who don’t have GitHub accounts are counted when people survey GitHub. I’m skeptical that using GitHub APIs to pull user data to see “what company does so-and-so work for?” is effective when that person hasn’t created a GitHub account.

GitHub metrics are biased towards newer projects, corporate-founded projects, and projects that have a bent towards non-reciprocal licenses.

One metric, even if it’s wrong

We’ve established that GitHub is only a slice of open source development activity. What size slice, I’m not sure, but if I was feeling generous I might say GitHub hosts between 40%-60% of active and important projects today.

But 1) it’s a biased set of projects because it misses a lot of important and established projects (as well as those who have a bent towards software freedom and not just expediency in development), and 2) it captures only the activity you can measure via GitHub’s tools. (It also misses people who contribute via personal email address or project addresses.)

Let’s look at some of the things measuring via GitHub automatically excludes.

Non-development activity: Assuming you’ve identified users properly, etc., you can then measure their activity in contributing code and answering issues, how popular the projects are in the context of GitHub, and some other snazzy metrics.

What you don’t see are non-code contributions in the form of documentation, UX work, legal work, and other activities that healthy projects depend on. Yes, some projects do also host docs on GitHub, but many don’t. And you don’t see the work that a company contributes in the form of lawyers doing work to approve or vet code that moves from proprietary to open source, or verifying that a project’s CLA is acceptable or not.

Microsoft, for example, has probably had teams of lawyers working very hard as they move things to GitHub. That work? Completely silent and unobserved on GitHub – but essential. I know the legal teams at companies like SUSE and Red Hat spend a great deal of time supporting open source projects — and that work isn’t going to be surfaced when asking GitHub’s API for commits and PR data.

Quality assurance and testing: In many cases, when you see a PR or commit, that is the result of a developer fixing a bug somebody found in testing work outside GitHub. While GitHub repos have ways to report issues, if you look at projects that are shipped by companies like Microsoft or Red Hat, there’s a ton of bug reporting that happens outside GitHub. After extensive testing and examination there might be an issue filed on GitHub or it might be reflected in a PR, but you don’t see the bulk of work that’s gone into it.

Distribution: GitHub also doesn’t show you the work that goes into, for example, putting together Fedora or OpenStack. You don’t see the work that goes into packaging software to work with other software, or the work that goes into maintaining build systems and so forth that don’t live at GitHub.

Every six months or so, Fedora puts out a release with multiple ISO images for installation, huge sets of packages in repositories for multiple arches, as well as container images, rpm-ostree images, and more. All of that with documentation, artwork, promotional activity, and infrastructure that are all open source contributions from thousands of people (many are Red Hat employees, many aren’t) that happens almost entirely away from GitHub.

The same is true for Ubuntu, and there’s also Linux Mint, Debian, and openSUSE on various release cadences. So measuring only GitHub ignores all of that and wrongly credits newcomers like Microsoft with more influence than they actually have.

Effing the ineffable

Finally, measuring GitHub projects en masse without any real metric for the importance of the projects, their adoption, or their value to the ecosystem and the companies contributing fails to provide an accurate picture.  You can count GitHub stars and downloads and PRs and issues, but that doesn’t really give a full view of the importance of a project.

One company (or project, like Debian or Apache) may make its entire portfolio available as open source, while other companies focus their activities on open source they hope will add value to their core platforms. To be blunt, Microsoft’s strategy is to embrace open source so that its key platforms (Azure, Windows) have workloads to run.

This is not to say Microsoft isn’t a valuable contributor in those communities or that its contributions aren’t worthwhile. But if the company is just helping to entrench proprietary platforms by making open source work better with them, it’s premature to celebrate that company as a champion of open source.

It’s also faulty to claim that Microsoft or anybody else contributes “the most” to open source based on a survey of a single platform. (Especially when that platform is owned by one of the companies. Of course more of their developers are going to be doing work there!)

In short, people need to stop trotting out reports that only focus on GitHub if they want to claim an accurate picture of “open source.” It might be fair to use GitHub when examining a single project (like Kubernetes), but it’s not a fair representation of open source overall. Failing to understand that is perhaps understandable when talking to laypersons who don’t participate in open source or claim to be experts. But failing to understand that demonstrates questionable understanding of open source. Simply ignoring it or excusing it and using GitHub as a metric anyway is deceptive.

Amazon’s open source aspirations and actions

I stayed up last night to watch Amazon’s Tuesday night keynote for AWS re:Invent. Lemme tell you, I am not at all sad to be missing the crowds at re:Invent, and kudos to Amazon for its high-quality production values for the keynotes.

One of the things that really interested me, but wasn’t deeply explored, was the mention of Amazon’s home-grown KVM hypervisor and its Nitro setup, where it offloads networking, management, and storage to separate hardware and gives instances all the resources on the machine. (This is going by Peter DeSantis’ description and my following along with the keynote past midnight, so…)

Later in the keynote session when they brought Netflix on, they made some noises about open source and talked about their TLS implementation s2n. Haven’t dove deeply into s2n, but it sounds like they’re doing the right thing with this project, and a strong encryption alternative that has deep-pocket backing is not a bad thing at all.

But what struck me is the dichotomy of talking about open source and its importance for s2n, but glossing over completely their modifications or plans for KVM as a project. There’s a huge KVM community and I’m sure that they’d love to have Amazon participating actively. As far as I know, though, this isn’t happening.

Amazon has made moves to start an open source office and is doing more work in open source, but there’s a huge deficit between what Amazon builds off of open source and what it contributes back. If the company is serious about open source, it has an opportunity to make an enormous impact. I just hope the plan isn’t to limit its contributions to fringe or non-crucial projects and keep vital projects like Nitro/KVM behind closed doors away from the rest of the industry.

Communication Anti-Patterns

Let’s get this out of the way: Yes, I’m old and grumpy. I have more than a few “get off my lawn!” moments. But sometimes… sometimes, they’re justified. Especially when confronted with some of the common communication anti-patterns I run into day after day when working with distributed communities/workers. Here’s a few things you shouldn’t do, or stop doing if you do them.

Continue reading

Proprietary tools for FOSS projects

slackMy position on free and open source software is somewhere in the spectrum between hard-core FSF/GNU position on Free Software, and the corporate open source pragmatism that looks at open source as being great for some things but really not a goal in and of itself. I don’t eschew all proprietary software, and I’m not going to knock people for using tools and devices that fit their needs rather than sticking only to FOSS.

At the same time, I think it’s important that we trend towards everything being open, and I find myself troubled by the increasing acceptance of proprietary tools and services by FOSS developers/projects. It shouldn’t be the end of the world for a FOSS developer, advocate, project, or company to use proprietary tools if necessary. Sometimes the FOSS tools aren’t a good fit, and the need for something right now overrides the luxury of choosing a tool just based on licensing preference. And, of course, there’s a big difference between having that discussion for a project like Fedora, or an Apache podling/TLP, or a company that works with open source.

Fedora is generally averse to adopting anything proprietary, even using things like YouTube or Twitter to promote Fedora tends to generate discussion and questions about whether it’s proper to use proprietary services. Grudgingly, though, most folks have accepted that to promote Fedora you have to go where the people are–even if that means using non-FOSS services. Apache has been more willing to adopt non-free services (e.g., Jira) where acceptable FOSS services exist. Not surprising, because Apache’s culture is more “use open source because it’s pragmatic” rather than driven by ideology. (That is painting with a very broad brush, and I think you can find a diverse set of opinions within Apache, including mine.)

Generally, though, I worry about making too many concessions to non-free software. I worry that we’ve gone too far towards business concerns, and too far away from wanting to change the world for the better. There’s a balance to be struck, I think, where we put food on the table, build successful companies and successful and sustainable communities. Where we use tools we’ve built to do our work, and tools we can improve, but don’t rake people over the coals because of the tools they choose or make bad business decisions out of a desire for purity.

This post asking people not to use Slack really resonates with me. I see this as a wholly unnecessary adoption of proprietary software where there’s a reasonable and serviceable alternative. The good news, I think, is that Slack seems to be spurring some development of better IRC alternatives that might not have developed without Slack. And it’s spurred more people thinking about the tools they use, and whether they’re open, and what that means. Full disclosure, I have a personal Slack account. I’ll use it to chat with friends, just like I’ll use Facebook or Google Hangouts. But I don’t see recommending it for an official channel for, say, Project Atomic.

Marketing is not a spectator sport…

fmag-ribbonA piece over on Fedora Magazine, following a talk I did at Flock this summer. The short version: open source projects need all the help they can get spreading the word. Fedora, Apache projects, GNU projects, Debian, etc., all depend on word of mouth to reach users. By reaching users, we find new contributors, and it’s the new contributors that help keep projects going and reaching new users. We don’t have megabucks to throw at ad campaigns, but we have millions of users–and the impact would be enormous if even 10% of those users spent a little time spreading the word about Fedora (or other project).

More users means more contributors. More contributors equals better projects. Better projects mean more users, and fewer people choosing proprietary solutions. Don’t wait for somebody else to spread the word, jump in and lend a hand.

The marketing group was a bit disorganized in the F23 cycle, and we can do much better. I hope to do more in the F24 cycle, but I can’t do it alone, and don’t really want to! So if you want to see Fedora succeed wildly, I hope you’ll find a way to join our efforts. Read the full piece on Fedora Magazine, and feel free to ask if you need help jumping in!

Just what is “open,” anyway?

Here’s something I spend a lot of time thinking about: What constitutes “real” open source? Not just the license, I think the OSI has done just fine in defining an open source license. (And the GNU/FSF folks have done just fine in defining a Free software license as well.)

I’m asking, what constitutes a real open source project? What are the specific things you need to say “yep, this is a genuine open source project that really deserves the title”?

Curious what other folks think. It probably comes as no surprise that I don’t consider a project “open” just because there’s a public repository with code that is under an OSI-approved license.

Also curious of any bodies like the OSI have working definition. So many projects and companies lay claim to open source, but I see very little of it in practice.

The evils of top-posting…

email-icon-post Answer: Because it makes it hard to follow conversations in email.

Question: Why is top-posting evil?

I know I’m fighting a losing battle here, but I occasionally feel compelled to remind people just how inefficient top-posting is for multiple-participant conversations. This is doubly true for people added after the conversation is started.

It takes a little longer, but it’s so much nicer if you can read an email thread from top to bottom rather than having to scroll to the bottom, read, scroll backward, read, scroll backward, read, etc. Yes, it’s the easiest way to reply to a message, but it’s an enemy of comprehension for recipients.

What good is open source nobody knows about?

old-school-twitter-ad-thumb Here’s a pet peeve of mine, because I see it time and time again: Folks work on software or projects, put in a ton of effort, and then do nothing to promote the project or release. (And, for bonus points, complain that they don’t understand why the project isn’t getting more attention!)

This doesn’t mean developers have to do double-duty as marketeers and public relations folks. Well, not if they can pass the torch onto interested contributors who are happy to do it for them, anyway. It requires a little coordination and effort, but why put all the work into a project and then not get the attention of the users (and potential contributors) you’re trying to reach?

Additionally, it really helps to blog, tweet, and otherwise spread the word about projects while they’re in process. If you want people to collaborate, they really need to know that you’re doing something.

This isn’t necessarily intuitive for folks, I understand. But it is absolutely, vitally, necessary. Maybe, occasionally, a project is just so darn awesome that somebody happens to stumble on it via GitHub or whatever and word of mouth makes it a success – but typically, things get out into the world via consistent updates and communications to the right channels to get the word out.

Is that the right mailing list? Is that the right audience?

email-icon-postQuick thought for the day: are you sending that message to the right mailing list?

If you work in open source, odds are you spend a lot of time working with people via email. At Red Hat we have internal mailing lists for developers that work on projects, and external mailing lists for projects, as well as internal lists for specific groups, topics, etc. I’m also, less than I’d like these days, involved in the Apache Software Foundation (ASF) and it has user lists and developer lists for projects, announce lists, and a variety of private lists for projects and specifically for members, fundraising, and so on.

I could write volumes about what’s good and bad about various Mail User Agents (MUAs) and mailing list software. But this is not about that–this is about bad habits that people fall into when opening and conducting discussions on mailing lists. Specifically, whether they’re going to the right place.

All too often, I see people opting to go for the least-public list when opening discussions. Part of this, I think, is just human laziness. You get into a routine, and stick with it. This is doubly hard to overcome when an initiative starts “behind the firewall” and then moves into the public.

Part of this is a tendency to stay with a familiar group. It can be “scary” to expose your ideas, commentary, plans, or whatever to a large audience. It can also, honestly, be annoying. Everybody has an opinion, and filtering through all the opinions and commentary can be a royal pain in the posterior. Separating the wheat from the chaff can be tricky when you do opt for openness and then have to filter through all of the digressions, uninformed opinions, and (occasionally) dissent to come to a decision.

I could probably write volumes on this topic, but I promised a quick thought. So, in a nutshell: Think before you start a conversation on a mailing list. Are you sending it to a private list to avoid discussion or exposure, or is there a good reason the conversation needs to be private? (Alternately, are you sure of the audience you’re sending to? Are you sending anything group/company confidential to too wide an audience? It happens infrequently, but it can be a big problem when it does.) If not, then break the habit and opt for openness. You might just be surprised how effective that can be, so give it a shot.