Verification is hard

A while back I interviewed some core contributors to Wikipedia Project Medicine. While I’m wildly enthusiastic about their work, I can’t help feeling that Wikipedia falls far short of being an ideal medium for sharing medical knowledge.

A thought experiment demonstrates this. Suppose you read a claim on a Wikipedia page such as “Rubbing toothpaste on your feet has been shown to cause permanent stains in up to 20% of people”. How many people are you trusting? How difficult is this statement to verify?

Well, to begin with, there’s always the possibility of vandalism. You can view the history of the page and see if this particular statement was added recently. If it was, and by someone without much of an edit history, it may very well be vandalism. If the statement has made it through many page edits or was added by a trusted user, it’s probably not. This level of verification requires a relatively sophisticated understanding of how Wikipedia works, but is not a huge barrier.

The statement may not be vandalism, but it may be in dispute. Hopefully if this is the case it will be tagged as such, allowing you to learn about Wikipedia’s disputed statement procedures and visit the talk page to view the argument.

Let’s say that Wikipedia has come to the consensus that the statement is true, but without much discussion. They cite a research article from a scientific journal. The most obvious obstacle is that the article may be closed access. In this case, you have multiple options, each of which present additional barriers:

  • You can pay $30 or so to access the article.
  • You can email the authors of the article and ask for a copy. This requires a certain amount of social bravado, and is frequently unsuccessful.
  • You can ask a friend with access to get you a copy. To have friends with access in the first place requires a significant amount of privilege, and is also often illegal.
  • You can just read the abstract. If you only read the abstract, you are not really verifying the research – only the fact that the researchers really came to the conclusions that Wikipedia said they did.

Let’s say you can access the article. Reading and understanding it presents another significant barrier, as most articles are not written to be accessible to laypeople, even educated ones. Understanding the article may take hours or days of background research.

Furthermore, most journal articles are written in a traditional format, which does not allow the reader to verify many of the details of the work. (I’ve written about the failures of the traditional journal article format here.) So even reading and understanding the article may not be enough to verify the claim yourself.

In the end, you may spend days or weeks trying to verify the claim, only to be unsuccessful. If you don’t try to verify, you end up trusting the specific Wikipedia user who added the claim, the general Wikipedia community’s system for attaining accuracy, and the researchers who wrote the cited article. Even with the best of intentions, each one of these trusted actors can easily fail.

Obviously these are not trivial problems to fix, and I don’t blame Wikipedia for not having it all figured out yet. Our world is a messy and corrupt one, not an ideal one.

It’s still worth stating:

An ideal verification system allows people to verify as deeply and as broadly as they want to, without either overloading them with information or providing unnecessary hurdles to them learning more. Not coincidentally, the ideal education system allows people to learn as deeply and as broadly as they want to, without either overloading them with information or providing unnecessary hurdles to them learning more.

I hope that eventually we’ll find our way towards better platforms and knowledge systems, where the only limit to verification are the abilities and inclinations of our collective minds.

Just one in a crowd

A few months ago, I joined a project called Crowdstorming a Dataset. It’s a project affiliated with Center for Open Science and its basic premise is this: what if you gave a single dataset to dozens of researchers, and asked them to prove or disprove a particular hypothesis? What are the different analytical approaches they might take? Would they all give similar answers? Once they’re given the opportunity to give and receive feedback, would their answers and methodologies converge?

The project is far from finished, and those answers are still mostly unknown, but as I finished with my role this week, I thought I’d take a moment to reflect.

Deciding to Join

I hesitated before joining this project. Not because I thought it wasn’t valuable, but because I worried my skills would be inadequate. Many of the researchers involved have a lot more training, experience, and resources than I, when it comes to data analysis. What if my proposal was flat-out wrong?

In the end, I decided that any contribution I made would be valuable. While a typical researcher might have more than a single stats class to their credit, education and experience are no guarantee against making mistakes. If my analysis plan was poor, it would test whether reviewers could identify those flaws. If my execution was off, it would signal that conceptual review is insufficient without technical review.

In order to facilitate the discovery of errors – and also because I like to use and promote good tools – I did my analysis in the form of an iPython Notebook, with ample documentation and commentary. You can find the notebooks here.

The Basic Structure

Researchers were given two research questions: (1) Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players? and (2) Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?

We were also supplied with a large dataset of player-referee dyads, which included information such as the number of red cards given by said referee to said player, the number of games in which they both participated, the number of goals scored by the player during those games, bias scores for the referee’s country of origin, skin tone ratings of the player by two independent raters, and more.

We were asked to create and implement an analysis plan. We reported the plan and results separately to the organizers, who set up a system for us to peer-review the former. Each research group was asked to give feedback on at least three other analysis plans. We then altered our own analysis plans as we felt we needed in response to feedback, re-did our analysis, and reported back to the organizers. We also rated our confidence in the hypotheses at various points throughout the study.

You can read more details about the project here.


There were a few hiccups along the way, which is perfectly natural for a first-time project. Hopefully if there are future iterations they will be addressed.

  • The dataset was not thoroughly described. Most importantly, the organizers did not document the exact nature of the ‘club’ and ‘leagueCountry’ variables included in the dataset. Many researchers, including myself, assumed that these variables meant “the club and league that the player was in when this data was gathered”, but it turned out to mean “the club and league that the player began their career with” which covered an unknown fraction of the data. As a result, the many comments during feedback about how to address the multi-level nature of the data (with players nested in clubs nested in leagues) may have been inappropriate or even inaccurate. It’s worth thinking about best practices for documenting datasets and methodologies. How can we minimize omissions like these?
  • Some plans did not recieve enough feedback. One of the most interesting aspects of this project was the opportunity to see if participants consensed around which analysis plans were most likely to be effective/accurate. However due to how this process was designed there was significant variation in the number of ratings received. The average team received 5 ratings and responses, but many received only 2 or 3. How much is enough to indicate consensus? Surely it was too much to ask everyone to rate all 31 approaches, but I’m not sure how informative the ratings data actually are. I also found the qualitative feedback to be somewhat lacking, with some groups skipping it entirely and a few providing commentary that was too terse to be particularly useful.
  • For the final analysis, the organizers requested that we provide our results in the format of an odds ratio or cohen’s d. This presented a problem for me, as the result of Poisson regression is not easily converted to either of these statistics. I ended up submitting an incidence rate ratio, which will hopefully be useful. There is a tension here: to constrain result format too tightly is to falsely limit the kind of approaches researchers take, but to accept many different formats is to practically limit the ways in which results can be compared.

Educational Value

Regardless of the meta-analytical results, I think this protocol has strong value as an educational tool. Here are just a few topics I gained further understanding of:

  • Possibly the most helpful piece of feedback I recieved was that ‘games’ should have been an offset or exposure variable in my regression. This was not a concept I had heard of before, but a little reading made clear that the reviewer was absolutely correct. Offset/exposure variables are used when dealing with count data when the opportunity for events to occur – usually time – differs. Hence the term ‘exposure’.
  • Although I was familiar with multicollinearity before this project I had never grappled with it in a practical context. Multicollinearity is when two or more variables in a model are correlated with each other. Including multicollinear variables in a model doesn’t harm the predictive power of the model as a whole, but it can cause information about individual predictor variables to be wildly inaccurate. Since this hypothesis was a question not about how to predict red cards as a whole, but about the influence of predictor variables skin tone rating, mean implicit bias, and mean explicit bias in particular, this was a serious issue. One site I read suggested splitting the data and comparing coefficient values, but it was not clear to me how to interpret these results. Couldn’t high variance in a regression coefficient mean that there’s no true effect, as opposed to an effect being obscured by multicollinearity?
  • A piece of feedback on one of the other analysis plans mentioned the Akaike Information Criterion (AIC). This turns out to be a sort of abstracted way to compare models for a given dataset. It combines the likelihood of an observed dataset given a specific model with the sample size of the model as well as the number of parameters, discouraging overfitting. I would be interested in seeing the AIC values for the different models submitted in this project!

Looking Forward

Although we await the organizers’ report, I can already say that I found this to be a valuable and informative project. I thank Raphael Silberzahn, Eric Luis Uhlmann, Dan Martin and Brian Nosek for conducting it, and I hope it is not the last of its kind.

Attention Rob Thomas

I could write plenty about my experiences at Hackers on Planet Earth (HOPE) this weekend, and I probably will, later. But I had a quick thought during Steve Rambam’s talk on privacy loss yesterday that I wanted to share with you all.

How great would it have been if the Veronica Mars movie had focused on big data, surveillance and privacy issues instead of that throwaway murder mystery plot? I’m imagining Mac as a major player – a sys-admin for the government and perhaps a Snowden-esque leaker, or a corporate whistleblower. The Kanes are, in story, technology tycoons – you could easily make them a stand in for Facebook, Google, etc. There could even be a subplot about the Neptune police department buying drones and using new tools to invade people’s privacy.

It would have been relevant, provocative and even educational for viewers. Mac could’ve name-dropped real privacy tools like Tor, SecureDrop, CryptoCat, etc and Veronica could wrestle with the hypocrisy of fighting for privacy rights while she invades people’s privacy all the damn time.

Bonus: the focus would be on the two smart, complicated female leads and their friendship, rather than on predictable romantic subplots.

Maybe there’s hope for the next movie?

It’s my time

A little over a year ago I did a survey at the job fair of a major tech conference. At each booth I asked whether they were hiring people part time. The response was almost entirely no way, nuh-uh, never.

There is plenty of research showing that more time at work does not equal more productivity. (Caveat: I have not read the primary research here.)

I want to offer some anecdotal evidence.

I am a freelancer (or a contractor, or self-employed, whatever you want to call it). My main client right now is OpenHatch. Last year the hours I spent on OpenHatch worked out to approximately quarter time. This year my hours are the equivalent of half time. This means that at the end of June I had worked the same number of hours as if I had been hired on full-time for six months. So, what did I have to show for myself at my internal “six month review”?

  • I organized or co-organized 21 Open Source Comes to Campus events, personally running 15 of them.
  • I spoke at two conferences on behalf of OpenHatch, and wrote a proposal to speak at Grace Hopper this year. (The Grace Hopper proposal took an unexpectedly long time, as the process is quite competitive – approximately 20% of submissions are accepted.) I have also run workshops at three conferences.
  • I improved and documented our event planning process and made it far more efficient to use, both internally and for those who want to “fork” the project.
  • I created multiple tools which have been useful both for OpenHatch, and which other projects have shown interest in using/adapting. These are our In Person Event Guide, WelcomeBot, and Merge Stories.
  • I wrote 25 blog posts for the OpenHatch blog.
  • I helped redesign and maintain the program website.
  • I’ve done interviews (Wired, In Beta, Linux Magazine) which resulted in good publicity for OpenHatch.
  • Other small improvements including leading documentation sprints, creating and instituting a Code of Conduct for the IRC channel, organizational planning, and helping with the fundraising drive.
  • I have answered an uncountable number of questions via email and IRC.

I am almost certainly forgetting things.

I think this is more than most people can do in six months of full time work. It’s more than I could do in six months of full time work! Clearly, OpenHatch is benefiting from this arrangement.

And I’m benefiting too. I have tons of free time with which I can pursue other opportunities, whether that means working for other clients, or personal pursuits such as writing novels and children’s books, maintaining and writing for the Open Science Collaboration blog, taking online classes and reading non-fiction, and being there for family in the hardest times.

I wish more organizations were open to hiring contractors, because I know that I – and others! – can be amazing assets when given flexibility and independence. It’s funny how the capitalist desire to wring every last drop of productivity out of a worker often extinguishes the spark that makes them productive.

I’m guarding my spark. If I never work full time again, I don’t think I’ll regret it.

This is just to say

I got interviewed about OpenHatch / Open Source Comes to Campus for an article in Wired. Pretty cool!