When everything looks like a nail

A hammer hangs on a nail sticking out of a wooden fence.

Hammer by Jerry Swiatek, CC BY 2.0

I’ve long known the adage “When you have a hammer, everything looks like a nail” but I’ve only recently come to appreciate its truth.

As I’ve mentioned on this blog before, my main client for the last year and a half has been OpenHatch, for whom I’ve been organizing events and event series. (I do many other things for them but the event series are my biggest focus.) A few months ago I was chatting with science librarian Thea Atwood about the disappointing lack of interest in open science at most of the Five College schools, especially my alma mater, Hampshire. We decided to throw together an event in mid-October to address that.

Event-planning has become a “hammer” for me: a tool for addressing problems that feels easy and natural. When I see an issue that could plausibly be fixed by throwing an event, I instinctively think about doing so.

I have several other hammers. The first one I picked up was writing stories. I write for fun, yes, but I also write to fix problems: my first novel is a way of articulating flaws in libertarianism, my children’s book is a response to the way society was gendering my best friends’ children, and my current project is meant to encourage girls to go into technology.

In college, I gained another hammer: doing experiments. When I have a question, I often design in my head the process that would allow me to answer it, whether that’s observation, controlled manipulation, or analysis of pre-existing data. Sometimes I even get to carry out these experiments, though of course that was more common when I worked at a lab.

When I learned to program, I added the hammer of “make a website!” though my grip can be somewhat shaky. I tend to brainstorm static websites, simple apps, or sites that use basic mysql-style databases because that’s what I feel comfortable creating. There’s still a great deal to web development that I’m unfamiliar with and therefore don’t think of when faced with a problem.

Which brings me to my worry: with so many tools in my belt, do I have a false sense of security that I’m choosing the best method for approaching a particular problem? There are still so many approaches I can’t take. My response to a problem is almost never “Make a business!” or “Create a physical object!” or “Write a song!” or “Check/change the law!”, because those aren’t things that I know how to do.

The next time I think of a story, an experiment, a website or an event as the best answer to a given problem, I want to take a step back and think about what the best solution really is. I want to force myself to come up with an approach that is outside my comfort zone. And I want to keep improving my toolset.

What are your hammers, and what hammers do you wish you had?

Supporting AdaCamp and the Ada Initiative

If you’ve never been to a feminist conference, you’re missing out. If you’ve never found yourself surrounded by dozens of brilliant, empathetic, creative and determined women, you should consider giving it a try. If you’ve never gone from learning about how open source cloud computing platforms work straight to a discussion of microaggressions and how to deal with them, finishing things off by sharing your favorite feminist response gifs – well, maybe you should come to AdaCamp.

I’ve been to a number of open source and technical conferences over the last few years, most of which I’ve thoroughly enjoyed. But AdaCamp is a special kind of experience. It’s given me so much:

It gave me the ability to see how a major conference’s code of conduct was deeply flawed and the confidence to approach them with suggestions for how to fix it.

It’s encouraged me to speak frankly about diversity in our communities and how to improve it. (Including tomorrow!)

And it’s helped me to meet so many incredible women who are now my colleagues and my friends.

For all of this, I was glad to donate today to the Ada Initiative. If you’d like to do the same, you can do so here:

Donate now

Verification is hard

A while back I interviewed some core contributors to Wikipedia Project Medicine. While I’m wildly enthusiastic about their work, I can’t help feeling that Wikipedia falls far short of being an ideal medium for sharing medical knowledge.

A thought experiment demonstrates this. Suppose you read a claim on a Wikipedia page such as “Rubbing toothpaste on your feet has been shown to cause permanent stains in up to 20% of people”. How many people are you trusting? How difficult is this statement to verify?

Well, to begin with, there’s always the possibility of vandalism. You can view the history of the page and see if this particular statement was added recently. If it was, and by someone without much of an edit history, it may very well be vandalism. If the statement has made it through many page edits or was added by a trusted user, it’s probably not. This level of verification requires a relatively sophisticated understanding of how Wikipedia works, but is not a huge barrier.

The statement may not be vandalism, but it may be in dispute. Hopefully if this is the case it will be tagged as such, allowing you to learn about Wikipedia’s disputed statement procedures and visit the talk page to view the argument.

Let’s say that Wikipedia has come to the consensus that the statement is true, but without much discussion. They cite a research article from a scientific journal. The most obvious obstacle is that the article may be closed access. In this case, you have multiple options, each of which present additional barriers:

  • You can pay $30 or so to access the article.
  • You can email the authors of the article and ask for a copy. This requires a certain amount of social bravado, and is frequently unsuccessful.
  • You can ask a friend with access to get you a copy. To have friends with access in the first place requires a significant amount of privilege, and is also often illegal.
  • You can just read the abstract. If you only read the abstract, you are not really verifying the research – only the fact that the researchers really came to the conclusions that Wikipedia said they did.

Let’s say you can access the article. Reading and understanding it presents another significant barrier, as most articles are not written to be accessible to laypeople, even educated ones. Understanding the article may take hours or days of background research.

Furthermore, most journal articles are written in a traditional format, which does not allow the reader to verify many of the details of the work. (I’ve written about the failures of the traditional journal article format here.) So even reading and understanding the article may not be enough to verify the claim yourself.

In the end, you may spend days or weeks trying to verify the claim, only to be unsuccessful. If you don’t try to verify, you end up trusting the specific Wikipedia user who added the claim, the general Wikipedia community’s system for attaining accuracy, and the researchers who wrote the cited article. Even with the best of intentions, each one of these trusted actors can easily fail.

Obviously these are not trivial problems to fix, and I don’t blame Wikipedia for not having it all figured out yet. Our world is a messy and corrupt one, not an ideal one.

It’s still worth stating:

An ideal verification system allows people to verify as deeply and as broadly as they want to, without either overloading them with information or providing unnecessary hurdles to them learning more. Not coincidentally, the ideal education system allows people to learn as deeply and as broadly as they want to, without either overloading them with information or providing unnecessary hurdles to them learning more.

I hope that eventually we’ll find our way towards better platforms and knowledge systems, where the only limit to verification are the abilities and inclinations of our collective minds.

Just one in a crowd

A few months ago, I joined a project called Crowdstorming a Dataset. It’s a project affiliated with Center for Open Science and its basic premise is this: what if you gave a single dataset to dozens of researchers, and asked them to prove or disprove a particular hypothesis? What are the different analytical approaches they might take? Would they all give similar answers? Once they’re given the opportunity to give and receive feedback, would their answers and methodologies converge?

The project is far from finished, and those answers are still mostly unknown, but as I finished with my role this week, I thought I’d take a moment to reflect.

Deciding to Join

I hesitated before joining this project. Not because I thought it wasn’t valuable, but because I worried my skills would be inadequate. Many of the researchers involved have a lot more training, experience, and resources than I, when it comes to data analysis. What if my proposal was flat-out wrong?

In the end, I decided that any contribution I made would be valuable. While a typical researcher might have more than a single stats class to their credit, education and experience are no guarantee against making mistakes. If my analysis plan was poor, it would test whether reviewers could identify those flaws. If my execution was off, it would signal that conceptual review is insufficient without technical review.

In order to facilitate the discovery of errors – and also because I like to use and promote good tools – I did my analysis in the form of an iPython Notebook, with ample documentation and commentary. You can find the notebooks here.

The Basic Structure

Researchers were given two research questions: (1) Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players? and (2) Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?

We were also supplied with a large dataset of player-referee dyads, which included information such as the number of red cards given by said referee to said player, the number of games in which they both participated, the number of goals scored by the player during those games, bias scores for the referee’s country of origin, skin tone ratings of the player by two independent raters, and more.

We were asked to create and implement an analysis plan. We reported the plan and results separately to the organizers, who set up a system for us to peer-review the former. Each research group was asked to give feedback on at least three other analysis plans. We then altered our own analysis plans as we felt we needed in response to feedback, re-did our analysis, and reported back to the organizers. We also rated our confidence in the hypotheses at various points throughout the study.

You can read more details about the project here.

Flaws

There were a few hiccups along the way, which is perfectly natural for a first-time project. Hopefully if there are future iterations they will be addressed.

  • The dataset was not thoroughly described. Most importantly, the organizers did not document the exact nature of the ‘club’ and ‘leagueCountry’ variables included in the dataset. Many researchers, including myself, assumed that these variables meant “the club and league that the player was in when this data was gathered”, but it turned out to mean “the club and league that the player began their career with” which covered an unknown fraction of the data. As a result, the many comments during feedback about how to address the multi-level nature of the data (with players nested in clubs nested in leagues) may have been inappropriate or even inaccurate. It’s worth thinking about best practices for documenting datasets and methodologies. How can we minimize omissions like these?
  • Some plans did not recieve enough feedback. One of the most interesting aspects of this project was the opportunity to see if participants consensed around which analysis plans were most likely to be effective/accurate. However due to how this process was designed there was significant variation in the number of ratings received. The average team received 5 ratings and responses, but many received only 2 or 3. How much is enough to indicate consensus? Surely it was too much to ask everyone to rate all 31 approaches, but I’m not sure how informative the ratings data actually are. I also found the qualitative feedback to be somewhat lacking, with some groups skipping it entirely and a few providing commentary that was too terse to be particularly useful.
  • For the final analysis, the organizers requested that we provide our results in the format of an odds ratio or cohen’s d. This presented a problem for me, as the result of Poisson regression is not easily converted to either of these statistics. I ended up submitting an incidence rate ratio, which will hopefully be useful. There is a tension here: to constrain result format too tightly is to falsely limit the kind of approaches researchers take, but to accept many different formats is to practically limit the ways in which results can be compared.

Educational Value

Regardless of the meta-analytical results, I think this protocol has strong value as an educational tool. Here are just a few topics I gained further understanding of:

  • Possibly the most helpful piece of feedback I recieved was that ‘games’ should have been an offset or exposure variable in my regression. This was not a concept I had heard of before, but a little reading made clear that the reviewer was absolutely correct. Offset/exposure variables are used when dealing with count data when the opportunity for events to occur – usually time – differs. Hence the term ‘exposure’.
  • Although I was familiar with multicollinearity before this project I had never grappled with it in a practical context. Multicollinearity is when two or more variables in a model are correlated with each other. Including multicollinear variables in a model doesn’t harm the predictive power of the model as a whole, but it can cause information about individual predictor variables to be wildly inaccurate. Since this hypothesis was a question not about how to predict red cards as a whole, but about the influence of predictor variables skin tone rating, mean implicit bias, and mean explicit bias in particular, this was a serious issue. One site I read suggested splitting the data and comparing coefficient values, but it was not clear to me how to interpret these results. Couldn’t high variance in a regression coefficient mean that there’s no true effect, as opposed to an effect being obscured by multicollinearity?
  • A piece of feedback on one of the other analysis plans mentioned the Akaike Information Criterion (AIC). This turns out to be a sort of abstracted way to compare models for a given dataset. It combines the likelihood of an observed dataset given a specific model with the sample size of the model as well as the number of parameters, discouraging overfitting. I would be interested in seeing the AIC values for the different models submitted in this project!

Looking Forward

Although we await the organizers’ report, I can already say that I found this to be a valuable and informative project. I thank Raphael Silberzahn, Eric Luis Uhlmann, Dan Martin and Brian Nosek for conducting it, and I hope it is not the last of its kind.

Attention Rob Thomas

I could write plenty about my experiences at Hackers on Planet Earth (HOPE) this weekend, and I probably will, later. But I had a quick thought during Steve Rambam’s talk on privacy loss yesterday that I wanted to share with you all.

How great would it have been if the Veronica Mars movie had focused on big data, surveillance and privacy issues instead of that throwaway murder mystery plot? I’m imagining Mac as a major player – a sys-admin for the government and perhaps a Snowden-esque leaker, or a corporate whistleblower. The Kanes are, in story, technology tycoons – you could easily make them a stand in for Facebook, Google, etc. There could even be a subplot about the Neptune police department buying drones and using new tools to invade people’s privacy.

It would have been relevant, provocative and even educational for viewers. Mac could’ve name-dropped real privacy tools like Tor, SecureDrop, CryptoCat, etc and Veronica could wrestle with the hypocrisy of fighting for privacy rights while she invades people’s privacy all the damn time.

Bonus: the focus would be on the two smart, complicated female leads and their friendship, rather than on predictable romantic subplots.

Maybe there’s hope for the next movie?