A few months ago, I joined a project called Crowdstorming a Dataset. It’s a project affiliated with Center for Open Science and its basic premise is this: what if you gave a single dataset to dozens of researchers, and asked them to prove or disprove a particular hypothesis? What are the different analytical approaches they might take? Would they all give similar answers? Once they’re given the opportunity to give and receive feedback, would their answers and methodologies converge?
The project is far from finished, and those answers are still mostly unknown, but as I finished with my role this week, I thought I’d take a moment to reflect.
Deciding to Join
I hesitated before joining this project. Not because I thought it wasn’t valuable, but because I worried my skills would be inadequate. Many of the researchers involved have a lot more training, experience, and resources than I, when it comes to data analysis. What if my proposal was flat-out wrong?
In the end, I decided that any contribution I made would be valuable. While a typical researcher might have more than a single stats class to their credit, education and experience are no guarantee against making mistakes. If my analysis plan was poor, it would test whether reviewers could identify those flaws. If my execution was off, it would signal that conceptual review is insufficient without technical review.
In order to facilitate the discovery of errors – and also because I like to use and promote good tools – I did my analysis in the form of an iPython Notebook, with ample documentation and commentary. You can find the notebooks here.
The Basic Structure
Researchers were given two research questions: (1) Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players? and (2) Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?
We were also supplied with a large dataset of player-referee dyads, which included information such as the number of red cards given by said referee to said player, the number of games in which they both participated, the number of goals scored by the player during those games, bias scores for the referee’s country of origin, skin tone ratings of the player by two independent raters, and more.
We were asked to create and implement an analysis plan. We reported the plan and results separately to the organizers, who set up a system for us to peer-review the former. Each research group was asked to give feedback on at least three other analysis plans. We then altered our own analysis plans as we felt we needed in response to feedback, re-did our analysis, and reported back to the organizers. We also rated our confidence in the hypotheses at various points throughout the study.
You can read more details about the project here.
There were a few hiccups along the way, which is perfectly natural for a first-time project. Hopefully if there are future iterations they will be addressed.
- The dataset was not thoroughly described. Most importantly, the organizers did not document the exact nature of the ‘club’ and ‘leagueCountry’ variables included in the dataset. Many researchers, including myself, assumed that these variables meant “the club and league that the player was in when this data was gathered”, but it turned out to mean “the club and league that the player began their career with” which covered an unknown fraction of the data. As a result, the many comments during feedback about how to address the multi-level nature of the data (with players nested in clubs nested in leagues) may have been inappropriate or even inaccurate. It’s worth thinking about best practices for documenting datasets and methodologies. How can we minimize omissions like these?
- Some plans did not recieve enough feedback. One of the most interesting aspects of this project was the opportunity to see if participants consensed around which analysis plans were most likely to be effective/accurate. However due to how this process was designed there was significant variation in the number of ratings received. The average team received 5 ratings and responses, but many received only 2 or 3. How much is enough to indicate consensus? Surely it was too much to ask everyone to rate all 31 approaches, but I’m not sure how informative the ratings data actually are. I also found the qualitative feedback to be somewhat lacking, with some groups skipping it entirely and a few providing commentary that was too terse to be particularly useful.
- For the final analysis, the organizers requested that we provide our results in the format of an odds ratio or cohen’s d. This presented a problem for me, as the result of Poisson regression is not easily converted to either of these statistics. I ended up submitting an incidence rate ratio, which will hopefully be useful. There is a tension here: to constrain result format too tightly is to falsely limit the kind of approaches researchers take, but to accept many different formats is to practically limit the ways in which results can be compared.
Regardless of the meta-analytical results, I think this protocol has strong value as an educational tool. Here are just a few topics I gained further understanding of:
- Possibly the most helpful piece of feedback I recieved was that ‘games’ should have been an offset or exposure variable in my regression. This was not a concept I had heard of before, but a little reading made clear that the reviewer was absolutely correct. Offset/exposure variables are used when dealing with count data when the opportunity for events to occur – usually time – differs. Hence the term ‘exposure’.
- Although I was familiar with multicollinearity before this project I had never grappled with it in a practical context. Multicollinearity is when two or more variables in a model are correlated with each other. Including multicollinear variables in a model doesn’t harm the predictive power of the model as a whole, but it can cause information about individual predictor variables to be wildly inaccurate. Since this hypothesis was a question not about how to predict red cards as a whole, but about the influence of predictor variables skin tone rating, mean implicit bias, and mean explicit bias in particular, this was a serious issue. One site I read suggested splitting the data and comparing coefficient values, but it was not clear to me how to interpret these results. Couldn’t high variance in a regression coefficient mean that there’s no true effect, as opposed to an effect being obscured by multicollinearity?
- A piece of feedback on one of the other analysis plans mentioned the Akaike Information Criterion (AIC). This turns out to be a sort of abstracted way to compare models for a given dataset. It combines the likelihood of an observed dataset given a specific model with the sample size of the model as well as the number of parameters, discouraging overfitting. I would be interested in seeing the AIC values for the different models submitted in this project!
Although we await the organizers’ report, I can already say that I found this to be a valuable and informative project. I thank Raphael Silberzahn, Eric Luis Uhlmann, Dan Martin and Brian Nosek for conducting it, and I hope it is not the last of its kind.