4 Things I Thought About At Transparency Camp
Following up from last week with a more in-depth post. While I heard a bunch of compelling stories and found out about a ton of amazing projects, ultimately what I appreciated most about Tcamp was the chance to talk about the common issues that have arisen for those of us doing open government work.
The largest issue, to me, is accessibility. Obviously the first step to making data accessible is making it available – getting it out of government hands and into the public’s, whether through open data initiatives, FOIA requests, or asking very nicely. But how that information is made available very much impacts who will access it:
Not all obstacles are the result of malevolence, but they diminish accessibility just the same. Mike Morisy of Muckrock often talks about how they’ll request digitized information – databases, emails – and receive it printed on thousands of pages of paper. PDFs aren’t much better. At our last open government meetup we spent an hour debating the best text recognition program for searching meeting minutes and proposed regulations only available in PDF format. Non-programmers may face the opposite problem. I know plenty of people who’d be confounded by a CSV file, nevermind the raw contents of an Access or mySQL database. They’d be happy to get their information on printed paper or in PDF files. What works best for one person with one set of skills will be a constant frustration for another.
Once you’ve got the data in your preferred format, you need to have the the training to manipulate it, to have/know software programs like Excel or Calc or SPSS or JMP or scripting languages like R or Matlab or Octave. And you’ll need to understand at least some statistics – no simple feat when people who do analysis for a living often fall prey to common mistakes. Not to mention the healthy amount of civic literacy necessary to understand the meaning behind the numbers: how bill amendment works, or how federal contracts are awarded, or how the FDA’s clinical trial system works.
Accessibility isn’t simple. But it isn’t something we can ignore either, not if we want to be truthful when we say that we’re advocating for better data access for all. As transparency bloggers have talked about before, efforts to increase transparency can have unexpectedly oppressive effects:
A very interesting and well-documented example of this empowering of the empowered can be found in the work of Solly Benjamin and his colleagues looking at the impact of the digitization of land records in Bangalore. Their findings were that newly available access to land ownership and title information in Bangalore was primarily being put to use by middle and upper income people and by corporations to gain ownership of land from the marginalized and the poor. The newly digitized and openly accessible data allowed the well to do to take the information provided and use that as the basis for instructions to land surveyors and lawyers and others to challenge titles, exploit gaps in title, take advantage of mistakes in documentation, identify opportunities and targets for bribery, among others. They were able to directly translate their enhanced access to the information along with their already available access to capital and professional skills into unequal contests around land titles, court actions, offers of purchase and so on for self-benefit and to further marginalize those already marginalized.
The digital divide exists within the United States as well. Last year Boston unveiled a new initiative called Street Bump, the goal of which was to map pot holes by collecting and analyzing accelerometer data from smartphones. My friends and I were eager to join in the effort, until a friend pointed out that smartphone users were likely to live in – and therefore travel over the pot holes of – mostly well off neighborhoods. And in a session at TCamp, an activist (whose name unfortunately I didn’t catch) pointed out how increased access to crime rate data and the creation of apps like CrimeReports has the potential to stigmatize and further disadvantage poorer neighborhoods.
Which brings me to another issue I’ve been wrestling with – how can you build a narrative while maintaining accuracy? Data is, by itself, meaningless. Take this table:
Group A 0 0 1 0 Group B 1 1 0 1
What does it mean? Is it a count of men and women at a particular event? Is it a record of coin flips? Is it the political affiliation of people being canvassed at a sporting event? Even if we label our variables – say, assigning “Group A” as men and “Group B” as women, we’re still creating a narrative. We’re implying that men and women are the only two possible categories. We’re assuming that our sample is representative. We’re asserting that sex ratios at this event are a topic worth considering, if only for a moment. And this with a set of only four data points!
More concretely, let’s look at a semi-random dataset from Data.gov, labeled EPA Toxics Release Inventory Program. It’s a dataset with a few dozens columns and nearly two million rows – there’s no way a human mind could understand this holistically. We have to group the data together some way, maybe by location, or by the parent company, or maybe by whether the parent company was military, or whether the released toxic was a carcinogen. And as we organize, stories emerge. Maybe we see that toxics are disproportionately released in southern states, or that the vast majority of toxics are released by the military (or vice versa – I have not actually analyzed this data.) These are good and useful stories but they come at a price: lost nuance. A quick skim of this EPA page suggests that facilities do not have to report toxics release if it is under a certain amount per year. So we can’t necessarily say release is greater in southern states – only releasing by larger facilities. What are the potencies of the various carcinogens? If we say that some areas or companies or industries release “more carcinogens” than others, we may be misleading, if others are releasing small amounts of much more hazardous materials.
This might seem like nit-picking. In many ways it is. But the language that some transparency advocates use worries me. From the Data Journalism Handbook, which was released freely and very recently – on the first day of TCamp, as it happens:
Data analysis can reveal “a story’s shape” (Sarah Cohen), or provides us with a “new camera” (David McCandless). Using data the job of journalists shifts its main focus from being the first ones to report to being the ones telling us what a certain development might actually mean. The range of topics can be far and wide. The next financial crisis that is in the making. The economics behind the products we use. The misuse of funds or political blunders, presented in a compelling data visualization that leaves little room to argue with it.
I haven’t had the chance to read the book “cover to cover”, though I have skimmed it. I see a lot of quotes like the above, and not much about how to interrogate data or avoid common statistical mistakes. (Although to be fair, there is some discussion.) If we want to set ourselves up as storytellers, if we want to turn data into something meaningful, then we have a responsibility to make sure that what we’re saying is, well, true. Or as close to true as we can get it, with qualifications and caveats as our epilogue.
One last issue which I found myself talking about a lot was, ironically, communication. This came up first in an early session with the creator of Purple Binder, an online directory of Chicago social services. There is apparently a wealth of information being stored in the paper directories of social workers around the country – information that is duplicated and deprecated a little more every day. An online directory seems like an obvious solution, but it’s hard to get social workers to buy in to the process – they’re already stressed to the limit with a heavy workload, and entering data online is more cost (transferring information, adapting to new formats, dealing with bugs) than benefit to early adopters. And there have been efforts like this before, efforts which have failed due to disagreements about how to organize information, and power struggles over who gets to play the gatekeeper.
There were also debates about data standards. If we can agree on taxonomies and formats, we can combine and share data more easily, making it more accessible, more meaningful, more powerful. I’m not going to talk much more about this, because it’s not an area of expertise for me (although one talk at TCamp about RDF organization has me bookmarking pages to learn more.)
Finally, there’s a question of community. TCamp was, as most conferences are, primarily a networking opportunity. I met so many people working on a variety of projects, most of whom I’d never heard about before. I exchanged email addresses and project URLs with people who were doing work very similar to mine, and left the camp each day pleased to have all these new resources. But where are the tools for building a community beyond TCamp, and for those who couldn’t get there? The Sunlight Foundation has a google group, which is well-trafficked but difficult to search through, and an IRC channel, which is pretty quiet. Open Congress has a wiki, but I haven’t seen it promoted much and consequently, it stores only a fraction of the community’s knowledge.
My very last Tcamp session was with a dozen other local open government organizers from around the country. During the session, we did an exercise where we wrote down what we need to be successful. When we compared notes, we realized that we were all starved for communication. We needed to talk to our communities and find out what they wanted from the open government movement. We needed to talk to our local officials and figure out if they were willing to work with us, and how. We needed to talk to experts who could give us legal and technical advice. And most of all, we needed to talk to each other. To share resources and insights, to keep ourselves from needlessly duplicating others’ hard work, and, perhaps most importantly, to build a community. Because organizing can be hard, frustrating work, and it’s good to do it with friends. And you don’t really need a better to reason to do something than that.







