Saturday, January 18, 2014

Big Data, Predictive Algorithms and the Virtues of Transparency (Part One)



Transparency is a much-touted virtue of the internet age. Slogans such as the “democratisation of information” and “information wants to be free” trip lightly off the tongue of many commentators; classic quotes, like Brandeis’s “sunlight is the best disinfectant” are trotted out with predictable regularity. But why exactly is transparency virtuous? Should we aim for transparency in all endeavours? Over the next two posts, I look at four possible answers to that question.

The immediate context for this is the trend toward “big data” projects, and specifically the trend toward the use predictive algorithms by governmental agencies and corporations. Recent years have seen such institutions mine large swathes of personal data in an attempt to predict the future behaviour of citizens and customers. For example, in the U.S. (and elsewhere) the revenue service (IRS) uses data-mining algorithms to pool individuals for potential audits. These algorithms work on the basis that certain traits and behaviours make it more likely that an individual is understating income on a tax return.

There are laws in place covering the use of such data, but I’m not interested in those here; I’m interested in the moral and political question as to whether such use should be “transparent”. In other words, should an institution like the IRS be forced to disclose how their predictive algorithms work? To answer that question, I’m going to enlist the help of Tal Zarsky’s recent article “Transparent Predictions”, which appeared in last year’s University of Illinois Law Review.

Zarsky’s article is a bit of sprawling mess (typical, I’m afraid, of many law journal articles). It covers the legal, political and moral background on this topic in a manner that is not always as analytically sharp as it could be. Nevertheless, there are some useful ideas buried within the article, and I will draw heavily upon them here.

The remainder of this post is divided into two sections. The first looks at the nature of the predictive process and follows Zarsky in separating it out into three distinct stages. Each of those stages could raise different transparency issues. The second section then looks at the first of four possible justifications for transparency.

Note: the title of this post refers solely to the “virtues” of transparency. This is no accident: the series will deal specifically with the alleged virtues. There are, of course, alleged vices. I’ll address them at a later date.


1. Transparency and the Predictive Process
Take a predictive algorithm like the one used by IRS when targetting potential tax cheats. Such an algorithm will work by collecting various data points in a person’s financial transactions, analysing those data and then generating a prediction as to the likelihood that the said individual is guilty of tax fraud. This prediction will then be used by human agents to select cases for auditing. The subsequent auditing process will determine whether or not the algorithm was correct in its predictions.

The process here is divided into three distinct stages, stages that will be shared by all data-mining and predictive programmes (such as those employed by other government agencies and corporations). They are:

Collection Stage: Data points/sets are collected, cleansed and warehoused. Decisions must be made as to which data points are relevant and will be collected.
Analytical Stage: The collected data is “mined”, analysed for rules and associations and then processed in order to generate some prediction.
Usage Stage: The prediction that has been generated is used to guide particular decisions. Strategies and protocols are developed for the effective use of the predictions.

For the time being, human agency is relevant to all three stages: humans programme the algorithms, deciding which data points/sets are to be used and how they are to be analysed, and humans make decisions about how the algorithmic predictions are to be leveraged. It is possible that, as technology develops, human agency will become less prominent in all three stages.

Transparency is also a factor at all three stages. Before we consider the specifics of transparency, we must consider some general issues. The most important of these is the question of to whom must the process be transparent? It could be the population as a whole, or some specific subset thereof. In general, the wider the scope of transparency, the more truly “transparent” the process it is. Nevertheless, sometimes the twin objectives of transparency and privacy (or some other important goal) dictate that a more partial or selective form of transparency is desirable.

So how might transparency arise at each stage? At the first stage, transparency would seem to demand the disclosure of the data points or sets are going to be used in the process. Thus, potential “victims” of the IRS might be told which of their financial details is going to be collected and, ultimately, analysed. At the second stage, transparency would seem to demand some disclosure of how the analytical process works. The analytical stage is quite technical. One form of transparency would be to simply release the source code to the general public in the hope that the interested few would be able to figure it.

Nevertheless, as Zarsky is keen to point out, there are policy decisions to be made at this stage about how “opaque” the technical process really is. It is open to the programmers to develop an algorithm that is “interpretable” by the general public. In other words, a programme with a coherent theory of causation underlying it that can be communicated and understood by any who care to listen. Finally, at the third stage, transparency would seem to require some disclosure of how the prediction generated by the algorithm is used and, perhaps more importantly, how accurate the prediction really is (how many false positives did it generate? etc.)

But all this is to talk about transparency in essentially descriptive and value-free terms. The more important question is: why bother? Why bother make the process transparent? What moral ends does it serve? Zarsky singles out four rationales for transparency in his discussion. They are: (i) to promote efficiency and fairness; (ii) to promote more innovation and crowdsourcing; (iii) to protect privacy; and (iv) to promote autonomy.

I’m not a huge fan of this quadripartite conceptual framework. For example, I don’t think Zarsky does nearly enough to delineate the differences between the first and second rationales. Since Zarsky doesn’t talk about procedural fairness as distinct from substantive fairness, nor explain why innovation and crowdsourcing are valuable, it seems to me like both rationales could collapse into one another. Both could just be talking about promoting a (non-specified) morally superior outcome through transparency.

Still, there probably is some distinction to be drawn. It is likely that the first rationale is concerned with substantive fairness and the minimisation of bias/discrimination; that the second rationale is concerned with enhancing overall societal levels of well-being (through innovation); and that the third and fourth rationales are about other specific moral ends, privacy and autonomy, respectively. Thus, I think it is possible to rescue the conceptual framework suggested by Zarsky from its somewhat imprecise foundations. That said, I think that more work would need to be done on this.


2. Transparency as means of promoting fairness
But let’s start with the first rationale, imprecise as it may be. This is the one claiming that transparency will promote fairness. The concern underlying this rationale is that opaque data-mining systems may contain implicit or explicit biases, ones which may unfairly discriminate against particular segments of the population. For example, the data points that are fed into to the algorithm may be unfairly skewed towards a particular section of the population because of biases among those who engineer to the programme. The claim made by this rationale is that transparency will help to stamp out this type of discrimination.

For this claim to be compelling, some specific mechanism linking transparency to the minimisation of bias (and the promotion of fairness) must be spelled out. Zarsky does this by appealing to the notion of accountability. He suggests that one of the virtues of transparency is that it forces public officials, bureaucrats and policy makers to take responsibility for the predictive algorithms they create and endorse. And how exactly does that work? Zarsky uses work done by Lessig to suggest that there are two distinct mechanisms at play: (a) the shaming mechanism; and (b) market and democratic forces. The first mechanism keeps the algorithms honest because those involved in their creation will want to avoid feeling ashamed for what they have created; and the second mechanism keeps things honest by ensuring that those who fail to promote fairness will be “punished” by the market or by the democratic vote.

To encapsulate all of this in a syllogism, we can say that proponents of the first rationale for transparency adopt the following argument:


  • (1) The predictive policies and protocols we adopt ought to promote fairness and minimise bias.
  • (2) Transparency promotes fairness and minimises bias through (a) shaming and (b) market and democratic forces.
  • (3) Therefore, we ought to incorporate transparency into our predictive policies and protocols.


At the moment, this argument is crucially vague. It fails to specify the extent of the transparency envisaged by the rationale, and it fails to specify at which stage of the predictive process transparency may become relevant. Until we add in these specifications, we will be unable to determine the plausibility of the argument. One major reason for this is that the argument, when left in its original form, seems to rest on a questionable assumption, viz. that a sufficient number of the population will take an interest in shaming and disciplining those responsible for implementing the predictive programme. Is this really true? We’ll only be able to tell if we take each stage of the predictive process in turn.

We start with stage one, the data collection stage. It seems safe to say that those whose behaviour is being analysed by the algorithm will take some interest in the bits of their personal data are being “mined” for predictive insights. Transparency at this stage of the process could take advantage of this natural interest and thereby be harnessed to promote fairness. This would seem to be particularly true if the predictions issued by the algorithm will have deleterious personal consequences: people want to avoid IRS audits, so they are probably going to want to know what kinds of data the IRS mines in order to single people out for auditing.

It is a little more uncertain whether transparency will have similar virtues if the predictions being utilised have positive or neutral consequences. Arguably, many people don’t pay attention to the data-mining exploits of corporations because the end result of such exploits seems to entail little personal loss (maybe some nuisance advertising, but little more) and some potential gain. This nonchalance could be a recipe for disaster.

Moving on then to stage two, the analytical stage. It is much more doubtful whether transparency will facilitate shaming and disciplining here. After all, the majority of the people will not have the technical expertise needed to evaluate the algorithm and hold those responsible to account. Furthermore, if it is analysts and programmers that are responsible for many of the details of the algorithm, there may be a problem. Such individuals may be insulated from social shaming and discipline in ways that policy makers and politicians are not. For transparency to be effective at this stage, a sufficient number of technically adept persons would need to take an interest in the details and must have some way to shame and discipline those responsible, perhaps through some “trickle down” mechanism (i.e. shaming and disciplining of public officials trickles down to those with the technical know-how).

Finally, we must consider stage three. It seems superficially plausible to say that people will take an interest in how the predictions generated by the algorithms are used and how accurate they are. Again, this would seem to be particularly true if the end result is costly to those singled out by the algorithm. That said, Zarsky suggests that the usage protocols might involve technical terms that are too subtle to generate shame and discipline.

There is also one big risk when it comes to using transparency to promote fairness: populism. There is a danger that the people who are held to account will be beholden to popular prejudices and be overly conservative in the policies they adopt. This may actually prevent the development of truly fair and unbiased predictive algorithms.

So that brings us to the end of this first alleged virtue of transparency. In part two, we will consider the remaining three.

No comments:

Post a Comment