Tuesday, September 02, 2008

Privacy & sharing

An article by David Rhind, chair of the old Statistics Commission, brought it home to me that I have been thinking only very superficially about the question of privacy & data losses

Rhind lists 10 conflicting priorities for (Government) statistical services, & ends with an 11th: protect confidentiality at all costs. The wording makes it plain that its 11th placing reflects not lack of importance but the need to go away with that one ringing in the ears

But there is so much to this issue of confidentiality that it could be - & may well be - an academic discipline all on its own.

One would start at a high level – the whole principle of privacy in law, sociology, philosophy, politics. And history – revisiting the late 18th century objections to the holding of the first English census would be instructive. Especially as, then as now, part of the justification for government intrusion was a threat to national security, in this case from the French

Another more recent piece of history to read could be the report of the independent external audit of arrangements for keeping individual details secure in the 1971 Census. Although the world of IT & data processing has moved on so far that much of this now seems irrelevant & quaint (if memory serves, concerns were expressed about someone possibly sneaking a peek through a window) the importance attached to the issue is salutary. Bernard Levins column in The Times of 15 April 1971 (available via the Times Archive) which helped ignite the debate is also entertaining

Angela Dale won the RSS West medal for her work on supporting the census needs of users outside government. This has been a long battle, particularly for the academic social sciences. The result is that 5 different microdata files are available for the 2001 Census. NONE has any names or addresses but by limiting the amount of detail which is available care is taken to ensure that identities cannot be deduced

There are 3 different levels of access to these files. The most draconian: “The file can only be accessed within 1 of 4 ONS offices in England & Wales & there is a rigorous application procedure …. & careful vetting of outputs”

The census is of course compulsory, but statisticians generally operate on the basis that, even in voluntary surveys, no individual details will be disclosed. Social statisticians in particular are used to sharing data, & for many years UK Data Archive at Essex University has performed the useful function of collecting, storing & making available in electronic form to researchers a wide range of academic & government surveys with appropriate confidentiality safeguards. Those thinking about modes of administrative safeguards might usefully look in to this experience

The Essex experience also brings another aspect of data & information sharing, namely coding standards & the need to have these well documented. There may ultimately need to be agreed international standards, as there are already for example for cause of death & international trade. But at the very detailed level there is a need to allow flexibility or else risk being overwhelmed with too much detail

Not the least of the standards which need to be clearly understood, if not agreed, is the treatment of missing data. I have come to grief myself with my own private small scale applications when the software automatically assumes that the absence of any data in a field, or an X instead of a number, is given a value of 0. This could be disastrous in large scale administrative applications

There are also the problems of data validation & error control – because errors will arise. Even the widely used Six Sigma system for quality control in industry assumes an error rate of 3.4 per million. Applied to population databases covering 60 million or more people whose circumstances are liable to change, that is a lot of errors

Which is one reason why the idea of a National Identity database is so worrying

If this were to be the definitive identifier for everybody (a fond hope) then the effect of any mistake or error in a record would multiply alarmingly - & blight lives. As Sir Fred Hoyle said, in another context, If you try to makeeverything consistent, the penalty is that you might be wrong in everything

The government is also vastly underestimating what it will cost in money & manpower to administer such a monster

Now the whole purpose of statistics is to provide aggregated numerical data, while the whole purpose of administrative databases is precisely to provide personal information, usually to aid the delivery of some kind of personal service or benefit, though not always. As well as medical treatment or pensions, it may be to arrest you as a terrorist or to confirm that, when asked if you are fit to work with children, the powers that be can issue a ringing Nothing Known

But quis custodiet? Surely all those who have access to such databases be vetted & registered too?

Access & usage should also be monitored

Annual reports made to Parliament