Wednesday 26 February 2014

We need data literate journalists.

So what do you think happened recently with NHS data? Do you think that the NHS handed over the records of millions of patients to insurers who then looked up their credit records and suggested that their insurance premiums should be changed?

That would be awful wouldn't it? Just as well it didn't happen.

I came back from visiting family on Sunday night to see the  Telegraph  story "Hospital records of all NHS patients sold to insurers" being tweeted.  It was picked up by many other papers including the Guardian.

The story is about research done by a group of actuaries, the Critical Illness Definitions and Geographical Variations Working Party . You can find the full report here.  It's a 200 page plus document written to explain to other actuaries how 'geodemographic' data might help predict how likely someone is to develop a critical illness.  If you have ever applied for life insurance or critical illness cover or an income protection policy you would know that you your policy is priced based on what your risk is... your age, weight, smoking status, what illnesses you have. If you know anything about health inequalities you will know that beyond our own personal risk factors( age, weight etc) our social circumstances are important in determining how long we will live and if we will get sick. And you can tell a lot about our social circumstances from where we live. The relationship is so strong that there is a postcode mortality lottery Your postcode *might* reflect your lifestyle, your wealth, your education- all the things that predict how likely you are to get sick or to live long.

Geodemographics , CACI's Acorn, and Experian's Mosaic , classify postcodes into strange-sounding groups like 'happy families' and 'twilight subsistence' based on information obtained from public and commercial sources. The research by the actuaries was about whether these postcode classifications could predict when people developed serious illnesses. You can read the report to find out more but the short answer is that they do.

You may disagree with the idea that you postcode should be used to predict your risk to insurers. Is it a smart way of doing things? You can read some discussion of this in Tony Hirst's blog post here.

Most of the discussion was not about this though. It was about the fact that 'hospital records' were given to insurers.

So what actually happened in the research? 

The above tweet by Roger quotes the Guardian's coverage of this story. Are Acorn and Mosaic 'credit ratings data'? Well, yes, they may have been originally  developed to predict how likely you were to be able to pay back a loan. But as we can see they can also predict how likely you are to get sick or to die.

What did the hospital records look like? There were Hospital Episode Statistics. This is what the data looked like (from this presentation)
Is that what you thought the 'hospital records' would look like?

Did the researchers have full postcodes and dates of birth? It was a bit hard to tell this from the report. I presumed they didn't because I didn't see why they needed it. And I didn't think that the NHS was likely to give away information that would make it easy for individuals to be re-identified. But the full postcode was needed to be able to assign a 'geodemographic profile' to each person in the dataset. The following screenshot is from page 10 of the report.

I read this as meaning that the geodemographics were added to the HES dataset by the NHSIC who provided the dataset to the insurers. But others, including Tony Hirst, first read this as meaning that it was the researchers that did the datalinking. Who did the datalinking was important because who ever did it needed the full postcode. 

Today after reading an article by Wired in which it is stated the hospital data was given to the Institute and Faculty of Actuarie IFOA and "was then combined with secondary sources, including Experian credit ratings data, in order to influence insurance premiums." I decided that I had to find out. So I phoned the press office of the Institute and Faculty of Actuaries (IFOA) on the number I found on the press release of their rebuttal to the Telegraph article.

I got straight through. The press officer directed me to page 10 above. I asked who had done the datalinking and they said it was the NHSIC. This made sense and fitted with their statement that they had no identifiable information for the individuals in the dataset. They only had an age group, and the 1st part of their postcode.

So how many people do you think contacted the IFOA to try and make the same clarifications as me? Every journalist that had written a story about this perhaps? No 3 people. The BBC and two bloggers. I was one of them. 

Why didn't other journalists get in touch with them? Didn't they understand the significance of this? Didn't they care?

In the next few months and years we are going to be having many conversations about big data. We need to have journalists who know how to ask the right questions. And at the moment it looks as if we haven't.

If you think that the problem is that actuaries were given NHS data at all then see this.
EDIT In the past GPRD data was provided to actuaries. This is no longer the case although at least one application was made recently to CPRD. They rejected this.


  1. Thanks. Whilst I remain increasingly sceptical - "don't trust 'em" probably sums me up now - this is an excellent piece. Thanks

  2. Well one of the people who could have made that phone call, but didn't, was someone from HSCIC who could have then given their director the information required to debunk the story at the HSA yesterday. Instead, they adopted the "we've lost the records of what data we handed over, and can't tell you anything" and looked, to be honest, like shifty liars.

    You're an adult. You watched the HSC meeting yesterday. How trustworthy would you say Jones, Poulter and Kelsey appeared? How impressed would you say the committee were?

    It's all very well providing the esprit d'escalier responses after the fact, but Jones set out his stall yesterday: "we don't know and we won't tell you". So the interesting thing would be is HSCIC could or would confirm that they did the match to Mosaic data because, right now, they're claiming they don't know.

    I'd also ask why you're so keen to defend NHS data being provided, very cheap, so that private companies can do research work solely of interest to those private companies. The NHS doesn't give a shit about the premiums for critical illness cover, and should not be providing data for such work under any circumstances.

    "We need to have journalists who know how to ask the right questions."

    Do we? Max Jones was sat in front of a large audience. He was asked about exactly this issue. He stone-walled rather than provide an answer. He could have killed the story stone dead, but instead was either shockingly badly briefed or couldn't be bothered to read his brief.

  3. Jonathan Richards26 February 2014 at 20:16

    I think (confession time) that the NHS has taken some of this for granted. I made a mistake 20 years ago and a patient was identified so I have always set the bar as high as possible in my work on anonymised data, registers and so on. I have had complaints made about my refusal to release data for research and for NHS service development in Wales because of the issues about consent and confidentiality. I have been shocked at how colleagues have not appreciated how important consent is when there is any chance of identifying someone. However, this current debate has really shaken me.
    Are the public aware of QMAS and Audit+? Are they aware of how Public Health and other agencies already access GP data? Are they aware of QReseach for EMIS practices and of the CPRD? Some years ago, I was looking at All Wales Public Health data about long term conditions and realised that anyone who knew how to "look underneath" the statistics would be able to identify the GP practice. Then, especially for unusual people (a 90- year old with a recent MI for example) it would be possible to have a good idea about the person who had contributed that unusual data. Nearly everyone would still have been anonymous, but not every single person whose data contributed to the analysis. The AWPHO then changed the data presentation to prevent this from happening.
    I realise now that others will accuse me of clinical or professional arrogance and taking people's wishes for granted. My starting point has been that anonymised data is not personal and so therefore consent is not required. I have trusted the people in Wales who do this work; they do have the highest standards.
    The paradox is perhaps that is is the unusual people whose data may contribute the greatest insights into this work, since their contribution to the data is the most powerful (why did that 24 year old develop pancreatic cancer?). However their informed refusal of consent because of this risk would render the data analysis as good as meaningless.

  4. Hello,

    I have no idea why HSCIC response to this has been so rubbish. I didn't watch the session live as I was working as a GP. And when I got home I thought I had heard enough about it. It sounded as if watching it would be a painful experience.

    You should have a look at the storify I made a few days ago

    NHS Data is being provided to insurers through CPRD. I asked some questions there about whether it was appropriate but haven't had any responses yet.

    I completely stand my claim that we need better journalists.


  5. "However their informed refusal of consent because of this risk would render the data analysis as good as meaningless."

    I don't mind having my arbitrary data, fully identifiable, released to people I have a rational basis to trust. Hence being involved in biobank: the data is either identifiable or as near as, certainly in the form biobank themselves hold it, and I trust both them and their projects. I occasionally look at the list of projects, and they are all beyond reproach. The investigators are open, transparent and rigorous. They accept risk, make clear moves to mitigate them, and can clearly articulate benefits to society.

    Contrast Is it for commissioning, audit, payment, fundamental research, treatment comparison? Who knows? Is the data protected by anonymisation, pseudonymisation, aggregation, contract, encryption? Who knows: they can't be bothered to publish a code of practice, so we're guessing. Who gets access to the data: insurers, government, drug companies, private health care? Well, on past showing, all of them, and in the absence of the code of practice they haven't written, we don't know what the future looks like.

    Biobank show how to do it. Hence my opt-in (even though I'm terrified of needles and the original sampling to join was a very traumatic experience). shows how not to do it. Hence my opt-out.

    As I said below, it's all very well coming up with responses to the media narrative on blogs read by nerds. Tim Kelsey is paid to seize control of the media narrative: it's his _job_. If the story in the Torygraph was wrong, should have clearly corrected it, at the HSC meeting. They had the platform. Instead, they've let it run because their own incompetence and arrogance means they don't see it as important.

    "We flog your data to insurers, give it to Atos and might give it to DWP" is completely fatal. Jones and Kelsey are paid to manage that. They need to manage it.

  6. Hello Jonathan,

    I think we need to know how easy it would be to reidentify someone from the data given here. I'm in no position to say but I hope that someone will leave a comment who knows more than us!

    If people can be identified then it is personal information and it shouldn't be shared without consent.

    Thanks again,AM

  7. "I think we need to know how easy it would be to reidentify someone from the data given here."

    It depends on what other data the attacker has. k-anonymity has a substantial literature; k-anon is the property of there always being at least k people in the dataset who are indistinguishable from each other. But actually providing that property in a real-world dataset where the rows are different (it's not just "fuzz GPS locations so that in a dataset of phone numbers and locations there are always at least ten people in a given vicinity") and the attacker has an unknown amount of other data to join it with is hideously difficult.

    Fuzz some medical data so that you can show it's k-anonymous (as a side issue, deciding what k you need is not easy, either). But now imagine the attacker has supermarket records: they'll be able to spot pregnancies by seeing changes in purchasing habits of sanitary products followed by purchases of nappies etc, so given a list of births they're good to go. Or imagine the attacker has insurance data: they can simply match the health declarations people make with the fuzzed data, and get a load of matches. And so on, and so on, and so on.

  8. Jonathan Richards26 February 2014 at 20:56

    Thank you. I think that this is in many ways the gold standard. I am a GP and we had a number of people walking in the other day with the opt out forms. They were not the people I would have expected (i.e I had taken their views and wishes for granted) and I could say to them that did not apply in Wales. That did not ease my conscience. Then I felt overwhelmed by the thought of having to engage with 11,500 people and obtain their opt in.
    Some years ago we had a real battle about digital retinal photographs and diabetes screening: as personal as a finger print. We were able to prevent their sharing without fully informed consent for commercial purposes.

  9. Like I say I can't make head nor tail of my this seems to have been so confused. Maybe they didn't see it as important. Who knows.

    Conspiracy or cock-up?

  10. "I think we need to know how easy it would be to reidentify someone from the data given here"

    Well, I live in a smallish village. A code + postcode + age would probably uniquely identify me. If it's only part 1 of postcode and a fuzzy age, then it would probably be OK. The difference is important.

  11. Code(s) and 1st part of post code and 5 year age band...

  12. With respect, AM, that's wrong.

    Their analysis groups people into five-year age bands.

    That doesn't say that the raw data the analysts had was in five-year age bands. And it clearly wasn't, because in Appendix 6 they show raw rates broken down into one-year buckets. They had at least one-year resolution ages, because otherwise they would not have been able to derive Appendix 6.

  13. Jonathan - I'd been looking at SAIL last week - comparing and contrasting the Celtic models (Wales/Scotland) with - appears also to extract identifiable data and link with a range of datasets - Are you familiar with it and can educate us there? to share in anonymous formats. That said, with the limited detail on the file, it is hard to see if disseminated data is anonymous or pseudonymised. Person identifiable data is shared with them in order to enable linkage. I have asked directly but yet to receive response. I wondered if the person level data is more pseudonymised than anonymised from the limited detail here and whether or not it is opt in or compulsory for primary care too? because it confirms individuals do not give consent it states, "because it is anonymised" - however, if it extracts identifiable data sets then links, then pseudonymises - that's not anonymous -" because we are only dealing with anonymous data, we are not required to obtain informed consent from individuals whose records come to SAIL." Sounds just like the DLES at HSCIC? Any thoughts?

  14. Jonathan Richards27 February 2014 at 13:59

    SAIL is an academic project that Welsh Government is very proud of and has publicised often. It builds on many years of projects about collecting and sharing data in Wales. The data is processed by NHS Wales IT and then released to SAIL. There is robust Information Governance in place that includes service users and any requests for data that could inadvertenly identify someone with a rare conditon or in a small community are turned down unless informed consent is obtained. I have been involved in this work and can be confident about what I have been involved in personally.
    The fiasco has really shaken me up because it has shown how paternalistic I have been: "Trust me, I am an experienced clinican who is fanatical about consent and confidentiality and about patient rights and professional duties."
    I am aware of Nassim Taleb's warnings that things we never expect are bound to happen. I still trust the Welsh processes.
    I wonder if any sceptic could ever be reassued by fine words. As Onora O'Neil commented in her Reith lectures trust cannot be earned it can only be given.

  15. Hi Anne Marie,

    Here's a view on the 'NHS data release to actuaries' from a 'conduct / insurance' perspective -

    About contacting the IFoA - I considered doing so, but the way in which they wrote their rebuttal indicated that they didn't seem that open to much of a conversation.

  16. Hello
    They were pretty open to me. Your post is very interesting. Thanks for sharing.

  17. Joining late!

    I completely agree we need better journalists. But doesn't this (and many other) sorry saga show that we need better everybodies - politicians, civil servants, members of the public. As noted elsewhere (, education doesn't seem to be the solution. At this point, my Dad would start going on about the benefits of a Brave New World arrangement.....

  18. Hello
    Sorry I'm a little but confused by your comment. Which 'publically available sources' do you mean?
    We discussed above and on Tony's post whether use of geodemographics makes 'jigsawing' easier. I don't think it was concluded that it does.
    If the ethical debate is in part about when data becomes relatively non-identifiable then who did the data linking is important as personal data was needed for that process.

  19. The risk I raised in my own blog was disabled people who want to make only potential disclosures of disability to a potential employer given the overwhelming incidence of disability discrimination in recruitment. Taking me as an example, there probably aren't going to be too many 50-somethings with dyspraxia, chronic pain syndrome and hypermobility syndrome in most post-code areas (and no point trying to hide those, the crutches say something is going on and I need to justify reasonable adjustment). Considering the history of engineering industry-wide blacklisting by 'the Consulting Association', it's easy enough to envisage a similar 'service' for disability. Even if you don't have named data, it would be easy enough to match me by disability mix and post-code, and with few enough matches to decide to take the worst case, and everything else it reveals. Which would then tell an employer about all the non-disclosed invisible disabilities. You can also envisage an existing employer accessing such a facility during Employment Tribunal proceedings, and in such a case they are likely to be able to provide details like dates of hospital appointments or admissions to aid in a de-anonymization attack.

  20. All good reasons to follow example of SAIL in Wales and have the data only accessed through a portal?

  21. Let's just say I have good reason to be personally sceptical of employers following the law with respect to my disability....


I am reintroducing word verification to cut back on spam posts. I'm sorry if you find it frustrating,