Unique Identifiers Which Don’t Identify Anyone!

  1. Why might we need Unique Identifiers?
  2. How might a Unique ID Be Generated?
  3. But we don’t want our data to be identified!
  4. Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

Why might we need Unique Identifiers?

So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):

Age Starting Course UCAS points Subject Final Grade Number of library loans
18 300 Physics and Astronomy Upper Second 3
19 250 Cake Decorating and Icing Manufacture First 153
43 400 Medieval History Lower Second 96

This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:

Number of PC logins e-resource logins Library barrier entires
1307 4 260
51 0 2741
254 98 5

Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…

How might a Unique ID Be Generated?

The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.

The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.

The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.

But we don’t want our data to be identified!

The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.

Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:

Anonymous UID Age Starting Course UCAS points Subject Final Grade Number of library loans Number of PC logins e-resource logins Library barrier entires
765003 18 300 Physics and Astronomy Upper Second 3 51 0 2741
331224 19 250 Cake Decorating and Icing Manufacture First 153
903876 43 400 Medieval History Lower Second 96

In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.

The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.

To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.

Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.

Age Course Degree Result FTE Country of Domicile
23 Physics with Study in Europe 67% 0.75 Spain
Mature Physics Upper Second Part Time EU

Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:

Number of loans vs final grade
loans
final grade

Or if you query the LAMP API, you will be able to get the same data in tabular form:

Loans Final Grade
3 2
22 1
30 2
10 3
0 5
0 4
7 3
6 4
15 3
17 2

The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.

Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.

Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!

  1. Pick some ‘main’ table and use the row number from it

    In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.

  2. Make up your own anonymiser routine

    As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…

  3. Use a hash function

    This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system, echo id | openssl sha256 will return the hash code for id.

3 Comments

  1. This is a really useful explanation of the proposed approach. However, based on the work we did on the legal aspects of this (see CETIS report – http://publications.cetis.ac.uk/2012/500), I’d like to comment on this part of your LAMP use case: “To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems.”

    Whilst this is the right sort of principle, I would suggest that in legal terms you need to be tighter than this with reference to confidentiality and the Data Protection Act, as follows:
    a) Is the source HEI covered in terms of confidentiality for giving that level of data to external organisations? You need to check what the source university says to its students / staff – my guess is that this will be OK
    b) Is the LAMP project covered registered to handle data for these purposes under the DPA? Almost certainly not as a project – so how about the University of Manchester and the University of Huddersfield and New Jisc separately? The answer needs to be Yes and will be ok in most cases. DPA registration is essential for those accessing the data within the project.
    c) Finally, how does this work for allowing data access between project partners? e.g. Mimas stores it and allows Hudd to see it. I’d be reasonably confident that this is ok if they are each DPA registered for these purposes.

    Conclusion – Every partner handling or having access to the non-aggregated data needs to be appropriately registered under the DPA.

  2. Joy

    Thanks so much for this, Dave. We’ve been having some good discussions here about the approaches we might take to deal with these issues, and of course tackling it from a legal perspective as opposed to a technical one is going to ensure a much more sustainable model. We think you’re exactly right that the answer lies with each institutions being DPA registered. Manchester obviously is, and we’re already putting things in place to take this angle with Wolverhampton. Look forward to discussing this further on Monday!

Trackbacks for this post

  1. LAMP project: a let's-pretend Post-it post-mortem | Paul Stainthorp

Leave a Reply