- Why might we need Unique Identifiers?
- How might a Unique ID Be Generated?
- But we don’t want our data to be identified!
- Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers
So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):
|Age Starting Course||UCAS points||Subject||Final Grade||Number of library loans||…|
|18||300||Physics and Astronomy||Upper Second||3||…|
|19||250||Cake Decorating and Icing Manufacture||First||153||…|
|43||400||Medieval History||Lower Second||96||…|
This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:
|Number of PC logins||e-resource logins||Library barrier entires||…|
Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…
The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.
The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.
The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.
The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.
Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:
|Anonymous UID||Age Starting Course||UCAS points||Subject||Final Grade||Number of library loans||Number of PC logins||e-resource logins||Library barrier entires||…|
|765003||18||300||Physics and Astronomy||Upper Second||3||51||0||2741||…|
|331224||19||250||Cake Decorating and Icing Manufacture||First||153||…|
|903876||43||400||Medieval History||Lower Second||96||…|
In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.
The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.
To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.
Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.
|Age||Course||Degree Result||FTE||Country of Domicile|
|23||Physics with Study in Europe||67%||0.75||Spain|
|Mature||Physics||Upper Second||Part Time||EU|
Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:
Or if you query the LAMP API, you will be able to get the same data in tabular form:
The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.
Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.
The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!
Pick some ‘main’ table and use the row number from it
In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.
Make up your own anonymiser routine
As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…
Use a hash function
This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system,
echo id | openssl sha256will return the hash code for