Custom Normalisations


At the recent 17 October LAMP CAP meeting, I gave a presentation starting from my last blog posts and detailing some of the progress I have been making along a number of separate lines. I know in my last post I said I would be talking about statistics next, but during parallel development of the various architectural components it turned out that some other pieces of work slotted in before then, and so I want to do a quick series of posts going into more detail from my talk.

In this post, I want to build upon previous discussions of normalising data for statistical analysis and discuss the idea of LAMP users being able to use custom normalisations. The other posts I want to write will cover data content standardisation, our database schema, our API, and the graph drawing portion of our user interface respectively.

Problems with Fixed Normalisations

At the moment, the LAMP database uses fixed data normalisation. This means we are collecting records together into similar groups based on their content (e.g. Ethnicities of ‘White’, ‘White African’ and ‘White Irish’ all become grouped together as ‘White’ for statistical purposes), but that the LAMP team have decided what those groupings should be based on an overview of the data which providers have submitted to us.

We’re expecting that these fixed normalisations will cover the majority of use cases fro LAMP, but let me use a user story example to illustrate a situation in which the fixed normalisations will not be sufficient, and to introduce custom normalisations as a solution.

Suppose someone is interested in what percentage of people get first class degrees at their institution, and for diversity purposes wants to break that down by ethnicity. They could ask the LAMP API for this information and get back the following (fictional) graph, which has been normalised (as described above) by grouping related records together into common categories for anonymisation and statistical analysis.

Graph 1: Percentage of first class degrees by ethnicity

Not KnownOtherAsianBlackWhiteChinese00.

Normalisations used to build Graph 1

Content from provider Normalised group
“Asian – Pakistani” “Asian”
“Black – Other” “Black”
“White/Black Caribbn” “Other”
“Asian – Indian” “Asian”
“White and Asian” “Other”
“White Irish” “White”
“Other White Background” “White”
“Black – Caribbean” “Black”
“White” “White”
“Asian – Other” “Asian”
“Black – African” “Black”
“Information Refused” “Not known”
“Asian – Chinese” “Chinese”
“Asian – Bangladeshi” “Asian”
“White/Black African” “Other”
“Latin American” “Other”
“Other Mixed” “Other”
“Other” “Other”

Now, let’s say this particular institution is looking to break into South America and so is interested in the performance of its Latin American Students in particular, but not bothered at all about separating out Chinese students from the rest of Asia. At the moment, as mentioned above, the ethnicity normalisations which appear in graph 1 are a fixed default set of normalisations. In other words, you currently cannot change the categories to focus in on Latin America.

This seems quite restrictive, and so it has been suggested that we should not tie users of LAMP to the groupings we have chosen. The only way to get around this is if users are able to supply a list of their own ‘custom’ groupings if they prefer. Initially, I assumed this would just be a case of the user submitting a list (like the one next to graph 1) to our API when they run their query, and asking the API to use that list instead.

An Example of Custom Normalisations

Since the normalisations table doing the work is similar to the one above, then my initial idea was that altering the normalisations groupings as below and submitting those up to the API would solve the problem of custom normalisations.

Content from provider LAMP Normalised group Custom Normalised group
“Asian – Pakistani” “Asian” “Asian”
“Black – Other” “Black” “Black”
“White/Black Caribbn” “Other” “Other”
“Asian – Indian” “Asian” “Asian”
“White and Asian” “Other” “Other”
“White Irish” “White” “White”
“Other White Background” “White” “White”
“Black – Caribbean” “Black” “Black”
“White” “White” “White”
“Asian – Other” “Asian” “Asian”
“Black – African” “Black” “Black”
“Information Refused” “Not known” “Not known”
“Asian – Chinese” “Chinese” “Asian”
“Asian – Bangladeshi” “Asian” “Asian”
“White/Black African” “Other” “Other”
“Latin American” “Other” “Latin American”
“Other Mixed” “Other” “Other”
“Other” “Other” “Other”

Making the changes detailed in the right-hand column and submitting them to the API would then result in the following graph being output instead:

Not KnownOtherAsianBlackWhiteLatin American00.

Unfortunately, although this is the right general idea, the tables above are limited to the output from just one (fictitious) provider. When I thought about this process in the context of multiple providers, a barrier to this approach became evident. I’ll talk about this barrier more, and how standardising the content of data fields like those in the left-hand columns above will be necessary for a solution, in my next post.

Standardising Data Content to Allow For Custom Normalisations


In my previous post, I set out to describe a potential infrastructure which would allow users of LAMP to choose how records are normalised by defining their own groupings, but then I mentioned that I had spotted a barrier to doing this easily — that every institution which has submitted data to LAMP describes the same content in a different way. This is like the difference on a web page between a free text field and a drop down list — at the moment, institutions are submitting their data to us as if they had filled in a free text field.

For this post, I’ll explain a little more about standardising the content of the data fields. The process of standardising data content is analogous to converting the ‘free text’ values mentioned above into a fixed set such as the ones you might find on a drop-down list.

I’ve already introduced the idea of standardising, but last time it was with field names from different providers where they hold the same conceptual data (e.g. Country of Domicile from one provider means the same thing as Nationality from another). This time, we’ll be focusing on the content of the fields instead.

Our Current Fixed Normalisations and the Barriers to Custom Normalisations

Let’s consider an example which exposes how difficult it would currently be for users to supply custom normalisations. For the sake of this example, I have only focused on all the various types of Ethnicity that LAMP are currently grouping together as ‘Asian’, with the exception of Chinese, which we are currently putting in a different ‘Chinese’ group.

You can see a sample of the fixed normalisations table we currently have inside the LAMP database below. So far this covers content from four different providers.

1. a sample of Ethnicities from our current comprehensive normalisations table (visualised below as a flow diagram)
institution original_contents normalised_contents
1 Asian Other Asian
1 Bangladeshi Asian
1 Indian Asian
1 Pakistani Asian
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Asian
2 Asian or Asian British – Indian Asian
2 Asian or Asian British – Pakistani Asian
2 Other Asian background Asian
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Asian
3 Asian or Asian British – Bangladeshi Asian
3 Asian or Asian British – Indian Asian
3 Asian – Other Asian
3 Chinese Chinese
4 3[^4] Asian
4 6 Asian
4 34 Chinese

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/85d3ed9ea91310e55bc27a8ee1f3a365.png'
Warning: : syntax error in line 21 near '-'

Original DOT:
    2 digraph table1{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterRaw {
   19 graph[label="original_contents"];
   20 node[shape="rectangle"];
   21 1->"Asian Other"->"Asian";
   22 1->"Bangladeshi"->"Asian";
   23 1->"Indian"->"Asian";
   24 1->"Pakistani"->"Asian";
   25 1->"Chinese"->"Chinese (n)";
   26 edge[color="red"];
   27 2->"Asian or Asian British - Bangladeshi"->"Asian";
   28 2->"Asian or Asian British - Indian"->"Asian";
   29 2->"Asian or Asian British - Pakistani"->"Asian";
   30 2->"Other Asian background"->"Asian";
   31 2->"Chinese";
   32 edge[color="blue"];
   33 3->"Asian or Asian Brisith - Pakistani"->"Asian";
   34 3->"Asian or Asian British - Bangladeshi";
   35 3->"Asian or Asian British - Indian";
   36 3->"Asian - Other"->"Asian";
   37 3->"Chinese";
   38 edge[color="green"];
   39 4->"3[^4]"->"Asian";
   40 4->6->"Asian";
   41 4->34->"Chinese (n)";
   42 }
   43 }

In this simplified example, let’s suppose our user doesn’t want to use ‘Chinese’ as a grouping, but would prefer for their business purposes to only use ‘Asian’ for everything in that region. My assumption from my last post was that the user would be able to achieve this by submitting a simple custom normalisations table up to the API. On the other hand, in order for the LAMP application to offer all the same functionality as with our default normalisations, the user would effectively have to submit their own custom implementation of table 1.

The word simple above is the key — as you can see, table 1 is quite complicated owing to different institutions representing ethnicities differently (institution 1 uses ‘Indian’, for example, whereas institution 2 uses ‘Asian or Asian British – Indian’, and so on). Thinking along these lines, in order for a user to draw up their custom copy of table 1, they would need to know all of the possible entries for every different institution. This is not simple at all for the user!

Data Content Standardisation as a Solution

In order to simplify the table which users will need to supply in order to generate custom normalisations, we will need to insert an extra step into table 1. At the moment, we look at the content of the data and decide which normalised grouping it belongs in. If instead, we first look at the content of the data and replace it with a value chosen from a standard list of LAMP-certified values, we can then look at the standardised values and group them into normalisations as a second step. Finally, in order to supply a custom normalisation, a user only needs to know our list of LAMP-standardised values and put them into their own groupings.

In our example, to standardise the values, we would replace the values in table 2a) with the corresponding value from table 2b):

2. Standardising data content
a) the original list of possible ethnicities from our normalisations table
Asian Other
Asian or Asian British – Bangladeshi
Asian or Asian British – Indian
Asian or Asian British – Pakistani
Other Asian background
Asian or Asian Brisith – Pakistani
Asian – Other
b) A suggested list of standardised replacements
Asian – Other
Asian British
Asian – Any

Splitting table 1 into two steps can now be achieved as shown in table 3. The first part would be to use a lookup table to standardise all the different values from different institutions into the ones in table 2b. Ideally, the end user generally doesn’t need to see or know about this step — it would be something we did inside the LAMP application. The result would look something like table 3a):

3. Replacing the normalisations table with two tables: a) a standardisations table…
institution original_contents standardised_contents
1 Asian Other Asian – Other
1 Bangladeshi Bangladeshi
1 Indian Indian
1 Pakistani Pakistani
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Bangladeshi
2 Asian or Asian British – Indian Indian
2 Asian or Asian British – Pakistani Pakistani
2 Other Asian background Asian – Other
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Pakistani
3 Asian or Asian British – Bangladeshi Bangladeshi
3 Asian or Asian British – Indian Indian
3 Asian – Other Asian – Other
3 Chinese Chinese
4 3[^4] Asian – Any
4 6 Pakistani
4 34 Chinese
… and b) a normalisations table
standardised contents normalised_contents
Asian – Other Asian
Bangladeshi Asian
Indian Asian
Pakistani Asian
Chinese Chinese
Asian British Asian
Asian – Any Asian

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/9727d7839350636bb5c025654663ff79.png'
Warning: : syntax error in line 31 near '-'

Original DOT:
    2 digraph table3{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterStandard {
   19 graph[label="standardisation"];
   20 node[shape="diamond"];
   21 "Asian - Any";
   22 "Pakistani (s)";
   23 "Asian - Other (s)";
   24 "Indian (s)";
   25 "Bangladeshi (s)";
   26 "Chinese (s)";
   27 }
   28 subgraph clusterRaw {
   29 graph[label="original_contents"];
   30 node[shape="rectangle"];
   31 1->"Asian Other"->"Asian - Other (s)"->"Asian";
   32 1->"Bangladeshi"->"Bangladeshi (s)"->"Asian";
   33 1->"Indian"->"Indian (s)"->"Asian";
   34 1->"Pakistani"->"Pakistani (s)"->"Asian";
   35 1->"Chinese"->"Chinese (s)"->"Chinese (n)";
   36 edge[color="red"];
   37 2->"Asian or Asian British - Bangladeshi"->"Bangladeshi (s)";
   38 2->"Asian or Asian British - Indian"->"Indian (s)";
   39 2->"Asian or Asian British - Pakistani"->"Pakistani (s)";
   40 2->"Other Asian background"->"Asian - Other (s)";
   41 2->"Chinese";
   42 edge[color="blue"];
   43 3->"Asian or Asian Brisith - Pakistani"->"Pakistani (s)";
   44 3->"Asian or Asian British - Bangladeshi";
   45 3->"Asian or Asian British - Indian";
   46 3->"Asian - Other"->"Asian - Other (s)";
   47 3->"Chinese";
   48 edge[color="green"];
   49 4->"3[^4]"->"Asian - Any"->"Asian";
   50 4->6->"Pakistani (s)";
   51 4->34->"Chinese (s)";
   52 }
   53 }

The normalisations step is now performed by a much simpler second lookup table (3b) ) which groups these new standard field contents into the LAMP default categories. In this new table it no longer matters which institution the data originally came from, which makes it a much simpler table.

Custom Normalisations, Revisited

The end result after introducing data content standardisation will be that if you want to specify your own custom normalisation, you will only need to submit something like the following (which groups ‘Chinese’ into ‘Asian’ instead) up to the API:

5. Proposed structure of an alternative normalisation for the data in table 1, which could be submitted by a user to our API
standardised contents target_column normalised_contents
Asian – Other Ethnicity Asian
Bangladeshi Ethnicity Asian
Indian Ethnicity Asian
Pakistani Ethnicity Asian
Chinese Ethnicity Asian
Asian British Ethnicity Asian
Asian – Any Ethnicity Asian

Hopefully this clears up why we want to focus on content standardisation, as well as how we’ll be going about it! The ideas above will obviously result in some slight changes to our database schema, and how our API works, and so in my next posts I want to talk about both of those aspects of our architecture.

Creating the LAMP database

We now have data from three different LAMP partners, and we’ve started looking at the structure of the data. On one hand, we are interested in how we normalise the data for statistical analysis, but on the other hand, we also need to start thinking about how data is going to be consumed by the LAMP application. In my previous post regarding the architecture of the LAMP system as a whole, I looked at some theoretical architectures which might be a good fit for the application’s requirements as we understand them right now, and the common point in all of the architecture options I have considered is the LAMP database.

Data Structure Concerns

It has been suggested that we retain the full granularity of data as supplied to us from the partners (although in my anonymous UIDs post I noted that this data will not be available to end users), which in itself raises some interesting challenges. Each partner is storing and sending different pieces of information, kept in different columns. For example, from one provider we might see something like column one of the following table, and from another something like column 2. As you’ll see below, some columns clearly mean the same thing (and I have lined them up accordingly), some are possibly similar, and others have no analogues between different partners.

Provider One Provider Two
user# Identifier
Ethnicity ethnicity
Country of Domicile nationality
Country of Domicile Region
Gender gender
Disability disability
UCAS Tariff Points Tariff
Age on entry
Mode of Attendance Attendance Mode
Class Code
Course Code Prog Code
Course Name Course name
Course Type Course type
JACS code
Location of Study Campus
Franchised Out
Enrolment End Date Date of graduation
Agreed Award final award
Agreed Classification
Agreed Overall mark
loans Loans per borrower
total E Number of different E-resources accessed
all visits Number of visits

In designing a database, one sets out to work out which pieces of data will be present in a database table. For any one partner, this is easy — we are being given the tables directly. Add more than one feed like the ones above, however, and there are some choices to make in our database design regarding how we standardise the column names when they clearly mean the same thing.

Access Control Concerns

The next factor which may influence our choice of database structure is security. Database products such as Postgresql come with built in security, so access can be restricted to certain database users on a per-table basis. This is certainly an appealing model for controlling partners having full access to their own data — we could create one database user for each LAMP partner, and perform access control accordingly.

The difficulty with that access model, however, comes when we want to run analyses which compare data from multiple partners, which implies a degree of access to each others’ data. Of course, we would only be allowing access to other partners data after it had been standardised and normalised, but in order to perform the calculations, access will still be required at a level which is far more granular than the LAMP project wants to expose. And this is the crux of the matter — the kind of statistical comparisons people require mean that it is simply not possible to avoid the LAMP statistics layer having read access to all the normalised data.

The implication of this is that we will not be able to use database access control alone to restrict access to the full granularity of the normalised data. Instead, either our API or our statistics layer will have to use software routines to implement its own filters and sanity checks based on credentials supplied to the API. Only after these have been passed will results from cross-provider queries be returned to the user.

Outside of cross-provider queries, there will still be the need to restrict access to raw data solely to the provider who submitted it, and it remains to be seen whether or not database access control will be able to play a part in achieving this or whether it too will be purely achieved via software routines.

Some Design Options

  1. One table per provider, with normalisations and column name standardisations in a lookup table:

    Error generating Graphviz image

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/c3047af0fad4741ce1fe25fae80c5c55.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Normalisations -> Standardisations -> "Provider 1 table";
        7 Standardisations -> "Provider 2 table";
        8 Standardisations -> "Provider 3 table";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 }}

    The benefit of this approach would be that the original detail submitted by providers would appear in the database, one provider per table. This would also mean making use of database access control on a per-table basis was an option. However, the need to constantly perform lookup routines in order to standardise and normalise the data for comparison could impact performance and result in quite complicated queries, and there would also have to be at least one database user with read access to all the tables for the purposes of running such comparisons.
  2. One table with standardised column names for all providers, with normalisations in a separate lookup table:

    Error generating Graphviz image

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/cfc16ef1a9a1e902bc0a4c29ad4aeb4e.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> "All Providers' Standardised Data";
        7 "Filter for only onenProvider's data" -> "All Providers' Standardised Data";
        8 Normalisations -> "All Providers' Standardised Data";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Filter for only onenProvider's data";
       12 }
       13 "Provider 1 raw" -> Standardisations;
       14 "Provider 2 raw" -> Standardisations;
       15 "Provider 3 raw" -> Standardisations;
       16 }

    This may be a ‘best of both worlds’ solution — we would hold the full level of detail submitted to us by providers in our database, in a single table with standardised headings. Queries could then be run through normalisation routines when comparison and statistics are required, but institutions would still be able to get at the data they submitted (albeit standardised) for other types of analysis which would be private to their LAMP dashboard. From an access control pespective, this scenario would rely entirely on software checks — at the provider filter, and on the comparison query results — in order to protect data.
  3. One normalised, standardised table, with no access to original data:

    Error generating Graphviz image

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/26bc031cec0deeb4314ee2db64e8c63e.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 node[shape="trapezium"]
        8 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
        9 }
       10 "Provider 1 raw" -> Standardisations;
       11 "Provider 2 raw" -> Standardisations;
       12 "Provider 3 raw" -> Standardisations;
       13 }

    This scenario would probably perform best and keep queries simple, but has the drawback that the full level of detail in the data as submitted by the providers would not be held in the database, which would limit us to only running queries on the normalised data. Since a number of the LAMP use cases seem to involve providers wanting to store their data with us and query it in place, this option is pretty much ruled out. Restricting access to the detailed normalised data, as in the previous example, would be completely done in software.
  4. Using redundancy and holding both raw individual tables as well as a standardised/normalised combined one

    Error generating Graphviz image

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/a1c5501eadf600068966f36dc1a9009b.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 "Provider 1 table";
        8 "Provider 2 table";
        9 "Provider 3 table";
       10 node[shape="trapezium"]
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
       15 }
       16 "Provider 1 raw" -> Standardisations;
       17 "Provider 2 raw" -> Standardisations;
       18 "Provider 3 raw" -> Standardisations;
       19 "Provider 1 raw" -> "Provider 1 table";
       20 "Provider 2 raw" -> "Provider 2 table";
       21 "Provider 3 raw" -> "Provider 3 table";
       23 }

    Another option exists whereby we could combine options one and three — the raw data goes back in the database, and database user accounts are reintroduced to control access to it as in option one. However, as in option three, the normalised table is also present in the database, for combined queries which are regulated by the API. For even higher levels of security, the raw and the normalised tables don’t strictly speaking even need to be in the same database!

    This last option would really be an implementation suited to high levels of paranoia regarding the raw data and our API’s software safeguards, and faith that the normalisation routines do a good enough job of anonymising that data to justify the combined table not being subject to the same levels of security.

At the moment I’m leaning towards option two — we can reverse-lookup the standardisations if partners absolutely need their original column headings back, but having all the data in one standardised table will help with both private and combined queries. Since database access control cannot offer us the security required in our application, we will need to implement software checks in any case, so we may as well embrace the fact and get on with how those checks will work! It’s conceivable that option 4 might perform better, as the need to do standardisation and/or normalisation lookups at query-time is removed, but we’ll keep an eye on that as we test and build the database.

In my next post, I’m hoping to go into more detail about the statistics layer, and how we implement some of the routines Ellen blogged about, leading into how we build our API!

Unique Identifiers Which Don’t Identify Anyone!

  1. Why might we need Unique Identifiers?
  2. How might a Unique ID Be Generated?
  3. But we don’t want our data to be identified!
  4. Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

Why might we need Unique Identifiers?

So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):

Age Starting Course UCAS points Subject Final Grade Number of library loans
18 300 Physics and Astronomy Upper Second 3
19 250 Cake Decorating and Icing Manufacture First 153
43 400 Medieval History Lower Second 96

This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:

Number of PC logins e-resource logins Library barrier entires
1307 4 260
51 0 2741
254 98 5

Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…

How might a Unique ID Be Generated?

The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.

The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.

The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.

But we don’t want our data to be identified!

The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.

Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:

Anonymous UID Age Starting Course UCAS points Subject Final Grade Number of library loans Number of PC logins e-resource logins Library barrier entires
765003 18 300 Physics and Astronomy Upper Second 3 51 0 2741
331224 19 250 Cake Decorating and Icing Manufacture First 153
903876 43 400 Medieval History Lower Second 96

In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.

The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.

To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.

Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.

Age Course Degree Result FTE Country of Domicile
23 Physics with Study in Europe 67% 0.75 Spain
Mature Physics Upper Second Part Time EU

Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:

Number of loans vs final grade
final grade

Or if you query the LAMP API, you will be able to get the same data in tabular form:

Loans Final Grade
3 2
22 1
30 2
10 3
0 5
0 4
7 3
6 4
15 3
17 2

The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.

Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.

Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!

  1. Pick some ‘main’ table and use the row number from it

    In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.

  2. Make up your own anonymiser routine

    As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…

  3. Use a hash function

    This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system, echo id | openssl sha256 will return the hash code for id.

Planning the LAMP Architecture

As data starts to come in from various LAMP partners, we’re considering the architectural design of our application. Some ideas came out of the first LAMP CAP meeting as well which might have a bearing on how we put our system together.

For each partner, we will receive a series of datasets covering various aspects of library analytic data (usage, degree attainment, journal subscriptions, etc.), which we’ll be importing into a database. One of the goals of LAMP is to be able to perform statistical analysis of this data, and so at some point there will need to be a component in our application which is capable of statistical calculation.

Our project will deliver a user interface front-end for the purposes of viewing the analysed data and customising which analysis to apply. Mention has also been made, however, of an API being delivered so that LAMP users can, if needs be, get the results of statistical analysis for use in their own applications without using our user interface.

As soon as we’re considering provision of an API, a common best-practise concept is that we should eat our own dog food — namely, that our own LAMP application should be built on top of our API. This process helps keep our API useable and relevant, as it means that any functionality we’ll need for our front end is available to everyone else through our API as well.

One thing which was mentioned at the CAP meeting, however, was the idea that we could combine our data with that from other APIs out there which offer different analytics data, such as JUSP and IRIS. This would increase the number of sources from which we could pull data and perform statistical analysis.

Before other APIs were mentioned, my initial feeling was that a statistical application layer such as R, integrated with the database, might be the most efficient way to offer up analyses from the LAMP data. A reasonable structure for this arrangement might be:

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/3620b6bd6a5926bcbae6b850e1ddfc30.png'
Warning: : syntax error in line 7 near '-'

Original DOT:
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "LAMP statistics API interface";
    5 "UI Interface";
    6 node[shape=rectangle];
    7 "LAMP database"->"LAMP data statistics layer"->"LAMP statistics API"->"LAMP statistics API interface";
    8 "LAMP statistics API"->"UI layer"->"UI Interface";
    9 }

(system components are rectangles, interfaces to the system are trapeziums)

However, the additional requirement to bring in other APIs pulls such a structure into question. Depending on what content is in other APIs and whether we can successfully cross-reference and/or make use of it, there may be a corresponding requirement to carry out statistical analysis across results from multiple APIs. If this is the case, then either the statistics processing layer needs to move (so that processing occurs after the other data has been pulled in), or a second processing layer will be necessary. Questions also arise as to whether or not a LAMP API should serve aggregated statistical results after data has been consumed from other APIs, or whether we only want to offer results from our own database via a LAMP API, leaving analysis across multiple APIs for consumers to implement according to their own individual requirements.

The following diagram attempts to sum up these options, with component options represented with dashed boundaries. Statistical processing options are in purple, and API options in blue:

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/d52fddb4ca8130df1614206e3ba698b2.png'
Warning: : syntax error in line 11 near '-'

Original DOT:
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "UI Interface";
    6 node[style=dashed,penwidth=2,color=blue];
    7 "LAMP statistics API interface";
    8 "aggregated statistics API interface";
   10 node[shape=rectangle]
   11 "LAMP statistics API"->"LAMP statistics API interface";
   12 "aggregated statistics API"->"aggregated statistics API interface";
   14 node[color=purple];
   15 "LAMP data statistics layer"->"LAMP statistics API";
   16 "aggregated statistics layer"->"aggregated statistics API";
   18 node[style=solid, penwidth=1, color=black];
   19 "LAMP database"->"LAMP data statistics layer";
   20 "LAMP statistics API"->"aggregator";
   21 "external APIs"->"aggregator"->"aggregated statistics layer";
   22 "aggregated statistics API"->"UI layer"->"UI Interface";
   23 }

From comments within the LAMP team, we’re leaning towards implementing a processing layer and API on our own LAMP data for now, and only including data from other APIs in the UI for comparison. Further statistical analysis of the type described above could then be an option for the a future release phase of LAMP. Whilst we consider these options further, I’ll be working on the LAMP database structure and trying to import some of the early data from our partners, which I’ll cover in a future post!