Creating the LAMP database

We now have data from three different LAMP partners, and we’ve started looking at the structure of the data. On one hand, we are interested in how we normalise the data for statistical analysis, but on the other hand, we also need to start thinking about how data is going to be consumed by the LAMP application. In my previous post regarding the architecture of the LAMP system as a whole, I looked at some theoretical architectures which might be a good fit for the application’s requirements as we understand them right now, and the common point in all of the architecture options I have considered is the LAMP database.

Data Structure Concerns

It has been suggested that we retain the full granularity of data as supplied to us from the partners (although in my anonymous UIDs post I noted that this data will not be available to end users), which in itself raises some interesting challenges. Each partner is storing and sending different pieces of information, kept in different columns. For example, from one provider we might see something like column one of the following table, and from another something like column 2. As you’ll see below, some columns clearly mean the same thing (and I have lined them up accordingly), some are possibly similar, and others have no analogues between different partners.

Provider One Provider Two
user# Identifier
Ethnicity ethnicity
Country of Domicile nationality
Country of Domicile Region
Gender gender
Disability disability
UCAS Tariff Points Tariff
Age on entry
AcadYear
Mode of Attendance Attendance Mode
progression_code
Class Code
Course Code Prog Code
Course Name Course name
Course Type Course type
JACS code
student_status
Faculty
Location of Study Campus
Franchised Out
Enrolment End Date Date of graduation
Agreed Award final award
Agreed Classification
Agreed Overall mark
loans Loans per borrower
total E Number of different E-resources accessed
all visits Number of visits

In designing a database, one sets out to work out which pieces of data will be present in a database table. For any one partner, this is easy — we are being given the tables directly. Add more than one feed like the ones above, however, and there are some choices to make in our database design regarding how we standardise the column names when they clearly mean the same thing.

Access Control Concerns

The next factor which may influence our choice of database structure is security. Database products such as Postgresql come with built in security, so access can be restricted to certain database users on a per-table basis. This is certainly an appealing model for controlling partners having full access to their own data — we could create one database user for each LAMP partner, and perform access control accordingly.

The difficulty with that access model, however, comes when we want to run analyses which compare data from multiple partners, which implies a degree of access to each others’ data. Of course, we would only be allowing access to other partners data after it had been standardised and normalised, but in order to perform the calculations, access will still be required at a level which is far more granular than the LAMP project wants to expose. And this is the crux of the matter — the kind of statistical comparisons people require mean that it is simply not possible to avoid the LAMP statistics layer having read access to all the normalised data.

The implication of this is that we will not be able to use database access control alone to restrict access to the full granularity of the normalised data. Instead, either our API or our statistics layer will have to use software routines to implement its own filters and sanity checks based on credentials supplied to the API. Only after these have been passed will results from cross-provider queries be returned to the user.

Outside of cross-provider queries, there will still be the need to restrict access to raw data solely to the provider who submitted it, and it remains to be seen whether or not database access control will be able to play a part in achieving this or whether it too will be purely achieved via software routines.

Some Design Options

  1. One table per provider, with normalisations and column name standardisations in a lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/c3047af0fad4741ce1fe25fae80c5c55.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Normalisations -> Standardisations -> "Provider 1 table";
        7 Standardisations -> "Provider 2 table";
        8 Standardisations -> "Provider 3 table";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 }}
       15 
    


    The benefit of this approach would be that the original detail submitted by providers would appear in the database, one provider per table. This would also mean making use of database access control on a per-table basis was an option. However, the need to constantly perform lookup routines in order to standardise and normalise the data for comparison could impact performance and result in quite complicated queries, and there would also have to be at least one database user with read access to all the tables for the purposes of running such comparisons.
  2. One table with standardised column names for all providers, with normalisations in a separate lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/cfc16ef1a9a1e902bc0a4c29ad4aeb4e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> "All Providers' Standardised Data";
        7 "Filter for only onenProvider's data" -> "All Providers' Standardised Data";
        8 Normalisations -> "All Providers' Standardised Data";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Filter for only onenProvider's data";
       12 }
       13 "Provider 1 raw" -> Standardisations;
       14 "Provider 2 raw" -> Standardisations;
       15 "Provider 3 raw" -> Standardisations;
       16 }
       17 
    


    This may be a ‘best of both worlds’ solution — we would hold the full level of detail submitted to us by providers in our database, in a single table with standardised headings. Queries could then be run through normalisation routines when comparison and statistics are required, but institutions would still be able to get at the data they submitted (albeit standardised) for other types of analysis which would be private to their LAMP dashboard. From an access control pespective, this scenario would rely entirely on software checks — at the provider filter, and on the comparison query results — in order to protect data.
  3. One normalised, standardised table, with no access to original data:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/26bc031cec0deeb4314ee2db64e8c63e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 node[shape="trapezium"]
        8 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
        9 }
       10 "Provider 1 raw" -> Standardisations;
       11 "Provider 2 raw" -> Standardisations;
       12 "Provider 3 raw" -> Standardisations;
       13 }
       14 
    


    This scenario would probably perform best and keep queries simple, but has the drawback that the full level of detail in the data as submitted by the providers would not be held in the database, which would limit us to only running queries on the normalised data. Since a number of the LAMP use cases seem to involve providers wanting to store their data with us and query it in place, this option is pretty much ruled out. Restricting access to the detailed normalised data, as in the previous example, would be completely done in software.
  4. Using redundancy and holding both raw individual tables as well as a standardised/normalised combined one

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/a1c5501eadf600068966f36dc1a9009b.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 "Provider 1 table";
        8 "Provider 2 table";
        9 "Provider 3 table";
       10 node[shape="trapezium"]
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
       15 }
       16 "Provider 1 raw" -> Standardisations;
       17 "Provider 2 raw" -> Standardisations;
       18 "Provider 3 raw" -> Standardisations;
       19 "Provider 1 raw" -> "Provider 1 table";
       20 "Provider 2 raw" -> "Provider 2 table";
       21 "Provider 3 raw" -> "Provider 3 table";
       22 
       23 }
       24 
    

    Another option exists whereby we could combine options one and three — the raw data goes back in the database, and database user accounts are reintroduced to control access to it as in option one. However, as in option three, the normalised table is also present in the database, for combined queries which are regulated by the API. For even higher levels of security, the raw and the normalised tables don’t strictly speaking even need to be in the same database!

    This last option would really be an implementation suited to high levels of paranoia regarding the raw data and our API’s software safeguards, and faith that the normalisation routines do a good enough job of anonymising that data to justify the combined table not being subject to the same levels of security.

At the moment I’m leaning towards option two — we can reverse-lookup the standardisations if partners absolutely need their original column headings back, but having all the data in one standardised table will help with both private and combined queries. Since database access control cannot offer us the security required in our application, we will need to implement software checks in any case, so we may as well embrace the fact and get on with how those checks will work! It’s conceivable that option 4 might perform better, as the need to do standardisation and/or normalisation lookups at query-time is removed, but we’ll keep an eye on that as we test and build the database.

In my next post, I’m hoping to go into more detail about the statistics layer, and how we implement some of the routines Ellen blogged about, leading into how we build our API!

Unique Identifiers Which Don’t Identify Anyone!

  1. Why might we need Unique Identifiers?
  2. How might a Unique ID Be Generated?
  3. But we don’t want our data to be identified!
  4. Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

Why might we need Unique Identifiers?

So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):

Age Starting Course UCAS points Subject Final Grade Number of library loans
18 300 Physics and Astronomy Upper Second 3
19 250 Cake Decorating and Icing Manufacture First 153
43 400 Medieval History Lower Second 96

This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:

Number of PC logins e-resource logins Library barrier entires
1307 4 260
51 0 2741
254 98 5

Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…

How might a Unique ID Be Generated?

The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.

The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.

The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.

But we don’t want our data to be identified!

The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.

Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:

Anonymous UID Age Starting Course UCAS points Subject Final Grade Number of library loans Number of PC logins e-resource logins Library barrier entires
765003 18 300 Physics and Astronomy Upper Second 3 51 0 2741
331224 19 250 Cake Decorating and Icing Manufacture First 153
903876 43 400 Medieval History Lower Second 96

In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.

The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.

To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.

Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.

Age Course Degree Result FTE Country of Domicile
23 Physics with Study in Europe 67% 0.75 Spain
Mature Physics Upper Second Part Time EU

Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:

Number of loans vs final grade
loans
final grade

Or if you query the LAMP API, you will be able to get the same data in tabular form:

Loans Final Grade
3 2
22 1
30 2
10 3
0 5
0 4
7 3
6 4
15 3
17 2

The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.

Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.

Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!

  1. Pick some ‘main’ table and use the row number from it

    In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.

  2. Make up your own anonymiser routine

    As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…

  3. Use a hash function

    This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system, echo id | openssl sha256 will return the hash code for id.

Planning the LAMP Architecture

As data starts to come in from various LAMP partners, we’re considering the architectural design of our application. Some ideas came out of the first LAMP CAP meeting as well which might have a bearing on how we put our system together.

For each partner, we will receive a series of datasets covering various aspects of library analytic data (usage, degree attainment, journal subscriptions, etc.), which we’ll be importing into a database. One of the goals of LAMP is to be able to perform statistical analysis of this data, and so at some point there will need to be a component in our application which is capable of statistical calculation.

Our project will deliver a user interface front-end for the purposes of viewing the analysed data and customising which analysis to apply. Mention has also been made, however, of an API being delivered so that LAMP users can, if needs be, get the results of statistical analysis for use in their own applications without using our user interface.

As soon as we’re considering provision of an API, a common best-practise concept is that we should eat our own dog food — namely, that our own LAMP application should be built on top of our API. This process helps keep our API useable and relevant, as it means that any functionality we’ll need for our front end is available to everyone else through our API as well.

One thing which was mentioned at the CAP meeting, however, was the idea that we could combine our data with that from other APIs out there which offer different analytics data, such as JUSP and IRIS. This would increase the number of sources from which we could pull data and perform statistical analysis.

Before other APIs were mentioned, my initial feeling was that a statistical application layer such as R, integrated with the database, might be the most efficient way to offer up analyses from the LAMP data. A reasonable structure for this arrangement might be:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/3620b6bd6a5926bcbae6b850e1ddfc30.png'
Output: 
Warning: : syntax error in line 7 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "LAMP statistics API interface";
    5 "UI Interface";
    6 node[shape=rectangle];
    7 "LAMP database"->"LAMP data statistics layer"->"LAMP statistics API"->"LAMP statistics API interface";
    8 "LAMP statistics API"->"UI layer"->"UI Interface";
    9 }
   10 


(system components are rectangles, interfaces to the system are trapeziums)

However, the additional requirement to bring in other APIs pulls such a structure into question. Depending on what content is in other APIs and whether we can successfully cross-reference and/or make use of it, there may be a corresponding requirement to carry out statistical analysis across results from multiple APIs. If this is the case, then either the statistics processing layer needs to move (so that processing occurs after the other data has been pulled in), or a second processing layer will be necessary. Questions also arise as to whether or not a LAMP API should serve aggregated statistical results after data has been consumed from other APIs, or whether we only want to offer results from our own database via a LAMP API, leaving analysis across multiple APIs for consumers to implement according to their own individual requirements.

The following diagram attempts to sum up these options, with component options represented with dashed boundaries. Statistical processing options are in purple, and API options in blue:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/d52fddb4ca8130df1614206e3ba698b2.png'
Output: 
Warning: : syntax error in line 11 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "UI Interface";
    5 
    6 node[style=dashed,penwidth=2,color=blue];
    7 "LAMP statistics API interface";
    8 "aggregated statistics API interface";
    9 
   10 node[shape=rectangle]
   11 "LAMP statistics API"->"LAMP statistics API interface";
   12 "aggregated statistics API"->"aggregated statistics API interface";
   13 
   14 node[color=purple];
   15 "LAMP data statistics layer"->"LAMP statistics API";
   16 "aggregated statistics layer"->"aggregated statistics API";
   17 
   18 node[style=solid, penwidth=1, color=black];
   19 "LAMP database"->"LAMP data statistics layer";
   20 "LAMP statistics API"->"aggregator";
   21 "external APIs"->"aggregator"->"aggregated statistics layer";
   22 "aggregated statistics API"->"UI layer"->"UI Interface";
   23 }
   24 

From comments within the LAMP team, we’re leaning towards implementing a processing layer and API on our own LAMP data for now, and only including data from other APIs in the UI for comparison. Further statistical analysis of the type described above could then be an option for the a future release phase of LAMP. Whilst we consider these options further, I’ll be working on the LAMP database structure and trying to import some of the early data from our partners, which I’ll cover in a future post!