Creating the LAMP database

We now have data from three different LAMP partners, and we’ve started looking at the structure of the data. On one hand, we are interested in how we normalise the data for statistical analysis, but on the other hand, we also need to start thinking about how data is going to be consumed by the LAMP application. In my previous post regarding the architecture of the LAMP system as a whole, I looked at some theoretical architectures which might be a good fit for the application’s requirements as we understand them right now, and the common point in all of the architecture options I have considered is the LAMP database.

Data Structure Concerns

It has been suggested that we retain the full granularity of data as supplied to us from the partners (although in my anonymous UIDs post I noted that this data will not be available to end users), which in itself raises some interesting challenges. Each partner is storing and sending different pieces of information, kept in different columns. For example, from one provider we might see something like column one of the following table, and from another something like column 2. As you’ll see below, some columns clearly mean the same thing (and I have lined them up accordingly), some are possibly similar, and others have no analogues between different partners.

Provider One Provider Two
user# Identifier
Ethnicity ethnicity
Country of Domicile nationality
Country of Domicile Region
Gender gender
Disability disability
UCAS Tariff Points Tariff
Age on entry
AcadYear
Mode of Attendance Attendance Mode
progression_code
Class Code
Course Code Prog Code
Course Name Course name
Course Type Course type
JACS code
student_status
Faculty
Location of Study Campus
Franchised Out
Enrolment End Date Date of graduation
Agreed Award final award
Agreed Classification
Agreed Overall mark
loans Loans per borrower
total E Number of different E-resources accessed
all visits Number of visits

In designing a database, one sets out to work out which pieces of data will be present in a database table. For any one partner, this is easy — we are being given the tables directly. Add more than one feed like the ones above, however, and there are some choices to make in our database design regarding how we standardise the column names when they clearly mean the same thing.

Access Control Concerns

The next factor which may influence our choice of database structure is security. Database products such as Postgresql come with built in security, so access can be restricted to certain database users on a per-table basis. This is certainly an appealing model for controlling partners having full access to their own data — we could create one database user for each LAMP partner, and perform access control accordingly.

The difficulty with that access model, however, comes when we want to run analyses which compare data from multiple partners, which implies a degree of access to each others’ data. Of course, we would only be allowing access to other partners data after it had been standardised and normalised, but in order to perform the calculations, access will still be required at a level which is far more granular than the LAMP project wants to expose. And this is the crux of the matter — the kind of statistical comparisons people require mean that it is simply not possible to avoid the LAMP statistics layer having read access to all the normalised data.

The implication of this is that we will not be able to use database access control alone to restrict access to the full granularity of the normalised data. Instead, either our API or our statistics layer will have to use software routines to implement its own filters and sanity checks based on credentials supplied to the API. Only after these have been passed will results from cross-provider queries be returned to the user.

Outside of cross-provider queries, there will still be the need to restrict access to raw data solely to the provider who submitted it, and it remains to be seen whether or not database access control will be able to play a part in achieving this or whether it too will be purely achieved via software routines.

Some Design Options

  1. One table per provider, with normalisations and column name standardisations in a lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/c3047af0fad4741ce1fe25fae80c5c55.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Normalisations -> Standardisations -> "Provider 1 table";
        7 Standardisations -> "Provider 2 table";
        8 Standardisations -> "Provider 3 table";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 }}
       15 
    


    The benefit of this approach would be that the original detail submitted by providers would appear in the database, one provider per table. This would also mean making use of database access control on a per-table basis was an option. However, the need to constantly perform lookup routines in order to standardise and normalise the data for comparison could impact performance and result in quite complicated queries, and there would also have to be at least one database user with read access to all the tables for the purposes of running such comparisons.
  2. One table with standardised column names for all providers, with normalisations in a separate lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/cfc16ef1a9a1e902bc0a4c29ad4aeb4e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> "All Providers' Standardised Data";
        7 "Filter for only onenProvider's data" -> "All Providers' Standardised Data";
        8 Normalisations -> "All Providers' Standardised Data";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Filter for only onenProvider's data";
       12 }
       13 "Provider 1 raw" -> Standardisations;
       14 "Provider 2 raw" -> Standardisations;
       15 "Provider 3 raw" -> Standardisations;
       16 }
       17 
    


    This may be a ‘best of both worlds’ solution — we would hold the full level of detail submitted to us by providers in our database, in a single table with standardised headings. Queries could then be run through normalisation routines when comparison and statistics are required, but institutions would still be able to get at the data they submitted (albeit standardised) for other types of analysis which would be private to their LAMP dashboard. From an access control pespective, this scenario would rely entirely on software checks — at the provider filter, and on the comparison query results — in order to protect data.
  3. One normalised, standardised table, with no access to original data:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/26bc031cec0deeb4314ee2db64e8c63e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 node[shape="trapezium"]
        8 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
        9 }
       10 "Provider 1 raw" -> Standardisations;
       11 "Provider 2 raw" -> Standardisations;
       12 "Provider 3 raw" -> Standardisations;
       13 }
       14 
    


    This scenario would probably perform best and keep queries simple, but has the drawback that the full level of detail in the data as submitted by the providers would not be held in the database, which would limit us to only running queries on the normalised data. Since a number of the LAMP use cases seem to involve providers wanting to store their data with us and query it in place, this option is pretty much ruled out. Restricting access to the detailed normalised data, as in the previous example, would be completely done in software.
  4. Using redundancy and holding both raw individual tables as well as a standardised/normalised combined one

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/a1c5501eadf600068966f36dc1a9009b.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 "Provider 1 table";
        8 "Provider 2 table";
        9 "Provider 3 table";
       10 node[shape="trapezium"]
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
       15 }
       16 "Provider 1 raw" -> Standardisations;
       17 "Provider 2 raw" -> Standardisations;
       18 "Provider 3 raw" -> Standardisations;
       19 "Provider 1 raw" -> "Provider 1 table";
       20 "Provider 2 raw" -> "Provider 2 table";
       21 "Provider 3 raw" -> "Provider 3 table";
       22 
       23 }
       24 
    

    Another option exists whereby we could combine options one and three — the raw data goes back in the database, and database user accounts are reintroduced to control access to it as in option one. However, as in option three, the normalised table is also present in the database, for combined queries which are regulated by the API. For even higher levels of security, the raw and the normalised tables don’t strictly speaking even need to be in the same database!

    This last option would really be an implementation suited to high levels of paranoia regarding the raw data and our API’s software safeguards, and faith that the normalisation routines do a good enough job of anonymising that data to justify the combined table not being subject to the same levels of security.

At the moment I’m leaning towards option two — we can reverse-lookup the standardisations if partners absolutely need their original column headings back, but having all the data in one standardised table will help with both private and combined queries. Since database access control cannot offer us the security required in our application, we will need to implement software checks in any case, so we may as well embrace the fact and get on with how those checks will work! It’s conceivable that option 4 might perform better, as the need to do standardisation and/or normalisation lookups at query-time is removed, but we’ll keep an eye on that as we test and build the database.

In my next post, I’m hoping to go into more detail about the statistics layer, and how we implement some of the routines Ellen blogged about, leading into how we build our API!