Custom Normalisations


At the recent 17 October LAMP CAP meeting, I gave a presentation starting from my last blog posts and detailing some of the progress I have been making along a number of separate lines. I know in my last post I said I would be talking about statistics next, but during parallel development of the various architectural components it turned out that some other pieces of work slotted in before then, and so I want to do a quick series of posts going into more detail from my talk.

In this post, I want to build upon previous discussions of normalising data for statistical analysis and discuss the idea of LAMP users being able to use custom normalisations. The other posts I want to write will cover data content standardisation, our database schema, our API, and the graph drawing portion of our user interface respectively.

Problems with Fixed Normalisations

At the moment, the LAMP database uses fixed data normalisation. This means we are collecting records together into similar groups based on their content (e.g. Ethnicities of ‘White’, ‘White African’ and ‘White Irish’ all become grouped together as ‘White’ for statistical purposes), but that the LAMP team have decided what those groupings should be based on an overview of the data which providers have submitted to us.

We’re expecting that these fixed normalisations will cover the majority of use cases fro LAMP, but let me use a user story example to illustrate a situation in which the fixed normalisations will not be sufficient, and to introduce custom normalisations as a solution.

Suppose someone is interested in what percentage of people get first class degrees at their institution, and for diversity purposes wants to break that down by ethnicity. They could ask the LAMP API for this information and get back the following (fictional) graph, which has been normalised (as described above) by grouping related records together into common categories for anonymisation and statistical analysis.

Graph 1: Percentage of first class degrees by ethnicity

Not KnownOtherAsianBlackWhiteChinese00.

Normalisations used to build Graph 1

Content from provider Normalised group
“Asian – Pakistani” “Asian”
“Black – Other” “Black”
“White/Black Caribbn” “Other”
“Asian – Indian” “Asian”
“White and Asian” “Other”
“White Irish” “White”
“Other White Background” “White”
“Black – Caribbean” “Black”
“White” “White”
“Asian – Other” “Asian”
“Black – African” “Black”
“Information Refused” “Not known”
“Asian – Chinese” “Chinese”
“Asian – Bangladeshi” “Asian”
“White/Black African” “Other”
“Latin American” “Other”
“Other Mixed” “Other”
“Other” “Other”

Now, let’s say this particular institution is looking to break into South America and so is interested in the performance of its Latin American Students in particular, but not bothered at all about separating out Chinese students from the rest of Asia. At the moment, as mentioned above, the ethnicity normalisations which appear in graph 1 are a fixed default set of normalisations. In other words, you currently cannot change the categories to focus in on Latin America.

This seems quite restrictive, and so it has been suggested that we should not tie users of LAMP to the groupings we have chosen. The only way to get around this is if users are able to supply a list of their own ‘custom’ groupings if they prefer. Initially, I assumed this would just be a case of the user submitting a list (like the one next to graph 1) to our API when they run their query, and asking the API to use that list instead.

An Example of Custom Normalisations

Since the normalisations table doing the work is similar to the one above, then my initial idea was that altering the normalisations groupings as below and submitting those up to the API would solve the problem of custom normalisations.

Content from provider LAMP Normalised group Custom Normalised group
“Asian – Pakistani” “Asian” “Asian”
“Black – Other” “Black” “Black”
“White/Black Caribbn” “Other” “Other”
“Asian – Indian” “Asian” “Asian”
“White and Asian” “Other” “Other”
“White Irish” “White” “White”
“Other White Background” “White” “White”
“Black – Caribbean” “Black” “Black”
“White” “White” “White”
“Asian – Other” “Asian” “Asian”
“Black – African” “Black” “Black”
“Information Refused” “Not known” “Not known”
“Asian – Chinese” “Chinese” “Asian”
“Asian – Bangladeshi” “Asian” “Asian”
“White/Black African” “Other” “Other”
“Latin American” “Other” “Latin American”
“Other Mixed” “Other” “Other”
“Other” “Other” “Other”

Making the changes detailed in the right-hand column and submitting them to the API would then result in the following graph being output instead:

Not KnownOtherAsianBlackWhiteLatin American00.

Unfortunately, although this is the right general idea, the tables above are limited to the output from just one (fictitious) provider. When I thought about this process in the context of multiple providers, a barrier to this approach became evident. I’ll talk about this barrier more, and how standardising the content of data fields like those in the left-hand columns above will be necessary for a solution, in my next post.

Personas, job stories and simple planes: wireframing a LAMP interface

Let me start by introducing myself – my name is Benjamin Perry  and I am the Creative Design Coordinator at Mimas. My role on this project is alongside Leigh Morris as the User Experience (UX) Team who are responsible for designing the website itself. As part of this team it’s crucial that I not only understand but am involved with the Information Architecture and the User Interaction Design (the Skeleton Plane – more about planes below), However my particular focus is the visual appearance of the website itself (the Surface Plane).

Having recently read the seminal work of Jesse James Garrett on The Elements of User Experience , it’s clear that where I join this project it has already been through some good User Centered Design (UCD) processes. In Garrett’s book he talks about UX design consisting of 5 layers; the Strategy Plane, the Scope Plane, the Structure Plane, the Skeleton Plane and finally the Surface Plane. “These five planes provide a conceptual framework for talking about user experience problems and the tools we use to solve them”


Jesse James Garrett's simple planes Jesse James Garrett’s simple planes

As you work through each of the planes the issues that you have to deal with move from being abstract to more concrete. Each of these planes is dependent on the ones below it, but this does not mean that each plane needs to be finalized before the next can be started. It’s much better to be flexible; sometimes decisions made on an upper plane may force a reevaluation (or an evaluation for the first time) of an issue from a lower plane.

As I said before my dealings are mainly with the Skeleton and Surface planes, so I’m very much at the sharp end of the process. My work requires the planes before these to have been clearly defined (but not finalized!). As much as I’d like it, I don’t ever expect to be handed all the project brief and documentation then sail through the work without having any questions and get sign off first go – It’s never going to happen. User Centered Design is core to our business at Mimas so the UX team need to be the ones preaching this gospel the loudest. If we can enlighten people to the process we use and give them the tools to help work through the issues that they are faced with, then we will not only build better products but our jobs will be made much easier too.

So lets take a step back and look at what’s happened so far:

David Kay has been looking at the Epic Level narrative with the User Stories. These have been essential to help provide a clear business case for this project and define its strategy – the Strategy Plane.

Bethan Ruddock  has then used these to create some Personas, which have been used to create workflows. These are step-by-step guides that detail how a user would potentially perform a tasks based around of the features of the website (we’ve made these available). These have been used to start to flesh out the Scope Plane and also start to inform the Structure Plane.

The UX team has then taken these workflows along with some early prototype wireframes and visuals to create interactive walkthroughs of how these tasks might be achieved in the UI (we used InVision to do this – its fantastic!) – Structure, Skeleton and Surface Planes.

These have been essential to help generate discussion not only within the team but the wider LAMP CAP group. Seeing something physical in front of you is very powerful and it certainly highlighted some of the issues on the lower planes that needed to be readdressed or hadn’t yet been addressed.

In discussing these issues I was reminded of this blog post by Alan Klement  that Leigh had found which introduced the idea of Job Stories. In his post he summarises –

“… the problem with user stories is that it’s too many assumptions and doesn’t acknowledge causality. When a task is put in the format of a user story (As a [type of user], I want [some action], so that [outcome] ) there’s no room to ask ‘why’ – you’re essentially locked into a particular sequence with no context.”

Klement says with mature defined projects his team were able to talk through and understand proposed changes however “because our canvas is blank, we are having trouble getting on the same page when it comes to customer motivations, events and expectations. But today, things have turned around. I’ve come across a great way to use the jobs to be done philosophy to help define features. I call them Job Stories”

This immediately drew parallels with our project and seemed a natural solution for us for us to gather more information to inform our UCD process. We needed to get a more definitive idea of what people want to do with LAMP but also wrap that in real world context and expectations. With the CAP group full of future users of the site we thought this would be the perfect opportunity to introduce this idea and get them to tell us about their own Job Stories. So we gave the group blank forms to fill in following Klement’s process: (situation) When _____ , (Motivation) I want to _____ , (Expected Outcome) so I can _____ .

Getting the CAP group to think in this way was really positive. The information we collected is helping to define the features for this first phase of the project as well as generating discussion around future enhancements.

As a first outing using Job Stories we found this to be very successful. I’m not sure we did this in the same way that Alan Klement did it with his team, but it certainly generated lots of discussion, highlighted issues and gave us much more information to work with than we had before. What’s more, this information is not based on assumptions, as with the personas, but based on real life users, which is what we set out to achieve. You can see the job stories the CAP created.

We’ll be using the information and insights from these job stories as we work on the next stage of developing the LAMP interface.

Standardising Data Content to Allow For Custom Normalisations


In my previous post, I set out to describe a potential infrastructure which would allow users of LAMP to choose how records are normalised by defining their own groupings, but then I mentioned that I had spotted a barrier to doing this easily — that every institution which has submitted data to LAMP describes the same content in a different way. This is like the difference on a web page between a free text field and a drop down list — at the moment, institutions are submitting their data to us as if they had filled in a free text field.

For this post, I’ll explain a little more about standardising the content of the data fields. The process of standardising data content is analogous to converting the ‘free text’ values mentioned above into a fixed set such as the ones you might find on a drop-down list.

I’ve already introduced the idea of standardising, but last time it was with field names from different providers where they hold the same conceptual data (e.g. Country of Domicile from one provider means the same thing as Nationality from another). This time, we’ll be focusing on the content of the fields instead.

Our Current Fixed Normalisations and the Barriers to Custom Normalisations

Let’s consider an example which exposes how difficult it would currently be for users to supply custom normalisations. For the sake of this example, I have only focused on all the various types of Ethnicity that LAMP are currently grouping together as ‘Asian’, with the exception of Chinese, which we are currently putting in a different ‘Chinese’ group.

You can see a sample of the fixed normalisations table we currently have inside the LAMP database below. So far this covers content from four different providers.

1. a sample of Ethnicities from our current comprehensive normalisations table (visualised below as a flow diagram)
institution original_contents normalised_contents
1 Asian Other Asian
1 Bangladeshi Asian
1 Indian Asian
1 Pakistani Asian
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Asian
2 Asian or Asian British – Indian Asian
2 Asian or Asian British – Pakistani Asian
2 Other Asian background Asian
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Asian
3 Asian or Asian British – Bangladeshi Asian
3 Asian or Asian British – Indian Asian
3 Asian – Other Asian
3 Chinese Chinese
4 3[^4] Asian
4 6 Asian
4 34 Chinese

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/85d3ed9ea91310e55bc27a8ee1f3a365.png'
Warning: : syntax error in line 21 near '-'

Original DOT:
    2 digraph table1{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterRaw {
   19 graph[label="original_contents"];
   20 node[shape="rectangle"];
   21 1->"Asian Other"->"Asian";
   22 1->"Bangladeshi"->"Asian";
   23 1->"Indian"->"Asian";
   24 1->"Pakistani"->"Asian";
   25 1->"Chinese"->"Chinese (n)";
   26 edge[color="red"];
   27 2->"Asian or Asian British - Bangladeshi"->"Asian";
   28 2->"Asian or Asian British - Indian"->"Asian";
   29 2->"Asian or Asian British - Pakistani"->"Asian";
   30 2->"Other Asian background"->"Asian";
   31 2->"Chinese";
   32 edge[color="blue"];
   33 3->"Asian or Asian Brisith - Pakistani"->"Asian";
   34 3->"Asian or Asian British - Bangladeshi";
   35 3->"Asian or Asian British - Indian";
   36 3->"Asian - Other"->"Asian";
   37 3->"Chinese";
   38 edge[color="green"];
   39 4->"3[^4]"->"Asian";
   40 4->6->"Asian";
   41 4->34->"Chinese (n)";
   42 }
   43 }

In this simplified example, let’s suppose our user doesn’t want to use ‘Chinese’ as a grouping, but would prefer for their business purposes to only use ‘Asian’ for everything in that region. My assumption from my last post was that the user would be able to achieve this by submitting a simple custom normalisations table up to the API. On the other hand, in order for the LAMP application to offer all the same functionality as with our default normalisations, the user would effectively have to submit their own custom implementation of table 1.

The word simple above is the key — as you can see, table 1 is quite complicated owing to different institutions representing ethnicities differently (institution 1 uses ‘Indian’, for example, whereas institution 2 uses ‘Asian or Asian British – Indian’, and so on). Thinking along these lines, in order for a user to draw up their custom copy of table 1, they would need to know all of the possible entries for every different institution. This is not simple at all for the user!

Data Content Standardisation as a Solution

In order to simplify the table which users will need to supply in order to generate custom normalisations, we will need to insert an extra step into table 1. At the moment, we look at the content of the data and decide which normalised grouping it belongs in. If instead, we first look at the content of the data and replace it with a value chosen from a standard list of LAMP-certified values, we can then look at the standardised values and group them into normalisations as a second step. Finally, in order to supply a custom normalisation, a user only needs to know our list of LAMP-standardised values and put them into their own groupings.

In our example, to standardise the values, we would replace the values in table 2a) with the corresponding value from table 2b):

2. Standardising data content
a) the original list of possible ethnicities from our normalisations table
Asian Other
Asian or Asian British – Bangladeshi
Asian or Asian British – Indian
Asian or Asian British – Pakistani
Other Asian background
Asian or Asian Brisith – Pakistani
Asian – Other
b) A suggested list of standardised replacements
Asian – Other
Asian British
Asian – Any

Splitting table 1 into two steps can now be achieved as shown in table 3. The first part would be to use a lookup table to standardise all the different values from different institutions into the ones in table 2b. Ideally, the end user generally doesn’t need to see or know about this step — it would be something we did inside the LAMP application. The result would look something like table 3a):

3. Replacing the normalisations table with two tables: a) a standardisations table…
institution original_contents standardised_contents
1 Asian Other Asian – Other
1 Bangladeshi Bangladeshi
1 Indian Indian
1 Pakistani Pakistani
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Bangladeshi
2 Asian or Asian British – Indian Indian
2 Asian or Asian British – Pakistani Pakistani
2 Other Asian background Asian – Other
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Pakistani
3 Asian or Asian British – Bangladeshi Bangladeshi
3 Asian or Asian British – Indian Indian
3 Asian – Other Asian – Other
3 Chinese Chinese
4 3[^4] Asian – Any
4 6 Pakistani
4 34 Chinese
… and b) a normalisations table
standardised contents normalised_contents
Asian – Other Asian
Bangladeshi Asian
Indian Asian
Pakistani Asian
Chinese Chinese
Asian British Asian
Asian – Any Asian

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/9727d7839350636bb5c025654663ff79.png'
Warning: : syntax error in line 31 near '-'

Original DOT:
    2 digraph table3{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterStandard {
   19 graph[label="standardisation"];
   20 node[shape="diamond"];
   21 "Asian - Any";
   22 "Pakistani (s)";
   23 "Asian - Other (s)";
   24 "Indian (s)";
   25 "Bangladeshi (s)";
   26 "Chinese (s)";
   27 }
   28 subgraph clusterRaw {
   29 graph[label="original_contents"];
   30 node[shape="rectangle"];
   31 1->"Asian Other"->"Asian - Other (s)"->"Asian";
   32 1->"Bangladeshi"->"Bangladeshi (s)"->"Asian";
   33 1->"Indian"->"Indian (s)"->"Asian";
   34 1->"Pakistani"->"Pakistani (s)"->"Asian";
   35 1->"Chinese"->"Chinese (s)"->"Chinese (n)";
   36 edge[color="red"];
   37 2->"Asian or Asian British - Bangladeshi"->"Bangladeshi (s)";
   38 2->"Asian or Asian British - Indian"->"Indian (s)";
   39 2->"Asian or Asian British - Pakistani"->"Pakistani (s)";
   40 2->"Other Asian background"->"Asian - Other (s)";
   41 2->"Chinese";
   42 edge[color="blue"];
   43 3->"Asian or Asian Brisith - Pakistani"->"Pakistani (s)";
   44 3->"Asian or Asian British - Bangladeshi";
   45 3->"Asian or Asian British - Indian";
   46 3->"Asian - Other"->"Asian - Other (s)";
   47 3->"Chinese";
   48 edge[color="green"];
   49 4->"3[^4]"->"Asian - Any"->"Asian";
   50 4->6->"Pakistani (s)";
   51 4->34->"Chinese (s)";
   52 }
   53 }

The normalisations step is now performed by a much simpler second lookup table (3b) ) which groups these new standard field contents into the LAMP default categories. In this new table it no longer matters which institution the data originally came from, which makes it a much simpler table.

Custom Normalisations, Revisited

The end result after introducing data content standardisation will be that if you want to specify your own custom normalisation, you will only need to submit something like the following (which groups ‘Chinese’ into ‘Asian’ instead) up to the API:

5. Proposed structure of an alternative normalisation for the data in table 1, which could be submitted by a user to our API
standardised contents target_column normalised_contents
Asian – Other Ethnicity Asian
Bangladeshi Ethnicity Asian
Indian Ethnicity Asian
Pakistani Ethnicity Asian
Chinese Ethnicity Asian
Asian British Ethnicity Asian
Asian – Any Ethnicity Asian

Hopefully this clears up why we want to focus on content standardisation, as well as how we’ll be going about it! The ideas above will obviously result in some slight changes to our database schema, and how our API works, and so in my next posts I want to talk about both of those aspects of our architecture.

Creating the LAMP database

We now have data from three different LAMP partners, and we’ve started looking at the structure of the data. On one hand, we are interested in how we normalise the data for statistical analysis, but on the other hand, we also need to start thinking about how data is going to be consumed by the LAMP application. In my previous post regarding the architecture of the LAMP system as a whole, I looked at some theoretical architectures which might be a good fit for the application’s requirements as we understand them right now, and the common point in all of the architecture options I have considered is the LAMP database.

Data Structure Concerns

It has been suggested that we retain the full granularity of data as supplied to us from the partners (although in my anonymous UIDs post I noted that this data will not be available to end users), which in itself raises some interesting challenges. Each partner is storing and sending different pieces of information, kept in different columns. For example, from one provider we might see something like column one of the following table, and from another something like column 2. As you’ll see below, some columns clearly mean the same thing (and I have lined them up accordingly), some are possibly similar, and others have no analogues between different partners.

Provider One Provider Two
user# Identifier
Ethnicity ethnicity
Country of Domicile nationality
Country of Domicile Region
Gender gender
Disability disability
UCAS Tariff Points Tariff
Age on entry
Mode of Attendance Attendance Mode
Class Code
Course Code Prog Code
Course Name Course name
Course Type Course type
JACS code
Location of Study Campus
Franchised Out
Enrolment End Date Date of graduation
Agreed Award final award
Agreed Classification
Agreed Overall mark
loans Loans per borrower
total E Number of different E-resources accessed
all visits Number of visits

In designing a database, one sets out to work out which pieces of data will be present in a database table. For any one partner, this is easy — we are being given the tables directly. Add more than one feed like the ones above, however, and there are some choices to make in our database design regarding how we standardise the column names when they clearly mean the same thing.

Access Control Concerns

The next factor which may influence our choice of database structure is security. Database products such as Postgresql come with built in security, so access can be restricted to certain database users on a per-table basis. This is certainly an appealing model for controlling partners having full access to their own data — we could create one database user for each LAMP partner, and perform access control accordingly.

The difficulty with that access model, however, comes when we want to run analyses which compare data from multiple partners, which implies a degree of access to each others’ data. Of course, we would only be allowing access to other partners data after it had been standardised and normalised, but in order to perform the calculations, access will still be required at a level which is far more granular than the LAMP project wants to expose. And this is the crux of the matter — the kind of statistical comparisons people require mean that it is simply not possible to avoid the LAMP statistics layer having read access to all the normalised data.

The implication of this is that we will not be able to use database access control alone to restrict access to the full granularity of the normalised data. Instead, either our API or our statistics layer will have to use software routines to implement its own filters and sanity checks based on credentials supplied to the API. Only after these have been passed will results from cross-provider queries be returned to the user.

Outside of cross-provider queries, there will still be the need to restrict access to raw data solely to the provider who submitted it, and it remains to be seen whether or not database access control will be able to play a part in achieving this or whether it too will be purely achieved via software routines.

Some Design Options

  1. One table per provider, with normalisations and column name standardisations in a lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/c3047af0fad4741ce1fe25fae80c5c55.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Normalisations -> Standardisations -> "Provider 1 table";
        7 Standardisations -> "Provider 2 table";
        8 Standardisations -> "Provider 3 table";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 }}

    The benefit of this approach would be that the original detail submitted by providers would appear in the database, one provider per table. This would also mean making use of database access control on a per-table basis was an option. However, the need to constantly perform lookup routines in order to standardise and normalise the data for comparison could impact performance and result in quite complicated queries, and there would also have to be at least one database user with read access to all the tables for the purposes of running such comparisons.
  2. One table with standardised column names for all providers, with normalisations in a separate lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/cfc16ef1a9a1e902bc0a4c29ad4aeb4e.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> "All Providers' Standardised Data";
        7 "Filter for only onenProvider's data" -> "All Providers' Standardised Data";
        8 Normalisations -> "All Providers' Standardised Data";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Filter for only onenProvider's data";
       12 }
       13 "Provider 1 raw" -> Standardisations;
       14 "Provider 2 raw" -> Standardisations;
       15 "Provider 3 raw" -> Standardisations;
       16 }

    This may be a ‘best of both worlds’ solution — we would hold the full level of detail submitted to us by providers in our database, in a single table with standardised headings. Queries could then be run through normalisation routines when comparison and statistics are required, but institutions would still be able to get at the data they submitted (albeit standardised) for other types of analysis which would be private to their LAMP dashboard. From an access control pespective, this scenario would rely entirely on software checks — at the provider filter, and on the comparison query results — in order to protect data.
  3. One normalised, standardised table, with no access to original data:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/26bc031cec0deeb4314ee2db64e8c63e.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 node[shape="trapezium"]
        8 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
        9 }
       10 "Provider 1 raw" -> Standardisations;
       11 "Provider 2 raw" -> Standardisations;
       12 "Provider 3 raw" -> Standardisations;
       13 }

    This scenario would probably perform best and keep queries simple, but has the drawback that the full level of detail in the data as submitted by the providers would not be held in the database, which would limit us to only running queries on the normalised data. Since a number of the LAMP use cases seem to involve providers wanting to store their data with us and query it in place, this option is pretty much ruled out. Restricting access to the detailed normalised data, as in the previous example, would be completely done in software.
  4. Using redundancy and holding both raw individual tables as well as a standardised/normalised combined one

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/a1c5501eadf600068966f36dc1a9009b.png'
    Warning: : syntax error in line 6 near '-'
    Original DOT:
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 "Provider 1 table";
        8 "Provider 2 table";
        9 "Provider 3 table";
       10 node[shape="trapezium"]
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
       15 }
       16 "Provider 1 raw" -> Standardisations;
       17 "Provider 2 raw" -> Standardisations;
       18 "Provider 3 raw" -> Standardisations;
       19 "Provider 1 raw" -> "Provider 1 table";
       20 "Provider 2 raw" -> "Provider 2 table";
       21 "Provider 3 raw" -> "Provider 3 table";
       23 }

    Another option exists whereby we could combine options one and three — the raw data goes back in the database, and database user accounts are reintroduced to control access to it as in option one. However, as in option three, the normalised table is also present in the database, for combined queries which are regulated by the API. For even higher levels of security, the raw and the normalised tables don’t strictly speaking even need to be in the same database!

    This last option would really be an implementation suited to high levels of paranoia regarding the raw data and our API’s software safeguards, and faith that the normalisation routines do a good enough job of anonymising that data to justify the combined table not being subject to the same levels of security.

At the moment I’m leaning towards option two — we can reverse-lookup the standardisations if partners absolutely need their original column headings back, but having all the data in one standardised table will help with both private and combined queries. Since database access control cannot offer us the security required in our application, we will need to implement software checks in any case, so we may as well embrace the fact and get on with how those checks will work! It’s conceivable that option 4 might perform better, as the need to do standardisation and/or normalisation lookups at query-time is removed, but we’ll keep an eye on that as we test and build the database.

In my next post, I’m hoping to go into more detail about the statistics layer, and how we implement some of the routines Ellen blogged about, leading into how we build our API!