Standardising Data Content to Allow For Custom Normalisations

Introduction

In my previous post, I set out to describe a potential infrastructure which would allow users of LAMP to choose how records are normalised by defining their own groupings, but then I mentioned that I had spotted a barrier to doing this easily — that every institution which has submitted data to LAMP describes the same content in a different way. This is like the difference on a web page between a free text field and a drop down list — at the moment, institutions are submitting their data to us as if they had filled in a free text field.

For this post, I’ll explain a little more about standardising the content of the data fields. The process of standardising data content is analogous to converting the ‘free text’ values mentioned above into a fixed set such as the ones you might find on a drop-down list.

I’ve already introduced the idea of standardising, but last time it was with field names from different providers where they hold the same conceptual data (e.g. Country of Domicile from one provider means the same thing as Nationality from another). This time, we’ll be focusing on the content of the fields instead.

Our Current Fixed Normalisations and the Barriers to Custom Normalisations

Let’s consider an example which exposes how difficult it would currently be for users to supply custom normalisations. For the sake of this example, I have only focused on all the various types of Ethnicity that LAMP are currently grouping together as ‘Asian’, with the exception of Chinese, which we are currently putting in a different ‘Chinese’ group.

You can see a sample of the fixed normalisations table we currently have inside the LAMP database below. So far this covers content from four different providers.

1. a sample of Ethnicities from our current comprehensive normalisations table (visualised below as a flow diagram)
institution original_contents normalised_contents
1 Asian Other Asian
1 Bangladeshi Asian
1 Indian Asian
1 Pakistani Asian
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Asian
2 Asian or Asian British – Indian Asian
2 Asian or Asian British – Pakistani Asian
2 Other Asian background Asian
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Asian
3 Asian or Asian British – Bangladeshi Asian
3 Asian or Asian British – Indian Asian
3 Asian – Other Asian
3 Chinese Chinese
4 3[^4] Asian
4 6 Asian
4 34 Chinese

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/85d3ed9ea91310e55bc27a8ee1f3a365.png'
Output: 
Warning: : syntax error in line 21 near '-'

Original DOT:
    1 
    2 digraph table1{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterRaw {
   19 graph[label="original_contents"];
   20 node[shape="rectangle"];
   21 1->"Asian Other"->"Asian";
   22 1->"Bangladeshi"->"Asian";
   23 1->"Indian"->"Asian";
   24 1->"Pakistani"->"Asian";
   25 1->"Chinese"->"Chinese (n)";
   26 edge[color="red"];
   27 2->"Asian or Asian British - Bangladeshi"->"Asian";
   28 2->"Asian or Asian British - Indian"->"Asian";
   29 2->"Asian or Asian British - Pakistani"->"Asian";
   30 2->"Other Asian background"->"Asian";
   31 2->"Chinese";
   32 edge[color="blue"];
   33 3->"Asian or Asian Brisith - Pakistani"->"Asian";
   34 3->"Asian or Asian British - Bangladeshi";
   35 3->"Asian or Asian British - Indian";
   36 3->"Asian - Other"->"Asian";
   37 3->"Chinese";
   38 edge[color="green"];
   39 4->"3[^4]"->"Asian";
   40 4->6->"Asian";
   41 4->34->"Chinese (n)";
   42 }
   43 }
   44 
   45 

In this simplified example, let’s suppose our user doesn’t want to use ‘Chinese’ as a grouping, but would prefer for their business purposes to only use ‘Asian’ for everything in that region. My assumption from my last post was that the user would be able to achieve this by submitting a simple custom normalisations table up to the API. On the other hand, in order for the LAMP application to offer all the same functionality as with our default normalisations, the user would effectively have to submit their own custom implementation of table 1.

The word simple above is the key — as you can see, table 1 is quite complicated owing to different institutions representing ethnicities differently (institution 1 uses ‘Indian’, for example, whereas institution 2 uses ‘Asian or Asian British – Indian’, and so on). Thinking along these lines, in order for a user to draw up their custom copy of table 1, they would need to know all of the possible entries for every different institution. This is not simple at all for the user!

Data Content Standardisation as a Solution

In order to simplify the table which users will need to supply in order to generate custom normalisations, we will need to insert an extra step into table 1. At the moment, we look at the content of the data and decide which normalised grouping it belongs in. If instead, we first look at the content of the data and replace it with a value chosen from a standard list of LAMP-certified values, we can then look at the standardised values and group them into normalisations as a second step. Finally, in order to supply a custom normalisation, a user only needs to know our list of LAMP-standardised values and put them into their own groupings.

In our example, to standardise the values, we would replace the values in table 2a) with the corresponding value from table 2b):

2. Standardising data content
a) the original list of possible ethnicities from our normalisations table
Asian Other
Bangladeshi
Indian
Pakistani
Chinese
Asian or Asian British – Bangladeshi
Asian or Asian British – Indian
Asian or Asian British – Pakistani
Other Asian background
Asian or Asian Brisith – Pakistani
Asian – Other
3[^4]
6
34
b) A suggested list of standardised replacements
Asian – Other
Bangladeshi
Indian
Pakistani
Chinese
Asian British
Asian – Any

Splitting table 1 into two steps can now be achieved as shown in table 3. The first part would be to use a lookup table to standardise all the different values from different institutions into the ones in table 2b. Ideally, the end user generally doesn’t need to see or know about this step — it would be something we did inside the LAMP application. The result would look something like table 3a):

3. Replacing the normalisations table with two tables: a) a standardisations table…
institution original_contents standardised_contents
1 Asian Other Asian – Other
1 Bangladeshi Bangladeshi
1 Indian Indian
1 Pakistani Pakistani
1 Chinese Chinese
2 Asian or Asian British – Bangladeshi Bangladeshi
2 Asian or Asian British – Indian Indian
2 Asian or Asian British – Pakistani Pakistani
2 Other Asian background Asian – Other
2 Chinese Chinese
3 Asian or Asian Brisith – Pakistani Pakistani
3 Asian or Asian British – Bangladeshi Bangladeshi
3 Asian or Asian British – Indian Indian
3 Asian – Other Asian – Other
3 Chinese Chinese
4 3[^4] Asian – Any
4 6 Pakistani
4 34 Chinese
… and b) a normalisations table
standardised contents normalised_contents
Asian – Other Asian
Bangladeshi Asian
Indian Asian
Pakistani Asian
Chinese Chinese
Asian British Asian
Asian – Any Asian

Error generating Graphviz image

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/9727d7839350636bb5c025654663ff79.png'
Output: 
Warning: : syntax error in line 31 near '-'

Original DOT:
    1 
    2 digraph table3{
    3 graph[rankdir="LR"];
    4 subgraph clusterNormalised {
    5 graph[label="normalised_contents"];
    6 node[shape="trapezium"];
    7 "Chinese (n)";
    8 "Asian";
    9 }
   10 subgraph clusterInstitutions {
   11 graph[label="institution"];
   12 node[shape="parallelogram"];
   13 1;
   14 2;
   15 3;
   16 4;
   17 }
   18 subgraph clusterStandard {
   19 graph[label="standardisation"];
   20 node[shape="diamond"];
   21 "Asian - Any";
   22 "Pakistani (s)";
   23 "Asian - Other (s)";
   24 "Indian (s)";
   25 "Bangladeshi (s)";
   26 "Chinese (s)";
   27 }
   28 subgraph clusterRaw {
   29 graph[label="original_contents"];
   30 node[shape="rectangle"];
   31 1->"Asian Other"->"Asian - Other (s)"->"Asian";
   32 1->"Bangladeshi"->"Bangladeshi (s)"->"Asian";
   33 1->"Indian"->"Indian (s)"->"Asian";
   34 1->"Pakistani"->"Pakistani (s)"->"Asian";
   35 1->"Chinese"->"Chinese (s)"->"Chinese (n)";
   36 edge[color="red"];
   37 2->"Asian or Asian British - Bangladeshi"->"Bangladeshi (s)";
   38 2->"Asian or Asian British - Indian"->"Indian (s)";
   39 2->"Asian or Asian British - Pakistani"->"Pakistani (s)";
   40 2->"Other Asian background"->"Asian - Other (s)";
   41 2->"Chinese";
   42 edge[color="blue"];
   43 3->"Asian or Asian Brisith - Pakistani"->"Pakistani (s)";
   44 3->"Asian or Asian British - Bangladeshi";
   45 3->"Asian or Asian British - Indian";
   46 3->"Asian - Other"->"Asian - Other (s)";
   47 3->"Chinese";
   48 edge[color="green"];
   49 4->"3[^4]"->"Asian - Any"->"Asian";
   50 4->6->"Pakistani (s)";
   51 4->34->"Chinese (s)";
   52 }
   53 }
   54 

The normalisations step is now performed by a much simpler second lookup table (3b) ) which groups these new standard field contents into the LAMP default categories. In this new table it no longer matters which institution the data originally came from, which makes it a much simpler table.

Custom Normalisations, Revisited

The end result after introducing data content standardisation will be that if you want to specify your own custom normalisation, you will only need to submit something like the following (which groups ‘Chinese’ into ‘Asian’ instead) up to the API:

5. Proposed structure of an alternative normalisation for the data in table 1, which could be submitted by a user to our API
standardised contents target_column normalised_contents
Asian – Other Ethnicity Asian
Bangladeshi Ethnicity Asian
Indian Ethnicity Asian
Pakistani Ethnicity Asian
Chinese Ethnicity Asian
Asian British Ethnicity Asian
Asian – Any Ethnicity Asian

Hopefully this clears up why we want to focus on content standardisation, as well as how we’ll be going about it! The ideas above will obviously result in some slight changes to our database schema, and how our API works, and so in my next posts I want to talk about both of those aspects of our architecture.

Leave a Reply