At the recent 17 October LAMP CAP meeting, I gave a presentation starting from my last blog posts and detailing some of the progress I have been making along a number of separate lines. I know in my last post I said I would be talking about statistics next, but during parallel development of the various architectural components it turned out that some other pieces of work slotted in before then, and so I want to do a quick series of posts going into more detail from my talk.
In this post, I want to build upon previous discussions of normalising data for statistical analysis and discuss the idea of LAMP users being able to use custom normalisations. The other posts I want to write will cover data content standardisation, our database schema, our API, and the graph drawing portion of our user interface respectively.
Problems with Fixed Normalisations
At the moment, the LAMP database uses fixed data normalisation. This means we are collecting records together into similar groups based on their content (e.g. Ethnicities of ‘White’, ‘White African’ and ‘White Irish’ all become grouped together as ‘White’ for statistical purposes), but that the LAMP team have decided what those groupings should be based on an overview of the data which providers have submitted to us.
We’re expecting that these fixed normalisations will cover the majority of use cases fro LAMP, but let me use a user story example to illustrate a situation in which the fixed normalisations will not be sufficient, and to introduce custom normalisations as a solution.
Suppose someone is interested in what percentage of people get first class degrees at their institution, and for diversity purposes wants to break that down by ethnicity. They could ask the LAMP API for this information and get back the following (fictional) graph, which has been normalised (as described above) by grouping related records together into common categories for anonymisation and statistical analysis.
Graph 1: Percentage of first class degrees by ethnicity
Normalisations used to build Graph 1
Now, let’s say this particular institution is looking to break into South America and so is interested in the performance of its Latin American Students in particular, but not bothered at all about separating out Chinese students from the rest of Asia. At the moment, as mentioned above, the ethnicity normalisations which appear in graph 1 are a fixed default set of normalisations. In other words, you currently cannot change the categories to focus in on Latin America.
This seems quite restrictive, and so it has been suggested that we should not tie users of LAMP to the groupings we have chosen. The only way to get around this is if users are able to supply a list of their own ‘custom’ groupings if they prefer. Initially, I assumed this would just be a case of the user submitting a list (like the one next to graph 1) to our API when they run their query, and asking the API to use that list instead.
An Example of Custom Normalisations
Since the normalisations table doing the work is similar to the one above, then my initial idea was that altering the normalisations groupings as below and submitting those up to the API would solve the problem of custom normalisations.
|Content from provider||LAMP Normalised group||Custom Normalised group|
|“Asian – Pakistani”||“Asian”||“Asian”|
|“Black – Other”||“Black”||“Black”|
|“Asian – Indian”||“Asian”||“Asian”|
|“White and Asian”||“Other”||“Other”|
|“Other White Background”||“White”||“White”|
|“Black – Caribbean”||“Black”||“Black”|
|“Asian – Other”||“Asian”||“Asian”|
|“Black – African”||“Black”||“Black”|
|“Information Refused”||“Not known”||“Not known”|
|“Asian – Chinese”||“Chinese”||“Asian”|
|“Asian – Bangladeshi”||“Asian”||“Asian”|
|“Latin American”||“Other”||“Latin American”|
Making the changes detailed in the right-hand column and submitting them to the API would then result in the following graph being output instead:
Not KnownOtherAsianBlackWhiteLatin American00.20.40.60.818.104.22.168.822.2%
Unfortunately, although this is the right general idea, the tables above are limited to the output from just one (fictitious) provider. When I thought about this process in the context of multiple providers, a barrier to this approach became evident. I’ll talk about this barrier more, and how standardising the content of data fields like those in the left-hand columns above will be necessary for a solution, in my next post.