Custom Normalisations

Introduction

At the recent 17 October LAMP CAP meeting, I gave a presentation starting from my last blog posts and detailing some of the progress I have been making along a number of separate lines. I know in my last post I said I would be talking about statistics next, but during parallel development of the various architectural components it turned out that some other pieces of work slotted in before then, and so I want to do a quick series of posts going into more detail from my talk.

In this post, I want to build upon previous discussions of normalising data for statistical analysis and discuss the idea of LAMP users being able to use custom normalisations. The other posts I want to write will cover data content standardisation, our database schema, our API, and the graph drawing portion of our user interface respectively.

Problems with Fixed Normalisations

At the moment, the LAMP database uses fixed data normalisation. This means we are collecting records together into similar groups based on their content (e.g. Ethnicities of ‘White’, ‘White African’ and ‘White Irish’ all become grouped together as ‘White’ for statistical purposes), but that the LAMP team have decided what those groupings should be based on an overview of the data which providers have submitted to us.

We’re expecting that these fixed normalisations will cover the majority of use cases fro LAMP, but let me use a user story example to illustrate a situation in which the fixed normalisations will not be sufficient, and to introduce custom normalisations as a solution.

Suppose someone is interested in what percentage of people get first class degrees at their institution, and for diversity purposes wants to break that down by ethnicity. They could ask the LAMP API for this information and get back the following (fictional) graph, which has been normalised (as described above) by grouping related records together into common categories for anonymisation and statistical analysis.

Graph 1: Percentage of first class degrees by ethnicity

Not KnownOtherAsianBlackWhiteChinese00.20.40.60.811.21.41.61.822.2%

Normalisations used to build Graph 1

Content from provider Normalised group
“Asian – Pakistani” “Asian”
“Black – Other” “Black”
“White/Black Caribbn” “Other”
“Asian – Indian” “Asian”
“White and Asian” “Other”
“White Irish” “White”
“Other White Background” “White”
“Black – Caribbean” “Black”
“White” “White”
“Asian – Other” “Asian”
“Black – African” “Black”
“Information Refused” “Not known”
“Asian – Chinese” “Chinese”
“Asian – Bangladeshi” “Asian”
“White/Black African” “Other”
“Latin American” “Other”
“Other Mixed” “Other”
“Other” “Other”

Now, let’s say this particular institution is looking to break into South America and so is interested in the performance of its Latin American Students in particular, but not bothered at all about separating out Chinese students from the rest of Asia. At the moment, as mentioned above, the ethnicity normalisations which appear in graph 1 are a fixed default set of normalisations. In other words, you currently cannot change the categories to focus in on Latin America.

This seems quite restrictive, and so it has been suggested that we should not tie users of LAMP to the groupings we have chosen. The only way to get around this is if users are able to supply a list of their own ‘custom’ groupings if they prefer. Initially, I assumed this would just be a case of the user submitting a list (like the one next to graph 1) to our API when they run their query, and asking the API to use that list instead.

An Example of Custom Normalisations

Since the normalisations table doing the work is similar to the one above, then my initial idea was that altering the normalisations groupings as below and submitting those up to the API would solve the problem of custom normalisations.

Content from provider LAMP Normalised group Custom Normalised group
“Asian – Pakistani” “Asian” “Asian”
“Black – Other” “Black” “Black”
“White/Black Caribbn” “Other” “Other”
“Asian – Indian” “Asian” “Asian”
“White and Asian” “Other” “Other”
“White Irish” “White” “White”
“Other White Background” “White” “White”
“Black – Caribbean” “Black” “Black”
“White” “White” “White”
“Asian – Other” “Asian” “Asian”
“Black – African” “Black” “Black”
“Information Refused” “Not known” “Not known”
“Asian – Chinese” “Chinese” “Asian”
“Asian – Bangladeshi” “Asian” “Asian”
“White/Black African” “Other” “Other”
“Latin American” “Other” “Latin American”
“Other Mixed” “Other” “Other”
“Other” “Other” “Other”

Making the changes detailed in the right-hand column and submitting them to the API would then result in the following graph being output instead:

Not KnownOtherAsianBlackWhiteLatin American00.20.40.60.811.21.41.61.822.2%

Unfortunately, although this is the right general idea, the tables above are limited to the output from just one (fictitious) provider. When I thought about this process in the context of multiple providers, a barrier to this approach became evident. I’ll talk about this barrier more, and how standardising the content of data fields like those in the left-hand columns above will be necessary for a solution, in my next post.

Creating the LAMP database

We now have data from three different LAMP partners, and we’ve started looking at the structure of the data. On one hand, we are interested in how we normalise the data for statistical analysis, but on the other hand, we also need to start thinking about how data is going to be consumed by the LAMP application. In my previous post regarding the architecture of the LAMP system as a whole, I looked at some theoretical architectures which might be a good fit for the application’s requirements as we understand them right now, and the common point in all of the architecture options I have considered is the LAMP database.

Data Structure Concerns

It has been suggested that we retain the full granularity of data as supplied to us from the partners (although in my anonymous UIDs post I noted that this data will not be available to end users), which in itself raises some interesting challenges. Each partner is storing and sending different pieces of information, kept in different columns. For example, from one provider we might see something like column one of the following table, and from another something like column 2. As you’ll see below, some columns clearly mean the same thing (and I have lined them up accordingly), some are possibly similar, and others have no analogues between different partners.

Provider One Provider Two
user# Identifier
Ethnicity ethnicity
Country of Domicile nationality
Country of Domicile Region
Gender gender
Disability disability
UCAS Tariff Points Tariff
Age on entry
AcadYear
Mode of Attendance Attendance Mode
progression_code
Class Code
Course Code Prog Code
Course Name Course name
Course Type Course type
JACS code
student_status
Faculty
Location of Study Campus
Franchised Out
Enrolment End Date Date of graduation
Agreed Award final award
Agreed Classification
Agreed Overall mark
loans Loans per borrower
total E Number of different E-resources accessed
all visits Number of visits

In designing a database, one sets out to work out which pieces of data will be present in a database table. For any one partner, this is easy — we are being given the tables directly. Add more than one feed like the ones above, however, and there are some choices to make in our database design regarding how we standardise the column names when they clearly mean the same thing.

Access Control Concerns

The next factor which may influence our choice of database structure is security. Database products such as Postgresql come with built in security, so access can be restricted to certain database users on a per-table basis. This is certainly an appealing model for controlling partners having full access to their own data — we could create one database user for each LAMP partner, and perform access control accordingly.

The difficulty with that access model, however, comes when we want to run analyses which compare data from multiple partners, which implies a degree of access to each others’ data. Of course, we would only be allowing access to other partners data after it had been standardised and normalised, but in order to perform the calculations, access will still be required at a level which is far more granular than the LAMP project wants to expose. And this is the crux of the matter — the kind of statistical comparisons people require mean that it is simply not possible to avoid the LAMP statistics layer having read access to all the normalised data.

The implication of this is that we will not be able to use database access control alone to restrict access to the full granularity of the normalised data. Instead, either our API or our statistics layer will have to use software routines to implement its own filters and sanity checks based on credentials supplied to the API. Only after these have been passed will results from cross-provider queries be returned to the user.

Outside of cross-provider queries, there will still be the need to restrict access to raw data solely to the provider who submitted it, and it remains to be seen whether or not database access control will be able to play a part in achieving this or whether it too will be purely achieved via software routines.

Some Design Options

  1. One table per provider, with normalisations and column name standardisations in a lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/c3047af0fad4741ce1fe25fae80c5c55.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Normalisations -> Standardisations -> "Provider 1 table";
        7 Standardisations -> "Provider 2 table";
        8 Standardisations -> "Provider 3 table";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 }}
       15 
    


    The benefit of this approach would be that the original detail submitted by providers would appear in the database, one provider per table. This would also mean making use of database access control on a per-table basis was an option. However, the need to constantly perform lookup routines in order to standardise and normalise the data for comparison could impact performance and result in quite complicated queries, and there would also have to be at least one database user with read access to all the tables for the purposes of running such comparisons.
  2. One table with standardised column names for all providers, with normalisations in a separate lookup table:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/cfc16ef1a9a1e902bc0a4c29ad4aeb4e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> "All Providers' Standardised Data";
        7 "Filter for only onenProvider's data" -> "All Providers' Standardised Data";
        8 Normalisations -> "All Providers' Standardised Data";
        9 node[shape="trapezium"]
       10 "Statistical Comparison Queries" -> Normalisations;
       11 "Private Queries" -> "Filter for only onenProvider's data";
       12 }
       13 "Provider 1 raw" -> Standardisations;
       14 "Provider 2 raw" -> Standardisations;
       15 "Provider 3 raw" -> Standardisations;
       16 }
       17 
    


    This may be a ‘best of both worlds’ solution — we would hold the full level of detail submitted to us by providers in our database, in a single table with standardised headings. Queries could then be run through normalisation routines when comparison and statistics are required, but institutions would still be able to get at the data they submitted (albeit standardised) for other types of analysis which would be private to their LAMP dashboard. From an access control pespective, this scenario would rely entirely on software checks — at the provider filter, and on the comparison query results — in order to protect data.
  3. One normalised, standardised table, with no access to original data:

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/26bc031cec0deeb4314ee2db64e8c63e.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 node[shape="trapezium"]
        8 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
        9 }
       10 "Provider 1 raw" -> Standardisations;
       11 "Provider 2 raw" -> Standardisations;
       12 "Provider 3 raw" -> Standardisations;
       13 }
       14 
    


    This scenario would probably perform best and keep queries simple, but has the drawback that the full level of detail in the data as submitted by the providers would not be held in the database, which would limit us to only running queries on the normalised data. Since a number of the LAMP use cases seem to involve providers wanting to store their data with us and query it in place, this option is pretty much ruled out. Restricting access to the detailed normalised data, as in the previous example, would be completely done in software.
  4. Using redundancy and holding both raw individual tables as well as a standardised/normalised combined one

    Error generating Graphviz image:

    Graphviz cannot generate graph
    Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/a1c5501eadf600068966f36dc1a9009b.png'
    Output: 
    Warning: : syntax error in line 6 near '-'
    
    Original DOT:
        1 
        2 digraph op1{
        3 graph[rankdir="LR"]
        4 subgraph clusterDB{
        5 graph[label="Database"]
        6 Standardisations -> Normalisations -> "All Providers' Normalised Data";
        7 "Provider 1 table";
        8 "Provider 2 table";
        9 "Provider 3 table";
       10 node[shape="trapezium"]
       11 "Private Queries" -> "Provider 1 table";
       12 "Private Queries" -> "Provider 2 table";
       13 "Private Queries" -> "Provider 3 table";
       14 "Statistical Comparison Queries" -> "All Providers' Normalised Data";
       15 }
       16 "Provider 1 raw" -> Standardisations;
       17 "Provider 2 raw" -> Standardisations;
       18 "Provider 3 raw" -> Standardisations;
       19 "Provider 1 raw" -> "Provider 1 table";
       20 "Provider 2 raw" -> "Provider 2 table";
       21 "Provider 3 raw" -> "Provider 3 table";
       22 
       23 }
       24 
    

    Another option exists whereby we could combine options one and three — the raw data goes back in the database, and database user accounts are reintroduced to control access to it as in option one. However, as in option three, the normalised table is also present in the database, for combined queries which are regulated by the API. For even higher levels of security, the raw and the normalised tables don’t strictly speaking even need to be in the same database!

    This last option would really be an implementation suited to high levels of paranoia regarding the raw data and our API’s software safeguards, and faith that the normalisation routines do a good enough job of anonymising that data to justify the combined table not being subject to the same levels of security.

At the moment I’m leaning towards option two — we can reverse-lookup the standardisations if partners absolutely need their original column headings back, but having all the data in one standardised table will help with both private and combined queries. Since database access control cannot offer us the security required in our application, we will need to implement software checks in any case, so we may as well embrace the fact and get on with how those checks will work! It’s conceivable that option 4 might perform better, as the need to do standardisation and/or normalisation lookups at query-time is removed, but we’ll keep an eye on that as we test and build the database.

In my next post, I’m hoping to go into more detail about the statistics layer, and how we implement some of the routines Ellen blogged about, leading into how we build our API!

All about data normalisation

I missed the last CAP meeting and I’m really disappointed because it sounded great. I’m also sorry because it seems that there were a few questions about the approach we’re taking to data categorisation and normalisation. Since I’m the difficult person who is setting the parameters and making the demands on this, I thought perhaps I ought to explain myself in a blog post!

One of the big challenges for this project is reconciling the big set of things that institutions might want to do with their data, and the much smaller subset of things that are statistically acceptable. That’s not to say that institutions won’t be able to make their own decisions about how they analyse their data – quite the reverse, in fact – but we need to make sure that if they are doing statistical analysis it is done right. We have to pay attention to statistical principles, and this has big repercussions for the way that data is structured within the project.

Whenever you run a statistical test, you are basically trying to understand the relationship between two or more variables – for example, discipline and book borrowing, or country of origin and e-resource usage. Now, because we’re working with samples (in our case, a single year of students, rather than every student who has ever studied at a university, ever!) we have to accept that there might be a random blip in the data which means the relationship only exists in the sample (the year group), not the wider population (all students ever at the university). Significance testing allows you to say how confident you are that any relationships you find do exist within the wider population, not just within your particular sample. It does this by calculating the probability of you finding your result if the relationship did not exist in the wider population and spits out a number between 0 and 1. You compare this number to your ‘critical value’ – usually .05 in the social sciences – and if your result is smaller, your finding is statistically significant.

‘This is all very interesting’, you may be thinking, ‘but what on earth does it have to do with data normalisation?’ Allow me to explain!

Some of the tests that we are using need data that has been separated into groups: for example, to understand the relationship between discipline and library usage, you need to group your students into different disciplines. You can then look and see whether these groups have different levels of library usage. Let’s take a hypothetical example where you have three groups, A, B and C, and you want to see whether these groups have significantly different levels of e-resource use.

The first thing that you do is run the Kruskal-Wallis test, which tells you whether any of the groups have a level of e-resource use that is significantly different from any of the other groups. Crucially, it only tells you that this difference exists somewhere: it doesn’t tell you whether the difference is between A&B, A&C, or B&C – or, indeed, any combination of the above. So that’s not especially helpful, if you want to use the information to decide which of your groups needs support in learning how to use e-resources.

If you find that your Kruskal-Wallis test is significant, you then need to go on and run Mann-Whitney tests on the same data. You take each pair – A&B, A&C and B&C – and run three separate tests, to see whether there is a difference between the two groups in each pair. For reasons that I’m not even going to try to explain, when you run lots of Mann Whitney tests on the same set of data, you increase your risk of a Type I error, which is statistical jargon for thinking a relationship exists where it actually doesn’t. In this example, it would result in libraries spending a lot of time educating a group of people in e-resource use where, in fact, the group is already perfectly competent. Again, not particularly helpful!

To avoid Type I errors you apply a Bonferroni correction, which basically means dividing your critical value –.05 – by the number of tests you’re running, to give you a new critical value. So, in our A, B, C example, you would divide .05 by 3, giving you a critical value of .017. For your Mann-Whitney test on a single pair to be statistically significant, it now needs to spit out a number which is smaller than .017. If you had four groups and you wanted to compare them all, your critical value would be .008. That’s pretty small, and makes it much less likely that you’ll find a statistically significant relationship.

So this is where – finally! – we get to data categorisation. With LAMP, we want to maximise the chances of telling universities something useful about their datasets. So we need to keep our number of groups as small as possible. You can minimise the number of tests that you run by taking a control group for each variable – the biggest group, let’s say – and comparing all the others to it with Mann-Whitney tests, without comparing them to each other. But if you have six groups, this will still mean running five tests and therefore working with a critical value of .01. So we really want to keep the numbers of groups down.

In short, then: if we have too many groups, we starkly reduce our chances of finding a statistically significant relationship between variables. This would make LAMP much less useful to institutions which want to build an evidence base in order to develop their library services. So we need to be a bit prescriptive.

The situation is further complicated by the fact that one aim of LAMP is to aggregate data across institutions. If we’re going to do this, obviously we need everybody to be working with the same set of definitions: it’s no use one university using groups A, B and C if another is using D, E and F and another is using red, blue and green!

In principle, there’s no reason that universities shouldn’t run a separate analysis on their own data using different groupings, if that makes more sense for them. But they’ll still have to think about the number of groups they include if they want to get statistically significant results.

Another option we’re thinking about is allowing people to compare sub-groups within each group: so, for example, within group A you might have subgroups 1, 2 and 3, and within group B you might have subgroups 4, 5 and 6. You can use the same Kruskal-Wallis/Mann-Whitney procedure to compare groups 1, 2 and 3 and groups 4, 5 and 6: but – crucially – you can’t compare 1, 4 and 6, and you can’t compare all six groups with each other. This should be helpful with something like discipline.

I hope that clears things up a bit! If not, let me know in the comments and I’ll do my best to answer any questions…

Unique Identifiers Which Don’t Identify Anyone!

  1. Why might we need Unique Identifiers?
  2. How might a Unique ID Be Generated?
  3. But we don’t want our data to be identified!
  4. Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

Why might we need Unique Identifiers?

So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):

Age Starting Course UCAS points Subject Final Grade Number of library loans
18 300 Physics and Astronomy Upper Second 3
19 250 Cake Decorating and Icing Manufacture First 153
43 400 Medieval History Lower Second 96

This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:

Number of PC logins e-resource logins Library barrier entires
1307 4 260
51 0 2741
254 98 5

Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…

How might a Unique ID Be Generated?

The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.

The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.

The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.

But we don’t want our data to be identified!

The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.

Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:

Anonymous UID Age Starting Course UCAS points Subject Final Grade Number of library loans Number of PC logins e-resource logins Library barrier entires
765003 18 300 Physics and Astronomy Upper Second 3 51 0 2741
331224 19 250 Cake Decorating and Icing Manufacture First 153
903876 43 400 Medieval History Lower Second 96

In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.

The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.

To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.

Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.

Age Course Degree Result FTE Country of Domicile
23 Physics with Study in Europe 67% 0.75 Spain
Mature Physics Upper Second Part Time EU

Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:

Number of loans vs final grade
loans
final grade

Or if you query the LAMP API, you will be able to get the same data in tabular form:

Loans Final Grade
3 2
22 1
30 2
10 3
0 5
0 4
7 3
6 4
15 3
17 2

The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.

Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.

Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!

  1. Pick some ‘main’ table and use the row number from it

    In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.

  2. Make up your own anonymiser routine

    As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…

  3. Use a hash function

    This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system, echo id | openssl sha256 will return the hash code for id.

Planning the LAMP Architecture

As data starts to come in from various LAMP partners, we’re considering the architectural design of our application. Some ideas came out of the first LAMP CAP meeting as well which might have a bearing on how we put our system together.

For each partner, we will receive a series of datasets covering various aspects of library analytic data (usage, degree attainment, journal subscriptions, etc.), which we’ll be importing into a database. One of the goals of LAMP is to be able to perform statistical analysis of this data, and so at some point there will need to be a component in our application which is capable of statistical calculation.

Our project will deliver a user interface front-end for the purposes of viewing the analysed data and customising which analysis to apply. Mention has also been made, however, of an API being delivered so that LAMP users can, if needs be, get the results of statistical analysis for use in their own applications without using our user interface.

As soon as we’re considering provision of an API, a common best-practise concept is that we should eat our own dog food — namely, that our own LAMP application should be built on top of our API. This process helps keep our API useable and relevant, as it means that any functionality we’ll need for our front end is available to everyone else through our API as well.

One thing which was mentioned at the CAP meeting, however, was the idea that we could combine our data with that from other APIs out there which offer different analytics data, such as JUSP and IRIS. This would increase the number of sources from which we could pull data and perform statistical analysis.

Before other APIs were mentioned, my initial feeling was that a statistical application layer such as R, integrated with the database, might be the most efficient way to offer up analyses from the LAMP data. A reasonable structure for this arrangement might be:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/3620b6bd6a5926bcbae6b850e1ddfc30.png'
Output: 
Warning: : syntax error in line 7 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "LAMP statistics API interface";
    5 "UI Interface";
    6 node[shape=rectangle];
    7 "LAMP database"->"LAMP data statistics layer"->"LAMP statistics API"->"LAMP statistics API interface";
    8 "LAMP statistics API"->"UI layer"->"UI Interface";
    9 }
   10 


(system components are rectangles, interfaces to the system are trapeziums)

However, the additional requirement to bring in other APIs pulls such a structure into question. Depending on what content is in other APIs and whether we can successfully cross-reference and/or make use of it, there may be a corresponding requirement to carry out statistical analysis across results from multiple APIs. If this is the case, then either the statistics processing layer needs to move (so that processing occurs after the other data has been pulled in), or a second processing layer will be necessary. Questions also arise as to whether or not a LAMP API should serve aggregated statistical results after data has been consumed from other APIs, or whether we only want to offer results from our own database via a LAMP API, leaving analysis across multiple APIs for consumers to implement according to their own individual requirements.

The following diagram attempts to sum up these options, with component options represented with dashed boundaries. Statistical processing options are in purple, and API options in blue:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/d52fddb4ca8130df1614206e3ba698b2.png'
Output: 
Warning: : syntax error in line 11 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "UI Interface";
    5 
    6 node[style=dashed,penwidth=2,color=blue];
    7 "LAMP statistics API interface";
    8 "aggregated statistics API interface";
    9 
   10 node[shape=rectangle]
   11 "LAMP statistics API"->"LAMP statistics API interface";
   12 "aggregated statistics API"->"aggregated statistics API interface";
   13 
   14 node[color=purple];
   15 "LAMP data statistics layer"->"LAMP statistics API";
   16 "aggregated statistics layer"->"aggregated statistics API";
   17 
   18 node[style=solid, penwidth=1, color=black];
   19 "LAMP database"->"LAMP data statistics layer";
   20 "LAMP statistics API"->"aggregator";
   21 "external APIs"->"aggregator"->"aggregated statistics layer";
   22 "aggregated statistics API"->"UI layer"->"UI Interface";
   23 }
   24 

From comments within the LAMP team, we’re leaning towards implementing a processing layer and API on our own LAMP data for now, and only including data from other APIs in the UI for comparison. Further statistical analysis of the type described above could then be an option for the a future release phase of LAMP. Whilst we consider these options further, I’ll be working on the LAMP database structure and trying to import some of the early data from our partners, which I’ll cover in a future post!