All about data normalisation

I missed the last CAP meeting and I’m really disappointed because it sounded great. I’m also sorry because it seems that there were a few questions about the approach we’re taking to data categorisation and normalisation. Since I’m the difficult person who is setting the parameters and making the demands on this, I thought perhaps I ought to explain myself in a blog post!

One of the big challenges for this project is reconciling the big set of things that institutions might want to do with their data, and the much smaller subset of things that are statistically acceptable. That’s not to say that institutions won’t be able to make their own decisions about how they analyse their data – quite the reverse, in fact – but we need to make sure that if they are doing statistical analysis it is done right. We have to pay attention to statistical principles, and this has big repercussions for the way that data is structured within the project.

Whenever you run a statistical test, you are basically trying to understand the relationship between two or more variables – for example, discipline and book borrowing, or country of origin and e-resource usage. Now, because we’re working with samples (in our case, a single year of students, rather than every student who has ever studied at a university, ever!) we have to accept that there might be a random blip in the data which means the relationship only exists in the sample (the year group), not the wider population (all students ever at the university). Significance testing allows you to say how confident you are that any relationships you find do exist within the wider population, not just within your particular sample. It does this by calculating the probability of you finding your result if the relationship did not exist in the wider population and spits out a number between 0 and 1. You compare this number to your ‘critical value’ – usually .05 in the social sciences – and if your result is smaller, your finding is statistically significant.

‘This is all very interesting’, you may be thinking, ‘but what on earth does it have to do with data normalisation?’ Allow me to explain!

Some of the tests that we are using need data that has been separated into groups: for example, to understand the relationship between discipline and library usage, you need to group your students into different disciplines. You can then look and see whether these groups have different levels of library usage. Let’s take a hypothetical example where you have three groups, A, B and C, and you want to see whether these groups have significantly different levels of e-resource use.

The first thing that you do is run the Kruskal-Wallis test, which tells you whether any of the groups have a level of e-resource use that is significantly different from any of the other groups. Crucially, it only tells you that this difference exists somewhere: it doesn’t tell you whether the difference is between A&B, A&C, or B&C – or, indeed, any combination of the above. So that’s not especially helpful, if you want to use the information to decide which of your groups needs support in learning how to use e-resources.

If you find that your Kruskal-Wallis test is significant, you then need to go on and run Mann-Whitney tests on the same data. You take each pair – A&B, A&C and B&C – and run three separate tests, to see whether there is a difference between the two groups in each pair. For reasons that I’m not even going to try to explain, when you run lots of Mann Whitney tests on the same set of data, you increase your risk of a Type I error, which is statistical jargon for thinking a relationship exists where it actually doesn’t. In this example, it would result in libraries spending a lot of time educating a group of people in e-resource use where, in fact, the group is already perfectly competent. Again, not particularly helpful!

To avoid Type I errors you apply a Bonferroni correction, which basically means dividing your critical value –.05 – by the number of tests you’re running, to give you a new critical value. So, in our A, B, C example, you would divide .05 by 3, giving you a critical value of .017. For your Mann-Whitney test on a single pair to be statistically significant, it now needs to spit out a number which is smaller than .017. If you had four groups and you wanted to compare them all, your critical value would be .008. That’s pretty small, and makes it much less likely that you’ll find a statistically significant relationship.

So this is where – finally! – we get to data categorisation. With LAMP, we want to maximise the chances of telling universities something useful about their datasets. So we need to keep our number of groups as small as possible. You can minimise the number of tests that you run by taking a control group for each variable – the biggest group, let’s say – and comparing all the others to it with Mann-Whitney tests, without comparing them to each other. But if you have six groups, this will still mean running five tests and therefore working with a critical value of .01. So we really want to keep the numbers of groups down.

In short, then: if we have too many groups, we starkly reduce our chances of finding a statistically significant relationship between variables. This would make LAMP much less useful to institutions which want to build an evidence base in order to develop their library services. So we need to be a bit prescriptive.

The situation is further complicated by the fact that one aim of LAMP is to aggregate data across institutions. If we’re going to do this, obviously we need everybody to be working with the same set of definitions: it’s no use one university using groups A, B and C if another is using D, E and F and another is using red, blue and green!

In principle, there’s no reason that universities shouldn’t run a separate analysis on their own data using different groupings, if that makes more sense for them. But they’ll still have to think about the number of groups they include if they want to get statistically significant results.

Another option we’re thinking about is allowing people to compare sub-groups within each group: so, for example, within group A you might have subgroups 1, 2 and 3, and within group B you might have subgroups 4, 5 and 6. You can use the same Kruskal-Wallis/Mann-Whitney procedure to compare groups 1, 2 and 3 and groups 4, 5 and 6: but – crucially – you can’t compare 1, 4 and 6, and you can’t compare all six groups with each other. This should be helpful with something like discipline.

I hope that clears things up a bit! If not, let me know in the comments and I’ll do my best to answer any questions…


  1. Lee Baylis

    Great post, which clears up a couple of questions I had about the number of groupings we’re using in the normalisations!

    One thing that I did wonder though was whether statistical theory has anything to say about the mechanism via which the groupings are chosen — my instinct is convinced that you wouldn’t expect anything significant to come from comparing groups which you had populated at random, but if you were too clever with your grouping you might be inadvertently causing the significance that you end up finding?

    If we’re going to allow LAMP users to customise their own groupings in the application (normalising countries to world regions of interest to their institution, say) then this will definitely need bearing in mind…

  2. Ellen Collins

    Lee, great question, and you have put your finger on a very important point.

    When you run a statistical test, you are doing exactly that – testing a hypothesis that there is a relationship between two variables. (Actually, you are usually testing the null hypothesis that there isn’t a relationship…but I digress). The hypothesis ought to be informed by theory: you should have a valid reason for why you’ve grouped the data points within a variable in the way that you have.

    So, for example, when we ran the analysis at Huddersfield, we grouped the data for country of origin into UK, New EU, Old EU and rest of world. This was due, in part, to technical reasons (getting enough data points into each group to run the test), but also because we thought that new and old EU students might use resources differently: as, indeed, they did – particularly e-resources.

    Now, we might have seen even bigger differences if we had, for example, grouped England in one place, Scotland, Africa and Australasia in another, Germany and the Middle East in another… you get the picture. But we would have no theoretical justification for doing this – there’s no reason to think that students in Germany might be like students in the Middle East. So, even though we might see statistically significant differences in usage, we would have no theory to explain *why* we saw those differences, so the analysis would be useless.

    How we protect against this in LAMP is a tricky question, and it comes back to that point I made in the blog about what institutions might want to do compared to what, statistically, they are allowed to do. Part of it might be setting up a range of valid detailed options for comparison that they can choose from; part might be very clear guidance. Definitely one to think about.

  3. Lee Baylis

    There was another point which I was wondering about, and sorry if this is a second blog post in itself!

    One of the other concerns which we have touched on in CAP meetings is the risk that we’ll output a piece of data which could somehow be identifying even after normalisation. For example, if we produced a plot of normalised country grouping versus normalised average degree classification for a given course, but in reality there’s only one person on that course from all the countries grouped under ‘other’, so the “average classification” is actually just that one person’s degree result.

    Do you know of any statistical (or possibly information theory) methods which would help us to identify whether we are about to output a result in one of the normalised ‘bins’ on our charts which differs from the selection as a whole in such a way as to inadvertently identify someone?

    Maybe a check on the sample sizes in each of the normalised ‘bins’ before the statistics are run (or outputted), and then either refusing to run the analysis on that sample, or insisting that the bin be left out in the output? Or possibly something more sophisticated which will help us to calculate what the minimum sample size per bin has to be before we leave it out?

    Or is my thinking (like my example plot above) about the types of chart we will be outputting not quite correct, making this an unnecessary worry?

  4. Ellen Collins

    Hmmm. Yes, identification is a challenge. The sort of output you’re talking about is really descriptive statistics (i.e. ways of ‘describing’, visually or otherwise, your dataset). So because you’re just cross-tabulating, it should be relatively easy to identify any cells which have an uncomfortably low number of people in them behind the scenes, before you generate an output for public viewing. Once you’ve done this, there are a number of ways to mitigate the risk that people will be identified: the Office of National Statistics has some quite detailed information in the PDFs available to download here: What I tend to do is just to lump everybody who looks risky into an ‘other’ category, which becomes a bit of a dumping ground, but there we go…

    Does that answer the question?

Trackbacks for this post

  1. It’s time to talk about standards | Library Analytics and Metrics project

Leave a Reply