I missed the last CAP meeting and I’m really disappointed because it sounded great. I’m also sorry because it seems that there were a few questions about the approach we’re taking to data categorisation and normalisation. Since I’m the difficult person who is setting the parameters and making the demands on this, I thought perhaps I ought to explain myself in a blog post!
One of the big challenges for this project is reconciling the big set of things that institutions might want to do with their data, and the much smaller subset of things that are statistically acceptable. That’s not to say that institutions won’t be able to make their own decisions about how they analyse their data – quite the reverse, in fact – but we need to make sure that if they are doing statistical analysis it is done right. We have to pay attention to statistical principles, and this has big repercussions for the way that data is structured within the project.
Whenever you run a statistical test, you are basically trying to understand the relationship between two or more variables – for example, discipline and book borrowing, or country of origin and e-resource usage. Now, because we’re working with samples (in our case, a single year of students, rather than every student who has ever studied at a university, ever!) we have to accept that there might be a random blip in the data which means the relationship only exists in the sample (the year group), not the wider population (all students ever at the university). Significance testing allows you to say how confident you are that any relationships you find do exist within the wider population, not just within your particular sample. It does this by calculating the probability of you finding your result if the relationship did not exist in the wider population and spits out a number between 0 and 1. You compare this number to your ‘critical value’ – usually .05 in the social sciences – and if your result is smaller, your finding is statistically significant.
‘This is all very interesting’, you may be thinking, ‘but what on earth does it have to do with data normalisation?’ Allow me to explain!
Some of the tests that we are using need data that has been separated into groups: for example, to understand the relationship between discipline and library usage, you need to group your students into different disciplines. You can then look and see whether these groups have different levels of library usage. Let’s take a hypothetical example where you have three groups, A, B and C, and you want to see whether these groups have significantly different levels of e-resource use.
The first thing that you do is run the Kruskal-Wallis test, which tells you whether any of the groups have a level of e-resource use that is significantly different from any of the other groups. Crucially, it only tells you that this difference exists somewhere: it doesn’t tell you whether the difference is between A&B, A&C, or B&C – or, indeed, any combination of the above. So that’s not especially helpful, if you want to use the information to decide which of your groups needs support in learning how to use e-resources.
If you find that your Kruskal-Wallis test is significant, you then need to go on and run Mann-Whitney tests on the same data. You take each pair – A&B, A&C and B&C – and run three separate tests, to see whether there is a difference between the two groups in each pair. For reasons that I’m not even going to try to explain, when you run lots of Mann Whitney tests on the same set of data, you increase your risk of a Type I error, which is statistical jargon for thinking a relationship exists where it actually doesn’t. In this example, it would result in libraries spending a lot of time educating a group of people in e-resource use where, in fact, the group is already perfectly competent. Again, not particularly helpful!
To avoid Type I errors you apply a Bonferroni correction, which basically means dividing your critical value –.05 – by the number of tests you’re running, to give you a new critical value. So, in our A, B, C example, you would divide .05 by 3, giving you a critical value of .017. For your Mann-Whitney test on a single pair to be statistically significant, it now needs to spit out a number which is smaller than .017. If you had four groups and you wanted to compare them all, your critical value would be .008. That’s pretty small, and makes it much less likely that you’ll find a statistically significant relationship.
So this is where – finally! – we get to data categorisation. With LAMP, we want to maximise the chances of telling universities something useful about their datasets. So we need to keep our number of groups as small as possible. You can minimise the number of tests that you run by taking a control group for each variable – the biggest group, let’s say – and comparing all the others to it with Mann-Whitney tests, without comparing them to each other. But if you have six groups, this will still mean running five tests and therefore working with a critical value of .01. So we really want to keep the numbers of groups down.
In short, then: if we have too many groups, we starkly reduce our chances of finding a statistically significant relationship between variables. This would make LAMP much less useful to institutions which want to build an evidence base in order to develop their library services. So we need to be a bit prescriptive.
The situation is further complicated by the fact that one aim of LAMP is to aggregate data across institutions. If we’re going to do this, obviously we need everybody to be working with the same set of definitions: it’s no use one university using groups A, B and C if another is using D, E and F and another is using red, blue and green!
In principle, there’s no reason that universities shouldn’t run a separate analysis on their own data using different groupings, if that makes more sense for them. But they’ll still have to think about the number of groups they include if they want to get statistically significant results.
Another option we’re thinking about is allowing people to compare sub-groups within each group: so, for example, within group A you might have subgroups 1, 2 and 3, and within group B you might have subgroups 4, 5 and 6. You can use the same Kruskal-Wallis/Mann-Whitney procedure to compare groups 1, 2 and 3 and groups 4, 5 and 6: but – crucially – you can’t compare 1, 4 and 6, and you can’t compare all six groups with each other. This should be helpful with something like discipline.
I hope that clears things up a bit! If not, let me know in the comments and I’ll do my best to answer any questions…