All about data normalisation

I missed the last CAP meeting and I’m really disappointed because it sounded great. I’m also sorry because it seems that there were a few questions about the approach we’re taking to data categorisation and normalisation. Since I’m the difficult person who is setting the parameters and making the demands on this, I thought perhaps I ought to explain myself in a blog post!

One of the big challenges for this project is reconciling the big set of things that institutions might want to do with their data, and the much smaller subset of things that are statistically acceptable. That’s not to say that institutions won’t be able to make their own decisions about how they analyse their data – quite the reverse, in fact – but we need to make sure that if they are doing statistical analysis it is done right. We have to pay attention to statistical principles, and this has big repercussions for the way that data is structured within the project.

Whenever you run a statistical test, you are basically trying to understand the relationship between two or more variables – for example, discipline and book borrowing, or country of origin and e-resource usage. Now, because we’re working with samples (in our case, a single year of students, rather than every student who has ever studied at a university, ever!) we have to accept that there might be a random blip in the data which means the relationship only exists in the sample (the year group), not the wider population (all students ever at the university). Significance testing allows you to say how confident you are that any relationships you find do exist within the wider population, not just within your particular sample. It does this by calculating the probability of you finding your result if the relationship did not exist in the wider population and spits out a number between 0 and 1. You compare this number to your ‘critical value’ – usually .05 in the social sciences – and if your result is smaller, your finding is statistically significant.

‘This is all very interesting’, you may be thinking, ‘but what on earth does it have to do with data normalisation?’ Allow me to explain!

Some of the tests that we are using need data that has been separated into groups: for example, to understand the relationship between discipline and library usage, you need to group your students into different disciplines. You can then look and see whether these groups have different levels of library usage. Let’s take a hypothetical example where you have three groups, A, B and C, and you want to see whether these groups have significantly different levels of e-resource use.

The first thing that you do is run the Kruskal-Wallis test, which tells you whether any of the groups have a level of e-resource use that is significantly different from any of the other groups. Crucially, it only tells you that this difference exists somewhere: it doesn’t tell you whether the difference is between A&B, A&C, or B&C – or, indeed, any combination of the above. So that’s not especially helpful, if you want to use the information to decide which of your groups needs support in learning how to use e-resources.

If you find that your Kruskal-Wallis test is significant, you then need to go on and run Mann-Whitney tests on the same data. You take each pair – A&B, A&C and B&C – and run three separate tests, to see whether there is a difference between the two groups in each pair. For reasons that I’m not even going to try to explain, when you run lots of Mann Whitney tests on the same set of data, you increase your risk of a Type I error, which is statistical jargon for thinking a relationship exists where it actually doesn’t. In this example, it would result in libraries spending a lot of time educating a group of people in e-resource use where, in fact, the group is already perfectly competent. Again, not particularly helpful!

To avoid Type I errors you apply a Bonferroni correction, which basically means dividing your critical value –.05 – by the number of tests you’re running, to give you a new critical value. So, in our A, B, C example, you would divide .05 by 3, giving you a critical value of .017. For your Mann-Whitney test on a single pair to be statistically significant, it now needs to spit out a number which is smaller than .017. If you had four groups and you wanted to compare them all, your critical value would be .008. That’s pretty small, and makes it much less likely that you’ll find a statistically significant relationship.

So this is where – finally! – we get to data categorisation. With LAMP, we want to maximise the chances of telling universities something useful about their datasets. So we need to keep our number of groups as small as possible. You can minimise the number of tests that you run by taking a control group for each variable – the biggest group, let’s say – and comparing all the others to it with Mann-Whitney tests, without comparing them to each other. But if you have six groups, this will still mean running five tests and therefore working with a critical value of .01. So we really want to keep the numbers of groups down.

In short, then: if we have too many groups, we starkly reduce our chances of finding a statistically significant relationship between variables. This would make LAMP much less useful to institutions which want to build an evidence base in order to develop their library services. So we need to be a bit prescriptive.

The situation is further complicated by the fact that one aim of LAMP is to aggregate data across institutions. If we’re going to do this, obviously we need everybody to be working with the same set of definitions: it’s no use one university using groups A, B and C if another is using D, E and F and another is using red, blue and green!

In principle, there’s no reason that universities shouldn’t run a separate analysis on their own data using different groupings, if that makes more sense for them. But they’ll still have to think about the number of groups they include if they want to get statistically significant results.

Another option we’re thinking about is allowing people to compare sub-groups within each group: so, for example, within group A you might have subgroups 1, 2 and 3, and within group B you might have subgroups 4, 5 and 6. You can use the same Kruskal-Wallis/Mann-Whitney procedure to compare groups 1, 2 and 3 and groups 4, 5 and 6: but – crucially – you can’t compare 1, 4 and 6, and you can’t compare all six groups with each other. This should be helpful with something like discipline.

I hope that clears things up a bit! If not, let me know in the comments and I’ll do my best to answer any questions…

LAMP Principles

As the project begins to engage with institutions and existing library systems vendors and services it’s important that we make it very clear what we plan to do with the data, and more broadly how the project will undertake its work.

With this in mind the team have come up with a set of principles by which the project will operate.

Let us know what you think and how we could improve them…

The following project principles are designed to ensure that the work of LAMP and its partners, contributors and contractors is aligned to all relevant legal and ethical frameworks.

These principles will help ensure:
• An understanding of the status of data provided by contributors and third parties to the project;
• Legal and ethical guidelines for the project, partners and contributors;
• Clarity on issues of competition and market differentiation.

Data Protection

The ownership of any and all raw data supplied to the project will remain under the ownership of the originating institution or organisation. Data can be taken down or removed at anytime upon request from the owner. At the completion of the project all data will be returned to the owner, or deleted by the project.

The project will ensure protection of data and confidentiality to persons and organisations through appropriate measures (such as anonymisation of records linked to individuals) in accordance with the Data Protection Act.

Commercial Confidentiality

In order to protect any commercially confidential data or information the project will seek to use other sources of openly available data, or ensure that this information and/or data is not made publicly available.

Access to the service will be via UK Access Management Federation, ensuring confidential data cannot be accessed outside of an institution.

Licensing and standards

The project will ensure its reports and technical specifications are licensed under an appropriate open license (such as Creative Commons) to encourage standardisation and reuse. All reports will be made available via the project website.

LAMP will look to adopt and implement existing technical standards and make use of structured data principles to facilitate interoperability with other systems.

Ethics

Wherever the project handles data that pertains to the analysis of learning and research, it will act in accordance with ethical principles that treat the wellbeing and interests of the individual as paramount and as the basis for the good reputation of the sector and its institutions. [Link to Legal and Ethical framework for the project].

Development

The ongoing development of LAMP will be driven by engagement with the UK library community.

The prototype service will look to add value to existing institutional systems and services through the possibilities of data aggregation and benchmarking. It will not look to duplicate the functionality of existing systems or services where the market is functioning and healthy.

Participation

LAMP has convened a Community Advisory and Planning Group to ensure the project meets the requirements, values and aspirations of the UK academic library community. The project also has a website and blog which will regularly be updated with new information.

The project is also directly working with six institutions who are supplying data to the project. A full list of the participating institutions can be found here.

As LAMP progresses and prototypes are developed, the project will plan ways of engaging wider input and feedback from the wider library community, including: International libraries and commercial suppliers.

The prototype service will be available to the UK academic community upon its release in December 2013.