All about data normalisation

I missed the last CAP meeting and I’m really disappointed because it sounded great. I’m also sorry because it seems that there were a few questions about the approach we’re taking to data categorisation and normalisation. Since I’m the difficult person who is setting the parameters and making the demands on this, I thought perhaps I ought to explain myself in a blog post!

One of the big challenges for this project is reconciling the big set of things that institutions might want to do with their data, and the much smaller subset of things that are statistically acceptable. That’s not to say that institutions won’t be able to make their own decisions about how they analyse their data – quite the reverse, in fact – but we need to make sure that if they are doing statistical analysis it is done right. We have to pay attention to statistical principles, and this has big repercussions for the way that data is structured within the project.

Whenever you run a statistical test, you are basically trying to understand the relationship between two or more variables – for example, discipline and book borrowing, or country of origin and e-resource usage. Now, because we’re working with samples (in our case, a single year of students, rather than every student who has ever studied at a university, ever!) we have to accept that there might be a random blip in the data which means the relationship only exists in the sample (the year group), not the wider population (all students ever at the university). Significance testing allows you to say how confident you are that any relationships you find do exist within the wider population, not just within your particular sample. It does this by calculating the probability of you finding your result if the relationship did not exist in the wider population and spits out a number between 0 and 1. You compare this number to your ‘critical value’ – usually .05 in the social sciences – and if your result is smaller, your finding is statistically significant.

‘This is all very interesting’, you may be thinking, ‘but what on earth does it have to do with data normalisation?’ Allow me to explain!

Some of the tests that we are using need data that has been separated into groups: for example, to understand the relationship between discipline and library usage, you need to group your students into different disciplines. You can then look and see whether these groups have different levels of library usage. Let’s take a hypothetical example where you have three groups, A, B and C, and you want to see whether these groups have significantly different levels of e-resource use.

The first thing that you do is run the Kruskal-Wallis test, which tells you whether any of the groups have a level of e-resource use that is significantly different from any of the other groups. Crucially, it only tells you that this difference exists somewhere: it doesn’t tell you whether the difference is between A&B, A&C, or B&C – or, indeed, any combination of the above. So that’s not especially helpful, if you want to use the information to decide which of your groups needs support in learning how to use e-resources.

If you find that your Kruskal-Wallis test is significant, you then need to go on and run Mann-Whitney tests on the same data. You take each pair – A&B, A&C and B&C – and run three separate tests, to see whether there is a difference between the two groups in each pair. For reasons that I’m not even going to try to explain, when you run lots of Mann Whitney tests on the same set of data, you increase your risk of a Type I error, which is statistical jargon for thinking a relationship exists where it actually doesn’t. In this example, it would result in libraries spending a lot of time educating a group of people in e-resource use where, in fact, the group is already perfectly competent. Again, not particularly helpful!

To avoid Type I errors you apply a Bonferroni correction, which basically means dividing your critical value –.05 – by the number of tests you’re running, to give you a new critical value. So, in our A, B, C example, you would divide .05 by 3, giving you a critical value of .017. For your Mann-Whitney test on a single pair to be statistically significant, it now needs to spit out a number which is smaller than .017. If you had four groups and you wanted to compare them all, your critical value would be .008. That’s pretty small, and makes it much less likely that you’ll find a statistically significant relationship.

So this is where – finally! – we get to data categorisation. With LAMP, we want to maximise the chances of telling universities something useful about their datasets. So we need to keep our number of groups as small as possible. You can minimise the number of tests that you run by taking a control group for each variable – the biggest group, let’s say – and comparing all the others to it with Mann-Whitney tests, without comparing them to each other. But if you have six groups, this will still mean running five tests and therefore working with a critical value of .01. So we really want to keep the numbers of groups down.

In short, then: if we have too many groups, we starkly reduce our chances of finding a statistically significant relationship between variables. This would make LAMP much less useful to institutions which want to build an evidence base in order to develop their library services. So we need to be a bit prescriptive.

The situation is further complicated by the fact that one aim of LAMP is to aggregate data across institutions. If we’re going to do this, obviously we need everybody to be working with the same set of definitions: it’s no use one university using groups A, B and C if another is using D, E and F and another is using red, blue and green!

In principle, there’s no reason that universities shouldn’t run a separate analysis on their own data using different groupings, if that makes more sense for them. But they’ll still have to think about the number of groups they include if they want to get statistically significant results.

Another option we’re thinking about is allowing people to compare sub-groups within each group: so, for example, within group A you might have subgroups 1, 2 and 3, and within group B you might have subgroups 4, 5 and 6. You can use the same Kruskal-Wallis/Mann-Whitney procedure to compare groups 1, 2 and 3 and groups 4, 5 and 6: but – crucially – you can’t compare 1, 4 and 6, and you can’t compare all six groups with each other. This should be helpful with something like discipline.

I hope that clears things up a bit! If not, let me know in the comments and I’ll do my best to answer any questions…

LAMP Principles

As the project begins to engage with institutions and existing library systems vendors and services it’s important that we make it very clear what we plan to do with the data, and more broadly how the project will undertake its work.

With this in mind the team have come up with a set of principles by which the project will operate.

Let us know what you think and how we could improve them…

The following project principles are designed to ensure that the work of LAMP and its partners, contributors and contractors is aligned to all relevant legal and ethical frameworks.

These principles will help ensure:
• An understanding of the status of data provided by contributors and third parties to the project;
• Legal and ethical guidelines for the project, partners and contributors;
• Clarity on issues of competition and market differentiation.

Data Protection

The ownership of any and all raw data supplied to the project will remain under the ownership of the originating institution or organisation. Data can be taken down or removed at anytime upon request from the owner. At the completion of the project all data will be returned to the owner, or deleted by the project.

The project will ensure protection of data and confidentiality to persons and organisations through appropriate measures (such as anonymisation of records linked to individuals) in accordance with the Data Protection Act.

Commercial Confidentiality

In order to protect any commercially confidential data or information the project will seek to use other sources of openly available data, or ensure that this information and/or data is not made publicly available.

Access to the service will be via UK Access Management Federation, ensuring confidential data cannot be accessed outside of an institution.

Licensing and standards

The project will ensure its reports and technical specifications are licensed under an appropriate open license (such as Creative Commons) to encourage standardisation and reuse. All reports will be made available via the project website.

LAMP will look to adopt and implement existing technical standards and make use of structured data principles to facilitate interoperability with other systems.

Ethics

Wherever the project handles data that pertains to the analysis of learning and research, it will act in accordance with ethical principles that treat the wellbeing and interests of the individual as paramount and as the basis for the good reputation of the sector and its institutions. [Link to Legal and Ethical framework for the project].

Development

The ongoing development of LAMP will be driven by engagement with the UK library community.

The prototype service will look to add value to existing institutional systems and services through the possibilities of data aggregation and benchmarking. It will not look to duplicate the functionality of existing systems or services where the market is functioning and healthy.

Participation

LAMP has convened a Community Advisory and Planning Group to ensure the project meets the requirements, values and aspirations of the UK academic library community. The project also has a website and blog which will regularly be updated with new information.

The project is also directly working with six institutions who are supplying data to the project. A full list of the participating institutions can be found here.

As LAMP progresses and prototypes are developed, the project will plan ways of engaging wider input and feedback from the wider library community, including: International libraries and commercial suppliers.

The prototype service will be available to the UK academic community upon its release in December 2013.

Unique Identifiers Which Don’t Identify Anyone!

  1. Why might we need Unique Identifiers?
  2. How might a Unique ID Be Generated?
  3. But we don’t want our data to be identified!
  4. Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

Why might we need Unique Identifiers?

So far early data is coming through on the LAMP project, and partners have been anonymising student data before they send it to us. This anonymisation process removes their internal unique identifiers (student number, say, or registration number) as well as any other identifying data. A sample of the data that we are seeing might be something like the following (which I have just made up on the spot, any similarity to persons living or deceased is unintentional):

Age Starting Course UCAS points Subject Final Grade Number of library loans
18 300 Physics and Astronomy Upper Second 3
19 250 Cake Decorating and Icing Manufacture First 153
43 400 Medieval History Lower Second 96

This is great for us to get started with, and there’s a lot we can do statistically with this data. However, different partners are keeping (and submitting to this project) different levels of data, and so it’s conceivable that we might later get updates to this information (in subsequent academic years, say), or even other tables of information from partners. Maybe something like this detailing use of electronic and on-site facilities:

Number of PC logins e-resource logins Library barrier entires
1307 4 260
51 0 2741
254 98 5

Now, let’s suppose row 2 in our second table is actually the same person as row 1 in our first. If we know this, there are perhaps more meaningful questions which we can ask of our statistics routines. The problem, however, is how we can possibly find this out. When we receive the tables, we could give every row a number at our end, but this in itself doesn’t help us cross-reference between tables. The only way for us to know for sure that rows refer to the same people, is if the partners somehow supply us with this information. Of course, if partners are certain that they will never send updates, they can always merge tables like these themselves and just send us one single table. Otherwise…

How might a Unique ID Be Generated?

The most straightforward way to cross-reference between tables and updates would be if the partners added a unique number, or identifier, to records for every person they send us data about. Most partners will already have a unique ID in use internally, like the student number or registration number. However, sending us this information directly would destroy the anonymity of the data and would raise concerns about data protection.

The best solution to this problem would be for partners to generate a different unique ID, which cannot be back-traced to their internal ID, but instead is only of use for cross-referencing purposes. Some example ideas as to how an anonymous unique ID could be generated are appended to this post.

The lookup table (or routine) between the original UID and the new anonymous UID would be something private to the partners, maintained by them, that we would never get to see.

But we don’t want our data to be identified!

The question then becomes whether or not this new ID is, in itself, an identifying feature which should cause partners concern, and below I will attempt to set out some thoughts as to why assigning an anonymous ID should be worry-free.

Fundamentally speaking, if the anonymisation process which generates the new ID is good, then this extra piece of information will tell us absolutely nothing. It will purely be a means for us to join up different tables at our end — to illustrate, from the examples above, we would end up with something like:

Anonymous UID Age Starting Course UCAS points Subject Final Grade Number of library loans Number of PC logins e-resource logins Library barrier entires
765003 18 300 Physics and Astronomy Upper Second 3 51 0 2741
331224 19 250 Cake Decorating and Icing Manufacture First 153
903876 43 400 Medieval History Lower Second 96

In other words, it’s the same as if we received a single table from the providers, with just one now-meaningless column added — those anonymous values may as well be 1,2,3 for all the information they give us.

The next concern, then, is that looking at data in the table above could somehow be identifying even if the anonymous IDs are meaningless. This is a completely valid concern — if there is only one person on a given course who is over forty, then looking at the data in this way would immediately make it clear that we were talking about that person.

To answer this concern, we need to be clear about what the LAMP project will actually be letting users look at. Tables of the type shown above will only be held confidentially on LAMP systems. The LAMP project will not be offering data it holds at anything like the level of granularity given above. Aside from the fact that we want to protect anonymity, there are issues surrounding statistical significance. Simply put, we will be offering users the chance to observe and extract meaningful patterns from the data held in our systems. In order for us to definitely say there is a pattern, we must be able to say with confidence that an observed trend in the data is actually happening, and not just some effect in the numbers that we might see from any random sample. The way in which we make such statements is using statistical significance, and we cannot make statistically significant statements unless we reduce the level of granularity in our data by grouping data into ‘bins’.

Here’s an example — the first row in the following table shows a theoretical full granularity of data we might receive from some imaginary partner, but the second row contains the same information after we have normalised the data into categories for use in LAMP. You can immediately see that the chances of identifying someone from the data drop significantly.

Age Course Degree Result FTE Country of Domicile
23 Physics with Study in Europe 67% 0.75 Spain
Mature Physics Upper Second Part Time EU

Furthermore, our user interface will only be displaying data after we have performed statistical analysis on it. Nobody will even have access to row two of the data above — instead you should expect to see results such as the following:

Number of loans vs final grade
loans
final grade

Or if you query the LAMP API, you will be able to get the same data in tabular form:

Loans Final Grade
3 2
22 1
30 2
10 3
0 5
0 4
7 3
6 4
15 3
17 2

The user interfaces, both graphical and our API, will be limited to returning results at this level of depth.

Hopefully that covers some of the concerns surrounding generating and supplying LAMP with unique identifiers. For some more technical ideas about how to generate anonymous UIDs, I have given some examples in the appendix below to set people off. Next up, I’ll hopefully be posting about the design of our LAMP database, and how we will be applying the normalisations I talked about above so that we can achieve statistical significance.

Appendix: Some Technical Ideas For Generating Anonymous Unique Identifiers

The main issue with generating your own identifiers is that they must be unique. Beyond that, the issue of anonymity means that you might want a routine which is unguessable. Some possible ideas are below, but this is by no means an exhaustive list and anything else which follows these principles will be fine — at the end of the day, it’s better if we don’t know how the anonymous UIDs were generated!

  1. Pick some ‘main’ table and use the row number from it

    In the example tables above, the first table could be considered the ‘main’ one as it contains info about the student themselves, whereas the second one is just peripheral usage information, so is really a sub-table. If you have a situation like this, then whilst your original UIDs are still present in your data, you could use their presence in sub-tables to look up the row number from the main table, and add that to the sub-tables as the anonymous UID. Limitations of this approach are that sorting your table by original UID before you begin may result in your anonymous UIDs being guessable. On the other hand, if you don’t sort by UID, then if new data is added to the main table it may interrupt or otherwise disturb your numbering scheme.

  2. Make up your own anonymiser routine

    As long as you choose a process that will return a unique number for every input, then you can transform your old UIDs yourself and tell nobody how you did it. For example, you could multiply all the old IDs by 3 and add 4. As I mentioned before, too simple a routine may be guessable, but more complicated ones risk not returning a unique number — for example, subtracting a student’s age from their student ID would result in a clash between student 331, aged 19 and student 359, aged 47…

  3. Use a hash function

    This is the most technological approach — a hash function will return a unique code for every possible input. It is so computationally difficult to get the original data back from the hashed code as to make it effectively impossible, which is why they are used in digital certificates and for securing passwords. The returned codes don’t follow any numerical sequence, so id’s 1, 2, 3, 4 will not come out with hash codes in the same order. At the moment, SHA-2 is probably a reasonable set of hash functions to use. You can install and use the openssl library (it may well already be installed on your system) — for example, on any UNIX based system, echo id | openssl sha256 will return the hash code for id.

Planning the LAMP Architecture

As data starts to come in from various LAMP partners, we’re considering the architectural design of our application. Some ideas came out of the first LAMP CAP meeting as well which might have a bearing on how we put our system together.

For each partner, we will receive a series of datasets covering various aspects of library analytic data (usage, degree attainment, journal subscriptions, etc.), which we’ll be importing into a database. One of the goals of LAMP is to be able to perform statistical analysis of this data, and so at some point there will need to be a component in our application which is capable of statistical calculation.

Our project will deliver a user interface front-end for the purposes of viewing the analysed data and customising which analysis to apply. Mention has also been made, however, of an API being delivered so that LAMP users can, if needs be, get the results of statistical analysis for use in their own applications without using our user interface.

As soon as we’re considering provision of an API, a common best-practise concept is that we should eat our own dog food — namely, that our own LAMP application should be built on top of our API. This process helps keep our API useable and relevant, as it means that any functionality we’ll need for our front end is available to everyone else through our API as well.

One thing which was mentioned at the CAP meeting, however, was the idea that we could combine our data with that from other APIs out there which offer different analytics data, such as JUSP and IRIS. This would increase the number of sources from which we could pull data and perform statistical analysis.

Before other APIs were mentioned, my initial feeling was that a statistical application layer such as R, integrated with the database, might be the most efficient way to offer up analyses from the LAMP data. A reasonable structure for this arrangement might be:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/3620b6bd6a5926bcbae6b850e1ddfc30.png'
Output: 
Warning: : syntax error in line 7 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "LAMP statistics API interface";
    5 "UI Interface";
    6 node[shape=rectangle];
    7 "LAMP database"->"LAMP data statistics layer"->"LAMP statistics API"->"LAMP statistics API interface";
    8 "LAMP statistics API"->"UI layer"->"UI Interface";
    9 }
   10 


(system components are rectangles, interfaces to the system are trapeziums)

However, the additional requirement to bring in other APIs pulls such a structure into question. Depending on what content is in other APIs and whether we can successfully cross-reference and/or make use of it, there may be a corresponding requirement to carry out statistical analysis across results from multiple APIs. If this is the case, then either the statistics processing layer needs to move (so that processing occurs after the other data has been pulled in), or a second processing layer will be necessary. Questions also arise as to whether or not a LAMP API should serve aggregated statistical results after data has been consumed from other APIs, or whether we only want to offer results from our own database via a LAMP API, leaving analysis across multiple APIs for consumers to implement according to their own individual requirements.

The following diagram attempts to sum up these options, with component options represented with dashed boundaries. Statistical processing options are in purple, and API options in blue:

Error generating Graphviz image:

Graphviz cannot generate graph
Command: /usr/bin/dot '-Kdot' '-Tpng' '-o/blogs/wordpress/wp-content/tfo-graphviz/d52fddb4ca8130df1614206e3ba698b2.png'
Output: 
Warning: : syntax error in line 11 near '-'

Original DOT:
    1 
    2 digraph architecture{
    3 node[shape=trapezium];
    4 "UI Interface";
    5 
    6 node[style=dashed,penwidth=2,color=blue];
    7 "LAMP statistics API interface";
    8 "aggregated statistics API interface";
    9 
   10 node[shape=rectangle]
   11 "LAMP statistics API"->"LAMP statistics API interface";
   12 "aggregated statistics API"->"aggregated statistics API interface";
   13 
   14 node[color=purple];
   15 "LAMP data statistics layer"->"LAMP statistics API";
   16 "aggregated statistics layer"->"aggregated statistics API";
   17 
   18 node[style=solid, penwidth=1, color=black];
   19 "LAMP database"->"LAMP data statistics layer";
   20 "LAMP statistics API"->"aggregator";
   21 "external APIs"->"aggregator"->"aggregated statistics layer";
   22 "aggregated statistics API"->"UI layer"->"UI Interface";
   23 }
   24 

From comments within the LAMP team, we’re leaning towards implementing a processing layer and API on our own LAMP data for now, and only including data from other APIs in the UI for comparison. Further statistical analysis of the type described above could then be an option for the a future release phase of LAMP. Whilst we consider these options further, I’ll be working on the LAMP database structure and trying to import some of the early data from our partners, which I’ll cover in a future post!

Community Advisory and Planning Group – Meeting Notes

We recently had the first LAMP community advisory and planning (CAP) group meeting.

The meeting was roughly divided into two parts: The first covered many of the agenda items you’d expect to see on a project group meeting, such as: Discussion and agreement of the groups terms of reference, review of workpackages and so on.

The second part was focussed on presenting the group with the initial use-cases and design sketches that the project team had developed. The idea was to present this initial work as a way to stimulate ideas and new use-cases as well as sense-check the focus of the project so far.

cap notes

The discussions were also framed by a few assumptions: That the designs are simply sketches, not wireframes; an assumption that the types of data implied by the use-cases would be available and useable, and; all feedback was welcome.

The discussions were rich and varied, and provided plenty for the project to take-away and think about. Listed below are some of the discussion themes and issues raised:

Flexible Configuration

There was a significant amount of discussion around the need for flexibility when it comes to the data and the ‘views’ on that data.

Data might be fed in from various and disparate sources, including UCAS, NSS, KIS and even employment data once students have left the university. There was a sense that LAMP could serve a number of interesting new use-cases.

In addition to this variety of external sources, there was also a feeling that the service should enable local configuration. This would allow local, closed data sets to be fed into the institutional view.

This ability to tailor the data is also reflected in the need for tailored views on that data. The audiences and use-cases for the data are multiple, so it makes sense to provide flexible views and outputs. These may be reports for directors to make business cases and strategic decisions to daily service reports or occasional flags (like a fuel gauge on a car dashboard).

Audiences

Connected to the flexibility of the system configuration is the consideration of who will be the primary audience(s) for the service.

It quickly became clear that an analytics service like LAMP would potentially have multiple audiences, from librarians to library directors to external users as well.  The LAMP data may well be surfaced in existing systems across the campus, with entirely new users interacting with its data.

Ultimately, the project needs to understand who is this decision making tool for? This audience may expand and morph as the project develops, but it needs to ensure it doesn’t fail its primary audience(s) by trying to serve the needs of everyone.

Intuitive layers

A few times the issue of intuitive interfaces and visualisations bubbled-up during various conversations. There seemed to be two particular issues that emerged:

  1. An intuitive view over the data – so you may not be interacting directly with the data itself but with visualisations of that data. This does however bring up interesting ideas of data and the UI, etc.

  2. The possibility that the visualisations and the tools used to interact with the data should already be doing some of the interpretive work (or at least displaying it in such a way as to make analysis and interpretation easier).

This is potentially a very rich area, and one where a clear understanding of what users might want to do will be critical.

Preserving context

An interesting point was raised about the importance of ensuring context is captured and preserved within the service. While this is a relatively simple piece of functionality: You can imagine a notes field or similar, the implications are interesting.

Ensuring the context for certain data points would ensure that any future analysis is able to take into account extraordinary events or circumstances that would affect the interpretation of data. Such events might be a fire or building closure, which in the short term would be accounted for in any analysis, but over the long term might be forgotten.

Benchmarking

In discussing the use-cases the potential for the service to support benchmarking was something that interested the group. In particular, benchmarking that extended beyond the boundaries of the UK to international institutions.

Increasingly, UK academic institutions are judging their performance in an international or global context, not just a national one.

There was also an interesting discussion about the possibility of ‘internal benchmarking’: comparing the performance of departments and subjects within the local institution.

What’s next…

This first meeting of the community group was very rich, and resulted in a lot of ideas and potential ways forward. So, given the limits of resource and time here are a few of the next steps the project will take:

  • Continue to develop the prototype as a way to get solid feedback on the potential use-cases and functionality. The community meeting made it clear that it would be useful to have something people can actually interact with, test our assumptions and refine the kinds of functionality users require. data is the key component in this next step – both the existing data sets we might be able to use and the institutional data that will help drive some of the impact type use-cases.

  • Sketch out a development roadmap. This seems to be a way to both manage expectations (i.e., we’re not going to be able to deliver everything by December), and a way for us to prioritise the design as we progress.

  • User testing – make sure we are able to call upon small samples of potential users to test and refine the prototypes inbetween the CAP group meetings. These will likely be small, guerilla in nature, and aimed at ensuring a very iterative approach to the development.

 Our next meeting is planned for July, so we’ll have plenty to do between now and then!

The full minutes from the meeting can be found here: LAMP CAP meeting minutes 16 April 2013.

Library Analytics – Community Survey Results

The team is currently prepping for our first Community Advisory Board (CAB) meeting for the Jisc LAMP project. There’s a great deal to discuss, not least the use case ideas we have been drafting for feedback.  Ben Showers and I met last week to talk about setting the context for the meeting, and we agreed that it would be useful to more broadly share the findings of the survey we ran back in November 2012.  With the support and input of RLUK and SCONUL, Mimas worked with Jisc to run a community wide survey. We wanted to gauge the potential demand for data analytics services that could an enhance business intelligence at the institutional level and so support strategic decision-making within libraries and more broadly.  Below is a summary of the results available through slideshare.

Library Analytics – Community Survey Results (Nov 2012) from joypalmer
We wanted to get a better handle on how important analytics will be to academic libraries now and in the future, and what demand might be for a service in this area, for example, a shared service that centrally ingests and processes raw usage data and data visualisations back to local institutions (and this, of course, is what LAMP is exploring further in more practical detail).  We had response from 66 UK HE institutions, and asked a good number of questions. For example, we asked whether the following functions might be potentially useful:
  • Automated provision of analytics demonstrating the relationship between student attainment and resource/library usage within institutions
  • Automated provision of analytics demonstrating e-resource and collections (e.g. monographs) usage according to demographics (e.g. discipline, year, age, nationality, grade)
  • Resource recommendation functions for discovery services

Perhaps not surprisingly, the overwhelming response was positive – these tools would be valuable, yes (over 90 % ‘yes’ rate each time). But we also asked respondents to indicate which strategic drivers were informing their responses, i.e. supporting research excellence, enhancing the student experience, collection management, creating business efficiencies, demonstrating value for money, and others. What we found (based on our sample) was that the dominant driver was ‘enhancing the student experience,’ closely followed by the ability to demonstrate value for money, and then to support research excellence.

We also asked whether institutions would find the ability to compare and benchmark against other institutions would be of value. Whilst there was general consensus that this would be useful, respondents also indicated a strong preference to share data to be used as a benchmark for other institutions if it were anonymised and made available by a category such as Jisc Band (91%) (This compared to a 47% ‘yes’ rate when asked if they would, in principle, be willing to make this data available where users could see the source institution’s name).  So, there is appears to be a strong willingness to share business intelligence data with the wider community, so long as this is done in a carefully managed way that does not potentially expose too much about individual institutions. In addition, there was far more hesitation over sharing UCAS and student data than other forms of transactional data (again, not surprising).

Are analytics a current strategic priority for institutions?  Only nine respondents said yes it was a top priority at the present moment, with 39 stating that it was important but not essential. However, when asked whether it would become a strategic priority in the next five years, 40 respondents indicated it would become a ‘top priority.’

However, the question of where the decision-making in this area would reside evoked a wide range of different responses, indicating the organisational complexities we’d be dealing with here. Clearly the situation at each institution is complex and highly variable. Overall Library Directors and IT Directors are seen as the key decision-makers, but respondents also referenced Vice Chancellors, Registrars, Deputy Vice Chancellors. At certain individual institutions, the University Planning Office would need to be involved, or at another, the Director of Finance.

Other potential barriers to sharing include concerns over data privacy and sharing business intelligence, and our results revealed a mixed picture in terms of concerns over data quality, lack of technical expertise, and the fact that there are strong competing demands at the institutional level.

The LAMP project is now working to build on these findings and develop live prototypes to fully test out these use cases, working with data from several volunteer institutions.  Our major challenge will be to ascertain to what extent the data available can help us support these functions, and that’s very much what the next six months is going to be focused on.

 

  • Sketch of suggested main dashboard page
    3
  • Sketch of sub-dashboard on collection management
    2
  • Sketch of sub-dashboard on student use
    1

We had a brainstorming meeting at Mimas in late March, with Ellen, Graham, Joy, Lee and Bethan, to look at potential use cases and to sketch out what we thought the dashboard might look like. This will be taken to the Community Advisory and Planning Group meeting in April for feedback.

Mimas’s designer Ben will be working to prettify our rough sketches, but here are our first ideas of what we think we might include in the dashboard, and how it might look. Click on each of the images to go to a larger version. Thoughts? Questions? Comments? We’d love to hear what you think about the dashboards or the use cases.

Lighting-up time

It’s becoming increasingly important for libraries and institutions to capitalise on the data they’re collecting as part of their day-to-day activities. The Jisc Library Analytics and Metrics Project (JiscLAMP) aims to help libraries bring together this data into a prototype shared library analytics service for UK academic libraries.

We want to help libraries bring together disparate data sets in an attractive and informative dashboard, allowing them to make connections, and use these connections and insights to inform strategy and highlight the role the library is playing in the success of the institution. For more details of what we hope the project will achieve, see Ben Showers’s introductory blog post.

LAMP is a partnership project between Jisc, The University of Huddersfield, and Mimas at the University of Manchester. It is funded under the Jisc Digital Infrastructure: information and library infrastructure programme.