WHAT DID WE EXPERIMENT WITH?
Ughh it’s Friday afternoon, and you’re knee deep in spreadsheets changing and extracting arbitrary values so that you can pass some data onto someone else, who lets face might be doing the same thing. Only a few more hours and you can be free – except you can’t, not until this is done because your line manager ‘really needs’ them for first thing Monday morning. The worst part is that it’s the same job as last week and the week before that… and the week before that. Furthermore, these spreadsheets seem to be getting bigger and requested more frequently. Let’s face it, the reason you’ve left it until Friday is that you can’t bear doing it again. You dream of your next job, maybe someone else will have to do this sort of work, and you can concentrate on what your good at.
Sound familiar? This is more common than you think.
The problem with this sort of work, other than morale, is that it just doesn’t scale and it costs more than you think. When these tasks get bigger the person’s ability to churn through them stays the same, in the short/medium term this results in longer lead times with more resource needed and potentially with the quality of other work being affected. In the long run, this could mean missed project deadlines and an inability to take on new and more exciting tasks.
WHAT DID WE DO?
The SAGE Research Methods Editorial (SRM) team approached us to help with this problem. They have a goal of commissioning 500-800 case studies a year to be published on the SRM platform which means sending out approx 50,000-80,000 emails to achieve this. We sat down with them to see if we could help automate parts of their process and help alleviate some pain points.
HOW DID WE DO IT?
Step 1: Understanding workflow
Our first job was to really understand their workflow by spending a morning with the team talking through and listing out what they did, why they did it and how they did it. Many Post-its (Other brands are available), coffee and questions later we had mapped out the process the team goes through for generating leads.
The process turned out to be very manual with the team downloading a tonne of article data, then sifting through that data to try and find relevant articles. Once they identify the relevant articles, the team aim to send personalised emails to each one so which can involve a lot of hunting around the data looking for first names.
This ‘understanding’ step is the most key as we wanted to build something that actually helped, not something we assumed would help and then never got used.
Step 2: Identify IPOs (Ideas, Pain Points and Opportunities)
During the workshop, we talked a lot about high effort/time-consuming tasks and how we could potentially help. However, it wasn’t until we went away that we could really examine and propose areas where we could benefit. We identified 5 areas which we thought we could help. We then asked the SRM team to tell us which would save them the most time and effort:
Auto download articles from API
Build a Machine Learning model to filter relevant articles.
Author Name finding (data cleaning)
Remove authors who have already been emailed
Knowing we only had a few weeks to work on this, we didn’t want to take on too much and risk not delivering or helping at all so we asked the team which of the above would provide the most relief. They decided on filtering articles and name finding.
Step 3: Build
With this in mind, we went about prototyping. Using MS Flow and MS Teams, we created a workflow where the SRM could easily article data they download into a folder which would then be automatically processed.
Using Flow, we could pass the data in a serverless function which would manipulate the data as we needed and then output it again.
It turned out that for most authors, their name was somewhere in the article data, it was just a case of hunting it down, parsing lots of text and running some interesting regular expressions.
With this approach, we managed to cut down about 4-6 man hours per week into approx. 10-30 seconds.
Article filtering turned out to be a bit more tricky. Ian Mulvany went about building a machine learning model which we trained using the abstracts from previous successful commissioned articles. We applied supervised machine learning techniques and parsed the abstract through Spacy Entity recognition models to help identify scientific phrases and words. From this, we could then try to infer if an article fell within the remit that the team wanted.
From a small amount of data that the team started to collect over previous weeks we were getting about 60% probability that an article fell within the remit. This was quite positive but currently not very useful at this point in time so we advised the team to keep their data so we can continue training the model. With more data, the probability will increase until it becomes a usable product. From there, we can then tweak the algorithm further.
MORE IDEAS AND WHAT DID WE LEARN?
The whole process was really interesting for both the Tech Lab and the SRM team. It was eye-opening – for both teams! – to see how much work actually goes into generating the leads.
It was also interesting to see how much time could be saved from a ‘first-pass’ and a small amount of code. As a result, we’re aiming to revisit the project with the SRM team in a few months to help them further and hopefully implement the article and unique email filtering workflow.
We’ll also be working with other teams within SAGE to help them with their processes.
Thanks to Kasia Figiel and the rest of the SRM Editorial team for letting us mess with their workflow!