In my experience, building Software is difficult, really difficult.
Having a been a vocational developer for the last 10 years, I can't think of any two projects that were the same. Similar, yes, but never the same. As a result, the code needed to deliver each project was very different. For a lot of those projects, early architecture decisions were made without a second thought - that was just how we did it. They were not bad decisions, mostly they were the right decisions, but for some, some they were terrible decisions. I'm not just talking about choices on which framework or CMS to use, but rather, decisions which affected the longevity and maintainability of a project. Decisions such as scale, easy of implementation vs cost of ownership. Easy vs simple.
You see if there is anything I've learnt, it's that simple is not easy. Our goal should always be to implement something simple, even if it's difficult.
When we talk about 'simple vs complex' and 'easy vs difficult' what we're really talking about is the origins of the words:
"The state of being simple, uncomplicated, or uncompounded" 
"singleness of nature, unity, indivisibility; immutability" 
"composite nature, quality or state of being composed of interconnected parts," 
"causing or involving little difficulty or discomfort." 
"Hard to do, make, or carry out" 
From the definitions of simplicity and complexity, we can see that they have absolutely nothing to do with how easy something is to make! Its origins lie in how something is connected or 'coupled' to things around it. The implications of this determine how flexible things are in the long term.
I won't get into it into too much detail here, but for software development, this is a big problem. As we build, we tend to code things in an "easy" way, often tieing components, features and code together. We do this, and as time progresses and we need to change or update them, it's tough, expensive and often not even possible.
It's easy to create a feature, but when that feature needs to do something, it was never designed to then you're in trouble.
Imagine building a house, you have the specifications all laid out, beautiful designs for your modern luxury home.
The builder comes along and starts laying the foundations for your home. They then start building the outside walls, but instead of using bricks to build the walls, they use concrete. They pour concrete moulds to shape the house, not only that they use concrete for all the internal walls and embed the pipeworks and electrics in there too. They even make your bath out of concrete. 'It's fine' they say, 'it meets the specifications, plus you get your house in 1/10th of the time'. So you move in.
12 months later, after much user testing, you realise the house is just not quite right, you need some new features and some adjustments. The wall to the dining room just needs to move about 1 meter out, and a doorway needs shifting over the left by a few centimetres. In fact, you want to go in a whole new direction and knock out several walls to make more room and add more sockets for convenience. Except, you can't, not easily. The concrete shell, inner walls and fixtures are all tied together and moving one would mean having to knock out and rearchitect the entire floor.
Suddenly the simple tweaks become expensive labour heavy tasks needing many different skillsets, which could potentially cause stability issues also known as bugs. Sounds familiar? That's because this is how we often build software.
But, it doesn't have to be this way, instead of building easy, we can build simple. We can engineer software to be decoupled, documented and maintainable. We can implement design patterns which encourage these principles.
It's easy to make something complex, it's hard to make something simple. But it's worth it.
WHAT DID WE EXPERIMENT WITH?
With this in mind, we were approached by the SAGE Journals' team to help with displaying journal metadata, such as editorial boards, submission guidelines, on the site. Metadata is hosted in separate systems and would cost a lot of development time and money to import into the Journals' platform. We wanted to see if we could build a separate API 'service' which the platform could use. The platform would look up the data and display it on-the-fly without needing to import anything.
WHAT DID WE DO?
We knew our API was not going to be the 'source of truth' but rather just an endpoint to the data source. We also knew it needed to be robust - the Journals' platform gets tens of millions of hits per month. This meant that we would need to index the data into a separate repository that was fast and scalable. We also wanted the data to be as easy to access as possible.
Based on this, we knew that a traditional database approach was not going to be ideal, so we started to look at NoSQL technologies.
NoSQL is famed for being fast as it's 'flat' data structures means that it does not need to spend computing power on doing database joins. However, what you gain in speed, you trade-off in normalisation and challenges can occur when needing to update 'related' data, e.g. the last name of an author for a series off books.
We also wanted to explore using more modern API techniques such as GraphQL (as opposed to REST). GraphQL gives more control to applications using the API to decide which data they want.
HOW DID WE DO IT?
We looked at several NoSQL/flat solutions, including MongoDB and Elasticsearch. We decided to go with Elasticsearch due to its design with distributed sharding and replicas, which meant we could quickly scale the service if needed. This, coupled with the ability to do complex searches on the data, made it a natural choice.
We also talked to a developer within SAGE about using a MarkLogic database which could have also been a contender. MarkLogic has many of the same features as Elasticsearch and is an XML Database which suits SAGE as most data is in XML.
However, Elasticsearch is more widely used and is a technology which is gaining more and more adoption. We felt that the expertise to scale Elasticsearch would be more accessible than MarkLogic, which would probably require a substantial support agreement.
We indexed the data into Elasticsearch using a series of Serverless Functions which pulls in data from several sources.
Secondly, we also implemented a GraphQL endpoint which sits in front of Elasticsearch and can be called by the journal's platform.
MORE IDEAS AND WHAT DID WE LEARN?
XML to JSON conversion is hard to do right and requires strict schemas to do it properly. Mainly because dealing with missing data is difficult, and assumptions can't be made.
Although Elasticsearch is an open-source technology, it does require expertise to get the most out of it. I'd recommend doing some training on it. There is a lot of good content and courses out there.
Andy Hails - Software Engineer
With special thanks to:
The SAGE Journals' Team - Dan Huke, Helen Duce and Dan Hurley
John Cooper for his input on Mark Logic and XSLT
Tony Davies for navigating support agreements
Blog post written by Andy Hails