The social and behavioural sciences have always come with their fair share of jargon. Territoriality. Kinesics. Affective events theory. Pygmalion effect. Dystopia. Cacotopia. Those last two even mean the same thing. Sometimes it seems like doing social science means learning another language.
That's why we're releasing everything that we've learned about social science terminology as the first installment in our new Terminology Service. Go to https://concepts.sagepub.com and click around to discover information on over 61,000 concepts in the social and behavioural sciences, such as definitions of key terms and how they relate to other concepts. You can search for a specific concept, or try clicking the 'Hierarchy' tab and browse through the social sciences from the top down.
WHAT DID WE EXPERIMENT WITH?
This experiment has two parts. First of all, I asked myself how we could gather information on key concepts in the social sciences and structure that in a useful way. We publish lots of material that helps people understand concepts, such as glossaries, encyclopedias and even ‘jargon busters’. How can we make that information more discoverable and useful? I decided to see whether we could take all these concepts that appear across our content and represent them in a controlled vocabulary that would act as a sort of taxonomy of the social sciences.
Secondly, once we have that controlled vocabulary, what’s the best way of making it available to people in a useful way? We have internal vocabulary management tools to help us organise and arrange concepts, but we can’t open these up to end users and even if we could they wouldn’t be very user-friendly.
HOW DID WE DO IT?
Once I’d identified some of the implicit structures in our content, I set about extracting them in a structured way. Encyclopedias proved extremely fruitful. These reference works are already tagged with general subject areas, and by definition are a list of the couple of thousand most important concepts in a particular subject area. Most encyclopedias also contain ‘Readers’ Guides’ that further break down the entries by thematic area, such as ‘American social theory’ in an encyclopedia on social theory. I could use these guides to add more structure and hierarchy to the vocabulary. What’s more, encyclopedia entries are also helpfully written in such a way that they often define a term before discussing it. I created a simple script to extract the first two sentences for each entry and store that as a ‘definition’ in the vocabulary file. Other information that turned out to be presented in a consistent, semi-structured way included the date an event happened or that a person was born, and the type of thing an entry referred to, such as whether it was a person or organization. In information science terms, we had the makings of a thesaurus.
It took a long time to get to a point where I was happy with all that mess of data. The initial pass of terms took in over 120,000 entries, many of which were duplicates appearing in different contexts. Simply de-duping the file reduced that number closer to 70,000, but made for a Frankenstein’s monster where concepts had dozens of parent concepts and were appearing in hundreds of places in the thesaurus. Several periods of manual clean-up tamed the overall structure and whittled the number of terms down to about 61,000.
Once we were happy with the thesaurus, we decided to release it to see what people think. There are a few off-the-shelf tools that are designed to publish vocabularies such as taxonomies and thesauri to the web. After experimenting with a couple, we found that Skosmos, an open source vocabulary browser and publishing tool, provided the best user experience. During the next few weeks, I gave myself a crash course in just enough web hosting and development to put the Skosmos repository to use: the Linux command line, web servers, dependencies and SSL certificates, and how to recover all your work after the mail server you try to install attempts to overwrite everything!
WHAT DID WE LEARN AND OTHER IDEAS?
At the start of the project, I dove in a little too keenly without taking a step back and taking a look at the horizon. For starters, if you’re managing a large vocabulary it helps to have the right tool. The tool that I chose loads the entire vocabulary structure into memory at startup. When starting with 120,000 terms, tools like this tend to grind to a halt! Switching to a tool that loads concepts as requested saved my sanity. This hiccup was also a product of scraping all the data I could find and lumping it together, deciding to clean it afterwards. On reflection, when merging huge data sets together I think it’s easier to keep track if you clean up the data as you add it.
The terminology service is live at https://concepts.sagepub.com, where you can browse the Social Science Thesaurus, download it, or run API queries against it. We will shortly be adding other vocabularies to the site that we want to share with the organization or the world. As for the Social Science Thesaurus, we’re already using it to tag much of our social science content with keywords, and I’m keen to see what other uses it can serve. And that includes you: the thesaurus is free to download, adapt and use for non-commercial purposes, so let us know what you think you can use it for!
Alan Maloney, with the support and encouragement of dozens of colleagues across SAGE Publishing.
Blog post written by Alan Maloney