How We Used NLP Technology to Find Hidden Facts in Centuries Worth of Architecture History

One of the things I love about my work is that I’m in a unique position to observe both the academic and business applications of natural language processing (NLP) and AI technology. Through my work with NexLP and teaching at the Illinois Institute of Technology (IIT), I have opportunities to contribute to and see numerous ways that NLP can be used creatively to solve complex problems across academic research and business.

I wanted to highlight a unique application that was recently published by one of my former students, Dan Baciu. Both his published and ongoing work use NLP technology to quantify significant cultural moments within architectural history. His work is significant for the research community because it advances the use of NLP for understanding cultural history on an unprecedented scale.

However, it is also an interesting case study in how NLP can be creatively used to solve complex problems that are relevant across the academic and business communities.

The Challenge: Use NLP to Quantify Cultural History

The project I want to focus on is one that I, as well as my colleague Dr. Dan Roth, collaborated on. Our interest was in using NLP technology to help quantify and evaluate historical references to The Chicago School. You can download the full research paper, The Chicago School: Evolving Systems of Value, by clicking here.

The Chicago School is an interesting challenge for NLP for a few reasons:

  1. There are hundreds of thousands of references to it over hundreds of years.
  2. It is a highly nuanced subject with many variations in meaning.
  3. There is a lot of noise within the data, such as references to The Chicago School that are not especially important

What is The Chicago School and is it a Big Data problem?

The Chicago School is a term that has been widely used to describe a set of architectural principles, but it has also been used over the years to describe similar developments in other disciplines such as philosophy. It can also refer to actual schools, such as The Chicago School of Professional Psychology.  The key point of interest for this project was quantifying historically significant references to The Chicago School and to better understand how usage and variations evolved over time.

Why we used NLP technology: Our data included more than 100,000 volumes (including books and newspapers) from the HathiTrust Digital Library with volumes published over the course of two centuries. It would take years to label and classify such a massive unstructured data set without the help of NLP.

The data set also presented a creative challenge for using NLP effectively. Traditional text mining approaches would likely be inaccurate with so much nuance present. Text analytics technology also lacks the ability to extract contextual meaning.

This made NLP technology a natural choice, and, indeed, Dan was able to uncover an entire subarea of The Chicago School that had been formerly unknown.

We solved the challenge of context using a process called Wikification, which uses technology to link named entities (and sometimes concepts) to their entries on Wikipedia. Wikification has proven to be effective for adding much-needed context for text analytics technology. However, it has historically been used on far smaller data sets due to heavy computational power requirements.

The other significant element of our work was a filtering technique based on humans’ associative memory – this enabled us to identify truly important mentions of The Chicago School and eliminate a considerable amount of noise from the data.

The Results: A Step Forward in Using NLP for Research

Our project successfully used Wikification on an unprecedented scale, with two-three times greater accuracy than previous methods. We were able to extract 500 GB worth of relevant data, including named entities, metadata and 190,000 text snippets mentioning The Chicago School.

In addition, we were able to discover facts, that were not obvious to researchers that studied the original texts without NLP technology - we were able to identify a subarea of the Chicago School of Architecture that was not known before.

The convergence of several factors made it possible for NLP technology to understand context and deal with far larger volumes of data.

Developments in techniques like Wikification have given NLP the power to understand context in a more sophisticated way. Additionally, advances in the processing and storage power of computers have made it possible to use resource-intensive techniques on far larger data sets.  Finally, partnerships like the HathiTrust have given researches access to large volumes of data that were formerly scattered across libraries all over the world.

Dan’s use of NLP is unique because of the scale; Wikification had never been used on a data set that large before.

Our work also presented a new use case for Wikification in general; previous work utilized it to annotate texts for readers. However, we used it to collect structured data from vast volumes of newspaper and book data.

Using Wikification in this way also meant we needed a method for eliminating irrelevant mentions of The Chicago School to focus on culturally significant references. One of the reasons Wikification is a powerful technique is that it understands context. When The Chicago School is mentioned in text, Wikification can determine if it refers to The Chicago School of Architecture or The Chicago School of Psychology. Wikifier uses the context in which the term is mentioned, as well as the context of the relevant Wikipedia articles. It also considers the links to other connected articles, thus broadening the analyzed context.

Finding culturally important authors and mentions of a concept requires some context. For instance, traditional text mining techniques might look at word frequency to determine the importance of a named entity. However, important figures are not often mentioned many times in a particular text.  This is where context becomes critical.

Our filtering technique looked at the relationships between named entities and each other as well as the importance they play within the context of the overall text.  By understanding context, technology can be used to find truly important information instead of forcing users to manually remove irrelevant data.

The broader implications for NLP Technology

At NexLP, we often talk about context in terms of AI Features – individual elements of a data set that are used to extract meaning. In an investigation, these might be referred to as signals (e.g. signals of fraud), but regardless of the terminology used, the power of understanding more Features cannot be overstated. In a business context, the more Features technology understands, the more sophisticated the analysis that technology can provide. In our research into The Chicago School, context-sensitive technology helped reveal something entirely new to architectural history research.

Furthermore, once the structure of the unstructured data is revealed, it can be used to identify regularities, broader trends and anomalies in the data. This allows NLP technology users to discover information that would otherwise stay in the dark.

Interested in NexLP technology? Don't forget to subscribe to our email updates to get regular insights and content from The Team at NexLP. 

If you're curious how Story Engine™ leverages NLP, machine learning and artificial intelligence technology together, request a demo to see how it can understand your data.