One of the oldest and most common ways of storing information is text, used for thousands of years to record data long before the existence of Microsoft Excel or databases. And to this day, most information in any given organization exists in some form of text document, such as emails, Google documents or even post-its on the wall. That’s where NLP, or natural language processing, comes into play.
NLP has several applications but can be grouped into two main categories: understanding meaning/feeling and pattern identification — both of which are of incredible use for the largely text-based SEM industry. For SEM, NLP can be used for tasks such as search term reports and keyword expansion tools, both of which leverage the technology to analyze search queries, detect associated keywords and then suggest related keywords. Other, more complex uses involve generating audiences based on search term content or QuanticMind’s use of NLP to come up with a baseline bid for long-tail keywords.
So How Does NLP Work? (we break it down)
So How Does NLP Work? (we break it down)
The first step involved in an NLP process is phrase segmentation. Specifically, the phrase is broken down into sections, typically using full stops or commas.
After that comes tokenization, which is a fancy way of saying that we are going to “define the unit of the language.” In the case of the English language, that means a word that can be easily separated because it is contained between spaces. Then comes a first-grade refresher — deciding what type of word — verbs, nouns or adjectives — we are talking about. This can be done easily for certain words such as ‘car’ but might need some context in other cases in which a word could have multiple meanings such as ‘bitter’ or ‘fair.’
After that, we go to lemmatization, which means drilling down a word to its base form. Some words can vary from the root word, like geese and goose — lemmatization essentially tracks down the base word associated with all of those variations. From there, we remove stop words or filler words. These can be thought of as stock words that don’t really add any meaning to the phrase itself such as “and,” “is” and,” “the,” which need to be removed to reduce the noise while interpreting the phrase.
The following step is perhaps the most complex: using machine learning to understand how each component of the phrase relates to the others. Once we have moved past the grammar portion of this process, the NLP funnel moves into noun recognition, which splits the phrases, but uses nouns as segments to extract information. It works like this — say our NLP system has detected nouns like “EU” “Trump” “California” — our Named Entity Recognition algorithm (as it is technically called) would recognize that California and EU are geographical locations and that Trump is an American politician.
The final step of the NLP funnel is coreference resolution, which aims to understand pronouns. While humans can determine through context to which noun the pronoun refers, it becomes trickier for a computer.
NLP is not without its challenges. Computer programming is based on understanding the literal meaning of structured languages, so transitioning them with capabilities that allow them to understand unstructured natural language with contextual reference, metaphors, spelling mistakes and all the idiosyncrasies contained in our written and oral communication is a huge leap. Take, for example, the following headline from a major news publication: “Labor admits Brexit could lead UK to freefall.” A literal interpretation of this phrase is that the physical act of work (labor) has somehow gained consciousness and admits that were Brexit to occur, the entire physical United Kingdom would somehow be sucked into space and dropped off of earth’s gravity. Of course, through context and social understanding, we immediately know the real meaning of the phrase. But we only acquire that understanding through years of practice interpreting and reading between the lines. Only context gives the phrase the meaning it really has — that a political party admitted to a potential economic impact were the UK to sever from the EU.
Here’s the problem — the computer not only has to understand the literal meaning of every word (even such terms as UK or Brexit which are acronyms, or terms not present in the English language) but it has to derive the potential contextual meaning from the combination of words in the phrase. NLP will allow a whole spectrum of new developments such as more advanced chatbots that are indistinguishable from a human counterpart, the elimination of the UI concept (or invisible UI) in which the user only communicates with the software via speech, eliminating the need for clicks, swipes and scrolls. NLP will also allow for a new generation of search engines in which the user searches as he or she speaks, eliminating keywords and topics (this is already present in Apple’s Siri and Amazon’s Alexa). The potential of NLP not only looks to the future, but also to the past — once NLP is sufficiently powerful, researchers can use it to analyze all the text data acquired from past activities, thus creating a sort of backfill for all the unstructured data that we have accumulated.
NLP is a branch of data science that uses a series of steps to segment and extract information from text and speech. It aims to solve limitations in software that understand only formal or “structured data.” Text or language is ‘unstructured,’ so converting into ‘structured’ would allow us to convert a collection of information stored in this way into actionable insights. And it’s becoming a valuable tool in SEM.
The SEM industry is flooded with tools that use NLP at some level for such functions as keyword expansion, search query analysis, predictive search, and speech search engines. And this is just the beginning of its many use cases. And going forward, it will be leveraged in new and innovative ways, further bridging the gap between human language and computer data.