Natural language processing has many applications across both business and software development, but roadblocks in human language have made text challenging to analyze and replicate. Why can’t computers seem to get it exactly right? Mariana Romanyshyn from Grammarly sheds light on why.and discusses what you need to know about NLP Linguistics.
One major function of NLP lies intext simplification, and there are several reasons for building these tools. Consider, for example, the famous episode of Friendsin which Joey tries to write a letter of recommendation for Monica and Chandler’s adoption agency and uses a thesaurus to change
“They are warm, nice people with big hearts.”
“They are humid, prepossessing Homo Sapiens with full-sized aortic pumps.”(cue laugh track).
Other utilizations of text simplification have been implemented by companies like Newsela, which simplifies news for second language learners, and Hemingway, which helps lower the reading level of a text to reach a broader audience. These motivations are genuine, but the process can be a bit difficult.
Building a text simplification program begins with a primary pipeline. It starts with pre-processing raw text and looks something like this:
The team knows the process is successful because it has a good F-measure (on the test set) and it happens in relatively quickly. However, because the team is not the final consumer of the product, other success criteria apply. The tool should also be:
In many cases, the final product accomplishes the developer’s success criteria, but not the consumers. So what can developers do to make a better product? Why is it so difficult to build a usable text simplification tool in NLP? One major reason is:
How difficult could complex word identification be? You merely access a large corpus, tokenize it, and count word frequency.
Not quite. There are a few roadblocks to this process because language is a lot more complex than that.
Counting doesn’t meet the first criteria of consistency because of parts of speech. To make your program work, it has to identify words the way a linguist would, i.e., the collection of all word forms. This also includes alternate spellings such as the British “accessorise.” Scraping sites that give you the inflectional morphology, or the sum of all word forms, gives you a consistent resource for your program to use to begin chunking word forms together to create better consistency.
Controlling for word length also leaves inconsistencies. “Friend” could be simple while the form “friendliness” is marked complex only because of word length. Long words aren’t necessarily complex, since if you can derive meaning from word parts (i.e. “satisfy” to “satisfactory”) they are actually simple. Building a morphological (word part) analyzer within your program gives you a more consistent readout for which words are complex than a standard length analyzer alone.
You can take both of these one step further and analyze words for strange letter combinations. Complex words tend to have rare letter combinations — for example, “abhorrence” with the designation “abho.” If you compare to a simpler word “anger” you can immediately think of several words with “ange” combinations, but likely none with “abho” (at least quickly).
On an even more basic level, words can be analyzed for their sound. In English, complex words tend to have higher consonant to vowel ratios while simple words have more even proportions. For example “procrastination” has eight consonants and five vowels versus “information” with five vowels and five consonants.
Working with word meaning is notoriously tricky. However, complex words tend to have fewer meanings than simpler words because we use those simpler words many times in everyday language. The word “report” has seven noun meanings and six verb meanings for example, whereas abhorrence has only one noun meaning.
Psycholinguistics deals with language comprehension and production. For example, words in which your brain readily produces an image tend to be simple (“mouse” versus “abhorrence”). Another type of feature would be the average age of acquisition (again, “mouse” versus “abhorrence” with children knowing “mouse” more readily).
[Related article: NLP is Changing More Than Just Business: Applications of NLP in History, Art, and Higher Ed]
As you find replacements for your complex word, you must rerank them based on simplification. However, it’s possible to go too simple.
For example, the word dipsomania is a mostly unknown word with several synonyms. “inebriacy” isn’t any more simple than the original, but the simplest synonym, “habit,” actually changes the meaning of the word and cannot be considered a suitable replacement. You’ve gone too simple. Instead, a word like “alcoholism” strikes the right balance because the user is likely to know the text feature “alcohol.”
Filtering synonyms requires you to take out options that are just as complex as the original, too simple to be an adequate replacement, or aren’t grammatically correct (including common collocations).
From there, the team can rerank the suggestions using the language model. The most appropriate synonyms are ranked closer to the top while options suitable only to certain situations fall near the bottom of the list. Ranking can fall into two categories:
All these pieces produce a much more complex, but way more accurate pipeline for text simplification.
Romanyshyn believes that a working knowledge of linguistics gives developers more power in an increasingly complex system of machine learning that helps build smarter and more accurate programs. Because researchers aren’t the final consumers of any NLP model, it’s vital that developers consider their real needs when building.
She believes that although language still presents significant roadblocks to accurate NLP models, it’s better to dive into a problem from a linguistics standpoint rather than ignore it.
This video was taken at ODSC East 2018 — attendODSC East 2019this April 30 to May 3 for more unique content!Subscribe to our YouTube channelfor more videos taken at past conferences.