What is NLP (Natural Language Processing) Tokenization?
Natural Language Processing (NLP) enables machine learning algorithms to organize and understand human language. NLP enables machines to not only gather text and speech but also identify the core meaning it should respond to. Human language is complex, and constantly evolving, which means natural language processing has quite the challenge. Tokenization is one of the many pieces of the puzzle in how NLP works.
In this article, we’ll give a quick overview of what natural language processing is before diving into how tokenization enables this complex process.
What is Natural Language Processing?
Natural Language Processing uses both linguistics and mathematics to connect the languages of humans with the language of computers. Natural language usually comes in one of two forms, text or speech. Through NLP algorithms, these natural forms of communication are broken down into data that can be understood by a machine.
There are many complications working with natural language, especially with humans who aren’t accustomed to tailoring their speech for algorithms. Although there are rules for speech and written text that we can create programs out of, humans don’t always adhere to these rules. The study of the official and unofficial rules of language is called linguistics.
The issue with using formal linguistics to create NLP models is that the rules for any language are complex. The rules of language alone often pose problems when converted into formal mathematical rules. Although linguistic rules work well to define how an ideal person would speak in an ideal world, human language is also full of shortcuts, inconsistencies, and errors.
Because of the limitations of formal linguistics, computational linguistics has become a growing field. Using large datasets, linguists can discover more about how human language works and use those findings to inform natural language processing. This version of NLP, statistical NLP, has come to dominate the field of natural language processing. Using statistics derived from large amounts data, statistical NLP bridges the gap between how language is supposed to be used and how it is actually used.
How does Tokenization Work in Natural Language Processing?
Tokenization is a simple process that takes raw data and converts it into a useful data string. While tokenization is well known for its use in cybersecurity and in the creation of NFTs, tokenization is also an important part of the NLP process. Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning.
The first step of the NLP process is gathering the data (a sentence) and breaking it into understandable parts (words).
Here’s an example of a string of data:
“What restaurants are nearby?“
In order for this sentence to be understood by a machine, tokenization is performed on the string to break it into individual parts. With tokenization, we’d get something like this:
‘what’ ‘restaurants’ ‘are’ ‘nearby’
This may seem simple, but breaking a sentence into its parts allows a machine to understand the parts as well as the whole. This will help the program understand each of the words by themselves, as well as how they function in the larger text. This is especially important for larger amounts of text as it allows the machine to count the frequencies of certain words as well as where they frequently appear. This is important for later steps in natural language processing.
Tokenization Challenges in NLP
While breaking down sentences seems simple, after all we build sentences from words all the time, it can be a bit more complex for machines.
A large challenge is being able to segment words when spaces or punctuation marks don’t define the boundaries of the word. This is especially common for symbol-based languages like Chinese, Japanese, Korean, and Thai.
Another challenge is symbols that change the meaning of the word significantly. We intuitively understand that a ‘$’ sign with a number attached to it ($100) means something different than the number itself (100). Punction, especially in less common situations, can cause an issue for machines trying to isolate their meaning as a part of a data string.
Contractions such as ‘you’re’ and ‘I’m’ also need to be properly broken down into their respective parts. Failing to properly tokenize every part of the sentence can lead to misunderstandings later in the NLP process.
Tokenization is the start of the NLP process, converting sentences into understandable bits of data that a program can work with. Without a strong foundation built through tokenization, the NLP process can quickly devolve into a messy telephone game.
Kinds of Tokenization
There are several different methods that are used to separate words to tokenize them, and these methods will fundamentally change later steps of the NLP process.
Word tokenization is the most common version of tokenization. It takes natural breaks, like pauses in speech or spaces in text, and splits the data into its respective words using delimiters (characters like ‘,’ or ‘;’ or ‘“,”’). While this is the simplest way to separate speech or text into its parts, it does come with some drawbacks.
It’s difficult for word tokenization to separate unknown words or Out Of Vocabulary (OOV) words. This is often solved by replacing unknown words with a simple token that communicates that a word is unknown. This is a rough solution, especially since 5 ‘unknown’ word tokens could be 5 completely different unknown words or could all be the exact same word.
Word tokenization’s accuracy is based on the vocabulary it is trained with. These models have to find the balance between loading words for maximum accuracy and maximum efficiency. While adding an entire dictionary’s worth of vocabulary would make an NLP model more accurate, it’s often not the most efficient method. This is especially true for models that are being trained for a more niche purpose.
Character tokenization was created to address some of the issues that come with word tokenization. Instead of breaking text into words, it completely separates text into characters. This allows the tokenization process to retain information about OOV words that word tokenization cannot.
Character tokenization doesn’t have the same vocabulary issues as word tokenization as the size of the ‘vocabulary’ is only as many characters as the language needs. For English, for example, a character tokenization vocabulary would have about 26 characters.
While character tokenization solves OOV issues, it isn‘t without its own complications. By breaking even simple sentences into characters instead of words, the length of the output is increased dramatically. With word tokenization, our previous example “what restaurants are nearby” is broken down into four tokens. By contrast, character tokenization breaks this down into 24 tokens, a 6X increase in tokens to work with.
Character tokenization also adds an additional step of understanding the relationship between the characters and the meaning of the words. Sure, character tokenization can make additional inferences, like the fact that there are 5 “a” tokens in the above sentence. However, this tokenization method moves an additional step away from the purpose of NLP, interpreting meaning.
Sub Word Tokenization
Sub word tokenization is similar to word tokenization, but it breaks individual words down a little bit further using specific linguistic rules. One of the main tools they utilize is breaking off affixes. Because prefixes, suffixes, and infixes change the inherent meaning of words, they can also help programs understand a word’s function. This can be especially valuable for out of vocabulary words, as identifying an affix can give a program additional insight into how unknown words function.
The sub word model will search for these sub words and break down words that include them into distinct parts. For example, the query “What is the tallest building?” would be broken down into ‘what’ ‘is’ ‘the’ ‘tall’ ‘est’ ‘build’ ‘ing’
How does this method help the issue of OOV words? Let’s look at an example:
Perhaps a machine receives a more complicated word, like ‘machinating’ (the present tense of verb ‘machinate’ which means to scheme or engage in plots). It’s unlikely that machinating is a word included in many basic vocabularies.
If the NLP model was using word tokenization, this word would just be converted into just an unknown token. However, if the NLP model was using sub word tokenization, it would be able to separate the word into an ‘unknown’ token and an ‘ing’ token. From there it can make valuable inferences about how the word functions in the sentence.
But what information can a machine gather from a single suffix? The common ‘ing’ suffix, for example, functions in a few easily defined ways. It can form a verb into a noun, like the verb ‘build’ turned into the noun ‘building’. It can also form a verb into its present participle, like the verb ‘run’ becoming ‘running.’
If an NLP model is given this information about the ‘ing’ suffix, it can make several valuable inferences about any word that uses the sub word ‘ing.’ If ‘ing’ is being used in a word, it knows that it is either functioning as a verb turned into a noun, or as a present verb. This dramatically narrows down how the unknown word, ‘machinating,’ may be used in a sentence.
There are multiple ways that text or speech can be tokenized, although each method’s success relies heavily on the strength of the programming integrated in other parts of the NLP process. Tokenization serves as the first step, taking a complicated data input and transforming it into useful building blocks for the natural language processing program to work with.
As natural language processing continues to evolve using deep learning models, humans and machines are able to communicate more efficiently. This is just one of many ways that tokenization is providing a foundation for revolutionary technological leaps.