Topics Covered in Language Module
In language, we will cover how Artificial Intelligence is used to process human language and convert it into meaningful information that can be understood by the system and further convert the useful information into the form which can be understood by a human.
Natural Language Processing is the subfield of Artificial Intelligence, which deals with the interactions of human language and computers. It deals with the programs and techniques to analyze and process a large amount of human language data.
Examples of Natural Language Processing are:
Syntax: Syntax is the structure of the text. It is the arrangement of the words in a sentence in such a way that it makes sense. While speaking our native language, we don’t pay attention to the grammatical mistakes in our sentences and yet understand the sentence’s meaning.
For example, People usually make lots of mistakes while using “Your” and “You’re,” “Its” and “It’s,” Affect and Effect, etc.
In NLP, syntactic analysis is used to analyze the natural language. Different syntax techniques (Algorithms) are used to apply grammatical rules to the text document and extract meaning from them.
Semantics: Semantics is the meaning of words or sentences. Two completely different sentences can have the same meaning sometimes. For example, the sentence “Just before nine o’clock Sherlock Holmes stepped briskly into the room” is syntactically different from “Sherlock Holmes stepped briskly into the room just before nine o’clock,” Bit meaning of both sentences is the same. Similarly, the other sentence, “A few minutes before nine, Sherlock Holmes walked quickly into the room,” uses different words from the previous sentences, but it still contains the same meaning.
Also, a sentence can be perfectly grammatical and non-sensible at the same time. So, Semantics analysis is used to understand the content of the sentences. Different Algorithms and techniques are used to understand the meaning and structure of the sentences.
When certain rules are used for generating sentences in a language, then it’s called Formal Grammar.
When the text is abstracted from the meaning of the sentence and represents the sentence structure using formal grammar.
For Example:
She saw the city.
This is a simple grammatical sentence, and we have to generate a syntax tree that represents its structure.
So, we want our AI system to be able to look at the sentence and figure out what the structure of the sentence is? Because to answer any question, AI system should know the structure, like if we ask AI what did she see? Then, AI should answer “City,” so, for this, AI needs some understanding.
In the above example, each of the words in a sentence will be called terminal symbols, and each of the words in the sentence will be associated with the non-terminal symbols like N, V, and D etc.
N ——– She
V——– saw
D——– the
N——– city
We assign each word a part of the speech. In the above example, “She” and “city” are Nouns. We have marked them as N. “Saw” is a verb, is marked as V. And, “the” is the determiner, which marks the noun as definite or indefinite, so we marked it as D. Now, the above sentence can now be written as
N V D N
To translate these non-terminal symbols into terminal symbols, we have some rewriting rules, which are:
N= she| city| car| Harry……..
D= the| a| an|………………..
V= saw| ate| walked|………..
P= to | on| over|……………..
ADJ= blue| busy| old|……….
N stands for Noun, V for the verb, D for determiner, P for Preposition, ADJ for adjectives.
So, when we are defining the structure of any language, we define these types of rules. But here in the above rules, we are dealing with single nouns and single verbs, but when we deal with the multiple words which operate as a noun or verb, we call them noun phrases and verb phrases. So, to deal with those type of sentences, we need to introduce more rules, which means we need more non-terminal symbols like
NP= N|DN
It means that a noun phrase can be a noun or a determiner followed by a noun.
VP= V| V NP
It means that a verb phrase is just a verb or a verb followed by a noun phrase.
S= NP| VP
S is a sentence which can be a noun phrase or verb phrase. Likewise, we can have multiple rules to define the structure of the sentence. And, using the structure, we can construct the sentence which looks like this:
For the idea discussed in Context-free Grammar, many libraries have been written to implement it. In the case of python, one such library is “nltk” (Natural Language Toolkit). This library provides a wide variety of functions and classes which deal with natural language. One of the nltk library functions is “ChartParser,” which can parse the context-free grammar and construct the syntax tree for it.
Python Code:
Explanation:
For example, the input sentence is: “she saw a big dog on the street.”
Output:
The nltk’s Algorithm and this particular Algorithm has the ability to find the different structures and to extract some sort of useful meaning from the sentence as well.
A context-free grammar is likely to deal with the sentences having single verbs and nouns. So, we need a system to understand how the words’ sequence are likely to relate to each other in terms of the actual words themselves.
For example Context-free grammar is likely to understand and generate sentences like “I ate a banana.”
But, sentences like “I ate a blue car” seem syntactically correct according to the context-free grammar rules, but it doesn’t make any sense. So, we need our AI to encapsulate the idea that certain words are more or less likely than others. So to deal with that, we introduce the notion of the n-gram.
In n-gram, we refer to some sequence of n items inside our text, and those items can take different forms.
“How often have I said to you that when you have eliminated the impossible whatever remains, however improbable, must be the truth?”
In the above example, if we have to take the sequence of three words so the first trigram would be “How often have” the second trigram would be “often have I,” the third trigram would be “have I said” and so on.
And, extracting the trigrams and bigrams etc., are often helpful while analyzing a lot of text. So it’s not meant to analyze the whole text at one time. Instead, we segment that whole text into segments through which we can begin to do some analysis as our AI might have never seen this entire sentence before, but it has probably seen the trigram or bigram before. So segments make the analysis easy for AI.
Tokenization is a task where we extract the sequence. Basically we need to take our input and somehow separate it into pieces, also known as tokens. So sometimes tokens can be words and sometimes tokens can be sentences and task is called word tokenization and sentence tokenization respectively.
Text is split into words based on punctuation like period, space, and comma etc. Sometimes separating by punctuation is not perfect like in the case of “Mr. Holmes,” and we face more challenges like in the case of “o’clock” and hyphens, e.g., “pearl-grey.” These are the things that our Algorithm has to decide. There are always some rules that we can use like we know that in “Mr. Holmes,” the first period is not the ending of the sentence, so we can encode the rules in our AI system so that it can do tokenization the way we want.
Once we have the ability to tokenize a particular passage from there, we can begin to extract the n-grams and also check the most popular n-grams.
Output:
In the above program, we have used the word_tokenized function from nltk to tokenize the data.
Output:
In above program we have imported ngrams from nltk library and created a function to generate the n-grams and printed unigrams, bigrams, trigrams and so on.
And the next program is to calculate the most frequent bigrams in the text file named as English-KJV.text then computed bigrams by using bigrams function from nltk.
Output:
Text categorization is a classification problem in which we take some text and categorize them into different classes. Every time we have some sample of text and we have to put it inside some category.
For Example:
Bag-of-Words Model is the approach where we think about the language and model a sample of text not by caring about its structure but just caring about the unordered collection of words which are there inside of a sample. So basically we don’t pay attention to the sequence of the words or what noun goes with what adjective, we only care about the words.
Let’s analyze few sentences:
“My son loved it! It was fun!”
“Table broke after a few days.”
“This was one of the best games I’ve played in a long time.”
“This is kind of cheap and flimsy, not worth it.”
After analyzing only on the words in every sentence and ignoring the grammar, we can conclude that sentences 1 and 3 are positive because they contain the words like “loved”, “fun”, and “best” and sentences 2 and 4 are negative because they contain words like “broke”, “cheap”, and “flimsy”.
This approach tends to work in classifying the text like positive or negative sentiments. There are different approaches to do classification, but in Natural Language Processing most popular is Naïve Bayes.
Naïve Bayes Classifiers are based on Bayes Theorem. It is the collection of classification algorithm. It is the family of Algorithm where all the Algorithm share a common principle that every pair of feature which is classified is independent of each other.
Bayes Theorem
Bayes theorem gives the formula also known as Bayes formula is used to calculate the conditional probability of events. It is expressed as:
Where:
For example for sentiment analysis, we would use the above formula to find the conditional probability i.e.
P (sentiment | text)
For example, P (positive | “my son loved it”)
First we will start by tokenizing the input in such a way that we get
P (positive | “my”, “son”, “loved”, “it”)
Now, after applying Bayes rule, we get the following expression:
This expression will give a precise answer of the probability.
Now, in the above examples, we can see that probability of denominator will remain unchanged as it doesn’t contain any positive or negative. So we can say that P is proportional only to the numerator.
For now we can ignore the denominator, and now we know what the probability is proportional to, and at the end, we can normalize the probability distribution and make sure that probability distribution ultimately sums up to 1.
Now, using the rule of joint probability, we can simplify the above equation as:
Now the question is that How do we calculate the Joint Probability.
We can calculate the Joint Probability by multiplying all of the conditional probabilities.
Now we can convert the above expression to.
Now, we can see that calculations become complex, but we can calculate this probability when we have some data available to us, and this is what a Natural Language Processing is about, i.e., analyzing data. If we have data with a bunch of reviews labeled as positive or negative, then we will be able to calculate the probability of positive and negative terms, respectively.
And,
For example,
Suppose we have this data available to us is given in the tables below:
Above table shows the total positive and total negative probabilities.
And, table below shows the positive and negative probabilities of all the words, that how many times a word used is in a positive or negative sentence.
And, now we have to calculate whether the sentence “My son love it” is positive or negative so, for that we will have to calculate the probability of the expressions given below;
P (positive | “my”, “son”, “loved”, “it”) = 0.49(0.30×0.01×0.32×0.30) = 0.00014112
P (negative | “my”, “son”, “loved”, “it”) = 0.51(0.20×0.02×0.08×0.40) = 0.00006528
As we can see above, negative and positive probability values doesn’t make sense, so we need to normalize them by dividing each value by sum of both positive and negative.
P (positive | “my”, “son”, “loved”, “it”) = 0.6837
P (negative | “my”, “son”, “loved”, “it”) = 0.3163
Now we can say that it is a positive sentence with 68.37% probability.
One problem with this approach is that if any word has never appeared in a certain type of sentence, then it became zero. Suppose none of the positive sentence in our data had “son” Then its Probability would be zero, and after calculating the final P(positive) it would also come zero. So to avoid these kind of scenario, there are different methods which are:
These are the approaches we can use while applying Naïve Bayes. Given enough train data, we can train our AI to be able to look at natural language, human words and categorize them accordingly.
Designed by Elegant Themes | Powered by WordPress