from IPython.display import Image
Fundamental concepts in NLP
- Distinction between Natural Languages and Artificial Languages.
- Natural languages are those that have naturally evolved, like English, German, etc.
- Artificial languages are those that are invented for a specific purpose, like programming language, dotraki etc.
NLP is either NLU or NLG:
- When people say NLP, they mean NLU or NLG.
- NLU: Natural language understanding. (70%) - most companies are doing this.
- NLG: Natural language generation. (30%) - many want to do this.
- Rise of NLG.
Applications of NLP:
- Text annotation: Given a web page or document, you want to annotate what this document is about. Later these annotations can be used by SEO’s for instance.
- Tagging : Example could be creating a word cloud of a document or a web page or a corpus of tweets. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
- Metadata extraction : Metadata is something that is not part of the document, but it is something about the document. Example: extracting the name of the author from a text/document. You use NLU to figure out where exactly is the name of the author.
- Classification :
- Document summarization : Most document summarization is NLU summarization. Advanced things like NLG summarization are hard to do.
- Corpus analytics: In corpus analytics, you are not looking just at one single document, but you are looking at the entire corpus of documents.
- Theme extraction :
- Clustering : You can cluster the documents. Ex: take a bunch of recipies of everything under the sun and if you run the right kind of NLU algorithm, then may be all the “sweet” and “spicy” recipies form into clusters. Document clustering is an unsupervised learning process, where we are trying to segment and categorize documents into separate categories by making the machine learn about the various text documents, their features, similarities, and the differences among them.
- Taxonomy analysis :
- Sentiment analysis :
- Search applications:
- Query repair: “Did you mean?”
- Query refinement: “Did you mean NLP as in Natural Language Processing or Neuro Linguistic Programming”
- Results postprocessing: Search engine results show the most relevant text in the search results.
What is the need for NLP ?
Computers are great at working with structured data like spreadsheets and database tables. But us humans usually communicate in words, not in tables. That’s unfortunate for computers.
A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand unstructured text and extract data from it?
Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to understand and process human languages. Let’s check out how NLP works and learn how to write programs that can extract information out of raw text using Python!
Can computers understand language ?:
As long as computers have been around, programmers have been trying to write programs that understand languages like English. The reason is pretty obvious — humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data.
Computers can’t yet truly understand English in the way that humans do — but they can already do a lot! In certain limited areas, what you can do with NLP already seems like magic. You might be able to save a lot of time by applying NLP techniques to your own projects.
And even better, the latest advances in NLP are easily accessible through open source Python libraries like spaCy, textacy, and neuralcoref. What you can do with just a few lines of python is amazing.
Extracting Meaning from text is hard !!:
The process of reading and understanding English is very complex — and that’s not even considering that English doesn’t follow logical and consistent rules. For example, what does this news headline mean?
“Environmental regulators grill business owners over illegal coal fires”
Are the regulators questioning a business owner about burning coal illegally? Or are the regulators literally cooking the business owner? As you can see, parsing English with a computer is going to be complicated.
Doing complicated things means building a pipeline:
Doing anything complicated in machine learning usually means building a pipeline. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. Then by chaining together several machine learning models that feed into each other, you can do very complicated things.
What is the need for NLP pipeline?:
And that’s exactly the strategy we are going to use for NLP. We’ll
break down the process of understanding English into small chunks and see how each one works.
Building an NLP pipeline, step-by-step:
Let’s look at this piece of text:
London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.
This paragraph contains several useful facts. It would be great if a computer could read this text and understand that:
- London is a city
- London is located in England
- London was settled by Romans and so on.
But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there.
Step 1: Sentence Segmentation: The first step in the pipeline is to break the text apart into separate sentences. That gives us a list of sentences to work with. We can use NLTK’s sentence tokenizer to get this step done.
Step 2: Word Tokenization: Now that we’ve split our document into sentences, we can process them one at a time. The next step in our pipeline is to break this sentence into separate words or tokens. This is called tokenization. Tokenization is easy to do in English. We’ll just split apart words whenever there’s a space between them. And we’ll also treat punctuation marks as separate tokens since punctuation also has meaning. This can also be easily accomplished using NLTK’s word tokenizer.
Step 3: Predicting Parts of Speech for Each Token: Next, we’ll look at each token and try to guess its part of speech — whether it is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence will help us start to figure out what the sentence is talking about.
We can do this by feeding each word (and some extra words around it for context) into a pre-trained part-of-speech classification model:
The part-of-speech model was originally trained by feeding it millions of English sentences with each word’s part of speech already tagged and having it learn to replicate that behavior.
Keep in mind that the model is completely based on statistics — it doesn’t actually understand what the words mean in the same way that humans do. It just knows how to guess a part of speech based on similar sentences and words it has seen before.
After processing the whole sentence, we’ll have a result like this:
With this information, we can already start to glean(extract) some very basic meaning. For example, we can see that the nouns in the sentence include “London” and “capital”, so the sentence is probably talking about London.
What are Neural word embeddings?
The vectors we use to represent words are called neural word embeddings, and representations are strange. One thing describes another, even though those two things are radically different. As Elvis Costello said: “Writing about music is like dancing about architecture.” Word2vec “vectorizes” about words, and by doing so it makes natural language computer-readable – we can start to perform powerful mathematical operations on words to detect their similarities.
So a neural word embedding represents a word with numbers. It’s a simple, yet unlikely, translation.
word2vec trains words against other words that neighbor them in the input corpus. It does so in one of two ways:
- either using context to predict a target word (a method known as continuous bag of words, or CBOW)
- or using a word to predict a target context, which is called skip-gram.
Paper Summary: Efficient Estimation of Word Representations in Vector Space:
- We propose
two novel model architecturesfor computing continuous vector representations of words from very large data sets.
Goal: The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary.
Many different types of models were proposed for estimating continuous representations of words, including the well-known Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA).
Lot of people reference this page.
- What is word embedding ?
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
* Word embeddings for word vectors. The idea is that we can start to make sense. * Vectors that represents words. Vectors lenght and direction have some meaning. * Dimensions are typically 5-300 * Vector is a sequence of numbers. We don't say first number represents gender etc. We don't do that. * Word vectors end up with 50-300 dimensions.
- What is Word2Vec?
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.
What is a bag-of-words model?:
We will begin with using word counts with a bag of words approach. Bag of words is a very simple and basic method to finding topics in a text.
For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. The theory is that the more frequent a word or token is, the more central or important it might be to the text.
Bag of words can be a great way to determine the significant words in a text, based on the number of times they are used.
We can use NLP fundamentals such as tokenization with NLTK to create a list of tokens.
We will use the Counter class from the built-in collections module.
The list of tokens generated using
word_tokenize can be passed as the initialization argument for the Counter class.
Levels of Analysis in NLP
Levels of Analysis: Lower level
So we started at a very crude level of words. Then we learned how to put words together grammatically. Then we learned that depending on context, a word can mean different things. And then we even learned how to infer statements that were unspoken but implied by spoken sentences. And we do all that by age 3 or 4 or 5 or whatever. Absolutely amazing.
If you understand those levels right there, if you understand how every one of us already went– we sort of graduated up through those levels– then you understand, essentially, all the levels of NLP because they’re the same as that, more or less. We have what we call lexical analysis. A lexicon means a dictionary. It’s the definitive list of all the words in our vocabulary. And so that’s just trying to identify what word it is.
And it’s very nontrivial in speech recognition because words run together. You have to separate phonetic sounds into individual words– and so just figuring out what word was intended. Even in written texts, we have typos and things. We have new words being invented. So just figuring out what word we’re looking at is sort of the elementary level.
Then we get into grammar. We parse sentences. We tag different parts of speech. This is a noun. This is a verb. This is an adjective. That’s the next thing we do. It assumes that you’ve already kind of normalized the vocabulary to what words there are before you move up to grammar. So we have that thing of levels, moving up levels.
Then we get into semantics. We normally do not do a lot of semantics in NLP until after we’ve already done some syntax. So in other words, we’ve usually parsed sentences out into this is a verb, this is a noun. And only then do we go, OK, what does this verb mean compared to other verbs, what does this noun mean. So we do semantics usually after grammatical analysis or syntax.
When would you need to do stemming ? We might have a custom vocabulary on purpose that’s not the whole English language. It’s just the words that our organization cares about. And we would just pull only those words out of a corpus of documents and construct a cloud like this, right?
So to do this, this is where the stemming would come in handy, right? Because you don’t want to get an improper count of how many times a word occurred, because you had moral, morals, you had the plural, the singular, you had the present, the past tense of a verb. And you don’t want to be counting those as if those were all separate words that have nothing to do with each other. You want to stem all the words first.
What are some of the applications of lexical analysis ? OR Why is it important to do lexical analysis ?
- Spell checking is essentially lexical analysis.
- You can also use lexical analysis to disambuiguate a word by looking it up in a good lexical knowledge base like WordNet.
- Terminology extraction. Terminology extraction means here’s a big pile of documents, right, and extract for me the key terms, the key terminology across this collection of documents. And so what we can do there is we can compare the words in the documents to all the words of the language. We can start gathering statistics, and we can look at word frequencies. And so we don’t have to do syntax. We don’t have to look at grammar. We don’t have to look at semantics.
- Just having a good dictionary and being really good with statistics, we can extract the most prominent terminology from a collection of documents. So these are things we can do while we stay down on that lower level of analysis, just lexical. We don’t have to move up into the world of syntax or semantics very much. Just if we have a good lexicon and we’re good at counting statistics of things, we can make a lot of really cool applications.
- Another one would be lexical diversity. Lexical diversity is important in the field of education for vocabulary building and so on. So a simple lexical diversity measure is the total unique vocabulary, for example, in a book compared to how long the book is. So you just divide one over the other and you get a number.
- You could also use this in student essays, right? Very low lexical diversity is not a sign of being a good writer. If you’re a teacher in education, trying to teach 10th graders or even college students how to write good essays, you can imagine automatically measuring the lexical diversity, and you have a threshold that, if it’s really low, you know that this student just is kind of using a small vocabulary and is just using the same words over and over again.
- If they’re writing about a baby, they probably just say baby, baby, baby. They don’t say baby, infant, toddler, youngster, little one. They don’t vary their vocabulary very much. They don’t have a lot of lexical diversity. And that would be a signal for you to intervene and say, hey, maybe vary your language a little bit more. So we use lexical measures and using AI to automate and add scalability to language and writing education. Lots of other applications, but these are some of them.
- So we’ll discover more applications of lexical analysis as we continue down the road. And you’ll see that it’s amazing what you can do without even having to get into grammar semantics. A lot of times people in NLP make the mistake of getting so excited and interested in syntax and semantics that they just take shortcuts and skip over the lexical part. They’ll use a not very good dictionary and then just race forward to trying to do really fancy semantics on it.
Syntax analysis: Because syntax analysis is going to be basically sentence by sentence. Grammar works by making a complete sentence. So you sort of have to feed a grammar parser one sentence at a time. If you can’t first divide the text in sentences, then you can’t get grammar parsing off of the ground. Before we do full grammar parsing, we just tag the parts of speech typically. So we’ve broken a text into sentences.
When would you do Syntax analysis ? In a nutshell, anytime you’re trying to really get at the way words relate to each other in patterns, and the patterns have rules, then what you’re doing is syntactical. And that’s an important thing to do, and you usually have to do that before you can really get into the semantics very much.
Stemming: means get to the root by removing the stems. Example stemming running you will get run.
Lemmatization: Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
Normalization: Normalizing means getting consistent nomenclature and units of measure for how information is conveyed.
How does syntactic analysis get used in semantic analysis ? So here’s an example where, you can imagine, we have already done the named-entity recognition, and we found Tim Cook. That’s a person.
And we found Apple. That’s a corporation. But then there’s something else, CEO. What’s that?
CEO is not a named entity. That’s the relationship between the other two entities. OK. And so in order to do this, you’ve got to have syntax analysis at your disposal.
You've got to be able to look at the grammar of the sentence and see what's attached to what. (Remember: Dependency parsers and Contingency Parsers ? )
Essense of Semantic analysis: Because somebody’s name might be mentioned in the same sentence as Apple, it doesn’t mean they’re the CEO of Apple, right? You’ve got to put a bunch of this stuff together in order to make that as an inference. And it means that you have to get into the world of semantics because you have to have a representation of the order of things in the world, a representation of what the world is like.
What is Ontology ? (20 Questions game) It really involves something we call ontology. What is ontology? It just sort of means an organized view of all the types of beings that there are and the relations they can have one to another. So when you used to play the game 20 Questions, what is it?
And you’d say, is it animal, mineral, or vegetable? That’s doing ontology. Everything’s an animal, mineral, or vegetable, and there’s animals and then there’s people. And then there’s plants, and then there’s companies, and you know, there’s CEOs. And a person can be a CEO, but a dog or a table cannot be.
So knowing what kinds of things there are and what kinds of relationships they’re allowed to have with each other in order to make sense, that’s having an ontology.
And that means you’re doing semantics. And so when you have all of that at your disposal, then you can look at this sentence– your software can look at it with an algorithm– and say, oh, this means that Tim Cook has a relationship to Apple.
A person has a relationship to a company. What kind of relationship? The “being CEO of” relationship, Tim Cook is the CEO of Apple, and you can go on and on and on and on and on with that.
So that’s taking a step beyond regular named-entity recognition with a lot of semantics to extract a lot of really interesting relationships. Now you don’t just know what the entities are. You know how they relate to one another.
Types of Discourse Analysis:
Paper 2: GloVe: Global Vectors for Word Representation:
Semantic vector space models of language represent each word with a real-valued vector.
These vectors can be used as features in a variety of applications, such as information retrieval (Manning et al., 2008), document classification (Sebastiani, 2002), question answering (Tellex et al., 2003), named entity recognition (Turian et al., 2010), and parsing (Socher et al., 2013).
Most word vector methods rely on the distance or angle between pairs of word vectors as the primary method for evaluating the intrinsic quality of such a set of word representations.
Trade-offs in NLP
Understand the trade-offs in different approaches to NLP.
What are some of the trade-offs you come accross while doing NLP ?
Shallow vs Deep NLP: Most of the time you wont be needed to do deep NLP. Deep parsing each and every sentence and doing discourse analysis etc are considered deep NLP. Shallow NLP will suffice most of the time. Also doing deep parsing might result in more errors than doing shallow parsing. Shallow has a lot of advantages. Sometimes deep is overkill. Sometimes it isn’t really needed to deliver the actionable insight your business stakeholders need. So no right or wrong answer. That’s a big trade-off.
- Statistical vs Symbolyic:
Feature Engineering vs Feature Learning:
- Top down vs Bottom up:
- Transparent vs Opaque AI:
Are you going to do shallow or deep NLP ?:
I’m not talking about doing shallow learning versus deep learning with neural nets and machine learning. I’m using the word shallow and deep in a different way, more from the semantics point of view, or even the syntax.
- Are you going to do a really deep parse of every sentence?
- Are you going to do really deep semantics and know every nuance of every word sense of every word in every sentence?
That would be really deep NLP.
Or are you going to do something shallow, where you just kind of scrape the surface of all the documents, and you pull out sort of a lightweight representation of every document and do a shallow treatment?
Statistical vs symbolic
Feature Engineering vs Feature Learning:
If you don’t use a human, then you’re doing what’s called feature learning. You’re having the machine learn the features. If you have a human tell the machine what features it should be considering, then you’re doing feature engineering.
Top-down vs Bottom-up
Transparent vs Opaque
Precision and Recall:
Precision and recall, do you remember that from your machine learning class maybe, or something else? If you had that class, precision is out of the times that the system said this is a conservative post, how many times was it right? 84% of the time. Right? Recall is kind of the converse. Out of all the posts that were actually conservative, how many did it recognize, did the system recognize, and say, yeah, this is conservative? 61%. Right?
What is the relation between NLP and Data Science?:
You use NLP to create a structured representation from an unstructured data.
On the flip side, NLP also uses Machine Learning to certain things like Document Clustering.
What does it mean to do text preprocessing?
Before we do most NLP, we want to be able to examine individual sentences and individual words. It means we need to do: - Sentence Segmentation (sentence tokenization). Use NLTK’s sentence tokenizer. Fortunately, there are many others that do a fairly decent job at it. - Lexical analysis (word tokenization). Word tokenization is not that easy. Since we need to deal with stripping punchuations, quotes and expand contractions.
What does it mean to do text Normalization?
To normalize a word-tokenized text source, we often address the following:
- Expand contractions
- Remove stop words. Sometimes you may want to add your own stopwords based on the problem you are solving.
- Deal with mis-spelled words. (Thanks twitter)
- Stemming - always not necessary. (I personally hate this).
How do you deal with stop-words?
There content words, function words and stop words.
- Function words:
- are a closed class, meaning they are a fixed list to which no new words are added.
- lack content, in that they don’t answer the “6 Ws” of who, what, why, when, where, and how (though they may set up the structure for an answer).
- Examples are is, am, are, was, were, he, she, you, we, they, if, then, therefore, possibly.
- Content words:
- are an open class, meaning they are an unfinished list to which new words are readily added.
- bear “content” in that they answer the “6 Ws” of who, what, why, when, where, and how.
- Examples are dog, cat, angrily, run, clever, democracy, green, God, wisdom.
- So the two sets of function and content words do not intersect, and function words are usually outnumbered by content words.
- Stop words: Like function words, but stop word list is application specific. An example is that if we want to gather only personal names from text, we may add personal titles to the stop word list: Mr., Mrs., Ms., Dr., Prof., Fr., etc.
How to deal with misspellings? Two approaches:
- Edit-distance method
- Fuzzy string compare
Lexical Knowledge Base
lexical knowledge base:
A lexicon goes beyond a “dictionary” in that it is machine-readable and carries the information needed to perform major NLP functions, such as:
- Parts of speech
- Transitive vs intransitive verbs
Transitive verbs: eg: hit something (hit is a transitive verb, becuase it is dependent on another thing) Intransitive verbs: eg: scream (scream is a intransitive verb, because it can be independent)
This type of information is stored in a lexicon.
Lexical Knowledge bases are built around a lexicon. They go further, in that, they create a rich database of how all these words relate to each other.
Lexical Knowledge bases are things like WordNet - because it has so much knowledge about the words than just a lexicon.
What is wordnet ?
- Wordnet is a lexical knowledge base made by Princeton University and is widely used in NLP.
- Wordnet is a lexical database for the English Language. It groups English words into sets of synonyms or synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.
- WordNet can thus be seen as a combination of dictionary and thesaurus.
- While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications.
- WordNet includes the lexical categories: nouns, verbs, adjectives and adverbs but ignores prepositions and determiners.
- Words from the same lexical category that are roughly synonymous are grouped into synsets.
A lexicon is more than a dictionary A lexical knowledge base contains more information than a lexicon. They are built around lexicons. You know what a thesaurus is, it captures synonyms and antonyms. Wordnet goes beyond that, it has things like homonyms and hypernyms, holonyms and meronyms and more.
Hyponym: a word of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery. Hypernym: a word with a broad meaning that more specific words fall under; a superordinate. For example, color is a hypernym of red. Holonym: Meronym:
WordNet contains all this above and beyond what a thesaurus contains.
What is the use of all these ? So how does this help us?
Well, we talked recently about feature extraction, pulling features out of documents that are illustrative of what makes a document different from all the other documents around it. And so we can improve that feature extraction by using WordNet. Because what we’ve been doing with feature extraction is we’ve just been like counting up when the same word occurs.
But think about the problems with that. If I count up all the times that a document has a certain word– say the word dog. And I count how many times another document also says the word dog, that’s an obvious way to compare them. If this document says dog a lot and so does this one, that’s one clue that maybe these documents are similar.
But what if this document says Springer Spaniel, Springer Spaniel, Springer Spaniel, and it doesn’t actually say dog exactly. Then my technique would miss that. I would look at it as though there was no relationship or no closeness whatsoever. That seems wrong. But a lexical knowledge base like WordNet can save me from that.
Because if I use it properly, it can let me know, wait a second. You know, the Springer Spaniel is a kind of dog. It’s close to it. In the big wide space of WordNet that stretches all over hundreds of thousands of words, these words are sort of close in semantic space. And so it could let me know that those deserve being treated as comparables. And so we could start talking about lexical space or semantic space or WordNet space, and measuring the distance therein.
Calculate the ontological distance between table and chair by looking in WordNet:
The lower the Ontological distance between 2 words the more semantically similar these 2 words are.
Calculate the distance between two documents in WordNet space
What are the limitations of using a WordNet ?
- As some words have different senses, for instance, the word chair, can mean furniture, a person in charge of an academic department, or an instrument for executing criminals. When we did the example of chair and table, we assumed that chair meant furniture. But how do we really know that ? Well, it was sense number one. WordNet attempts to put the most common sense of a word first.
Why is getting a word’s sense so important ?
And honestly, that’s what we do in a lot of NLP projects. We don’t try to be certain which one of these senses was meant. We assume that it’s sense one. In general, sense number one is the correct sense something like 70% of the time, depending on the corpus, depending on the application. In other words, you’ll be right a lot more often than you’re wrong, if you just always take sense number one.
But it can be a problem that this is not always going to be right. What if we had taken the wrong sense of chair a few minutes ago? What if we had taken the second sense of it being the person in charge of an academic department, and then calculated the ontological distance from there to table. We would have got a hugely wrong number, right?
You’d have to go forever and a day traveling around WordNet’s semantic network before you traveled all the way from chairperson of a department to table. And the distance wouldn’t be three. It probably wouldn’t be even 10. It might be 12 or 15, or something like that. I didn’t check it. It would be a big number. It would make it look like the words were utterly different, which is wrong. So getting the correct word sense is really important.
Monosemy vs Polysemy:
The words that only have one sense– we wish there were more of them. Our lives would be so much easier doing NLP if all words in the English language were monosemous, meaning they have only one sense. It can only mean one thing. And a lot of them are that way. A lot of the words are. But unfortunately, quite a lot of them are polysemous, or they exhibit polysemy, meaning they have more than one possible sense.
Although there are more Monosemous words than Polyseomous words in the English language, the occurence of Polysemous words is a lot more. Turns out that the more frequent a word gets used, the more number of senses it has. Thus it is vital to be able to correctly identify the parts-of-speech.
Example: We were looking at the word chair. It can be a noun and a verb. So it exhibits the multiple senses as a noun and multiple senses as a verb.
How to fix this then ?:
But that gives a clue to how to fix this. Because when we move on to grammatical analysis, syntax analysis, we’re going to learn how to automatically tag which part of speech a word is. And then that’ll be a way to eliminate some of the possible senses a word could have and narrow down to the right point in WordNet that we should be looking at when we want to judge that word’s relationship to others in our document.
Summary of WordNet:
So for now, super important that we appreciate what a valuable resource WordNet is, and how easy and straightforward it is to calculate ontological distance in it. It lets us see how similar words are, and therefore how similar documents are to each other. And all of the word sense disambiguation notwithstanding, it’s still an incredibly powerful thing to do. And we’ll learn to face some of those challenges as we move forward with other techniques.
What is POS tagging ?
It is a way of tagging every word in a sentence to its part of speech. That is, if it is a noun, verb, adjective or adverb.
Why would you need to do POS tagging ?
Look here at the word “take.” And we talked about word sense disambiguity, which sense of the word was intended. And just imagine how nice it would be if we just knew that in a particular occurrence of the word take that it was a noun, and not a verb, or that it was a verb, and not a noun. We would eliminate all the noun senses or we would eliminate all the verbs senses as being even possibly what was meant by the word take this time.
Define a tagset
The tagset is the set of tags that we use to assign all the parts of speech to all the words in a sentence. And the tagset in WordNet is just four, as we saw. The tagset for Penn Treebank is 45.
What are some of the other tagsets ?
There are lots of tagsets that have been carefully constructed by linguists and NLP people that we have to choose from. By far, the most widely used is the Penn Treebank tagset. The Penn Treebank is a collection of sentences that have been fully grammar parsed. We’ll see later when we do grammar parsers that they usually create a tree that shows how all the grammar of a sentence fits together.
Penn Treebank Tagset:
The Penn Treebank is a collection of sentences that have been fully grammar parsed.It is the most widely used tagset, this has way more than 4 tags - 45 to be exact. Other tagsets are excessively detailed (beyond what’s needed for most of our applications).
What is meant by a Treebank ?:
A treebank just means a repository of a whole bunch of parse trees of a whole bunch of sentences. People at the University of Pennsylvania made a large treebank, just a collection of tree diagrams that show how the grammar of a bunch of sentences work. In order to do that project, the people at the University of Pennsylvania made a tagset, meaning a set of part-of-speech tags that has 45 tags in it, to be exact.
Why part-of-speech tags == Penn Treebank tags ? Turns out that’s the one that everybody in the NLP world uses 90% of the time, mainly for a couple of simple reasons. It’s a lot more than four, like we have in WordNet, and we need more than four. And the other reason is, it’s a lot less than 87 or 145. It’s one of those things where it’s kind of overkill to have 145 different parts of speech. So 45 is this manageable medium size number.
It gives us enough granularity or specificity exactly which part of speech this is. But it doesn’t create 145 different parts of speech that we have to do that you would rarely need. This is it. So most of us in NLP have been doing this for a few years. We almost know by heart what every one of these little symbols mean. So I know that IN means it’s a preposition, and JJ means it’s an adjective, and so on.
So if you’re doing NLP at all, you’re going to want to download this from the web and save it on your hard drive on your laptop, maybe print it out, and pin it with a thumbtack on your cork board or whatever, because it’s just assumed– I mean, papers will be published that don’t even mention Penn Treebank, and just show parts diagrams with parts of speech that are exactly Penn Treebank, and they just assume it goes without saying that if not stated otherwise, part-of-speech tags are Penn Treebank tags.
NOTE: Think of word ambiguity when you think of the need for Part-of-speech Tagging!! Same word can have many meanings. Thus, you need proper POS tagging.
What is a Brown Corpus ?
The Brown corpus is just a freely available corpus of a whole bunch of documents, a lot of Wall Street Journal articles and things from some years ago. And it’s used as a freely available reference corpus in the NLP world. You’ll hear about it a lot.
What is the difference between
And so remember, the difference between types and tokens are, we talked about the word tokenizer. So every particular word is a token. But the type is like the unique vocabulary word. So if there’s the word run, and the word dog, and the word cat, so far, that’s three types. But if the word run occurs a whole bunch of times, then there’s that many tokens of the word run.
Summary of part-of-speech tagging:
So that’s setting up what part-of-speech tagger does. The tagset are going to give you the tags that it outputs. It helps you appreciate how big the problem is, that it’s a non-trivial task. The good news is we have a lot of good part-of-speech taggers out there that you can choose from. But it’s going to be important to understand how they work and what the differences are. So that’s what we’ll talk about next.
How do POS taggers work ? Like nltk.pos_tag and spacy’s:
**Intro to how POS taggers work: **
We talked about what Part of Speech tags are, what they represent, how the pen tag set is the one that we use the most, a particular set of Part of Speech tags. We appreciated how you need to have a Part of Speech tagger that’s smart, that can look at a sentence and correctly discern what really is the part of speech. Because words can have different parts of speech in different sentences. But now we need to look at how these Part of Speech taggers really work. And are there different kinds? Yes, there are different kinds.
Note: All you need to know is that there some “Rule-based Part-of-speech taggers” and there are some POS taggers that take a “Statistical approach” to tagging and then there are “Machine learning” based POS taggers.
Rule-based POS tagger:
Eric Brill invented one of the first POS tagger. It uses two basic steps: 1. Assign the most common POS tag initially (naively that is) 2. Check what occurs before and after that word and apply a rule-based transformation. In general, these rule-based POS taggers have several rules for correcting the inital errors until we get the right part-of-speech tags. It is a great approach, but let’s look at a different approach.
Statistical approach to POS tagging:
A Statistical model, like the Hidden Markov Model (HMM)
There are a wealth of POS taggers in NLTK toolkit that a lot of people use in the NLP world. So it is fairly easy to implement these in one of your projects.
Full parse tree: A full parse tree lets you visualize the grammatical structure of a sentence if great detail. It breaks down the sentence in Noun Phrases, Verb Phrases and then further sub-divides them until only POS tags are left.
Shallowest Parsing: POS-tagging is considered to be shallowest parsing, as we are just assigning a POS tag and we are not really parsing the sentence grammatically.
Shallow Parse tree: This is in-between the above two, it does Noun phrase and Verb phrase recognition but does not go beyond that. It is sometimes called, the verb-and-noun-phrase-chuncker. A chuncker is the same thing as a shallow parser.
Oftentimes, people rely on Shallow Parsing, because it breaks down the sentence into usable chuncks and it is also easy on the eye (aka simple to use). As we begin to deal with large sentences, shallow parsing proves to be more useful.
Why do we need to Chunck ? OR Why would we use a Shallow Parser ?
- The full parser are doing more, and thus, are computationally expensive.
- Full parsers are not also super-accurate when the sentences being fed in are not clean. Most of the UGC does not contain perfect grammar and these full parse trees were mostly trained on WSJ and NYtimes type of content, which is supposedly professionally written and the grammar is goood and thus they know how to make proper trees for proper grammar. But when the grammar is less than perfect, the trees coming out of it are less than perfect.
- Whereas, a chunker, because it’s more modest, it’s more humble, it’s not trying to do as much work, it’s just trying to pull the main chunks out– noun phrase, verb phrase, verb phrase, noun phrase– it has less opportunity to make mistakes. And so there are all kinds of cases where if you run your chunker and your full parse tree at the same time and parallel on the same sentence and the sentence is a hairy, crazy sentence, the full parse tree will have all kinds of things that are wrong, but the chunker will have all the chunks correct.
- Another reason for using a Chuncker is sometimes that is all your application needs. Unless you are doing some crazy AI stuff like machine translation, question and answer type application, you will get away by using just a shallow parser.
Why would I use a Chuncker instead of using just POS tagging ? OR Why would I ever need to do Parsing?
- Simply because, POS tagging sometimes isn’t just smart enough. It doesn’t group the words together at all. It just sticks part of speech tags on all of them. So, we really need sometimes to know, this is a noun phrase, this is a verb phrase, this is a prepositional phrase, because we want to pull such things out.
- An instance of use for a shallow parser is when you need to summarize a document. Just pull out all the noun phrases, then you will have a decent enough idea about what that document is all about.
- So pulling out Phrases is a super useful thing to do, which we cannot do with just POS-tagging.
UGC : User Generated Content: All the stuff you and I write on twitter/facebook/blogs etc. In otherwords text generated by non-professional writers.
Summary of Shallow Parsing: So in essence, you realize that there are levels of syntactic parsing. Without realizing it, you already did a little bit just by part of speech tagging. Now you realize you can go a little deeper and do a shallow parse. You’ll probably use that most of the time. And on occasion, you’ll want to do full parse trees.
Chunk and Chink : Chunck is what you want, like a NP, VP etc. Chinck is something you don’t want. The stuff in between.
NP Chunker: A lot of times, the most common type of chunker is a NP chunker and this is primarily used as a pre-step for Named-Entity-Recognition (NER).
Constituency Parser vs Dependency Parser:
- A Constituency Parser is nothing but a full parser. It divides the sentence into its constituent parts.
- A dependency parser gives us labelled relations between words. This can give us, for example, the subject and object of the main verb.
- A dependency parser is represented by a Directed Acyclic Graph (DAG).
There are different kinds of parsers:
- CYK parser - this you learnt how to construct in the lectures.
- Stanford parser - it is essentially a Constituency Parser.
Valence: Typically used in Sentiment Analysis. Valence is typically how emotionally loaded a particular word is.
What is the need for Similarity ? Measure semantic similarity on multiple levels: One of the most important things we need to do in NLP is to have automated methods of measuring semantic similarity on multiple levels. We can start with word similarity.
Generally, we take statistical approaches to this. So we look at how a word shows up in a whole collection of documents compared to another word. And we can use statistics to study if they seem to be similar or not. And of course there are several different flavors of that or different styles of how to do that.
So when two words have a dependent probability that’s much higher than their independent probability, it means that there’s some kind of relationship between these words.
PPMI: Why negative numbers are rounded to 0 ? When we get zero, it means that two words are unrelated. If we get a negative number, it means they’re even more unrelated. If we get a bigger negative, more negative number, it means they’re even more unrelated.
TWO DOCUMENTS ARE MORE SIMILAR IF THEIR VECTORS ARE MORE SIMILAR!!
- This is where you come acoss CountVectorizer and TF-IDF. Instead of having the columns represent documents, we could have them represent context words (e.g., words occurring within a ±10 word window of each target word throughout the corpus).
- In a real-world application, it would be 50,000 x 50,000.
- And the vectors would be sparse rather than dense, meaning most of the values would be 0.
Why do we need to reducde dimensions? For this reason we have algorithms to “reduce the dimensions” of the vector space to a more manageable number (e.g., about 300) where the variance between the values is the greatest (hence the vectors are the most informative). The most famous such algorithm is LSA (“latent semantic analysis”).
What LSA is doing? and Why is it valuable? The smaller number of dense vectors is valuable—it tells which words are most associated by vector semantics without needing huge vectors.
Pitfalls of LSA: LSA is still subject to “garbage-in, garbage-out.”
- Use cosine similarity measure as we did in the homework.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
One of the main applications of semantic similarity is Document Clustering. Having a good semantic similarity measure is a pre-requisite for doing this.
So what is document clustering?: It’s not document classification, where somebody says, I know what my categories are. In document clustering, we organize a set of documents into groups having similar characteristics.
Difference between Document Clustering and Document Classification
Methods of clustering documents:
- Centroid-based methods:
- Each cluster has a central representative member. In centrod-based method, there is a prototypical document that represents all the other documents that are going to be in that cluster and then we cluster a bunch of documents that are similiar to that centroid document. This is where the similarity measure comes into picture.
- Hierarchical methods:
- There are top-down and bottom-up methods, but the main idea here is that you don’t pre-select the number of clusters, you let the algorithm do that for your. Think of dendrogram when talking about these methods.
There are two main types of Document Classification (or) Text Classification that we do:
- Content-based Classification
- Descriptor-based Classification
95% of the time, people talk about content-based classification.
In Descriptor based classification, someone will give you a description of what they want, and let’s say, you have like 1000 documents with you and you want to classify these documents based on the given description then you are doing Descriptor-based classification.
How is descriptor-based classification different from a search task? Descriptor-based classification is not like a keyword search that you use in a search engine. It’s no because the request is very descriptive in its sentences, and it’s 10 sentences or 30 sentences long. It’s not something you’re just going to do a keyword search for those words. That’s not going to do it. You have to get the documents that sort of conceptually match that. So it is a classification task, not a search task, but it’s classification by description.
What are topic models?
- Topics give us a quick idea what a document is about.
Topiccan be thought of as
a labelfor a collection of words that often occur together. Eg: If the topic of the conversation was weather then you expect to see words like rain, storm, snow, winds, ice.
- Topic Modeling is the process of finding a collections of words that best represent a set of unknown topics.
- Topic models give a way to quickly make a judgement about contents of a collection of documents.
Topic models are for those who do not like to read
“Topic models have been designed specifically for the purpose of extracting various distinguishing concepts or topics from a large corpus containing various types of documents, where each document talks about one or more concepts.”
“The main aim of topic modeling is to use mathematical and statistical techniques to discover hidden and latent semantic structures in a corpus.”
“Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other - these cluster of words form topics or concepts.”
“These concepts can be used to interpret the main themes of a corpus”
“and also make semantic connections among words that co-occur together frequently in various documents. EXAMPLE: JAMES BOND”
“A topic consists of a group of words that frequently occur together.”
Since the topics are quantified, it is possible to track topic prevalence through time, compute similarity between documents, and use tools like linear regression to estimate causal effects.
Why Topic Modeling? Topic models are used as a foundation for more technical applications like text segmentation or classification.
How to do Topic Modeling? There are many algorithms that produce topic models. Shown below are three such algorithms.
Brief notes on LDA:
- For its input, LDA takes a document-term matrix. (A document-term matrix is a bag-of-words representation of a collection of documents - it records frequencies of word occurrence, but ignores word order).
- LDA returns two matrices: one contains prevalence of topics in the documents, the other: probability of words belonging to those topics.
- LDA is a supervised clustering algorithm: we need to specify the number of clusters we seek from the start.
- The result of LDA is two tables.
- The first table shows the probabilities of words belonging to topics. For instance, the probability of the word opened belonging to topic 2 is 36%.
- The second table shows the probabilities of documents belonging to topics. For instance, document 3 has 20% probability of belonging to topic 1, and 80% - of belonging to topic 2.
- When doing topic modeling, you need to make a choice of which words you want to keep for analysis.
- There are several Control Parameters that strongly affect the results. To be fully competent with Topic Modeling, you must have an idea of what these parameters control.
Example of LDA:
Question: We have a topic defined by the following terms: site, settlement, evidence, inhabit, region, period, earliest, ancient, reconstruct. Which concept is reflected in this topic?
Answer: The answer is archeology. You probably relied on the associations in your mind; these associations are just non-quantified word co-occurrences.
document_topics: Contains probabilities of a document belonging to a particular topic.
word_topics: Contains probabilities of each and every word belonging to a particular topic.
In order to fit a topic model, we must prepare a document-term matrix that will contain counts of word occurrences in documents. When we fit a topic model, we get back an LDA model object. This object contains two matrices:
- Beta: contains probabilities of words in topics.
- Gamma: contains probabilities of topics in documents.
We’ve talked about a number of scenarios where somebody is sitting on a big pile of documents that they feel they don’t know everything they’d like to know about their big collection of documents. That’s some of the cases where you get asked to classify or cluster the documents so people can better understand the collection.
- Canonical Topic Modeling
- Organic Topic Modeling
- Entity-centric Topic Modeling
The predominant style of topic modeling in the AI world is the organic topic modeling.
For discovering the clusters of words that make up a topic and then another topic and another topic across a collection of documents. How can LSA be used for that?
So it starts with this big, large term document matrix which remember, is our vector space with our vector semantics. But then out of that, it creates a topic to topic matrix. So if you imagine the x and y-axis– this topic, how many times did it occur together with this topic– so how many times did two topics co-occur in the same document?
LDA Inutively, it groups words together that have a high cooccurrence among all variety of different documents in your corpus.
It uses a probability distribution of how likely it is that a word is in a topic. And It uses a probability distribution for how likely a topic is assigned to a document.
These assignments are we call Dirichlet allocations.
Compare LSA and LDA LDA is like LSA except in the way that it utilizes probability distributions over words - that is, rather than a topic-topic matrix to guide the assignment of words to topics.
NMF vs LDA Use NMF when you don’t have enough data. Use LSA or LDA when you have lot of data.
There are two approaches:
- ML approach using supervised ML techniques.
- Quick to implement when a large amount of training data is ready at hand.
- Don’t need to develop a coded vocabulary.
- Not transparent
- Only as granular as the training data annotations (usually not very)
- Lexicon-based approach using unsupervised ML techniques.
- When there is no training data, then you have to use this.
- Ability to customize the lexicon based on your domain.
- Someone has to maintain the lexicon as new words keep getting added.
So far, we used labeled training data to learn patterns using features from the movie reviews and their corresponding sentiment. Then we applied this knowledge learned on new movie reviews (the testing dataset) to predict their sentiment. Often, you may not have the convenience of a well-labeled training dataset.
In those situations, you need to use unsupervised techniques for predicting the sentiment by using knowledgebases, ontologies, databases, and lexicons that have detailed information specially curated and prepared just for sentiment analysis.
As mentioned, a lexicon is a dictionary, vocabulary, or a book of words. In our case, lexicons are special dictionaries or vocabularies that have been created for analyzing sentiment. Most of these lexicons have a list of positive and negative polar words with some score associated with them, and using various techniques like the position of words, surrounding words, context, parts of speech, phrases, and so on, scores are assigned to the text documents for which we want to compute the sentiment.
After aggregating these scores, we get the final sentiment. More advanced analyses can also be done, including detecting the subjectivity, mood, and modality. Various popular lexicons are used for sentiment analysis, including the following:
- AFINN lexicon
- Bing Liu’s lexicon
- MPQA subjectivity lexicon
- Vader lexicon
- Pattern lexicon
This is not an exhaustive list of lexicons that can be leveraged for sentiment analysis, and there are several other lexicons which can be easily obtained from the Internet.
Using chunks with Sentiment Analysis: We can stem any verb to its lemma and then take the gerund form.
Should we ignore negation or not?
- When doing general sentiment detection (merely negative or positive) over multi- sentence text units, we can often just neglect negation since it “cancels out” over multiple instances of sentiment.
- But we’ll see that when we are refining our treatment of sentiment with higher dimensionality, it becomes awkward to turn out false readings because of negation.
_ What are some of the Advanced Sentiment Analysis methods? * Run a dependency parser and follow the parser until an object is found. * Tracing from “rejection,” follow arrows until you get to an NP.
Dimensionality of Sentiment: By default we have one dimension (polarity) or two (negative-positive), but in real life there are many more dimensions. There are about 150 different types of emotions. Yes 150.
Leveraging the hierarchy:
The nice thing is we can use all the labels of deeper levels as clues (trigger words) to identify the first level—so we have a ready-made starter vocabulary.
How do you present your Sentiment Analysis?:
In a commercial setting, sentiment analysis is not an end in itself. Your stakeholders want you to deliver actionable insights. You can leverage sentiment for insights. Roll-ups with examples are usually helpful. * For product reviews: roll-up aggregate results by theme, and show examples. * For politics: make a candidate-issue matrix with net sentiment balance, and link to examples.
Visualization of Sentiment:
Combining sentiment with demographics opens all kinds of sentiments for visualizations. Breakdown the sentiment analysis by:
This will lead to some interesting and engaging presentations.