You shall know a word by the company it keeps. - John Firth
We will delve into the concept of
Vector Spaces by developing an intuition behind this idea through some simple examples. There are many applications and algorithms that make use of Vector Space models, but this post focuses on the general idea behind vector spaces.
Beginner’s guide to understanding
Vector Space Models
So suppose you have two questions. The first one is,
where are you heading? And the second one is,
where are you from? .These sentences have identical words, except for the last ones. However, they both have a different meaning.
On the other hand, say you have two more questions whose words are completely different but both sentences mean the same thing.
Vector space models will help you identify whether the first pair of questions (or the second pair) are similar in meaning even if they do not share the same words.
They can be used to identify similarity for a question answering, paraphrasing, and summarization.
Vector space models will also allow you to capture dependencies between words.
Example 1: Consider this sentence. You eat
cereal from a
Here, you can see that the word cereal and the word bowl are related.
Example 2: Now let’s look at this other sentence.
You buy something and someone else sells it. So what it’s saying is that someone sells something because someone else buys it. The second half of the sentence is dependent on the first half. With vector space models, you will be able to capture this and many other types of relationships among different sets of words.
Vector space models are used in information extraction to answer questions in the style of
who, what, where, how, and etc., in machine translation and in chatbots programming.
Essence of Vector Space Models: When using vector space models, the way these representations are made is by identifying the context around each word in the text, and this captures the relative meaning.
To recap, vector space models allow you to represent words and documents as vectors. This captures the relative meaning.
We can construct our vectors based off a co-occurrence matrix. Specifically, depending on the task we are trying to solve, we can have several possible designs. We will also see how we can encode a word or a document as a vector. To get a vector space model using a
word by word design, we’ll make a co-occurrence matrix and extract vector representations for the words in our corpus.
Similarly, we can get a Vector Space Model using a
word by document design.
Finally, we’ll see how in a vector space we can find relationships between words and vectors, also known as their similarity.
The co-occurrence of two different words is the number of times that they appear in your corpus together within a certain word distance k.
For instance, suppose that your corpus has the following two sentences.
The row of the co-occurrence matrix corresponding to the word
datawith a k value equal to two would be populated with the above values.
For the column corresponding to the word
simple, you’d get a value equal to two. Because
simple co-occur in the first sentence within a distance of one word, and in the second sentence within a distance of two words.
The row of the co-occurrence matrix corresponding to the word data would look like this if you consider the co-occurrence with the words
simple, raw, like, and I. In this case, the vector representation of the word data would be equal to
2, 1, 1, 0.
What is n here?: With a word by word design, you can get a representation with n entries, with n between one and the size of your entire vocabulary.
For a word by document design, the process is quite similar. In this case, you’ll count the times that words from your vocabulary appear in documents that belong to specific categories.
For instance, you could have a corpus consisting of documents between different topics like entertainment, economy, and machine learning. Here, you’d have to count the number of times that your words appear on the document that belong to each of the three categories.
In this example, suppose that the word data appears 500 times in documents from your corpus related to entertainment, 6,620 times in economy documents, and 9,320 in documents related to machine learning. The word film appears in each document’s category 7,000, 4,000, and 1,000 times respectively.
Once you’ve constructed the representations for multiple sets of documents or words, you’ll get your vector space.
Let’s take the matrix from the last example. Here, you could take a representation for the words
film from the rows of the table. However, we can also take the representation for every category of documents by looking at the columns. So the vector space will have two dimensions.
If the number of times that the words
film appear on the type of document is plotted on 2-D plane, we get the vector representation for the categories: entertainment, economy and ML.
Note that in this space, it is easy to see that the economy and machine learning documents are much more similar than they are to the entertainment category.
In my next post, we’ll make comparisons between vector representations using the cosine similarity and the Euclidean distance in order to get the angle and distance between them.
- So far, you’ve seen how to get vector spaces by two different designs, word by word and word by document, by either counting the co-occurrence of words or the co-occurrence of words in the document’s corpora.
- I also showed you that in vector spaces, you can determine relationships between types of documents like similarity.