NLP projects
Topic Modeling using NMF and LDA using sklearn
Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other, and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and also make semantic connections among words that co-occur together frequently in various documents. There are various frameworks and algorithms to build topic models. Here, I will explore two: Non-negative matrix factorization, Latent Dirichlet Allocation
Document Clustering
Document clustering or cluster analysis is an interesting area in NLP and text analytics that applies unsupervised ML concepts and techniques. The main premise of document clustering is similar to that of document categorization, where you start with a whole corpus of documents and are tasked with segregating them into various groups based on some distinctive properties, attributes, and features of the documents. Document classification needs pre-labeled training data to build a model and then categorize documents. Document clustering uses unsupervised ML algorithms to group the documents into various clusters.
Clustering movies based on their plots
In this post, I will show how we can cluster movies based on IMDB and Wiki plot summaries. We will quantify the similarity of movies based on their plot summaries available on IMDb and Wikipedia, then separate them into groups, also known as clusters. We’ll create a dendrogram to represent how closely the movies are related to each other.
Ranking McDonald's reviews using NLP based on rudeness
McDonald’s receives thousands of customer comments** on their website per day, and many of them are negative. Their corporate employees don’t have time to read every single comment, but they do want to read a subset of comments that they are most interested in. In particular, the media has recently portrayed their employees as being rude, and so they want to review comments about **rude service. Using NLP and Multinomial Naive Bayes to rank customer reviews based on their rudeness.
Document Similarity
Two documents are similar if their vectors are similar. In this post, we will explore this idea through an example. A heatmap of Amazon books similarity is displayed to find the most similar and dissimilar books
Scrape IMDB movie reviews
Oftentimes it is required to construct a dataset by scraping a website and extracting relevant information. I will be using IMDB website to pull user reviews for the top 250 Thriller movies and construct a dataset that will later be used to perform NLP tasks like: shallow parsing, clustering and sentiment analysis. In this post, the focus is on how to create the dataset and how to do shallow parsing by breaking down each user review into Noun-chunks.