What comes to mind when you think of a neural network? In this post, we will develop the intuition for understanding how a neural network works and introduce...
Visually and Statistically detect Stationarity in a time series. Methods to convert a non-stationary time series to stationary are discussed with examples.
Understand the basics of Autoregressive models and how we can use PACF and AIC/BIC to identify the order of AR model.
How to know if your time series is White Noise (or) a Random Walk? How to detect a stationary process? What is the relation b/w these concepts?
Autocorrelation is the correlation of a single time series with a lagged copy of itself. In this post, we will explore this concept with couple of examples.
How to calculate correlation and regression coefficients for two time series?
Using pandas we will explore two types of window functions: rolling and expanding metrics for time series data.
Downsampling and aggregation of time series to observe trends and comparing time series that have different frequencies.
Changing the time series frequency is a very common operation as there are many cases when you want to compare time series with different freq.
Basics of manipulating time series like slicing, changing frequency, shifting, lags, diffs, percent_changes are explored in this post.
Content based recommendation using Jaccard Similarity.
Make recommendations based on the knowledge of the crowd. Example, people who watch Gladiator have also watched the Matrix. We will take a dataset and genera...
What are recommendation engines and what type of data is well suited for these problems?
Learn how to design, configure, secure and test HTTP endpoints, using AWS Lambda as backend.
Spark’s use of functional programming is illustrated with an example. The intuition for using pure functions and DAGs is explained.
When to use Spark? What are the modes in which Spark can run? When not to use Spark? and What are some alternatives to Spark? - these questions are answered ...
Using simple SQL filtering, aggregations and joins answering business questions.
Use PostgreSQL JSON operators and functions to explore and understand couple of datasets stored inside a database
General steps for using Sagemaker is discussed in this post. The Training process and inference process is explained in detail.
As an introduction to using SageMaker’s High Level Python API we will look at a relatively simple problem. Namely, we will use the Boston Housing Dataset to ...
KSQL is a SQL-like interface for building stream processing applications. In this post, I will show how to convert Kafka topics in Streams and Tables and the...
A data pipeline captures the movement and transformation of data from one place/format to another. It often runs on schedule and feeds data into multiple das...
A common use-case in building stream processing applications is to filter and enrich an incoming stream of data and send it to a new topic or create an inter...
Faust is a stream processing python library which allows us to read a stream of data from a Kafka Topic, process it and store the processed data into another...
Faust provides a few options for state storage that we need to understand before we start building streaming applications in production. Here, we are going t...
AWS Developer tools: CodeBuild, CodeDeploy and CodePipeline are discussed in the context of a CICD pipeline.
It is worth understanding how kafka stores data to better appreciate how the brokers achieve such high throughput. Kafka simply has a data directory on disk:...
Process CodeCommit events with a Lambda Function to create custom SNS notifications, containing useful information about branch, author and message for each ...
Process AWS S3 events using AWS Lambda. A simple example of uploading an image to a bucket will create a zipped version of it in the same bucket with zip pre...
An overview of AWS Lambda service and its key features.
Cloud9 is a cloud-based IDE to build Cloud-Native applications. In this post, I will show how to develop, test and deploy a Serverless app using Cloud9.
To understand EC2 Container Service AWS ECS, I will approach this from an Architectural Context, then I’ll cover the Computational Context, and then finally,...
Create RDS subnet groups, RDS database clusters, manage access to RDS cluster using security groups and finally connect to your RDS cluster and create tables.
AWS RDS and AWS non-relation databases are two main families of Databases offered by AWS. In this post I will introduce the various databases available on AW...
Before we can leverage text data in a machine learning model you must first transform it into a series of columns of numbers or vectors. There are many diffe...
The role of data schemas, Apache Avro and Schema Registry in Kafka is explained using an example.
A variety of SQL leetcode questions that I solved. This post captures Day 1 problems and their solutions.
How to create and administer Kafka topics? In this post, I will show 3 methods to manage Kafka topics. This will be useful when troubleshooting arcane issues
The Confluent REST Proxy provides a RESTful interface to a Kafka cluster. Learn how to produce, consume, view and administer Kafka cluster using simple pytho...
Use Kafka Connect to stream data from a log file and SQL table into Kafka using python and Kafka Connect HTTP REST API
Covers commonly used commands when working in Kafka ecosystem. Namely, Kafka CLI; Kafka Connect; Kafka REST Proxy; KSQL; Faust
When data is stored in arrays we can make use of some special Postgres operators like ANY and CONTAINS to query, filter and aggregate records
Using Flask + Pandas + Plotly + Dash deploy an interactive data dashboard to Heroku
Group Airbnb listings based on similarity. Used folium maps to display the clustering results for Chelsea neighborhood
An interactive 3-D scatter plot of t-SNE features showing similar airbnb listings is shown here
Used Dimensionality Reduction to reduce 2100 features to 50 principal features
Create a binary bag-of-words representation for amenities and host verifications and Tfidf representation for description of each airbnb listing
Dealing with outliers and choosing the type of scaling is covered in this post
Extract data from S3, clean the dataset, deal with missing values and load the cleaned dataset to S3
Visualize Airbnb price distribution across 5 boroughs of New York City. Using GeoPandas and Folium maps, created some interesting visualizations
I will show how to build a simple data pipeline using Apache Airflow to retrieve data from S3 and load it into Redshift cluster
Are James Bond movies the best in Thriller movies category? - I will answer this using 1500 user movie reviews of 500 top thriller movies using NLP and Topic...
Using NLP and Dimensionality Reduction (t-SNE), group cosemetics based on their chemical ingredients.
1000 user reviews are clustered using K-means and Agglomerative Hierarchical clustering algorithms. Silhouette analysis is conducted to evaluate the effectiv...
In this post, I will show how we can cluster movies based on IMDB and Wiki plot summaries. We will quantify the similarity of movies based on their plot summ...
Two documents are similar if their vectors are similar. In this post, we will explore this idea through an example. A heatmap of Amazon books similarity is d...
Scrape IMDB movie reviews and construct a dataset. Perform shallow parsing on user reviews using spaCy and pattern.
In this post, we will see what AWS Step Functions are and what types of problems this service is the best suited to solve with an example.
Scrape, clean and normalize 100+ Gutenberg texts and apply basic text analysis
Understand what EKS service provides, then create an eks cluster using eksctl and connect to the cluster using kubectl. Finally destroy the resources.
Kubernetes provides ConfigMaps and Secrets resource kinds to allow you to separate configuration from pod’s specs. This separation makes it easier to manage ...
Use requests library to send GET requests to an API endpoint, parse the response, handle pagination, and store the transformed data into a postgres table.
Volumes, PersistentVolumes, PersistentVolumeClaims are explored in depth with the use of an example to demonstrate the need for these and how to use them.
Init Containers let you perform some tasks or check some preconditions before the main application container starts. In this post, I will use an example to s...
Is your pod ready as soon as the container starts? This is the key question we will explore in this post through the use of Kubernetes liveness and readiness...
Kubernetes by default uses RollingUpdates strategy. In this post, we will learn how to trigger, pause, resume and view a rollout and demonstrate a rollback.
Instead of creating Pods, we will create deployments and use an example 3-tier application to illustrate scaling and load balancing.
Autoscaling uses metrics server to collect metrics about the cluster and uses this info to make scaling decisions. In this demo, I will show how to create me...
In this post, I will show how to use Kubernetes service discovery mechanisms: env variables and DNS, to design a multi-pod n-tier application.
In this post, I will show how to run a multi-container pod that implements a three tier application in a Kubernetes namespace.
Learn why you can’t live without these Services in Kubernetes and what exactly they solve.
In this post, I will first cover what pods are, how to create, destroy and configure them. I will then run an Nginx web server on Kubernetes cluster.
A basic level architecture overview of Kubernetes is the focus of this post. Clusters, Nodes, Control Plane, Pods, Services and Deployments are touched upon.
Once you’ve decided on using Kubernetes, you have a variety of methods for deploying Kubernetes. Single-node, Multi-node, Vendor managed, on-prem etc.
Overview of Kubernetes, what it does, how it does it and alternatives to Kubernetes.
Setup dvdrental database locally for demonstrating features of PostgresSQL
Use PostgreSQL JSON operators and functions to explore and understand couple of datasets stored inside a database
Lexical diversity is a measure of how many different words that are used in a text. The goal of this notebook is to use NLTK to explore the lexical diversity...
Web Scraping using python
A glossary of NLP terms and notes
Use Faster Region Based Convolutional Neural Network to detect objects under different driving conditions
The following topics are covered: Benefits of containerizing, why choose containers over VM, Docker basics, install postgres from Dockerhub and finally build...
We will delve into the concept of Vector Spaces by developing an intuition behind this idea through some simple examples. There are many applications and alg...
Predict School budgets using a machine learning pipeline. The target variable is multi-class-multi-label and we have a mix of numeric and text features. We w...
A glossary of terms used in CNN.
Basic Networking resources within AWS is covered through the use of LucidChart diagrams.
How to construct a MLP in PyTorch. A glossary of terms.
This post will introduce how to use pytorch to build and train neural networks
Approach to solving a binary classification problem
Everything about manipulation of dataframes using Pandas
Unsupervised Learning: Clustering using R and Python
IaaC is one of the pillars of Cloud DevOps and CloudFormation is the tool that lets us provision resources using declarative programming.
Lets understand the rise and popularity of Data Lakes and the need for modern Cloud-based Data Lakes.
Melting turns columns into rows. Whereas, pivot will take unique values from a column and creates a new columns.
The power of infrastructure-as-code is illustrated by launching a 4-node AWS Redshift cluster, performing some analysis, and destroying the resources, all us...
World of DataFrames, Series and their indexes
Creating dummy variables
Pandas Indexes Pandas indexes are the most confusing thing about pandas. So let’s try to understand it. What are the advantages of using indices instead of...
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorit...
A gentle introduction to sklearn
When working with text data, you are often required to write regular expressions to pre-process the data, extract useful information from the data, create ne...
What are different ways in which you can create a dataframe ? Using pd.DataFrame() constructor. Zip lists to build a DataFrame. Building dataframes w...
Predict Seismic bumps using Logistic Regression in R
The story of how an object goes from Pending to Persisted.
Build a simple flask application which connects to postgres using SQLAlchemy.
SQLAlchemy offers several layers of abstraction and convenient tools for interacting with a database. In this post, we will understand the purpose of each la...
For a long time I have been a fan of psycopg2 and resisted to use SQLAlchemy, but may be not anymore.
Learn when to use ElastiCache Service within your applications to improve your overall performance.
Learn to process SNS notifications with AWS Lambda function which will upload a file to S3 upon receipt of a message.
Get an overview of Amazon Kinesis to collect, process and analyze real-time streaming data.
Host a static gallery website on S3 and distribute the content the Edge locations using CloudFront Web Distribution.
Convert a SQLite database to a Postgres Database. Example when given database.sqlite and you want to do your analysis in Postgres, then read on
The MapReduce programming technique was designed to analyze massive data sets across a cluster. In this post, we will get a sense of how Hadoop MapReduce works
Work with dates and times in SQL
Since we don’t have DESCRIBE table in Postgres, we need to write this query again and again. So I am documenting it.
We will look at a real-world scenarios of a health-care company data and learn how to query JSON data using normal PostgreSQL.
Install postgres, administer it using cli, connect to it using jupyter notebook.
PCA in R
Principal Components Analysis and Linear Discriminant Analysis applied to BreastCancer Wisconsin Diagnostic dataset in R
Basics of linear algebra using numpy. Things like calculating the norm, dot product using numpy.
In this post we will think about the graph patterns to apply to the graph database and then we will perform the queries using Cypher
What is a property graph? How to create one in Neo4j using Cypher? are answered in this post using an example.
In this post I cover how to install Neo4j Desktop on mac and explore what is available in the tool. Basics of graph database is discussed.
Why do we need AWS X-Ray service and what are the key components of X-ray service are covered in this post.
We will explore how AWS Beanstalk can be used to help deploy and scale applications without having to worry about provisioning resources manually.
Introduction to Elastic Load Balancing and types of ELBs is covered in this post.
Filter inbound and outbound traffic using NACLs and Security Groups within your VPC.
Configure a static website with S3 and distribute it using CloudFront.
Everything about VPC subnets and how to design for high availability and resiliency.
Continuous Delivery requirements and tools are discussed extensively in this post.
In this post, we will look at 5 ways to encrypt data in S3: SSE-S3, SSE-KMS, SSE-C, CSE-KMS and CSE-C.
Rotation of CMKs, importing key material for CMKs and Deletion of CMKs are covered in this post.
Key policies are resource based policies which are tied to your CMK. And if you want a principal to be able to access your CMK, then a key policy must be in ...
Many of the AWS services rely on KMS for their encryption needs. KMS allows encryption of data at rest. In this post, we will look at the components that mak...
Amazon Elastic Beanstalk, AWS Lambda and AWS Batch are covered in this post.
Amazon EC2 Container Service and Elastic Container Registry are explained.
Install Docker on AWS, find and use images from public docker registry and finally build your docker images using Dockerfiles.
AWS Virtual Private Cloud (VPC) allows you to provision a logically isolated section of the AWS Cloud where you can launch AWS resources in a virtual network...
Create an Amazon EC2 Linux instance, connect to it and extract instance metadata.
AMI, instance types, instance purchase options, tenancy, user data, storage options and security of EC2 instances are covered in this post.
Learn about networking fundamentals in Docker. I explore the three pre-configured networks with examples.
What is the difference between Expose, PublishAll and Publish when talking about port mapping in docker? Also learn how to run docker container in detached m...
Build a docker image from a Container using the docker commit command and modify the default command using the change flag
Build your first docker image from a Dockerfile and run a simple go binary file inside the container.
Suppose your gradma caught you twiddling with a Blue Whale Icon on your laptop and you have to explain what that is, how would you explain it?
A high level overview of AWS SQS, SNS, SES
Storage fundamentals of AWS are discussed here. Non-S3 is covered in-depth.
Storage fundamentals of AWS are discussed here. S3 is covered in-depth.
The Identity and Access Management service, commonly referred to as IAM is a key security service within AWS and is likely the first service you will encount...
Everything about conda channels
Configure Jenkins plugins to talk to S3 and Github and build a simple pipeline which will upload a file checked into Github to S3. Each time you make a chang...
When you have multiple github profiles, you want to be prompted for your user and passwd.
This post covers the installation of Jenkins on EC2 and configures Blue Ocean plugin for building pipelines.
A glossary of terms pertaining to AWS