Find similar documents using Neural Topic Modeling (NTM)

35 minute read

Neural Topic Modeling to find similar documents

Find similar documents using Neural Topic Modeling.

In this post, we will use the Amazon SageMaker NTM algorithm to train a model on the incidents data set.

The main goals of this post are as follows:

  1. Understand what topic modeling is and when to use it?
  2. What is Neural Topic Modeling and how does it differ from LDA?
  3. Learn how to obtain and store data for use in Amazon SageMaker,
  4. Create an AWS SageMaker training job on a data set to produce an NTM model,
  5. Use the model to perform inference with an Amazon SageMaker endpoint.
  6. Explore trained model and visualize learned topics

Introduction

What is topic modeling in simple terms?

The technical definition of topic modeling is that each topic is a distribution of words and each document is a mixture of topics across a set of documents (also referred to as a corpus). For example, a collection of documents that contains frequent occurrences of words such as “bike,” “car,” “mile,” or “brake” are likely to share a topic on “transportation.” If another collection of documents shares words such as “SCSI,” “port,” “floppy,” or “serial” it is likely that they are discussing a topic on “computers.” The process of topic modeling is to infer hidden variables such as word distribution for all topics and topic mixture distribution for each document by observing the entire collection of documents. The figure that follows shows the relationships among words, topics, and documents.

When to use topic modeling?

There are many practical use cases for topic modeling, such as:

  • document classification based on the topics detected,
  • automatic content tagging using tags mapped to a set of topics,
  • document summarization using the topics found in the document,
  • information retrieval using topics, and
  • content recommendation based on topic similarities.

    Topic modeling can also be used as a feature engineering step for downstream text-related machine learning tasks. It’s also worth mentioning that, topic modeling is a general algorithm that attempts to describe a set of observations with the underlying themes. Although we focus on text documents here, the observations can be applied other types of data. For example, topic models can also be used for modeling other discrete-data use cases such as discovering peer-to-peer applications on the network of an internet service provider or corporate network.

Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as “bike”, “car”, “train”, “mileage”, and “speed” are likely to share a topic on “transportation” for example.

Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities.

Topics are latent representations.

The topics from documents that NTM learns are characterized as a latent representation because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are pre-specified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics. Documents relevant to each topic might be indexed or searched for based on their soft topic labels. The latent representations of documents might also be used to find similar documents in the topic space.

What is Neural Topic Modeling?

NTM takes the high-dimensional word count vectors in documents as inputs, maps them into lower-dimensional hidden representations, and reconstructs the original input back from the hidden representations. The hidden representation learned by the model corresponds to the mixture weights of the topics associated with the document. The semantic meaning of the topics can be determined by the top-ranking words in each topic as learned by the reconstruction layer.

Each document is described as a mixture of topics.

Why SageMaker NTM?

SageMaker NTM is trained in a highly distributed cluster environment for large scale model training. It supports three data channels for the training job, including the required train channel, and the optional validation and test channels. The validation channel is used to decide when to stop the training job. You have the option to replicate or shard the training and validation data to each of the training nodes or you can stream the data when the streaming feature is available. At inference time, SageMaker NTM takes data inputs in CSV or RecordIO-wrapped-Protobuf file formats.

Difference between LDA and NTM

SageMaker LDA (Latent Dirichlet Allocation, not to be confused with Linear Discriminant Analysis) model works by assuming that documents are formed by sampling words from a finite set of topics.

It is made of 2 moving parts:

  • the word composition per topic and
  • the topic composition per document

SageMaker NTM on the other hand doesn’t explicitly learn a word distribution per topic, it is a neural network that passes document through a bottleneck layer and tries to reproduce the input document (presumably a Variational Auto Encoder (VAE) according to AWS documentation). That means that the bottleneck layer ends up containing all necessary information to predict document composition and its coefficients can be considered as topics

The high-level diagram of SageMaker NTM is shown below:

When to use NTM?

Amazon Comprehend, is a fully managed text analytics service, provides a pre-configured topic modeling API that is best suited for the most popular use cases like organizing customer feedback, support incidents or workgroup documents. Amazon Comprehend is the suggested topic modeling choice for customers as it removes a lot of the most routine steps associated with topic modeling like tokenization, training a model and adjusting parameters.

Amazon SageMaker’s Neural Topic Model (NTM) caters to the use cases where a finer control of the training, optimization, and/or hosting of a topic model is required, such as training models on text corpus of particular writing style or domain, or hosting topic models as part of a web application.

Data Preperation

Our dataset comprises of 17409 incidents that have occurred over a period of time. Each record contains the id of the incident and text which is the full description of the incident. We will be using the text as our corpus of documents for training, validation and testing the model.

import boto3
import colab.utils.sagemaker
from io import StringIO
import pandas as pd

bucket = 'skuchkula'

file_name = 'pre_processed_input_wo_batch.csv'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket= bucket, Key= file_name)
df = pd.read_csv(StringIO(obj['Body'].read().decode('ISO-8859-1')))

print(df.shape)
df.head()
(17409, 2)
id text
0 181112008494 Subset of customers received email alerts for ...
1 181028002958 A subset of Mastercard customers are experienc...
2 181113008051 Multiple clients not seeing reporting dataFrom...
3 191202019482 Token Authorization errorsOn Monday, December ...
4 190927000700 COLAB ACCESS Portal was unavailableBetween ...

List of strings: Convert the text column to a list of strings. This list of strings will be treated as the corpus of documents.

data = df['text'].to_list()
# shown the first record
data[0]
"Subset of customers received email alerts for address changes which did not occurOn Monday, November 12 at 11:14 ET, CSS IM stated that a subset of customers reported receiving emails indicating there was an address change made on their account, but the customer did not did not change their address. This issue is drove higher than normal calls into call centers.; ; At 12:30 ET, CIS L2 reported the root cause of the issue was via ITSM (C03988360) which was part of the overall Core Customer Data Remediation project (DM52477) for Home Lending, Card, and Auto on Sunday, November 11. While the remediation work was underway, Saturday's batch job that sends email alerts to customers was suppressed however, Sunday's batch job was kicked off as scheduled, which sent the messages to the impacted customers. The intent was that when the clean-up was completed, the file that logs changes to customer profiles would have been deleted preventing messages from reaching customers.; ; The ticket was raised to a P1S1 at 14:07 ET due to the actual number of customers provided by CSS IM. Line of Business communications for Home Lending, Auto and Card were generated for customers who received the email alerts. No compliance issue were identified as the alerts were generic in nature and did not contain any sensitive data.; ; At 15:45 ET, CCB Communications and Legal and Compliance teams developed an approved customer communication for the 114,712 impacted customers. At 18:00 ET, the CCB Communications team began rolling out email alerts to Secure Message Center to disregard the previous alerts. Digital Operations confirmed the issue was mitigated at 19:05 ET when the final alerts were sent. No further communications will be sent for this issue.During a ~1.5 day period beginning on November 11, ~115k consumer and card customers erroneously received an email indicating that an address change had been made on their account. ~25k customers acknowledged receipt of the email by contacting the call center. The issue was triggered during a scheduled ITSM for the Core Customer Data Remediation project, when a failure to suppress a batch job resulted in the generation of emails for remediated accounts. The issue was mitigated when the impacted customers were sent another email informing them to disregard the previous message. Root cause investigation is underway to review the implementation plan for opportunities."

Text pre-processing steps


From Plain Text to Bag-of-Words (BOW)

The input documents to the algorithm, both in training and inference, need to be vectors of integers representing word counts. This is so-called bag-of-words (BOW) representation. To convert plain text to BOW, we need to first “tokenize” our documents, i.e identify words and assign an integer id to each of them.

Then, we count the occurence of each of the tokens in each document and form BOW vectors.

Also, note that many real-world applications have large vocabulary sizes. It may be necessary to represent the input documents in sparse format. Finally, the use of stemming and lemmatization in data preprocessing provides several benefits. Doing so can improve training and inference compute time since it reduces the effective vocabulary size. More importantly, though, it can improve the quality of learned topic-word probability matrices and inferred topic mixtures. For example, the words “parliament”, “parliaments”, “parliamentary”, “parliament’s”, and “parliamentarians” are all essentially the same word, “parliament”, but with different conjugations. For the purposes of detecting topics, such as a “politics” or governments” topic, the inclusion of all five does not add much additional value as they all essentially describe the same feature.

In this example, we will use a simple lemmatizer from nltk package and use CountVectorizer in scikit-learn to perform the token counting. For more details please refer to their documentation respectively. Alternatively, spaCy also offers easy-to-use tokenization and lemmatization functions.

In the following cell, we use a tokenizer and a lemmatizer from nltk. In the list comprehension, we implement a simple rule: only consider words that are longer than 2 characters, start with a letter and match the token_pattern.

!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer
import re
token_pattern = re.compile(r"(?u)\b\w\w+\b")
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc) if len(t) >= 2 and re.match("[a-z].*",t)
                and re.match(token_pattern, t)]
Looking in indexes: https://frs-art.server.net/artifactory/api/pypi/pppp-public-pypi/simple/
Requirement already satisfied: nltk in /opt/colab/software/Miniconda/lib/python3.6/site-packages (3.4.5)
Requirement already satisfied: six in /opt/colab/software/Miniconda/lib/python3.6/site-packages (from nltk) (1.14.0)


[nltk_data] Downloading package punkt to
[nltk_data]     /opt/colab/software/Miniconda/lib/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /opt/colab/software/Miniconda/lib/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

With the tokenizer defined we perform token counting next while limiting the vocabulary size to vocab_size

Vectorize the corpus

import time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vocab_size = 2000
print('Tokenizing and counting, this may take a few minutes...')
start_time = time.time()
vectorizer = CountVectorizer(input='content',
                             analyzer='word',
                             stop_words='english',
                             max_features=vocab_size,
                             tokenizer=LemmaTokenizer()
                            )
                             # max_features=vocab_size, max_df=0.95, min_df=0.2)

vectors = vectorizer.fit_transform(data)
vocab_list = vectorizer.get_feature_names()
print('vocab size:', len(vocab_list))

# random shuffle
idx = np.arange(vectors.shape[0])
np.random.shuffle(idx)
vectors = vectors[idx]

print('Done. Time elapsed: {:.2f}s'.format(time.time() - start_time))
Tokenizing and counting, this may take a few minutes...


/opt/colab/software/Miniconda/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:385: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.
  'stop_words.' % sorted(inconsistent))


vocab size: 2000
Done. Time elapsed: 35.29s

Drop docs/vectors with words less than 25

Optionally, we may consider removing very short documents, the following cell removes documents shorter than 25 words. This certainly depends on the application, but there are also some general justifications. It is hard to imagine very short documents express more than one topic. Topic modeling tries to model each document as a mixture of multiple topics, thus it may not be the best choice for modeling short documents.

threshold = 25
vectors = vectors[np.array(vectors.sum(axis=1)>threshold).reshape(-1,)]
print('removed short docs (<{} words)'.format(threshold))        
print(vectors.shape)
removed short docs (<25 words)
(13349, 2000)

The output from CountVectorizer are sparse matrices with their elements being integers.

print(type(vectors), vectors.dtype)
print(vectors[0])
<class 'scipy.sparse.csr.csr_matrix'> int64
  (0, 43)	2
  (0, 803)	1
  (0, 900)	1
  (0, 1718)	6
  (0, 1350)	1
  (0, 1014)	1
  (0, 1760)	2
  (0, 529)	1
  (0, 1600)	1
  (0, 1942)	1
  (0, 1655)	1
  (0, 79)	2
  (0, 776)	1
  (0, 1347)	1
  (0, 98)	1
  (0, 550)	1
  (0, 1652)	1
  (0, 866)	1
  (0, 1573)	1
  (0, 1540)	1
  (0, 1982)	1
  (0, 224)	1
  (0, 24)	1
  (0, 730)	1
  (0, 330)	1
  (0, 1763)	1
  (0, 867)	6
  (0, 257)	2
  (0, 123)	1
  (0, 1867)	2
  (0, 1200)	1
  (0, 1628)	1
  (0, 305)	1
  (0, 1299)	1
  (0, 362)	6
  (0, 696)	2
  (0, 1514)	1
  (0, 931)	1
  (0, 675)	1
  (0, 1836)	1
  (0, 554)	1
  (0, 0)	1
  (0, 1254)	1
  (0, 904)	1
  (0, 616)	1
  (0, 692)	1
  (0, 728)	1
  (0, 411)	1
  (0, 1887)	1
  (0, 1901)	3

Convert the vectors to sparse matrix type np.float32

Because all the parameters (weights and biases) in the NTM model are np.float32 type we’d need the input data to also be in np.float32. It is better to do this type-casting upfront rather than repeatedly casting during mini-batch training.

import scipy.sparse as sparse
vectors = sparse.csr_matrix(vectors, dtype=np.float32)
print(type(vectors), vectors.dtype)
<class 'scipy.sparse.csr.csr_matrix'> float32

Split the vectors into train/test/validation

As a common practice in modeling training, we should have a training set, a validation set, and a test set. The training set is the set of data the model is actually being trained on. But what we really care about is not the model’s performance on training set but its performance on future, unseen data. Therefore, during training, we periodically calculate scores (or losses) on the validation set to validate the performance of the model on unseen data. By assessing the model’s ability to generalize we can stop the training at the optimal point via early stopping to avoid over-training.

Note that when we only have a training set and no validation set, the NTM model will rely on scores on the training set to perform early stopping, which could result in over-training. Therefore, we recommend always supply a validation set to the model.

Here we use 80% of the data set as the training set and the rest for validation set and test set. We will use the validation set in training and use the test set for demonstrating model inference.

n_train = int(0.8 * vectors.shape[0])

# split train and test
train_vectors = vectors[:n_train, :]
test_vectors = vectors[n_train:, :]

# further split test set into validation set (val_vectors) and test  set (test_vectors)
n_test = test_vectors.shape[0]
val_vectors = test_vectors[:n_test//2, :]
test_vectors = test_vectors[n_test//2:, :]
print(train_vectors.shape, test_vectors.shape, val_vectors.shape)
(10679, 2000) (1335, 2000) (1335, 2000)

Get a Sagemaker session, role and bucket

import sagemaker
import colab.utils.sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
import os
import boto3

# All training algorithm parameters must be wrapped with this module
from colab.utils.sagemaker import get_pysdk_training_params

sagemaker_session = sagemaker.Session()

# Set Sagemaker's default bucket to correct S3 bucket
sagemaker_session._default_bucket = colab.utils.sagemaker.bucket

bucket = colab.utils.sagemaker.bucket

Prepare to upload data to S3

prefix = 'incidents'

train_prefix = os.path.join(prefix, 'train')
val_prefix = os.path.join(prefix, 'val')
output_prefix = os.path.join(prefix, 'output')

s3_train_data = os.path.join('s3://', bucket, train_prefix)
s3_val_data = os.path.join('s3://', bucket, val_prefix)
output_path = os.path.join('s3://', bucket, output_prefix)
print('Training set location', s3_train_data)
print('Validation set location', s3_val_data)
print('Trained model will be saved at', output_path)
Training set location s3://skuchkula/incidents/train
Validation set location s3://skuchkula/incidents/val
Trained model will be saved at s3://skuchkula/incidents/output

Utility function to help split, convert and upload to S3

The NTM algorithm, as well as other first-party SageMaker algorithms, accepts data in RecordIO Protobuf format. The SageMaker Python API provides helper functions for easily converting your data into this format. Below we convert the from numpy/scipy data and upload it to an Amazon S3 destination for the model to access it during training.

A word about recordIO format and csv format: The train, validation, and test data channels for NTM support both:

  • recordIO-wrapped-protobuf (dense and sparse) and
  • CSV file formats.

For CSV format, each row must be represented densely with zero counts for words not present in the corresponding document, and have dimension equal to: (number of records) * (vocabulary size).

You can use either File mode or Pipe mode to train models on data that is formatted as recordIO-wrapped-protobuf or as CSV.

What is File mode and Pipe mode? Check documentation

Here we define a helper function to convert the data to RecordIO Protobuf format and upload it to S3. In addition, we will have the option to split the data into several parts specified by n_parts.

The algorithm inherently supports multiple files in the training folder (“channel”), which could be very helpful for large data set. In addition, when we use distributed training with multiple workers (compute instances), having multiple files allows us to distribute different portions of the training data to different workers conveniently.

Inside this helper function we use write_spmatrix_to_sparse_tensor function provided by SageMaker Python SDK to convert scipy sparse matrix into RecordIO Protobuf format.

def split_convert_upload(sparray, bucket, prefix, fname_template='data_part{}.pbr', n_parts=2):
    import io
    import boto3
    import sagemaker.amazon.common as smac

    chunk_size = sparray.shape[0]// n_parts
    for i in range(n_parts):

        # Calculate start and end indices
        start = i*chunk_size
        end = (i+1)*chunk_size
        if i+1 == n_parts:
            end = sparray.shape[0]

        # Convert to record protobuf
        buf = io.BytesIO()
        smac.write_spmatrix_to_sparse_tensor(array=sparray[start:end], file=buf, labels=None)
        buf.seek(0)

        # Upload to s3 location specified by bucket and prefix
        fname = os.path.join(prefix, fname_template.format(i))
        boto3.resource('s3').Bucket(bucket).Object(fname).upload_fileobj(buf)
        print('Uploaded data to s3://{}'.format(os.path.join(bucket, fname)))

split_convert_upload(train_vectors, bucket=bucket, prefix=train_prefix, fname_template='train_part{}.pbr', n_parts=8)
split_convert_upload(val_vectors, bucket=bucket, prefix=val_prefix, fname_template='val_part{}.pbr', n_parts=1)
Uploaded data to s3://skuchkula/incidents/train/train_part0.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part1.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part2.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part3.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part4.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part5.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part6.pbr
Uploaded data to s3://skuchkula/incidents/train/train_part7.pbr
Uploaded data to s3://skuchkula/incidents/val/val_part0.pbr

Model Training

We have created the training and validation data sets and uploaded them to S3. Next, we configure a SageMaker training job to use the NTM algorithm on the data we prepared

Get the container, set the hyperparams, build your estimator

SageMaker uses Amazon Elastic Container Registry (ECR) docker container to host the NTM training image. The following ECR containers are currently available for SageMaker NTM training in different regions. For the latest Docker container registry please refer to Amazon SageMaker: Common Parameters.

The code in the cell below automatically chooses an algorithm container based on the current region. In the API call to sagemaker.estimator.Estimator we also specify the type and count of instances for the training job. Because the incidents data set is relatively small, we have chosen a CPU only instance (ml.c4.xlarge), but do feel free to change to other instance types.

Note: Using CPU instance didn’t work out for us, so we switched to using ml.m4.xlarge

NTM fully takes advantage of GPU hardware and in general trains roughly an order of magnitude faster on a GPU than on a CPU. Multi-GPU or multi-instance training further improves training speed roughly linearly if communication overhead is low compared to compute time.

Hyperparameters to consider

Here we highlight a few hyperparameters. For information about the full list of available hyperparameters, please refer to NTM Hyperparameters.

  • feature_dim - the “feature dimension”, it should be set to the vocabulary size
  • num_topics - the number of topics to extract
  • mini_batch_size - this is the batch size for each worker instance. Note that in multi-GPU instances, this number will be further divided by the number of GPUs. Therefore, for example, if we plan to train on an 8-GPU machine (such as ml.p2.8xlarge) and wish each GPU to have 1024 training examples per batch, mini_batch_size should be set to 8196.
  • epochs - the maximal number of epochs to train for, training may stop early
  • num_patience_epochs and tolerance controls the early stopping behavior. Roughly speaking, the algorithm will stop training if within the last num_patience_epochs epochs there have not been improvements on validation loss. Improvements smaller than tolerance will be considered non-improvement.
  • optimizer and learning_rate - by default we use adadelta optimizer and learning_rate does not need to be set. For other optimizers, the choice of an appropriate learning rate may require experimentation.
container = get_image_uri(boto3.Session().region_name, 'ntm')

ntm_params = {"train_instance_type":"ml.m4.xlarge",
                  "train_instance_count": 1,
                  "train_volume_size": 30,
                  "output_path":output_path,
                  "hyperparameters":{'num_topics': 20, 'feature_dim': vocab_size}
             }


estimator = sagemaker.estimator.Estimator(container, **get_pysdk_training_params(ntm_params),
                                      sagemaker_session=sagemaker_session)
INFO:colab.utils.sagemaker:Running colab-Sagemaker Parameters Checks for successful launch...
INFO:colab.utils.sagemaker:Bucket correctly set to: skuchkula
INFO:colab.utils.sagemaker:Setting Sagemaker role to: arn:aws:iam::888888888888:role/sagemaker/colab-system-sm-role
INFO:colab.utils.sagemaker:Setting Sagemaker KMS key to: arn:aws:kms:us-east-1:888888888888:key/f43c82fd-a0ff-4db7-bb5e-cd8ee4e8bdde
INFO:colab.utils.sagemaker:Setting Sagemaker Volume KMS key to: arn:aws:kms:us-east-1:888888888888:key/f43c82fd-a0ff-4db7-bb5e-cd8ee4e8bdde
INFO:colab.utils.sagemaker:Setting Sagemaker Security Group to: sg-08fa1a35165b38c8f
INFO:colab.utils.sagemaker:Setting Sagemaker Enable Inter Container Traffic Encryption to True
INFO:colab.utils.sagemaker:Setting Sagemaker Subnets to: ['subnet-03de0646d7a46a997', 'subnet-0e2973e7785ef53e1']
INFO:colab.utils.sagemaker:Tagging Sagemaker Job with SID: ABCDEFG
INFO:colab.utils.sagemaker:Finished. Please verify that your Sagemaker s3input python object contains the kms key: arn:aws:kms:us-east-1:888888888888:key/f43c82fd-a0ff-4db7-bb5e-cd8ee4e8bdde and is reading from bucket: skuchkula
WARNING:colab.config:Config section 'Environment' not found in '/opt/colab/work/instance1/jupyterinstall/bin/config.ini'

Next, we need to specify how the training data and validation data will be distributed to the workers during training. There are two modes for data channels:

  • FullyReplicated: all data files will be copied to all workers
  • ShardedByS3Key: data files will be sharded to different workers, i.e. each worker will receive a different portion of the full data set.

At the time of writing, by default, the Python SDK will use FullyReplicated mode for all data channels. This is desirable for validation (test) channel but not suitable for training channel. The reason is that when we use multiple workers we would like to go through the full data set by each of them going through a different portion of the data set, so as to provide different gradients within epochs. Using FullyReplicated mode on training data not only results in slower training time per epoch (nearly 1.5X in this example), but also defeats the purpose of distributed training. To set the training data channel correctly we specify distribution to be ShardedByS3Key for the training data channel as follows.

Data Channels: Amazon SageMaker Neural Topic Model supports four data channels: train, validation, test, and auxiliary. The validation, test, and auxiliary data channels are optional. If you specify any of these optional channels, set the value of the S3DataDistributionType parameter for them to FullyReplicated. If you provide validation data, the loss on this data is logged at every epoch, and the model stops training as soon as it detects that the validation loss is not improving. If you don’t provide validation data, the algorithm stops early based on the training data, but this can be less efficient. If you provide test data, the algorithm reports the test loss from the final model.

from sagemaker.session import s3_input

## don't do when going thru full dataset
#s3_train = s3_input(s3_train_data, distribution='ShardedByS3Key')

s3_train = s3_input(s3_train_data)

Start the training job

Now we are ready to train. The following cell takes a few minutes to run. The command below will first provision the required hardware. You will see a series of dots indicating the progress of the hardware provisioning process. Once the resources are allocated, training logs will be displayed. With multiple workers, the log color and the ID following INFO identifies logs emitted by different workers.

estimator.fit({'train': s3_train, 'validation': s3_val_data})
INFO:sagemaker:Creating training-job with name: ntm-2020-12-03-15-41-25-233
2020-12-03 15:41:25 Starting - Starting the training job...
2020-12-03 15:41:28 Starting - Launching requested ML instances......
2020-12-03 15:42:45 Starting - Preparing the instances for training......
2020-12-03 15:43:55 Downloading - Downloading input data
2020-12-03 15:43:55 Training - Downloading the training image.....Docker entrypoint called with argument(s): train


2020-12-03 15:46:06 Completed - Training job completed
Training seconds: 151
Billable seconds: 151

If you see the message

===== Job Complete =====

at the bottom of the output logs then that means training successfully completed and the output NTM model was stored in the specified output path. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the “Jobs” tab and select training job matching the training job name, below:

print('Training job name: {}'.format(estimator.latest_training_job.job_name))
Training job name: ntm-2020-12-03-15-41-25-233

Model Hosting and Inference

A trained NTM model does nothing on its own. We now want to use the model we computed to perform inference on data. For this example, that means predicting the topic mixture representing a given document.

We create an inference endpoint using the SageMaker Python SDK deploy() function from the job we defined above. We specify the instance type where inference is computed as well as an initial number of instances to spin up.

NOTE: We need to specify kms_key as per colab standards

Deploy the model

ntm_predictor = estimator.deploy(initial_instance_count=1,
                                 kms_key=colab.utils.sagemaker.kms_key,
                                 instance_type='ml.m4.xlarge')
INFO:sagemaker:Creating model with name: ntm-2020-12-03-15-41-25-233
INFO:sagemaker:Creating endpoint with name ntm-2020-12-03-15-41-25-233
-------------------!

You now have a functioning SageMaker NTM inference endpoint. You can confirm the endpoint configuration and status by navigating to the “Endpoints” tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name, below:

print('Endpoint name: {}'.format(ntm_predictor.endpoint))
Endpoint name: ntm-2020-12-03-15-41-25-233

Inference


Data Serialization/Deserialization

We can pass data in a variety of formats to our inference endpoint.

1. Inference with CSV

First, we will demonstrate passing CSV-formatted data.

We make use of the SageMaker Python SDK utilities csv_serializer and json_deserializer when configuring the inference endpoint.

from sagemaker.predictor import csv_serializer, json_deserializer

ntm_predictor.content_type = 'text/csv'
ntm_predictor.serializer = csv_serializer
ntm_predictor.deserializer = json_deserializer

Let’s pass 5 examples from the test set to the inference endpoint

# let's check the type of test_vectors.
# Recall, this was created by the Countvectorizer which creates a sparse matrix int representation
# We then converted that int representation to float since the model expects it that way.
print(type(test_vectors))
<class 'scipy.sparse.csr.csr_matrix'>
test_data = np.array(test_vectors.todense())
results = ntm_predictor.predict(test_data[:5])
print(results)
{'predictions': [{'topic_weights': [0.0567261837, 0.0671920627, 0.0289357956, 0.0599537827, 0.0297795124, 0.1187543124, 0.0279128887, 0.0327085704, 0.0593245439, 0.0196267124, 0.0641732216, 0.0345759168, 0.021036407, 0.0515591167, 0.0228312258, 0.0155250067, 0.1406449229, 0.0705114976, 0.0339260809, 0.0443022214]}, {'topic_weights': [0.1340247095, 0.037689887, 0.016485434, 0.0101175765, 0.0359941758, 0.022102369, 0.0186521877, 0.0161334891, 0.016403731, 0.0202505533, 0.0254321769, 0.0144165037, 0.0220218971, 0.0558746085, 0.0210400559, 0.0101732574, 0.2344350666, 0.2526515126, 0.0201292485, 0.0159715991]}, {'topic_weights': [0.023021739, 0.0144292209, 0.0203277655, 0.0586762503, 0.0872816145, 0.3128592372, 0.0127118388, 0.0125361746, 0.0196874347, 0.0246341676, 0.0538540818, 0.0141086187, 0.0158732701, 0.0195614435, 0.0201502163, 0.0144626331, 0.0898360834, 0.1529843211, 0.0185426343, 0.0144612845]}, {'topic_weights': [0.014288554, 0.0230178218, 0.0314109735, 0.0217012279, 0.0141560072, 0.0156232975, 0.1495890468, 0.0163509138, 0.0203847513, 0.1579205245, 0.0181005951, 0.0358017012, 0.0296010468, 0.0330678225, 0.0157390963, 0.2580412328, 0.1023192704, 0.0139160687, 0.0148121016, 0.0141580012]}, {'topic_weights': [0.0130719347, 0.0199746974, 0.0226689205, 0.0153542599, 0.2011407316, 0.0494823605, 0.0132068573, 0.0112162801, 0.0131191825, 0.0138364518, 0.0148863122, 0.0224576276, 0.0182857532, 0.0157599114, 0.0274943076, 0.148061797, 0.0121065462, 0.2481649071, 0.1063650027, 0.013346158]}]}

We can see the output format of SageMaker NTM inference endpoint is a Python dictionary with the following format.

{
  'predictions': [
    {'topic_weights': [ ... ] },
    {'topic_weights': [ ... ] },
    {'topic_weights': [ ... ] },
    ...
  ]
}

We extract the topic weights, themselves, corresponding to each of the input documents.

# convert the dict to a list of lists
predictions = np.array([prediction['topic_weights'] for prediction in results['predictions']])
print(predictions)
[[0.05672618 0.06719206 0.0289358  0.05995378 0.02977951 0.11875431
  0.02791289 0.03270857 0.05932454 0.01962671 0.06417322 0.03457592
  0.02103641 0.05155912 0.02283123 0.01552501 0.14064492 0.0705115
  0.03392608 0.04430222]
 [0.13402471 0.03768989 0.01648543 0.01011758 0.03599418 0.02210237
  0.01865219 0.01613349 0.01640373 0.02025055 0.02543218 0.0144165
  0.0220219  0.05587461 0.02104006 0.01017326 0.23443507 0.25265151
  0.02012925 0.0159716 ]
 [0.02302174 0.01442922 0.02032777 0.05867625 0.08728161 0.31285924
  0.01271184 0.01253617 0.01968743 0.02463417 0.05385408 0.01410862
  0.01587327 0.01956144 0.02015022 0.01446263 0.08983608 0.15298432
  0.01854263 0.01446128]
 [0.01428855 0.02301782 0.03141097 0.02170123 0.01415601 0.0156233
  0.14958905 0.01635091 0.02038475 0.15792052 0.0181006  0.0358017
  0.02960105 0.03306782 0.0157391  0.25804123 0.10231927 0.01391607
  0.0148121  0.014158  ]
 [0.01307193 0.0199747  0.02266892 0.01535426 0.20114073 0.04948236
  0.01320686 0.01121628 0.01311918 0.01383645 0.01488631 0.02245763
  0.01828575 0.01575991 0.02749431 0.1480618  0.01210655 0.24816491
  0.106365   0.01334616]]

2. Inference with RecordIO Protobuf

The inference endpoint also supports JSON-formatted and RecordIO Protobuf, see Common Data Formats—Inference for more information.

At the time of writing SageMaker Python SDK does not yet have a RecordIO Protobuf serializer, but it is fairly straightforward to create one as follows.

def recordio_protobuf_serializer(spmatrix):
    import io
    import sagemaker.amazon.common as smac
    buf = io.BytesIO()
    smac.write_spmatrix_to_sparse_tensor(array=spmatrix, file=buf, labels=None)
    buf.seek(0)
    return buf

Now we specify the serializer to be the one we just created and content_type to be 'application/x-recordio-protobuf' and inference can be carried out with RecordIO Protobuf format

ntm_predictor.content_type = 'application/x-recordio-protobuf'
ntm_predictor.serializer = recordio_protobuf_serializer
ntm_predictor.deserializer = json_deserializer

# notice here, we didn't have to convert test_vectors to dense
results = ntm_predictor.predict(test_vectors[:5])
print(results)
{'predictions': [{'topic_weights': [0.0567261837, 0.0671920627, 0.0289357956, 0.0599537827, 0.0297795124, 0.1187543124, 0.0279128887, 0.0327085704, 0.0593245439, 0.0196267124, 0.0641732216, 0.0345759168, 0.021036407, 0.0515591167, 0.0228312258, 0.0155250067, 0.1406449229, 0.0705114976, 0.0339260809, 0.0443022214]}, {'topic_weights': [0.1340247095, 0.037689887, 0.016485434, 0.0101175765, 0.0359941758, 0.022102369, 0.0186521877, 0.0161334891, 0.016403731, 0.0202505533, 0.0254321769, 0.0144165037, 0.0220218971, 0.0558746085, 0.0210400559, 0.0101732574, 0.2344350666, 0.2526515126, 0.0201292485, 0.0159715991]}, {'topic_weights': [0.023021739, 0.0144292209, 0.0203277655, 0.0586762503, 0.0872816145, 0.3128592372, 0.0127118388, 0.0125361746, 0.0196874347, 0.0246341676, 0.0538540818, 0.0141086187, 0.0158732701, 0.0195614435, 0.0201502163, 0.0144626331, 0.0898360834, 0.1529843211, 0.0185426343, 0.0144612845]}, {'topic_weights': [0.014288554, 0.0230178218, 0.0314109735, 0.0217012279, 0.0141560072, 0.0156232975, 0.1495890468, 0.0163509138, 0.0203847513, 0.1579205245, 0.0181005951, 0.0358017012, 0.0296010468, 0.0330678225, 0.0157390963, 0.2580412328, 0.1023192704, 0.0139160687, 0.0148121016, 0.0141580012]}, {'topic_weights': [0.0130719347, 0.0199746974, 0.0226689205, 0.0153542599, 0.2011407316, 0.0494823605, 0.0132068573, 0.0112162801, 0.0131191825, 0.0138364518, 0.0148863122, 0.0224576276, 0.0182857532, 0.0157599114, 0.0274943076, 0.148061797, 0.0121065462, 0.2481649071, 0.1063650027, 0.013346158]}]}

If you decide to compare these results to the known topic weights generated above keep in mind that SageMaker NTM discovers topics in no particular order. That is, the approximate topic mixtures computed above may be (approximate) permutations of the known topic mixtures corresponding to the same documents.

Visualize the results

Now we can take a look at how the 20 topics are assigned to the 5 test documents with a bar plot. How to interpret this plot? Recall that each document is represented as a mixture of topics. Since we gave 20 topics, each document is represented as a mixture of these 20 topics. For example, consider second document which is green, we can see that the most dominant topic is topic 5.

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

fs = 12
df=pd.DataFrame(predictions.T)
df.plot(kind='bar', figsize=(16,4), fontsize=fs)
plt.ylabel('Topic assignment', fontsize=fs+2)
plt.xlabel('Topic ID', fontsize=fs+2)
Text(0.5, 0, 'Topic ID')

png

But, what are these topics made of?? We will see that in the next section, where we explore the trained model.

Model Exploration

Note: The following section is meant as a deeper dive into exploring the trained models. The demonstrated functionalities may not be fully supported or guaranteed. For example, the parameter names may change without notice.

The trained model artifact is a compressed package of MXNet models from the two workers (in our case only 1 worker was used). To explore the model, we first need to install mxnet.

# If you use conda_mxnet_p36 kernel, mxnet is already installed, otherwise, uncomment the following line to install.
# !pip install mxnet
import mxnet as mx

Here we download & unpack the artifact

model_path = os.path.join(output_prefix, estimator._current_job_name, 'output/model.tar.gz')
print(model_path)
boto3.resource('s3').Bucket(bucket).download_file(model_path, 'downloaded_model.tar.gz')
'incidents/output/ntm-2020-12-03-15-41-25-233/output/model.tar.gz'
!tar -xzvf 'downloaded_model.tar.gz'
model_algo-1
# use flag -o to overwrite previous unzipped content
!unzip -o model_algo-1
Archive:  model_algo-1
 extracting: meta.json               
 extracting: symbol.json             
 extracting: params                  

We can load the model parameters and extract the weight matrix 𝑊 in the decoder as follows

model = mx.ndarray.load('params')
W = model['arg:projection_weight']

Matrix W corresponds to the W in the NTM digram at the beginning of this notebook. Each column of W corresponds to a learned topic. The elements in the columns of W corresponds to the pseudo-probability of a word within a topic. We can visualize each topic as a word cloud with the size of each word be proportional to the pseudo-probability of the words appearing under each topic.

WordCloud to visualize the top words in each topic

import wordcloud as wc

num_topics=20

word_to_id = dict()
for i, v in enumerate(vocab_list):
    word_to_id[v] = i

limit = 24
n_col = 4
counter = 0

plt.figure(figsize=(20,16))
for ind in range(num_topics):

    if counter >= limit:
        break

    title_str = 'Topic{}'.format(ind)

    #pvals = mx.nd.softmax(W[:, ind]).asnumpy()
    pvals = mx.nd.softmax(mx.nd.array(W[:, ind])).asnumpy()

    word_freq = dict()
    for k in word_to_id.keys():
        i = word_to_id[k]
        word_freq[k] =pvals[i]

    wordcloud = wc.WordCloud(background_color='white').fit_words(word_freq)

    plt.subplot(limit // n_col, n_col, counter+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(title_str)

    counter +=1

png

How to interpret this wordcloud?: This wordcloud is just a visual tool to check what each topic is made of. In other words, we are just trying visually see if these topics even make any sense. This can be one way to fine tune your topic modeling, by either increasing/decreasing the number of topics or by dropping certain stopwords/frequent words that occur in your domain (for example: server, string etc.)

Our objective is to find similar incidents, given a new incident. As each document is represented as a mixture of topics (topic_weights), we can use some distance metrics like cosine similarity to determine the nearest documents to a given document and return the top 10 nearest (or most similar) incidents.

Find similar incidents using topic weight and cosine similarity

Getting inference for the entire corpus

In order to calculate the nearest incidents, we need to have the topic_weights for all the documents. This can be achieved by sending the entire dataset (all rows) to the model, which will then give the topic_weights for each document.

n=174
final = [data[i * n:(i + 1) * n] for i in range((len(data) + n - 1) // n )]
final_inferences = [get_inference(item, ntm_predictor, vectorizer) for item in final]
import itertools
merged = list(itertools.chain(*final_inferences))
len(merged)
17409
merged[0]['topic_weights']
[0.0220480561,
 0.078898944,
 0.0286504366,
 0.0273015518,
 0.0183144696,
 0.0093562854,
 0.0121184299,
 0.0167128388,
 0.0132112149,
 0.0922882855,
 0.0414835848,
 0.0355786607,
 0.0236057937,
 0.0242753942,
 0.0530637614,
 0.0156719908,
 0.0858293027,
 0.0153687485,
 0.2848448157,
 0.1013774574]
merged_tw = [item['topic_weights'] for item in merged]
bucket = 'skuchkula'

file_name = 'pre_processed_input_wo_batch.csv'
s3 = boto3.client('s3')
obj = s3.get_object(Bucket= bucket, Key= file_name)
incidents_df = pd.read_csv(StringIO(obj['Body'].read().decode('ISO-8859-1')))
incidents_df.shape
(17409, 2)
incidents_df['topic_weights'] = merged_tw
incidents_df.head()
id text topic_weights
0 181112008494 Subset of customers received email alerts for ... [0.0220480561, 0.078898944, 0.0286504366, 0.02...
1 181028002958 A subset of Mastercard customers are experienc... [0.015776258, 0.0221211389, 0.0240189731, 0.03...
2 181113008051 Multiple clients not seeing reporting dataFrom... [0.012877184, 0.0996591225, 0.0216155462, 0.01...
3 191202019482 Token Authorization errorsOn Monday, December ... [0.0162641034, 0.0242145658, 0.0188226271, 0.0...
4 190927000700 COLAB ACCESS Portal was unavailableBetween ... [0.0148392115, 0.0320588984, 0.0286968369, 0.1...

Calculate dominant topic

incidents_df['topic'] = incidents_df['topic_weights'].apply(np.argmax)
incidents_df.sample(n=10)
id text topic_weights topic
5617 200819002819 Missing Patches(2008 and 2012)YGVAMRSSRSD102; ... [0.0739082694, 0.0920438617, 0.037747249, 0.03... 1
8385 200731005653 File Mover Error[NP2406]:NA PRODError during l... [0.0611550733, 0.1208992675, 0.0331611075, 0.0... 1
12303 201023009001 10/23/2020 14:47:36GMT - z1s51imdmgrp - alert... [0.0320595875, 0.0467956141, 0.0392448604, 0.0... 8
6375 200807007048 Update password for user EMEA/SLDNIMAPPSMARTPR... [0.1059061289, 0.1753362119, 0.0294531044, 0.0... 1
4691 200616005029 Some CCB customers experienced an error when p... [0.0169876423, 0.0419127829, 0.0224173274, 0.0... 19
14183 201020012574 10/20/2020 16:33:03GMT - vsin23p4618.svr.us.jp... [0.0185295306, 0.0217058826, 0.0233596079, 0.0... 8
12609 201006000773 LIQUIDITY_SERVER_Ufix nmfp get popup data issue [0.0838106275, 0.0918867439, 0.0397975631, 0.0... 1
4748 200925000242 Alacrity#25468776 - PIQ1_EQRSKLN space constra... [0.0216869693, 0.0364853889, 0.0256613437, 0.0... 5
9722 200604011407 06/04/2020 15:48:38GMT - vsin11p9931.svr.us.jp... [0.0159330759, 0.0206041448, 0.0246804114, 0.0... 8
5503 200813000057 08/12/2020 23:02:36GMT - manta-envoy-p - Unabl... [0.0112355184, 0.007043316, 0.0170210488, 0.17... 3

Get inference for a new incident

#new_incident = ['Sybase slowness experienced in CDC1 datacenter']
new_incident = ['Due to issues with the ARCOT authentication server in the DR environment, for International Private Retail (IPB) On-Line , clients were intermittentlyunable to log in to IPB On-Line through their web browser, from 04:10 ET to 05:39 ET. As a workaround, the mobile site was available for use. There was no impact to Geneva, APAC, and MX users. There were five (5) client calls into CSS due to this issue. TECHnology teams advised that, traffic was directed to the ARCOT Authentication DR nodes, during the Apache upgrade on the CC nodes. However one of the authentication DR nodes encountered a cache mishandling error causing intermittent log in errors. To mitigate the issue, TECHnology teams restarted the problematicserver. TECHnology teams will conduct full root cause analysis and will provide findings, as needed. Impacted: IPB']

‘Due to issues with ARCOT authentication server in the DR environment, for International Private Retail (IPB) On-Line , clients were to log in to IPB On-Line through their web browser, from 04:10 ET to 05:39 ET. As a workaround, the mobile site was available for use. There was no impact to Geneva, APAC, and MX users. There five (5) client calls into CSS due to this issue. TECHnology teams advised that, traffic was directed to the ARCOT Authentication DR nodes, during the Apache upgrade on the CC nodes. However one of the authentication DR nodes a cache mishandling error causing intermittent log in errors. To mitigate the issue, TECHnology teams restarted the problematicserver. TECHnology teams will conduct full root cause analysis and will provide findings, as needed. Impacted:’

def get_inference(incident, predictor, vectorizer):
    predictor.content_type = 'text/csv'
    predictor.serializer = csv_serializer
    predictor.deserializer = json_deserializer

    # transform the incident text using the fitted vectorizer
    incident_vector = vectorizer.transform(incident)

    # convert to float.32 representation as the one returned above is np.int representation
    incident_vector = sparse.csr_matrix(incident_vector, dtype=np.float32)

    # convert to dense format
    incident_vector_dense = np.array(incident_vector.todense())

    # get the prediction
    incident_weights = predictor.predict(incident_vector_dense)

    return incident_weights['predictions']
get_inference(new_incident, ntm_predictor, vectorizer)
[{'topic_weights': [0.0269878507,
   0.0280393343,
   0.0311581101,
   0.1714375615,
   0.0792532042,
   0.0155932307,
   0.0204415135,
   0.0243986864,
   0.0215448905,
   0.0799083039,
   0.0849156678,
   0.0286211241,
   0.0196325146,
   0.0216842052,
   0.1017097756,
   0.0181447566,
   0.1239855811,
   0.0236457027,
   0.0343143195,
   0.0445837118]}]

Given a new incident, what topic does it belong to?

def get_most_similar_incident(incident):
    weights = get_inference(incident, ntm_predictor, vectorizer)

    # currently this function takes only 1 incident at a time.
    weights = weights[0]['topic_weights']

    # get the most dominant topic
    topic = np.argmax(weights)

    # get the cosine similarity
    incidents_df['cos_similarity'] = incidents_df.topic_weights.apply(lambda x: np.dot(x, weights))

    # return the most similar incident
    index = np.argmax(incidents_df['cos_similarity'])

    # return the ticket text
    return (topic, incidents_df.loc[index, 'text'])

new_incident = ['PRISM batch marker delays impacting US SOD business']
weights = get_inference(new_incident, ntm_predictor, vectorizer)
weights[0]['topic_weights']
[0.0665333644,
 0.0553925261,
 0.0256432425,
 0.0248773713,
 0.0795705169,
 0.1314134002,
 0.0276308432,
 0.020379303,
 0.0416415259,
 0.0669960082,
 0.0353013016,
 0.0246028583,
 0.0285938624,
 0.0526068211,
 0.035232313,
 0.0256939959,
 0.065179944,
 0.1240279526,
 0.0437880531,
 0.0248947758]
# get the most dominant topic
incidents_df[incidents_df.topic == 5]
id text topic_weights topic cos_similarity
30 181119006474 Apollo stride trading cash instance failureFro... [0.019139871, 0.0267465133, 0.0195751935, 0.05... 5 0.053427
31 190107007909 [Orion NY] NA Rates Options Trading - Incorrec... [0.074222438, 0.0233598407, 0.0261603035, 0.02... 5 0.053875
46 190307005721 APAC Global Booking Service ( GBS ) FailureExe... [0.0403782725, 0.0358727239, 0.0251196455, 0.0... 5 0.054873
49 190624017387 RDT-PUMA slowness being reportedSince approxim... [0.0291204974, 0.0430362076, 0.0298821498, 0.1... 5 0.053771
60 190719000484 EDGE - Replaced wave from EDGE to GMOC did not... [0.0242208745, 0.0174499918, 0.0177331418, 0.0... 5 0.053625
... ... ... ... ... ...
17212 200901015162 09/01/2020 19:36:24GMT - ccpbvorainfprd09.svr.... [0.0142116211, 0.0220372397, 0.0646731779, 0.0... 5 0.043876
17233 200921018528 Stone Oak Building C , 20855 Stone Oak Pkwy, S... [0.074029237, 0.0733176395, 0.0546116829, 0.05... 5 0.052419
17261 200927011160 09/27/2020 20:51:13GMT - lspbvaradev03.svr.eme... [0.0148627209, 0.0452507846, 0.0366315506, 0.0... 5 0.046609
17262 200927011156 09/27/2020 20:50:30GMT - lspbvaradev03.svr.eme... [0.0125397649, 0.0219310503, 0.0332243107, 0.0... 5 0.045614
17297 200921005296 09/21/2020 07:06:21GMT - mrpbvorainfprd09.svr.... [0.0148063675, 0.029510336, 0.0901147649, 0.01... 5 0.044055

893 rows × 5 columns

Get top 5 closest incidents to the current incident

Given a new incident, return the 5 closes incidents based on the training corpus.

incidents_df.head()
id text topic_weights topic
0 181112008494 Subset of customers received email alerts for ... [0.0220480561, 0.078898944, 0.0286504366, 0.02... 18
1 181028002958 A subset of Mastercard customers are experienc... [0.015776258, 0.0221211389, 0.0240189731, 0.03... 18
2 181113008051 Multiple clients not seeing reporting dataFrom... [0.012877184, 0.0996591225, 0.0216155462, 0.01... 16
3 191202019482 Token Authorization errorsOn Monday, December ... [0.0162641034, 0.0242145658, 0.0188226271, 0.0... 16
4 190927000700 COLAB ACCESS Portal was unavailableBetween ... [0.0148392115, 0.0320588984, 0.0286968369, 0.1... 16
new_incident = ['Due to issues with the ARCOT authentication server in the DR environment, for International Private Retail (IPB) On-Line , clients were intermittentlyunable to log in to IPB On-Line through their web browser, from 04:10 ET to 05:39 ET. As a workaround, the mobile site was available for use. There was no impact to Geneva, APAC, and MX users. There were five (5) client calls into CSS due to this issue. TECHnology teams advised that, traffic was directed to the ARCOT Authentication DR nodes, during the Apache upgrade on the CC nodes. However one of the authentication DR nodes encountered a cache mishandling error causing intermittent log in errors. To mitigate the issue, TECHnology teams restarted the problematicserver. TECHnology teams will conduct full root cause analysis and will provide findings, as needed. Impacted: IPB']
new_incident[0]
'Due to issues with the ARCOT authentication server in the DR environment, for International Private Retail (IPB) On-Line , clients were intermittentlyunable to log in to IPB On-Line through their web browser, from 04:10 ET to 05:39 ET. As a workaround, the mobile site was available for use. There was no impact to Geneva, APAC, and MX users. There were five (5) client calls into CSS due to this issue. TECHnology teams advised that, traffic was directed to the ARCOT Authentication DR nodes, during the Apache upgrade on the CC nodes. However one of the authentication DR nodes encountered a cache mishandling error causing intermittent log in errors. To mitigate the issue, TECHnology teams restarted the problematicserver. TECHnology teams will conduct full root cause analysis and will provide findings, as needed. Impacted: IPB'
weights = get_inference(new_incident, ntm_predictor, vectorizer)
weights = weights[0]['topic_weights']
topic = np.argmax(weights)

# get the cosine similarity
incidents_df['cos_similarity'] = incidents_df.topic_weights.apply(lambda x: np.dot(x, weights))
incidents_df.head()
id text topic_weights topic cos_similarity
0 181112008494 Subset of customers received email alerts for ... [0.0220480561, 0.078898944, 0.0286504366, 0.02... 18 0.054804
1 181028002958 A subset of Mastercard customers are experienc... [0.015776258, 0.0221211389, 0.0240189731, 0.03... 18 0.054456
2 181113008051 Multiple clients not seeing reporting dataFrom... [0.012877184, 0.0996591225, 0.0216155462, 0.01... 16 0.060389
3 191202019482 Token Authorization errorsOn Monday, December ... [0.0162641034, 0.0242145658, 0.0188226271, 0.0... 16 0.063748
4 190927000700 COLAB ACCESS Portal was unavailableBetween ... [0.0148392115, 0.0320588984, 0.0286968369, 0.1... 16 0.085969

Put this into a function

def get_similar_incidents(incident, n=10):
    weights = get_inference(incident, ntm_predictor, vectorizer)

    # currently this function takes only 1 incident at a time.
    weights = weights[0]['topic_weights']

    # get the most dominant topic
    topic = np.argmax(weights)

    # get the cosine similarity
    incidents_df['cos_similarity'] = incidents_df.topic_weights.apply(lambda x: np.dot(x, weights))

    # sort based on cosine similarity
    sorted_incidents_df = incidents_df.sort_values(by='cos_similarity', ascending=False)

    # return top n similar incidents
    return sorted_incidents_df.loc[:n, 'text']
new_incident = ['CDC1 network issues impacting multiple AWM applications']
output = get_similar_incidents(new_incident)

Other approaches to improve the performance

Approach 1:

  • Group the tickets based on topics
  • Give a new incident, find the dominant topic first, then inside that topic, give the top 5 tickets.
  • Come up with some ranking strategy

Approach 2:

  • Fine tune the model by adjusting the hyperparameters.
  • Play around with topic numbers to see what makes sense.
  • Fine tune the vectorizer
  • Find the nearest vector in the entire corpus.

Approach 3:

  • Treat the weights as feature. Do a KNN

Approach 4:

  • Consider it as vector, and do a cosine similarity.