Building a model using Sagemaker

5 minute read

Building a model using SageMaker

Sagemaker is a combination of two very useful tools:

  • Managed jupyter notebook instance
  • The second tool provided by Sagemaker is an API, which simplifies training and deploying a ML model.

So, why use SageMaker?: It makes a lot of the Machine Learning tasks we need to perform much easier. Consider the ML workflow that I discussed in detail in my earlier post.

SageMaker provides tools to make each of the steps involved simpler. The notebook can be used to explore and process the data, and the API can help simplify the modeling and deployment steps.

So, how does SageMaker actually work?: For the most part, when we talk about using SageMaker, we really mean working in the managed notebook. This notebook has all the benefits that we talked about earlier, with the added benefit of having access to the SageMaker API. The SageMaker API itself can be thought of as a collection of tools that deal with the training process and the inference process.

Training Process: The training process is exactly what you think it is:

  1. First a computational task is constructed. Generally this task is meant to fit a Machine Learning model to some data.
  2. Then this task is executed on a Virtual Machine. The resulting model, such as the tree constructed in a Random Tree model or layers of a neural network, is then saved to a file. This saved data is called the model artifacts.

Inference Process: The inference process is very similar to the training process:

  1. First a computational task is constructured for the purposes of performing an inference.
  2. Then this task is executed on a Virtual Machine. In this case however, the VM waits for us to send it some data. When we do, it takes that data along with the model artifacts - which are created during the training process, and performs inference, returning the result.

Setting up the notebook instance

The first thing we are going to need to do is set up a notebook instance!

This will be the primary way in which we interact with the SageMaker ecosystem. Of course, this is not the only way to interact with SageMaker’s functionality, but it is the way that we will use for now.

Note: Once a notebook instance has been set up, by default, it will be InService which means that the notebook instance is running. This is important to know because the cost of a notebook instance is based on the length of time that it has been running. This means that once you are finished using a notebook instance you should Stop it so that you are no longer incurring a cost. Don’t worry though, you won’t lose any data provided you don’t delete the instance. Just start the instance back up when you have time and all of your saved data will still be there.

Create a new notebook instance, give it a name. We also need to make sure we set a role. A role acts as a security certificate, letting Amazon know what other resources our notebook will have access to. We need to make sure that we allow our notebook to access S3. To do that, create a new IAM role. The default selections will be fine for us.

Note: Selecting None here, since we only want buckets which have name sagemaker in them to be accessible from our notebook. If you want any other specific bucket, then select the first option and give the bucket name.

Now that your notebook instance has been set up and is running, it’s time to get the notebooks that we will be using from here:

sh-4.2$ pwd
/home/ec2-user
sh-4.2$ ls
anaconda3  examples  LICENSE  Nvidia_Cloud_EULA.pdf  README  SageMaker  sample-notebooks  sample-notebooks-1606850748  src  tools  tutorials
sh-4.2$ cd SageMaker/
sh-4.2$ ls
lost+found
sh-4.2$ git clone https://github.com/udacity/sagemaker-deployment.git
Cloning into 'sagemaker-deployment'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 259 (delta 3), reused 4 (delta 1), pack-reused 249
Receiving objects: 100% (259/259), 258.78 KiB | 23.52 MiB/s, done.
Resolving deltas: 100% (156/156), done.
sh-4.2$ ls
lost+found  sagemaker-deployment
sh-4.2$ cd sagemaker-deployment/
sh-4.2$ ls
LICENSE  Mini-Projects  Project  README.md  Tutorials
sh-4.2$

Boston Housing Example

As our first example of using SageMaker, we are going to take a look at the Boston Housing dataset, and we are going to use that dataset to predict the median cost of a house in the Boston area. Refer to this notebook

SageMaker Sessions & Execution Roles

SageMaker has some unique objects and terminology that will become more familiar over time. There are a few objects that you’ll see come up, over and over again:

  • Session - A session is a special object that allows you to do things like manage data in S3 and create and train any machine learning models. The upload_data function should be close to the top of the list! You’ll also see functions like train, tune, and create_model all of which we will use regularly.

  • Role - Sometimes called the execution role, this is the IAM role that you created when you created your notebook instance. The role basically defines how data that your notebook uses/creates will be stored. You can even try printing out the role with print(role) to see the details of this creation.

FAQs:

What is AWS SageMaker?:

AWS (or Amazon) SageMaker is a fully managed service that provides the ability to build, train, tune, deploy, and manage large-scale machine learning (ML) models quickly. Sagemaker provides tools to make each of the following steps simpler:

  1. Explore and process data
    • Retrieve
    • Clean and explore
    • Prepare and transform
  2. Modeling
    • Develop and train the model
    • Validate and evaluate the model
  3. Deployment
    • Deploy to production
    • Monitor, and update model & data

What tools are provided in Sagemaker?: The Amazon Sagemaker provides the following tools:

  • Ground Truth - To label the jobs, datasets, and workforces
  • Notebook - To create Jupyter notebook instances, configure the lifecycle of the notebooks, and attache Git repositories
  • Training - To choose an ML algorithm, define the training jobs, and tune the hyperparameter
  • Inference - To compile and configure the trained models, and endpoints for deployments

SageMaker Instance types start with ml.*

SageMaker instances are the dedicated VMs that are optimized to fit different machine learning (ML) use cases. The supported instance types, names, and pricing in SageMaker are different than that of EC2. Refer the following links to have better insight: