Storage fundamentals of AWS part 2: Glacier, EBS, EFS, EC2 Instance storage

Amazon Glacier

Amazon Glacier is similar in part to Amazon S3. It even directly interacts with the Amazon S3 lifecycle rules discussed in the previous lecture. However, the fundamental difference with Amazon Glacier is that it’s a fraction of the cost when it comes to storing the same amount of data. So what’s the catch?

Well it doesn’t provide you the same features, but more importantly, it doesn’t provide you instant access to your data. So what is Amazon Glacier exactly?

It’s an extremely low cost, long-term, durable storage solution which is often referred to as cold storage, ideally suited for long-term backup and archival requirements. It’s capable of storing the same data types as Amazon S3 -effectively any object. However, it does not provide instant access to your data. In addition to this, there are other fundamental differences which makes this service fit for purpose for other use cases.

The service itself again has eleven 9’s of durability, making this just as durable as Amazon S3. Again, this is achieved by replicating your data across multiple different availability zones within a single region. But it provides the storage at considerable lower cost compared to that of Amazon S3. This is because retrieval of data stored in Glacier is not an instant access retrieval process. When retrieving your data, it can take up to several hours to gain access to it, depending on certain criteria. The data structure within Glacier is centered around vaults and archives. Buckets and folders are not used. They are purely for S3.

aws_glacier

Vaults and Archives

A Glacier Vault simply acts as a container for Glacier Archives. These vaults are regional. And as such, during the creation of these vaults, you are asked to supply the region in which they will reside. Within these vaults, we then have our data which is stored as an archive. And these archives can be any object similarly to S3. For example, any document, audio, movie or image file, etc, and each will be saved as an archive. Thankfully, you can have unlimited archives within your Glacier Vaults. So from a capacity perspective, it follows the same rule as S3. Effectively, you have access to an unlimited quantity of storage for your archives and vaults.

Now whereas Amazon S3 provided a nice graphical user interface to view, manage and retrieve your data within buckets and folders, Amazon Glacier does not offer this service. The Glacier Dashboard within AWS management console, only allows you to create your vaults. Any operational process to upload and retrieve data has to be done using some form of code, either with the Glacier web app service API, or by using the AWS SDKs which simplifies the process of authentication.

Moving data into Glacier: When it comes to moving data into Amazon Glacier for the first time, it’s effectively a two-step process. aws_glacier1

  • Firstly, you need to create your vaults as a container for your archives, and this could be completed using the Glacier console.
  • Secondly, you need to move your data into the Glacier vault using the API or SDKs. As you may be thinking, there’s also another method of moving data into Glacier if it already exists in S3. This is by the S3 Lifecycle rules. There is an option to move your S3 data into Glacier after a set period of time as defined by you within the Lifecycle rule. When the rule is met, S3 will move the data into Glacier as required.

Retrieving data from Glacier: When it comes to retrieving your archives, your data, you will again have to use some form of code to do so, either the APIs, SDKs or the AWS CLI. Either way, you must first create an archival retrieval job. Request an access to all or part of the archive. When doing so, you can also specify a retrieval option which can be one of three options.

  • Expedited: This is used when you have an urgent requirement to retrieve your data but the request has to less than 250 meg. The data is then made available to you in one to five minutes. And the cost of this service is based upon three cents per gig and one cent per request.

  • Standard: This can be used to retrieve any of your archives no matter their size, but your data will be available in three to five hours. So much longer than the Expedited option. This cost for the service is one cent per gig requested and five cent per thousand requests.

  • Bulk: Finally, this option is used to retrieve petabytes of data at a time. However, this typically takes between five and twelve hours to complete. This is the cheapest of the retrieval options which is set at 0.25 cent per gig and 2.5 cents per thousand requests. So it really depends on how much data and how quickly you need it as to the retrieval speed and cost to you made by your retrieval option.

Glacier Security

By default, Amazon Glacier encrypts your data using the AES-256 algorithm which is the Advanced Encryption Standard 256 bit, and will manage all of the encryption keys on your behalf. In addition to IAM policies which govern access control to all of you AWS services, Glacier also uses additional methods of access control for protecting your data. And this comes in the form of Vault Access Policies and Vault Lock Policies. Whereas IAM policies are identity based, meaning the policies are associated to a particular user group or role, Vault Access Policies are classed as a resource-based policy as they are applied directly to your vault resource. This is similar to a bucket policy in Amazon S3. These Vault Access Policies govern access control to a particular vault and each vault can only have a single associated Vault Access Policy. They follow the same policy patent as identity policies using the JSON format. However, they also include a principal component to identify who is permitted or refused access. If a user has access to a vault through an identity policy, and there also happens to be a Vault Access Policy attached to the vault as well, then all access will be evaluated between the two policies to see if access is allowed. If there is an explicit deny in either policy, the identity will be refused access. For more information on policy evaluation logic, please view the link here.

aws_glacier2

Vault Lock Policies are similar to Vault Access Policies. However, once they are set, they cannot be changed. This allows you to implement stringent security controls to help you abide by specific governance and compliance controls. For example, you may not be allowed to delete archives for three years due to regulatory requirements. In this case, you could set a Vault Lock Policy denying anyone from deleting archives that are less than 1,095 days old. This will ensure that no data less than three years old can be deleted. You would Vault Access Policies to govern access control features that may change over time and you would use Vault Lock Policies to help you maintain compliance using access controls that must not be changed.

Glacier Pricing

The pricing structure for Amazon Glacier is very simple. It has a single storage cost for all data despite how much data you are storing. However, this will vary between region. For example, within the London region, the cost of storage is set at 0.45 cents per gig. Similarly to Amazon S3, there also additional costs such as data transfer, request and retrieval pricing. Data transfer into Glacier is free. However, there is a charge for data transferred out of this service. For example, to another region, the transfer is set at 2 cent per gigabyte. The price varies if transferring out across the internet depending on the quantity of data transferred. As we know, there are three variants of data retrieval. Each with their own cost per gig, which is already been covered for standard, expedited and bulk. However, there’s also a cost associated to how many of these retrieval request you make. As an example for the London region, this is priced as follows: For the latest pricing information, please refer to the Amazon Glacier Pricing Page.

Glacier Summary

To quickly summarize, Amazon Glacier is designed to archive data for extended periods of time in cold storage for a very small cost. And so it is ideally suited for retaining data for regulatory reasons. However, it is not a good choice if you need to store data that changes frequently, and that requires immediate access.

EC2 Instance level storage

Let’s look into EC2 Instance Level Storage, which are referred to as an instance store volume. The first point to make about EC2 instance store volumes, is that the volumes physically reside on the same host that provides your EC2 instance itself, acting as local disc drives, allowing you to store data locally to that instance. Up until now we have discussed persistent storage options. But instance store volumes provide ephemeral storage for you EC2 instances.

Ephemeral storage means that the block level storage that it provides offers no means of persistency. Any data stored on these volumes is considered temporary. With this in mind, it is not recommended to store critical or valuable data on these ephemeral instance store volumes, as it could be lost, should an event occur. By an event, let me explain under what conditions that your data would be lost, should it be stored on one of these volumes.

If your instance is either stopped or terminated, then any data that you have stored on that instance store volume associated with this instance will be deleted without any means of data recovery. However, if your instance was simply rebooted, your data would remain intact. Although, you can control when your instances are stopped or terminated, giving you the opportunity to either back-up the data or move it to another persistent volume store, such as the elastic block store service. Sometimes this control is not always possible. Let’s consider you had critical data stored on an ephemeral instance store volume and then the underlying host that provided your EC2 instance and storage failed. You had no warning that this failure was going to occur, and as a result of this failure, the instance was stopped or terminated. Now all of your data on these volumes is lost. When a stop and start, or termination occurs, all the blocks on the storage volume are reset, essentially wiping data. So, you might be thinking, why use these volumes? What use do they have if there is a chance that you are going to lose data? They do, in fact, have a number of benefits.

From a cost perspective, the storage used is included in the price of the EC2 instance. So, you don’t have an additional spend on storage cost. The I/O speed on these volumes can far exceed those provided by the alternative instance block storage, EBS for example. When using store optimized instance families, such as the I3 Family, it’s potentially possible to reach 3.3 million random read IOPS, and 1.4 million write IOPS. With speeds like this, it makes it ideal to handle the high demands of no SQL databases. However, any persistent data required would need to be replicate or copied to a persistent data store in this scenario. Instance store volumes are generally used for data that is frequently changing; that doesn’t need to be retained, as such, they are great to be used as a cache or buffer. They are also commonly used for service within a load balancing group, where data is replicated across the fleet such as a web server pool.

Not all instance types support instance store volumes. So, if you do have a need where these instance store volumes would work for your use case, then be sure to check the latest AWS documentation to ascertain if the instance type you’re looking to use supports the volume. The size of your volumes, however, will increase as you increase the EC2 instance size.

From a security stance, instance store volumes don’t offer any additional security features. As to be honest, they are not separate service like the previous storage options I have already explained. They are simply storage volumes attached to the same host on the EC2 instance, and they are provided as a part of the EC2 service. So, they effectively have the same security mechanisms provided by EC2. This can be IIM policies dictating which instances can and can’t be launched, and what action you can perform on the EC2 instance, itself. If you have data that needs to remain persistent, or that needs to be accessed and shared by others, then EC2 instance store volumes are not recommended. If you need to use block level storage and want a quick and easy method to maintain persistency, then there is another block level service that is recommended. This being the elastic block store service.

Elastic Block Store

EBS is another method of providing storage to your EC2 instances, which has different benefits to that of the instance store volumes.

aws_ebs

Much like instance store volumes, EBS also provides block level storage to your EC2 instances. However, it also offers persistent and durable data storage, which is a huge plus. As a result, this offers far more flexibility for the data that you store on the EBS volume to what you can do with the data stored on instance store volumes.

EBS volumes can be attached to your EC2 instances, and are primarily used for data that is rapidly changing that perhaps requires specific IOPS. As they provide persistent level storage to your instances, they are ideally suited for retaining important data. And as such, can be used to store personally identifiable information. In any environment where this is the case, it’s essential that the data is encrypted to protect the data on the volume from malicious activity.

EBS volumes are independent of the EC2 instance, meaning that they exist as two different resources. They are network attached to the instance instead of directly attached like instance store volumes. However, a single EBS volume can only ever be attached to a single EC2 instance.

Multiple EBS volumes can however be attached to a single instance. Due to the persistency of maintaining data, it doesn’t matter if your instances are intentionally or unintentionally stopped, restarted, or even terminated, settings permitting. The data will remain intact.

Snapshots

EBS offers the ability to provide point in time backup snapshots of the entire volume as and when you need to. These backups are known as snapshots. You can manually invoke a snapshot of your volume at any time, or create some code to perform this automatically on a scheduled basis. The snapshots themselves are then stored on Amazon S3 and so are very durable and reliable. The snapshots themselves are incremental, meaning that each snapshot will only copy data that has changed since the previous snapshot was taken. Once you have a snapshot of an EBS volume, you can then create a new volume from that snapshot. So, if for any reason you lost access to your EBS volume through some form of incident or disaster, you can recreate the data volume from an existing snapshot and then attach that volume to a new EC2 instance. To add additional flexibility and resilience, it is possible to copy a snapshot from one region to another.

HA and Resiliency of EBS Volumes

Looking at the subject of high availability and resiliency, your EBS volumes by default are created with reliability in mind. Every write to a EBS volume is replicated multiple times within the same availability zone of your region to help prevent the complete loss of data. This means that your EBS volume itself is only available in a single availability zone. Shown here on the left. aws_ebs0

As a result, should your availability zone fail, you will lose access to your EBS volume. Should this occur, you can simply recreate the volume from your previous snapshot, which is accessible from all availability zones within that region, and attach it to another instance in another availability zone.

Types of EBS Volumes

There are two types of EBS volumes available. Each have their own characteristics. These being SSD backed storage, solid state drive, and HDD backed storage, hard disk drive. This allows you to optimize your storage to fit your requirements from a cost to performance perspective. aws_ebs1

SSD vs HDD backed storage: SSD backed storage is better suited for scenarios that work with smaller blocks. Such as databases using transactional workloads. Or often as boot volumes for your EC2 instances. Whereas HDD backed volumes are designed for better workloads that require a higher rate of throughput in megabits per second, such as processing big data and logging information. So, essentially working with larger blocks of data.

Looking at these volume types further, SSD and HDD volumes can be broken down into the following. General purpose SSD, known as GP2, and provisioned IOPS, known as IO1. And cold HDD, known as SC1, and throughput optimized HDD, known as ST1.

The key points of the general purpose SSD volume includes it provides single digit millisecond latency. It can burst up to 3,000 IOPS. It has a baseline performance of three IOPS up to to 10,000 IOPS. A throughput of 128 megabytes per second for volumes up to 170 gig. Above this, this throughput then increases 768 kilobytes per second per gig up to a maximum of 160 meg per second.

aws_ebs2

Provisioned IOPS SSD volumes deliver enhanced predictable performance for applications requiring I/O intensive workloads. They also have the ability to specify IOPS rate during the creation of a new EBS volume. And when the volume is attached to an EBS-optimized instance, EBS will deliver the IOPS defined and required within 10%, 99.9% of the time throughout the year. And the volumes range from four to sixteen terabytes in size. Per volume, the maximum IOPS possible is set to 20,000 IOPS.

The cold HDD volumes offer the lowest cost compared to all other EBS volumes types. They are suited for workloads that are large in size and accessed infrequently. Their key performance attribute is its throughput capabilities in megabytes per second. It also has the ability to burst up to 80 megabits per second per terabyte, with a maximum burst capacity for each volume set at 250 megabytes per second. It will deliver the expected throughput 99% of the time over a given year, and due to the nature of these volumes, it’s not possible to use these as boot volumes for your EC2 instances.

For throughput optimized HDD volumes, they are designed for frequently accessed data and are ideally suited to work well with large data sets requiring throughput-intensive workloads, such as data streaming, big data, and log processing. They have the ability to burst up to 250 megabytes per second, with a maximum burst of 500 megabytes per second per volume. And again, these will deliver the expected throughput 99% of the time over a given year. And again, due to the nature of these volumes, it’s not possible to use these as boot volumes for your instances.

Security of EBS volumes

One great feature of EBS is its ability to enhance the security of your data, both at rest and when in transit, through data encryption. This is especially useful when you have sensitive data, including personally identifiable information, stored in your EBS volume. And in this case you may be required to have some form of encryption from a regulatory or governance perspective. EBS offers a very simple encryption mechanism. Simple in the fact that you don’t have to worry about managing the data keys to perform the encryption process yourself. It’s all managed and implemented by EBS. All you are required to do is to select if you want the volume encrypted or not during its creation via a checkbox.

aws_ebs3

The encryption process uses the AES-256 encryption algorithm and provides its encryption process by interacting with another AWS service, the key management service known as KMS. KMS uses customer master keys, CMKs, to create data encryption keys, DEKs, enabling the encryption of data across a range of AWS services, such as EBS in this instance. Any snapshot taken from an encrypted volume will also be encrypted, and also any volume created from this encrypted snapshot will also be encrypted. You should also be aware that this encryption option is only available on selected instance types.

Creating EBS volumes

As EBS volumes are separate to EC2 instances, you can create an EBS volume in a couple of different ways from within the management console. During the creation of a new instance and attach it at the time of launch, or from within the EC2 dashboard of the AWS management console as a standalone volume ready to be attached to an instance when required. When creating an EBS volume during an EC2 instance launch, at step four of creating that instance, you are presented with the storage configuration options. Here you can either create a new blank volume or create it from an existing snapshot. As shown here:

aws_ebs4

You can also specify the size in gigabytes and the volume type, which we discussed previously. Importantly, you can decide what happens to the volume when the instance terminates. You can either have the volume to be deleted with the termination of the EC2 instance, or retain the volume, allowing you to maintain the data and attach it to another EC2 instance. Lastly, you also have the option of encrypting the data if required.

And like I said, you can also create the EBS volume as a standalone volume. By selecting the volume option under EBS from within the EC2 dashboard of the management console, you can create a new EBS volume where you’ll be presented with the following screen.

aws_ebs5

Here you will have many of the same options. However, you can specify which availability zone that the volume will exist in, allowing you to attach it to any EC2 instance within that same availability zone. As you remember, EBS volumes can only be attached to EC2 instances that exist within the same availability zone.

Resizing EBS volumes

EBS volumes also offer the additional flexibility of being able to resize them elastically should the requirement arise. Perhaps you’re running out of disk space and need to scale up your volume. This can be achieved by modifying the volume within the console or via the AWS CLI. Once the volume has been increased, you would then need to extend the filesystem on the OS of the EC2 instance to see the new volume. You could also perform the same resize of a volume by creating a snapshot of your existing volume and then creating a new volume from that snapshot with an increased capacity size.

EBS pricing

Whereas with Amazon S3 you pay for only the storage you consume, with Amazon EBS you are charged for the storage provisioned to you per month. Meaning, if you provisioned an eight gigabyte EBS volume, you would be charged for that eight gigabytes regardless of if you are only using one gigabyte of storage on that volume. There is a different cost to each volume type I discussed earlier. Again, the cost is region dependent. Based on the London region, the costs are as follows. And here we have the general purpose SSD, the provisioned IOPS, the throughput optimized, and the cold HDD volumes. For these volumes, the charges are billed on a per-second basis and are charged pro rata, depending on how many seconds you have the storage provisioned for within that month. You must also remember that when an EBS snapshot is created and stored on Amazon S3, you will also be charged for that storage. And for the London region, this would be at .053 dollars per gigabyte month of data stored.

EBS Anti-patterns

As we can see, EBS offers a number of benefits over EC2 instance store volumes. But EBS is not well suited for all storage requirements.

aws_ebs6

For example, if you only needed temporary storage or multi-instance storage access, then EBS is not recommended or possible, as EBS volumes can only be accessed by one instance at a time. Also, if you needed very high durability and availability of data storage, then you would be better suited to use Amazon S3 or EFS, the elastic filesystem.

Elastic File System

Elastic File System, known as EFS. EFS provides a file level storage service, whereas the likes of EBS provided block level storage and Amazon S3, object level storage. EFS is a fully managed, highly available and durable service, that allows you to create shared file systems that can easily scale with your demands, and meet requests by tens, hundreds, or even thousands of ec2 instances concurrently.

aws_efs

Being a managed service, there is no need for you to provision file service to manage the storage elements, or provide any maintenance of those servers. This makes it a very simple option to provide file level storage within your environment. From a capacity perspective, it’s very similar to S3, in that it’s seemingly limitless. As you store more and more files using EFS, your storage simply elastically grows with the request.

EFS vs EBS: There is no need to provision a set size of data storage, like you do with EBS for example, where you need to specify the size of the volume that you require. As the file system can be accessed by multiple instances, it makes it a very good storage option for applications that span across multiple instances, allowing for parallel access of data.

The EFS file system is also regional, and so any application deployments that span across multiple availability zones can all access the same file systems, providing a level of high availability of your application’s storage layer. At the time of writing, EFS is not currently available within all regions.

aws_efs1

EFS has been designed to maintain a high level of throughput, in addition to low latency access response. These performance factors make EFS a desirable storage solution for a wide variety of workload and use cases.

Creating EFS

To create your EFS file system, it’s a very simple and quick process, which can be managed from within the AWS management console. From within the EFS dashboard, you select to create a new file system, and then you are required to enter some configurational information. You must select which VPC that this file system will exist in, and once selected, EFS will automatically create mount targets for you, across the availability zones where you have EC2 instances. These mount targets allow you to connect to the EFS file system, form your EC2 instances, using a configured mount target IP address.

aws_efs2

When mounting the EFS file system, be aware that it is only compatible with NFS version 4, and 4.1. With this in mind, EFS does not currently support the Windows operating system. You must insure that your Linus EC2 instance has the NFS client installed for the mounting process, and the NFS client version 4.1 is recommended for this procedure. For each mount point, you are able to select which subnet the mount point exists in, as well as defining your security group to control access from what instance level.

Set performance mode and encryption settings

The next step of creating your file system involves defining your tags, performance mode, and encryption settings. The two different performance mode of operations are general purpose and Max I/O. For most use cases and requirements, you will likely be using general purpose. It has the lowest latency out of the two different modes, and will work with many different application workloads. There is a limitation of this mode, allowing only up to 7,000 file system operations per second to your EFS file system.

aws_efs3

If, however, you have huge scale architectures, or your EFS file system is likely to be used by many thousands of EC2 instances concurrently, and will exceed 7,000 operations per second, then you will need to consider Max I/O. This also offers a virtually unlimited amount of throughput and IOPS, in addition to additional latency to each I/O.

The best way to understand which performance option you need is to run tests alongside your application. If your application sits comfortably within the upper limit of the 7,000 operations per second, then general purpose will be best suited, with the added plus point of the lower latency. However, if your testing confirms 7,000 operations per second may be tight, then select Max I/O. When using the general purpose mode of operations, EFS provides a CloudWatch metric, PercentIOLimit, which allows you to view your operations per second as a percentage of the top 7,000 limit. This allows you to make the decision to migrate and move to the Max I/O file system, should your operations be reaching the limit.

Encryption: You also have the opportunity to implement encryption of your EFS file system, using a simple checkbox, and selecting your desired CMK.

aws_efs4

Much like EBS, EFS uses the services offered by the key management service, to provide encryption of crucial storage. However at this stage, encryption is only offered at rest, and not in transit. The final stage requires you to review and create your EFS file system, based on the configuration that you have specified.

Once your file system is created, you are presented with your mount target points, allowing you to connect your EC2 instances as required.

aws_efs5

This shows the mount targets generated from the configuration given in these screenshots. In addition to having the ability of being able to mount the new EFS file system to your EC2 instances in your VPC, you can always use these mount points on your on-premise service, as long as you connect via direct connect, or 3rd party VPN.

Moving data in EFS using EFS file sync

aws_efs6

If you have existing data, either on your own service in your own on-premises data center, or data that already exists in AWS, such as on EC2 instances and you want to move that data into EFS, then you can use the file sync feature. File sync can be configured from within the EFS dashboard of the management console, and it allows you to securely manage the transfer of files between an existing storage location, and your EFS file system via a file sync agent.

aws_efs7

If you require the need to sync source files from your on-premises environment, then you can download the file sync agent as a VMware ESXi host. If you are syncing source files from within AWS, then it will provide a community-based AMI, to be used with an EC2 instance. This agent is then configured with your source destination amount target of your EFS file system details, and logically sits in between them. From within the EFS dashboard, you can then start the file sync, and monitor it’s progress with Amazon CloudWatch.

EFS Pricing

The pricing structure is very simple for EFS. There are no charges for data transfer, or any charges for requests, like we have in Amazon S3 in Glacier. You are, however, charged for the amount of data you consume in per gigabyte-months. This allows for the calculation of space used, as it fluctuates over the month. For example, let’s say from within the Ireland region, you used 80 gig of storage for 10 days in March, and 175 gig of storage for the rest of March, totaling 21 days. During March, you would have the following usage in gigabyte hours. This is then converted to gigabyte month, to calculate the final figure. For the latest information on EFS pricing, please visit the Amazon EFS pricing page, as shown on the screen.

EFS Anti-patterns

aws_efs8

As I have already explained, EFS provides file level storage capabilities. As a result, this doesn’t make it an ideal solution for data archiving. Instead, we would use Amazon Glacier. Also, relational database data is better suited to EBS, and finally, EFS is not recommended for temporary storage. This should be fulfilled by EC2 instance store volumes instead.

Tags:

Updated: