Storage Fundamentals of AWS and S3 features
Introduction: One of the core building blocks of Infrastructure as a Service (IaaS) is that of storage, and AWS provides a wide range of storage services that allow you to architect the correct solution for your needs. Understanding what each of these services is and what they have been designed and developed for, gives you the knowledge to implement best practices ensuring your data is stored, transmitted and backed up in the most efficient and scalable way.
Background: More and more organizations are moving and migrating to the cloud, for the many benefits the cloud brings, such as flexibility, scalability, cost efficiencies, security, and more. AWS offers many different services that allows for almost any migration of a solution or new solution to exist, and take advantage of these benefits.
This means that from a foundational and infrastructure as a service perspective, AWS has to provide services, components, and features that provide these core infrastructure elements, covering compute, storage, database, and network, and AWS does this very well. In this post, I will focus on the storage element of these components. The following are the storage options on AWS:
So, why does AWS provide so many different storage services, if all you need to do is store your data in the cloud?: Well, it’s effectively the same reasons why you have range of storage products and solutions in your own on-premise environment. For example, you are likely using different storage devices, such as a storage area network, known as a SAN, network attached storage, known as a NAS, directly attached storage, and also taped backup, to name but a few. Now, it’s not important to understand in detail what each of these solutions are and do, however, the point I’m trying to make here is that they all perform the same function, the ability to store data.
But at the same time, each solution also provides different benefits and features, such as cost variants, storage capacity, security features, such as encryption and access control, varied levels of durability and availability, different read/write speeds, different accessibility options, different media types, some can be auditable and traceable, and also use case, such as backup and file storage.
AWS is fully aware that not all of your data is to be treated exactly the same and that sometimes, data can require very specific requirements. This is the reason why AWS has so many different storage services available, to allow you to select the most appropriate service for your needs. Understanding which AWS storage can provide these features and more is critical to being able to select the most appropriate service, allowing you to implement an effective and efficient solution.
Block, File and Object storage
Data storage can be categorized between
object storage. So, what’s the difference between these?
Block storage: Block storage stores the data in chunks of data known as blocks, and these blocks are stored in a volume, and attached to a single instance. They generally provide very low latency, and can be considered similar to your directly attached disks within your own data center.
File storage: Your data is stored as separate files within a series of directories, forming a data structure hierarchy. The data is then stored on top of a file system, and provides shared access, allowing for multiple users to access the data. File storage in AWS can be associated to your network attached storage systems you may have in your own data center.
Object Storage: Each object does not conform to a data structure hierarchy. Instead, it exists across a flat address space, and is referenced by a
unique key. Each object can also have associated metadata to help categorize and identify the object.
Now that you have an understanding of why AWS has curated and developed a range of storage services for you to select, let me know start by introducing each of these services to provide information on exactly what the service is and does, and highlighting its key features, and when and why you might select the service.
Amazon Simple Storage Service
Amazon Simple Storage Service, commonly known as S3, is probably the most used storage service that is provided by AWS simply down to the fact that it can be used for many different use cases and called upon by many different AWS services.
Amazon S3 is a fully managed object based storage that is highly available, highly durable, very cost effective, and widely accessible.
When data is uploaded within S3 you as the customer are required to specify the regional location for that data to be placed in. By specifying your region for your data Amazon S3 will then store and duplicate your uploaded data multiple times across multiple available zones within that region to increase both its durability and availability.
Objects stored in S3 have a durability of 11 nines (
99.999999999%) and so, the likelihood of losing data is extremely rare and this is down to the fact that S3 stores numerous copies of the same data in different availability zones. The availability of S3 data objects is currently four nines (
99.99%). And the different between availability and durability is this. When looking at availability AWS ensure that the up time of Amazon S3 is 99.99% to enable you to access your stored data. The durability percentage refers to the probability of maintaining your data without it being lost through corruption, degradation of data, or other unknown potential damaging effects.
When uploading objects to S3, specific objects are used to manage your data.
To store objects in S3 you first need to define and create a bucket. You can think of a bucket as a parent folder for your data. This bucket name must be completely unique, not just within the region you specify, but globally across all other S3 buckets that exist, which there are many millions. Once you have created your bucket name you can begin to upload your data. You can upload your data directly into that bucket or you can, if required, create folders under your bucket to store your data in for easier management. By default, there is a limit of a hundred buckets that you are allowed to create within your AWS account but this can be increased if requested through AWS.
Objects that are then stored in these buckets have a unique object key that defines the object across the flat address space of S3. Although folders can provide additional management from a data organization point of view, Amazon S3 is not a file system and so, specific features of Amazon S3 work at the bucket level rather than specific folder levels. Let’s now take a closer look at some of the features offered by Amazon S3, starting with an overview of the different storage classes that are available.
There are a number of different storage classes within S3, all of which offer different performance features and costs. And it’s down to you to select the storage class that you require for the data. These classes are as follows
- Standard Infrequent Access,
- Intelligent Tiering,
- One Zone Infrequent Access, and
- Reduced Redundancy.
But this Reduced Redundancy option is no longer recommended by AWS, and I’ll explain why as we go. It’s best to review the differences between these classes from within a table to enable you to understand the key points of difference.
Now as you can see, the main differences between the classes is the durability and availability percentages that each class offers. And to help us look at the differences between these classes, we can split them into two different categories, data that is accessed frequently and data that is accessed infrequently.
For data that is accessed frequently, you have two options, either Standard or Reduced Redundancy Storage, RRS. The default storage class option for your objects is a Standard class unless you specify otherwise. The Reduced Redundancy Storage class is an older storage class option and is no longer recommended by AWS, as the Standard Storage class is now more cost effective and offers greater durability.
For data that is accessed infrequently, we’ve looked at Standard-IA and One Zone-IA. Although they offer the same speed of access as Standard, these infrequent access storage classes charge an additional cost to retrieve and access the data. So the data in these classes are typically long live data objects that require very little need to be accessed. The difference between One Zone-IA compared to Standard-IA is the durability aspect. One Zone-IA does not replicate its data across multiple availability zones, and so only offers a 99.5% availability SLA. With this in mind, it should only be used for data that can be reproduced. Between Standard-IA and One Zone-IA, One Zone-IA is the cheapest due to its limitation on a single availability zone.
Finally, we have the Intelligent Tiering Storage class, which is great for unpredictable access patterns. And this has the ability to optimize your cost of your storage by moving data objects between different tiers based upon its usage. Intelligent Tiering will move data between two tiers, a frequently accessed tier and a more cost effective infrequent accessed tier. If data has not been accessed for 30 days or more then Intelligent Tiering will move data into the Infrequent Accessed tier. The next time this object is accessed, it will be moved back into the Frequently Accessed tier and the 30-day timer will be reset. There are no retrieval costs for your data like there is with Standard-IA and One Zone-IA. However, do be aware that there is a small monthly cost associated to each object that is monitored by the Intelligence Tiering class. And each object must be larger than 128 kilobytes.
So before selecting your class for your data, you really need to be asking yourself the following questions.
When looking at the data within your bucket, the S3 console will also show you which storage class that the object belongs to. Amazon Glacier is another storage class. However, it is also a separate service to Amazon S3. There are interactions between the two services where S3 allows you to use lifecycle rules to move data from S3 to Glacier as the Standard storage class for archival purposes.
Security in S3
Let me now talk a little bit about the different security features offered by S3.
Bucket permissions can be very detailed and specific allowing only access for a specific user within your account to access the data within a specified time range and only when coming from a specific IP address.
Access Control Lists (ACLs)
Access control lists or ACLs are again another method of controlling who has access to your bucket. However, they only control access for users outside of your own AWS account, such as access from other AWS accounts, or public access. ACLs are not as granular as bucket policies can be and so the permissions are broad in access. For example, list objects and write objects. You may be familiar with a number of recent security instance in which huge amounts of data have been unnecessarily exposed on Amazon S3 because the owners of that data failed to restrict public access to these buckets which may have contained personally identifiable information. Understanding who has access to your buckets and data is essential when using Amazon S3 due to the potential of it being accessible across the internet.
S3 offers a number of different encryption mechanisms to allow you to be able to encrypt your data. These methods cover both server side encryption and client side encryption options. The main difference between client side and server side encryption is the location where the encryption takes place. Server side encryption takes place within a AWS S3 and client side encryption occurs on your client prior to uploading your objects. S3 also fully supports encryption in transit via SSL secure sockets layer.
S3 Data Management
Versioning, when you enable versioning on a bucket it allows for multiple versions of the same object to exist. This is useful to allow you to retrieve previous versions of a file or recover from some accidental deletion, or indeed intended malicious deletion of an object. Versioning is managed and created automatically by the bucket when you overwrite the same object. For easier management, S3 will only display the latest version of the object within the console but it does provide a way of viewing all versions as and when you need to see the previous versions. Versioning is not enabled by default, however once you have enabled it you need to be aware of two main points. Firstly, you can’t disable versioning. You can suspend it on the bucket which will prevent any further versions from being created of your objects, but you can’t disable it altogether. Secondly, versioning will be an added cost to you as you are storing multiple versions of the same object and the Amazon S3 cost model is based on actual usage of data.
Lifecycle rules, lifecycle rules in AWS provide an automatic method of managing the life of your data while it is being stored on Amazon S3. By adding a lifecycle rule to a bucket you are able to configure and set specific criteria that can automatically move your data from one class to another, move it to Amazon Glacier, or delete it from Amazon S3 altogether. You may want to do this as a cost saving exercise, by moving data to a cheaper storage class after a set period of time. For example, 30 days. And once those 30 days are up, Amazon S3 will then automatically change your storage class off that data as per the lifecycle rule. Another example would be that you may only be required to keep the data for a set period of time before it can be deleted, again, saving you money on storage. In this scenario you can set the bucket’s lifecycle policy to automatically delete anything older than 90 days for example. The time frames are up to you and your own set requirements.
Common S3 use cases
AWS is commonly used in a number of different use cases, due to the features I’ve already discussed making it widely accessible and usable for different data types. So let me now just cover a few different scenarios where Amazon S3 would be a good solution for your storage requirement, starting with data backup.
Data Backup: Many people find that the highly scalable and reliable components of Amazon S3 an ideal choice to store data backup. Either for existing AWS resources that you are using or for your own on-premise data.
AWS also offers solutions to help you manage the transfer of your on-premise production data to Amazon S3 as a backup to your primary storage on site. With its ability to scout enormously and the flexibility of being able to retrieve your data with ease, it’s easy to see why S3 makes a great data backup solution. When your data is stored on S3 it can be, permissions allowing, accessible from anywhere you have an internet connection if required. This is another reason S3 makes a great service for data backup solutions allowing you to retrieve your data from anywhere should you need to access it.
Static content and websites, S3 is perfect for storing static data such as images and video which are used on almost every website. Every object can also be referenced directly via unique URL or by a content delivering network such as Amazon CloudFront which interacts closely with S3. If your website is entirely static then your whole website can in fact be hosted on Amazon S3 providing a highly scalable and cost effective method for running your website.
Large data sets, S3 is also great to store computational, scientific, and statistical data allowing you to perform big data analytics. This makes this kind of data accessible to multiple parties at once for analysis without impacting performance thanks to the horizontal scaling abilities of S3. Due to the size of some of this data it’s also a very cost effective method of storing large amounts of data that can be easily accessed and shared by a number of people.
Integration with other AWS services, Amazon S3 is widely used by a number of other AWS services within AWS to help those services perform their own functionality and features behind the scenes. For example, the
Elastic Block Store service known as EBS, which I’ll be discussing more in detail in an upcoming lecture, is able to create a backup of itself and store this backup on Amazon S3 as a snapshot. However, unlike the buckets that you create and the data that you store on S3 your EBS snapshots are not visible in any S3 buckets that you own. The snapshots are managed by AWS and hidden from S3 console as these snapshots are only backed by S3 and do not offer you the ability to manage their storage requirements as there is no need for you to so. Using S3 for this purpose, makes your EBS snapshots highly available and highly resilient.
Another example would be logging. Many services use Amazon S3 to store their logs, such as
AWS CloudTrail. AWS CloudTrail is a service that records and tracks all API called made within your AWS account. These API calls are recorded as events and then stored within a log file. This log file is then stored on S3. Again, due to the highly scalable and reliable nature of this service it makes sense for other services to use S3 for purposes such as this. In this instance, you are able to view the CloudTrail logs by one of your configured buckets specified during the CloudTrail creation.
Amazon S3 can also be used as an origin for an
Amazon CloudFront distribution. During the configuration of your CloudFront distribution, you are able to specify a bucket to be used that stores files that are to be distributed out to AWS edge locations to help reduce web access latency for your end users.
As with many services, the cost of S3 storage varies depending on the region you select. Let me look at the example of the London region, across the three different storage classes. The first thing that I want to point is that the reduced redundancy storage class is actually more expensive than the standard storage. Originally, reduced redundancy storage was introduced to reduce the cost over that of the standard class as a trade off against lower durability. However, for many regions this is no longer the case as you can see. It is now more cost effective to use the standard class which provides a greater level of resilience for a cheaper price. If you want to optimize your cost then the preferable option would be to use the infrequent access storage class. However, the availability of this class drops to 99.9% instead of 99.99%. The durability remains the same reaching 11 nines of durability. The more storage you use on S3 the cost of each gigabyte reduces when certain thresholds are reached.
When looking at your S3 costs you might feel you are only charged for the storage itself per gigabyte, however there are a number of other cost elements to S3 and it’s worth being aware of a couple of these. Including request costs and data transfer costs. Your request costs are based on actions such as put, copy, post, and get requests and are charged on a per 10,000 requests basis. Again, it will depend on your storage class as to the cost of these requests. As an example, when using the standard storage class using the London region it will cost you just over five cents per 10,000 put, copy, post, or list requests. For get requests and all other types of request it will cost you just over four cents per 10,000 requests. Data being transferred into S3 is free, however to transfer data out to another region costs two cents per gigabyte.
Although I have covered a number of reasons as to why S3 is great for different storage solutions, it’s not a catch all storage service.
For example, it’s not ideal for the following scenarios. Archiving data for long term use, perhaps for compliance. When the data is dynamic and changing very fast. If the data being stored requires a file system. And structured data that needs to queried. There are other storage services that I will discuss that are far more suited to perform these functionalities.