Database Fundamentals on AWS
The goal of this post is to introduce the various database services offered by Amazon Web Services. There are two main families of databases provided by Amazon Web Services: relational and non-relational. Each suits different use cases.
Overview of the AWS Database Services:
AWS Relational Database Service (RDS):
- Amazon RDS for MySQL
- Amazon RDS for Microsoft SQL Server
- Amazon RDS for Oracle
- Amazon RDS for MariaDB
- Amazon RDS for PostGresSQL
- Amazon Aurora
AWS Non Relational Databases
- Amazon DynamoDB
- Amazon Elasticache
- Amazon Neptune
Amazon RDS family
What is Amazon RDS?: Amazon RDS is Amazon’s managed relational database service. RDS is the core platform for running relational databases on AWS. Before we delve right in, let’s just make sure we understand what the benefits of using the RDS service are. Now let’s just keep in mind that we don’t have to use Amazon RDS to run a relational database service. We could run our own version of a relational database service on an EC2 instance for example.
Benefits of using Amazon RDS: Let’s get to know the common services RDS provides across the range of database types that are provided within it.
- Firstly, one of the key benefits of the RDS service is this ability to scale compute metrics in or out independently. Now this means you can alter the memory, the processor size, the amount of storage, or the IOPS speed, independently of each other. You don’t need to go buy a whole new machine if you want to increase an aspect of performance.
- The second big benefit is the automatic backups and patching. So, RDS provides automatic backups of your database which is on by default. And that simplifies backups and recovery for you. You can set the frequency of backups, and how long archives will be stored for. And you can still create manual snapshots of a database at any time. So along with the backup routine, Amazon RDS also manages the software patching of the database platform. So you can configure when major patches should be applied.
- The third big benefit of RDS is high availability. Now RDS provides the option to run databases in multiple availability zones within a region, i.e., depending on the type of service that you choose, you can have RDS run a synchronous, or asynchronous version of your database, in a different availability zone. Now how this replication and failover works, is gonna be specific to each type of database, and the multi-AZ deployments for Oracle and Postgres, mySQL and MariaDB instances, use Amazon’s failover technology. Whereas, Microsoft SQL Server DB instances use SQL server mirroring. Lastly the Amazon Aurora instance types copy the data in a DB cluster across multiple availability zones in a single region by default. So regardless of whether the instances in the DB cluster span multiple availability zones, you still have that multi-AZ’s support for your storage using Amazon Aurora.
- The fourth benefit is the automatic failure detection and recovery. So, along with creating backups and running a failover database for you in another availability zone, Amazon RDS can also manage automatic failure detection and recovery.
Summary: Amazon RDS provides you with a managed service which takes care of the provisioning of the hardware, networking, and database software. So while you retain some control of the configuration, Amazon RDS manages the patching of a database software and the compute platform. Amazon RDS enables you to run a database in multiple availability zones so your database service is highly available. And by default, RDS also manages backing up your database for you. So this service in itself provides you with a managed service for the following database engines:
- Microsoft SQL Server
- Postgres: An open source database service and arguably the leading open source database service.
- MariaDB: MariaDB is the community develop fork of the MySQL Relational Database Management System.
- Aurora: Amazon Aurora which is Amazon’s own fork of MySQL which provides significantly faster processing and availability as it has its own cloud native database engine.
A bit more about Amazon Aurora: Amazon Aurora, the cloud native database from Amazon. It’s Amazon’s own fork of MySQL which provides significantly faster processing availability as a native MySQL and Postgres compatible relational database service. Amazon Aurora was designed and built from the ground up to be cloud native. So it’s a high-performance database service. As such there are speed and availability benefits in choosing the Aurora service in RDS.
The first key difference is how Amazon Aurora manages the underlying storage. Amazon Aurora replicates data across three availability zones by default. So when you create an Amazon Aurora instance, the Aurora service also deploys a cloud native database cluster, and the Aurora instances will use this database cluster as the underlying data store. The database cluster spans two or more availability zones by default, with each availability zone having a copy of the database cluster data. And each cluster has one primary instance which performs all of the data modifications to the cluster volume and supports read and write operations. Each cluster also has at least one Aurora replica which supports only read operations. So each Aurora DB cluster can have up to 15 Aurora replicas of the primary instance, amazing. Now this makes the response and recovery time for Amazon Aurora significantly faster and durable on most RDS services. And the multiple Aurora replicas distribute the read workload. And by locating Aurora replicas in separate availability zones, you can increase your database availability while increasing read replica performance.
So that’s the Amazon RDS family. But inside relational databases, there’s also Amazon RedShift. Amazon RedShift is Amazon’s data warehouse solution and it was built as a cloud native application so it provides speed and availability by default as well as a well-priced data warehousing service.
Amazon Non-Relational Databases
In comparison to relational databases, non-relational databases provide a simple tabular structure without a processing engine built into the database software. So the key difference with non-relational databases is the lack of a schema and transaction engine. This makes non-relational databases a little lighter, simpler, and perhaps less dependent on native database code.
A non-relational database can still be accessed and worked with, but in a different way from the Structured Query Language we use to access a relational database.
With a relational database, we have a persistent connection to the database and then we use the Structured Query Language to work with the data within it. With a non-relational database, we generally use a RESTful HTTP interface. So before your application can access a database, it must be authenticated to ensure that the application is allowed to use that database and that it needs to be authorized so that the application can only perform actions for which it has those permissions. For example, how we work with DynamoDB is different from how we would work with a relational database like Microsoft SQL Server.
With DynamoDB, we use the query action to retrieve data. The DynamoDB query action lets you retrieve data from the physical location of where the data is stored so the syntax and operations are different. We can use the DynamoDB query function with any table that has a composite primary key which is a partition key or a sort key. In DynamoDB, you must use the expression attribute values as place holders in the expression parameters such as the key condition expression and the filter expression.
In general, the AWS non-relational databases can scale faster than relational databases. With a non-relational database, you don’t need to define a schema for the tables first. So without having to define the schema means changes to a non-relational database can be made faster. Non-relational databases suit non-structured data so they are designed specifically for handling non-structured data types, i.e. videos, images, or data objects that are not uniform in structure.
Amazon DynamoDB is a cloud native database, and it’s designed for managing high volumes of records and transactions, without you needing to provision capacity up front. DynamoDB is a fully managed service, it’s simplicity, scalability, and speed has made it the go to database for online services that deal with high volumes of internet-based transactions.
DynamoDB supports both document and key store object types, big point, what that means is it can support multiple data types at the same time, without you needing to define a new schema or field type. That makes it a good choice if you need a database that can keep growing to meet demand with many different types of objects stored in it. DynamoDB runs as a web service, which we provision from the AWS console, or via API. AWS also provides a downloadable version of DynamoDB that you can run locally on your computer or server. The downloadable version lets you write and test applications locally without accessing the DynamoDB web service.
DynamoDB supports encryption at rest, so it can meet many compliance and security requirements. So to summarize, DynamoDB is speed and performance. If we need to scale something up really quickly, we’re not quite sure what type of information we’re going to be collecting, that we just need to have a flexible service, then DynamoDB is a fantastic fit.
The next in the family of non-relational databases is Amazon ElastiCache. Now Amazon ElastiCache is a managed data cache service built from the open source Redis and Memcached database engines, as a managed service, Amazon ElastiCache can improve your application performance by providing a frontline cache to respond to read requests made to an application or to a database. Let’s just delve into the differences between a cache and a permanent data store so we are clear on the distinctions. The purpose of a cache is generally to act as a fast access copy of data that is being read a lot. So lots of read requests.
A cache will hold a copy of frequently requested information as a way to reduce load on other parts of a service or application. Now imagine we have a travel alert page within our claims management application. People check this travel alert page frequently to see the state of weather before booking or commencing their travel plans. If there is a storm set to hit the east coast, and many of our customers check this update page to see if their travel insurance would include storm cover during the storm for example, over time we might experience some slowing of this business app, because when there’s a storm and the travel update page is being visited frequently, there’s a lot of load on the database and on the application. We didn’t envisage this would be an issue when we designed the claims management service.
To reduce the load on our claims database, would be to implement a cache between the database and the claims application. A cache is ideal for holding frequently requested data so the web application does not need to read those components from our permanent data store. A cache will generally hold data for a finite period of time, if a record is changed, the cache will compare, flush, and store the latest version of a record. So while ElastiCache is a reliable and durable service by nature, ElastiCache is more about speed than persistence. As a cache, it is different by design from a permanent data store. Using Amazon ElastiCache, we could easily provision and implement ElastiCache to sit between our main database and our web application. Read requests that are made over and over will be stored temporarily within the ElastiCache database. ElastiCache will respond and send that frequently requested data to the web application front end, meaning our main database does not receive so many requests. So which cache engine should we choose? ElastiCache supports two database engines, Memcached and Redis.
What’s the difference I hear you ask. Now both engines provide a great cache solution, easily provisioned using the ElastiCache service. A simple way I use to remember the difference between them is Memcache for simplicity and speed, Redis for features. Choose Memcache if the following applies to your situation, you need the simplest model possible, you need to run large nodes with multiple cores or threads, you need the ability to scale in and out, adding and removing nodes as demanded by your system. Use Redis if we have complex data types, such as strings, hashes, lists, or sets, or bitmaps for example.
If we need persistence in our data store, or if we need to encrypt our data, if we need to replicate our cache data, or if we need automatic fail over if our primary node should fail, and if we want backup and restore capabilities, if we need support for multiple caches, or we need to sort or rake in memory data sets, it’s all about the features, and Redis delivers on those.
Amazon Neptune is a graph database service that makes it easy to build and run applications that need to use a lot of queries and look ups to quickly visualize data. Now graphing data can require a complex number of connection strings and related queries. So as a managed service, Amazon Neptune reduces the need for hardware provisioning, and software patching, setting up all the configuration or the backups required generally do get impacted if you have a lot of these queries to run and manage. So Amazon Neptune is an AWS native graph database engine, it’s optimized for storing data relationships and querying a graph quickly and efficiently.
Neptune suits use cases such as knowledge graphs, recommendation engines, and network security, to name a few examples. Now Neptune supports the popular graph models like Property Graph and W3C’s RDF functions, and the respective query languages such as Apache, TinkerPop, Gremlin, and SPARQL, which allows you to easily build queries that efficiently never get very highly connected data sets.
Comparing the AWS Database Services
AWS Non-relational databases