Amazon Kinesis makes it easy to collect, process, and analyze
real-time streaming dataso you can get timely insights and react quickly to new information.
With Amazon Kinesis you can ingest real-time data such as application logs, website clickstreams, IoT telemetry data, and more into your databases, your data lakes, and data warehouses, or build your own real-time applications using this data.
Amazon Kinesis enables you to process and analyze data as it arrives and respond in real-time instead of having to wait until all your data is collected before the processing can begin.
In choosing a big data processing solution from within the available AWS service offerings, it is important to determine whether you need the latency of response from the process to be in seconds, minutes, or hours.
This will typically drive the decision on which AWS service is the best for that processing pattern or use case. AWS Kinesis is primarily designed to deliver processing orientated around real-time streaming.
One of the interesting things when we look at the storage patterns is that Amazon Kinesis does not store persistent data itself, unlike many of the other Amazon big data services. AWS Amazon Kinesis needs to be deployed as part of a larger solution where you define a target big data solution that will store the results of the streaming process.
Note that each Amazon Kinesis Firehose delivery stream stores data records for up to 24 hours in case the delivery destination is unavailable, and the Kinesis Stream stores records from 24 hours by default but this can be extended to retain the data for up to seven days.
Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds or thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location-tracking events.
Amazon Kinesis provides three different solution capabilities.
- Amazon Kinesis Streams enables you to build custom applications that process or analyze streaming data for specialized needs.
- Amazon Kinesis Firehose enables you to load streaming data into the Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, and the Amazon Elasticsearch services.
- Amazon Kinesis Analytics enables you to write standard SQL queries on streaming data.
Amazon Kinesis Streams is based on a platform as a service style architecture where you determine the throughput of the capacity you require and the architecture and components are automatically provisioned and stored and configured for you. You have no need or ability to change the way these architectural components are deployed.
Unlike some of the other Amazon Big Data services which have a container that the service sits within, for an example a DB instance within Amazon RDS, Amazon Kinesis doesn’t. The container is effectively the combination of the account and the region you are provisioning the Kinesis streams within.
An Amazon Kinesis Stream is an ordered sequence of data records.
A record is the unit of data in an Amazon Kinesis stream.
Each record in the stream is composed of a sequence number, a partition key, and a data blob.
A data blob is the data of interest your data producer adds to a stream. The data records in the stream are distributed into shards. A shard is the base throughput unit of an Amazon Kinesis stream. One shard provides a capacity of one megabyte per second of data input and two megabytes per second of data output, and can support up to 1,000 put-records per second. You specify the number of shards needed when you create a stream. The data capacity of your stream is a function of the number of shards that you specify for that stream. The total capacity of the stream is the sum of the capacity of its shards. If your data rate increases, you can increase or decrease the number of shards allocated to your stream. The producers continuously push data to Kinesis Streams and the consumers process the data in real-time. For example a web service sending log data to a stream is a producer.
Consumers: Consumers receive records from Amazon Kinesis Streams and process them. These consumers are known as Amazon Kinesis Streams applications. Consumers can store the result using an AWS service such as Amazon DynamoDB, Amazon Redshift, or Amazon S3. An Amazon Kinesis application is a data consumer that reads and processes data from an Amazon Kinesis Stream and typically runs on a fleet of EC2 instances. You need to build your applications using either the Amazon Kinesis API or the Amazon Kinesis Client Library or KCL. Before we go into each option and detail, let’s have a quick look at how AWS makes things easier for you.
One of the great things about AWS is they always try and make things easy for you. So when you go to create a new Amazon Kinesis Stream definition in the console, there are a couple of simple parameters we need to complete to create the stream. We just need to enter in a stream name and the number of shards and then we are ready to go.
An Amazon Kinesis stream is an ordered sequence of data records.
Each record in the stream has a sequence number that is assigned by Kinesis Streams.
A record is the unit of data stored in the Amazon Kinesis stream.
A record is composed of a sequence number, partition key, and data blob.
A data blob is the data of interest your data producer adds to a stream. The maximum size of a data blob, the data payload before Base64 encoding is one megabyte. A partition key is used to segregate and route records to different shards of a stream. The Kinesis Streams service segregates the data records belonging to a stream into multiple shards using the partition key associated with each data record to determine which shard a given data record belongs to. Partition keys are Unicoded streams with a maximum length of 256 bytes. An MD5 hash function is used to map partition keys to a 128-bit integer value and to map associated data records to shards.
A partition key is specified by your data producer while adding data to an Amazon Kinesis stream.
For example, assuming you have a stream with two shards, shard one and shard two, you can configure your data producer to use two partition keys, key A and key B, so that all records within key A are added to shard one, and all records with key B are added to shard two.
A sequence number is a unique identifier for each record.
Sequence numbers are assigned by Amazon Kinesis when a data producer calls
PutRecords operation to add data to an Amazon Kinesis stream. Sequence numbers for the same partition key generally increase over time. The longer the time period between
PutRecords requests, the larger the sequence number becomes.
A shard is a group of data records in a stream.
When you create a stream, you specify the number of shards for the stream.
Each shard can support up to five transactions per second for reads and up to a maximum total data read rate of two megabytes per second, and up to a 1,000 records per second for writes and up to a maximum total data write rate of one megabyte per second including partition keys. The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed, however note that you are charged on a per shard basis. Before you create a stream, you need to determine an initial size for the stream.
After you create the stream, you can dynamically scale your shard capacity up or down using the AWS Management Console or the
UpdateShardCount API. You can make updates while there is an Amazon Kinesis Stream application consuming data from the stream. You can calculate the initial number of shards you need to provision using the formula at the bottom of the image.
Kinesis Streams support changes to the data record retention period for your stream. A Kinesis stream is an ordered sequence of data records meant to be written to and read from in real-time.
A Kinesis Stream is an ordered sequence of data records
Data records are therefore stored in shards in your stream temporarily.
The time period from when a record is added to when it is no longer accessible is called the retention period.
Kinesis Streams supports re-sharding which enables you to adjust the number of shards in your stream in order to adapt to changes in the rate of data flow through the stream. There are two types of re-sharding operations, a shard split and a shard merge.
As the names suggest, in a shard split you divide a single shard into two shards, in a shard merge you combine the two shards into a single shard. You cannot split into more than two shards in a single operation, and you cannot merge more than two shards in a single operation. The shard or pair of shards that the re-sharding operation acts on are referred to as parent shards. The shard or pair of shards that result from the re-sharding operation are referred to as child shards.
After you call a re-sharding operation, you need to wait for the stream to become activated. Remember Kinesis Streams is a real-time data streaming service which is to say that your application should assume that the data is continuously flowing through the shards in your stream.
When you re-shard, data records that were flowing to the parent shards are rerouted to flow to the child shards based on the hash key values that the data record partition keys mapped to. However any data records that were in the parent shards before the re-shards remain in those shards. In other words the parent shards do not disappear when the re-shard occurs. They persist along with the data they contained prior to the re-shard.
A producer puts data records into Kinesis streams. For example a web server sending log data to a Kinesis stream is a producer. A consumer processes the data records from a stream.
To put data into the stream, you must specify:
- the name of the stream,
- a partition key, and
- the data blob to be added to the stream.
The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard. Which partition key you use depends on your application logic. The number of partition key should typically be much greater than the number of shards. This is because the partition key is used to determine how to map a data record to a particular shard.
If you have enough partition keys, the data can be evenly distributed across the shards in a stream.
A consumer gets records from the Kinesis stream. A consumer, known as an Amazon Kinesis Streams application, processes the data records from a stream. You need to create your own consumer applications. Each consumer must have a unique name that is scoped to the AWS account and region used by the application. This name is used as a name for the control table in Amazon DynamoDB and the namespace for Amazon CloudWatch metrics.
When your application starts up, it creates an Amazon DynamoDB table to store the application state, connects to the specified stream, and then starts consuming data from the stream. You can view the Kinesis stream metrics using the CloudWatch console.
You can deploy the consumer to an EC2 instance. You can use the Kinesis Client Library or KCL to simplify parallel processing of the stream by a fleet of workers running on a fleet of EC2 instances. The KCL simplifies writing code to read from the shards in the stream and it ensures that there is a worker allocated to every shard in the stream. The KCL also provides help with fault-tolerance by providing check-pointing capabilities.
Each consumer reads from a particular shard using a shard iterator. A shard iterator represents the position in the stream from which the consumer will read. When they start reading from a stream, consumers get a shard iterator which can be used to change where the consumers read from the stream.
When the consumer performs a read operation, it receives a batch of data records based on the position specified by the shard iterator. There are a number of limits within the Amazon Kineses Streams service you need to be aware of.
While still under the Kinesis moniker, the Amazon Kinesis Firehose architecture is different to that of Amazon Kinesis Streams.
Amazon Kinesis Firehouse
Amazon Kinesis Firehose is still based on a platform as a service style architecture where you determine the throughput of the capacity you require and the architecture and components are automatically provisioned and stored and configured for you. You have no need or ability to change the way these architectural components are deployed.
Amazon Kinesis Firehose is a fully-managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service or S3, Amazon Redshift, or Amazon Elasticsearch service.
With Kinesis Firehose you do not need to write applications or manage resources. You configure your data producers to send data to Kinesis Firehose and it automatically delvers the data to the destination that you specify. You can also configure Amazon Kinesis Firehose to transform your data before data delivery.
Unlike some of the other Amazon big data services which have a container that the service sits within, for example a DB instance within Amazon RDS, Amazon Kinesis Firehose doesn’t. The container is effectively the combination of the account and the region you provision the Kinesis delivery streams within. The delivery stream is the underlying entity of Kinesis Firehose. You use Kinesis Firehose by creating a Kinesis Firehose delivery stream and then sending data to it, which means each delivery stream is effectively defined by the target system that receives the restreamed data.
Data producers send records to Kinesis Firehose delivery streams. For example a web service sending log data to Kinesis Firehose delivery stream is a data producer. Each delivery stream stores data records for up to 24 hours in case the delivery destination is unavailable.
The Kinesis Firehose destination is the data store where the data will be delivered. Amazon Kinesis Firehose currently supports Amazon S3, Amazon Redshift, and Amazon Elasticsearch as service destinations.
Within the delivery stream is a data flow which is effectively the transfer and load process. The data flow is predetermined based on what target data source you configure your delivery stream to load data into. So for example if you are loading into Amazon Redshift, the data flow defines the process of landing the data into an S3 bucket and then invoking the copy command to load the Redshift table.
Kinesis Firehose can also invoke an AWS Lambda function to transform incoming data before delivering it to the selected destination. You can configure a new Lambda function using one of the Lambda blueprints AWS provides or choosing an existing Lambda function.
Before we go into each of the options in detail, let’s have a quick look at how AWS makes things easier for you. One of the great things about AWS is that they always try and make things easy for you. So when you go to create a new Amazon Kinesis Firehose definition in the console, there are a number of pre-baked destinations that will help you with streaming data into a AWS big data storage solution.
As you can see, you can select one of the three data services currently available as a target, S3, Redshift, or Elasticsearch. Selecting one of these destinations will create additional parameter options for you to complete to assist in creating the data flow. If we chose Amazon S3 as a destination data source, then the relevant parameters are displayed to be completed. If we chose Amazon Redshift as a destination target, you can see we get a different set of parameters as you would expect.
Note that we are required to define both an S3 bucket and a Redshift target database in this scenario as Amazon Kinesis Firehose is leveraging the Amazon Redshift copy capability to load the data.
Going back to the Amazon S3 scenario, we also have the ability to consume an AWS Lambda function as part of the loading process to transform the data on the way through.
AWS makes Kinesis Firehose simple to use by predefining the data flows that are required to load data into the destinations. For Amazon S3 destinations, streaming data is delivered to your S3 bucket. If data transformation is enabled, you can optionally back up source data to another Amazon S3 bucket. For Amazon Redshift destinations, streaming data is delivered to your S3 bucket first. Kinesis Firehose then issues an Amazon Redshift copy command to load data from your S3 bucket to your Amazon Redshift cluster. If data transformation is enabled, you can optionally back up source data to another Amazon S3 bucket.
Note that you need to configure your Amazon Redshift cluster to be publicly accessible and unblock the Kinesis Firehose IP addresses. Also note that Kinesis Firehose doesn’t delete the data from your S3 bucket after loading it to your Amazon Redshift cluster. For Amazon Elasticsearch destinations, streaming data is delivered to your Amazon Elasticsearch cluster and can optionally be backed up to your S3 bucket concurrently.
Streams vs Firehose
Let’s have a quick look at the difference between Amazon Kinesis Streams and Firehose.
Amazon Kinesis Streams is a service for workloads that require custom processing, per incoming record, with sub-one-second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is a service for workloads that require zero administration, the ability to use existing analytical tools based on S3, Amazon Redshift, and Amazon Elasticsearch with data latency of 60 seconds or higher. You use Firehose by creating a delivery stream to a specified destination and send data to it, you do not have to create a stream or provision shards, you do not have to create a custom application as the destination, and you do not have to specify partition keys unlike Streams. But Firehose is limited to S3, Redshift, and Elasticsearch as the data destinations.
Okay so as we come to the end of this post on AWS Kinesis, let’s have a quick look at a customer example from AWS where Amazon Kinesis has been used.
Sushiro uses Amazon Kinesis to stream data from sensors attached to plates in its 380 stores. This is used to monitor the conveyor belt sushi chain in each store and is used to help decide in real time what plates chefs should be making next.