AWS Kinesis data streams in real-time
INTRODUCTION
The purpose of the article is to describe how to collect and process large stream data in real-time using AWS service. Today we will introduce Amazon Kinesis Data Streams service that can do it easily and effectively with rapid and continuous process, high performance, and durability.
WHAT IS THE KINESIS DATA STREAMS?
Definition
Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale.
Use cases
Stream log and event data
Run real-time analytics
Power event-driven applications
INFRASTRUCTURE DIAGRAM
The following infrastructures are some common infrastructures that are recommendations from AWS.
Producers → Kinesis Data Streams → Consumers
Producers → Kinesis Data Streams → Consumers
Users → ALB → Producer (Fargate) → Kinesis Data Streams → Consumers (Fargate)
Producer: is an application that puts user data records into a Kinesis data stream (also called data ingestion). It is allowed to put records to Kinesis Data Streams from many resources such as EC2, ALB – Fargate, Gateway API Proxy, Mobile application, or any other servers.
Consumer: is an application that processes all data from a Kinesis data stream. It is allowed to pull records from Kinesis Data Streams using many resources as consumers such as EC2, Lambda, EMR, Fargate, etc. and do the many tasks depending on these records such as storing these records to databases like DynamoDB, Aurora, Redshift, or S3 and so on.
LIMITATION
Data record
- Put records: 1 record can up to 1 MB (PutRecord API), multiple records (PutRecords API) can up to 5MB per request (1MB / 1 record).
- Pull records: can retrieve up to 10 MB of data per call from a single shard, and up to 10,000 records per call.
Data retention period: default 24 hours, able to up to 8760 hours (365 days).
PERFORMANCE
Kinesis Data Streams relies on shards, which are units of throughput and represent a parallelism. One shard provides an ingest throughput of 1 MB / second or 1000 records / second. A shard also has an outbound throughput of 2 MB / second. As you ingest more data, Kinesis Data Streams can add more shards. Customers often ingest thousands of shards in a single stream.
When a consumer uses enhanced fan-out, it gets its own 2 MB/sec allotment of read throughput, allowing multiple consumers to read data from the same stream in parallel, without contending for read throughput with other consumers.
Comparison between consumers without enhanced fan-out and consumers with enhanced fan-out
PERFORMANCE MEASUREMENT
To assume the requirements are as below information.
76 KB / 1 record with 4,000 records.
Setup standard consumer (without enhanced fan-out) and enhanced fan-out consumers (EFO).
A single shard with 3 Lambda consumers.
As seen in the above chart, we can see that each of the enhanced fan-out functions processed the 4000 records in under 2 seconds, but each of standard at just over 2.5 seconds. If we process millions of records in real time, the latency between standard and enhanced fan-out becomes much more significant.
CONCLUSION
In conclusion, using Kinesis Data Stream to collect and analyze massive amounts of data is the one of the best choices because it provides the best performance when enabling enhanced fan-out, durability in a long time, scale up to create multiple shards to adapt for streaming the large data and provide the best cost when using this service. Hence, it is an easy way to stream large data in real-time in the era where performance is very important.
Let’s get started and try to implement this service if you have a chance!!!
We are a software development company based in Vietnam.
We offer DevOps development remotely to support the growth of your business.
If there is anything we can help with, please feel free to consult us.