HomeBig DataRetain extra for much less with tiered storage for Amazon MSK

Retain extra for much less with tiered storage for Amazon MSK

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to seize and analyze knowledge in real-time. Amazon MSK permits you to construct and run manufacturing purposes on Apache Kafka while not having Kafka infrastructure administration experience or having to cope with the advanced overheads related to operating Apache Kafka by yourself. With rising maturity, prospects search to construct subtle use instances that mix facets of actual time and batch processing. As an illustration, you might wish to prepare machine studying (ML) fashions based mostly on historic knowledge after which use these fashions to do actual time inferencing. Or you might have considered trying to have the ability to recompute earlier outcomes when the appliance logic modified, e.g., when a brand new KPI is added to a streaming analytics utility or when a bug was mounted that precipitated incorrect output. These use instances usually require storing knowledge for a number of weeks, months, and even years.

Apache Kafka is properly positioned to help these sort of use instances. Information is retained within the Kafka cluster so long as required by configuring the retention coverage. On this approach, the latest knowledge will be processed in actual time for low-latency use instances whereas historic knowledge stays accessible within the cluster and will be processed in a batch trend.

Nonetheless, retaining knowledge in a Kafka cluster can develop into costly as a result of storage and compute are tightly coupled in a cluster. To scale storage, that you must add extra brokers. However including extra brokers with the only function of accelerating the storage squanders the remainder of the compute sources like CPU and reminiscence. Additionally, a big cluster with extra nodes provides operational complexity with an extended time to get better and rebalance when a dealer fails. To keep away from that operational complexity and better price, you’ll be able to transfer your knowledge to Amazon Easy Storage Service (Amazon S3) for long-term entry and with cost-effective storage lessons in Amazon S3 you’ll be able to optimize your general storage price. This solves price challenges, however now it’s a must to construct and keep that a part of the structure for knowledge motion to a distinct knowledge retailer. You additionally must construct totally different knowledge processing logic utilizing totally different APIs for consuming knowledge (Kafka API for streaming, Amazon S3 API for historic reads).

At this time, we’re asserting Amazon MSK tiered storage, which brings a just about limitless and low-cost storage tier for Amazon MSK, making it easier and cost-effective for builders to construct streaming knowledge purposes. Because the launch of Amazon MSK in 2019, now we have enabled capabilities equivalent to vertical scaling and computerized scaling of dealer storage so you’ll be able to function your Kafka workloads in a cheap approach. Earlier this 12 months, we launched provisioned throughput which allows seamlessly scaling I/O with out having to provision extra brokers. Tiered storage makes it much more cost-effective so that you can run Kafka workloads. Now you can retailer knowledge in Apache Kafka with out worrying about limits. You’ll be able to successfully stability your efficiency and prices through the use of the performance-optimized major storage for real-time knowledge and the brand new low-cost tier for the historic knowledge. With a couple of clicks, you’ll be able to transfer streaming knowledge right into a lower-cost tier to retailer knowledge and solely pay for what you employ.

Tiered storage frees you from making laborious trade-offs between supporting the information retention wants of your utility groups and the operational complexity that comes with it. This lets you use the identical code to course of each real-time and historic knowledge to attenuate redundant workflows and simplify architectures. With Amazon MSK tiered storage, you’ll be able to implement a Kappa structure – a streaming-first software program structure deployment sample – to make use of the identical knowledge processing pipeline for correctness and completeness of knowledge over a for much longer time horizon for enterprise evaluation.

How Amazon MSK tiered storage works

Let’s take a look at how tiered storage works for Amazon MSK. Apache Kafka shops knowledge in information referred to as log segments. As every phase completes, based mostly on the phase measurement configured at cluster or matter degree, it’s copied to the low-cost storage tier. Information is held in performance-optimized storage for a specified retention time, or as much as a specified measurement, after which deleted. There’s a separate time and measurement restrict setting for the low-cost storage, which should be longer than the performance-optimized storage tier. If shoppers request knowledge from segments saved within the low-cost tier, the dealer reads the information from it and serves the information in the identical approach as if it have been being served from the performance-optimized storage. The APIs and present shoppers work with minimal modifications. When your utility begins studying knowledge from the low-cost tier, you’ll be able to anticipate a rise in learn latency for the primary few bytes. As you begin studying the remaining knowledge sequentially from the low-cost tier, you’ll be able to anticipate latencies which are just like the first storage tier. With tiered storage, you pay for the quantity of knowledge you retailer and the quantity of knowledge you retrieve.

For a pricing instance, let’s contemplate a workload the place your ingestion price is 15 MB/s, with a replication issue of three, and also you wish to retain knowledge in your Kafka cluster for 7 days. For such a workload, it requires 6x m5.giant brokers, with 32.4 TB EBS storage, which prices $4,755. However in case you use tiered storage for a similar workload with native retention of 4 hours and general knowledge retention of seven days, it requires 3x m5.giant brokers, with 0.8 TB EBS storage and 9 TB of tiered storage, which prices $1,584. If you wish to learn all of the historic knowledge directly, it prices $13 ($0.0015 per GB retrieval price). On this instance with tiered storage, you save round 66% of your general price.

Get began utilizing Amazon MSK tiered storage

To allow tiered storage in your present cluster, improve your MSK cluster to Kafka model 2.8.2.tiered after which select Tiered storage and EBS storage as your cluster storage mode on the Amazon MSK console.

After tiered storage is enabled on the cluster degree, run the next command to allow tiered storage on an present matter. On this instance, you’re enabling tiered storage on a subject referred to as msk-ts-topic with 7 days’ retention (native.retention.ms=604800000) for an area high-performance storage tier, setting 180 days’ retention (retention.ms=15550000000) to retain the information within the low-cost storage tier, and updating the log phase measurement to 48 MB:

bin/kafka-configs.sh --bootstrap-server $bsrv --alter --entity-type matters --entity-name msk-ts-topic --add-config 'distant.storage.allow=true, native.retention.ms=604800000, retention.ms=15550000000, phase.bytes=50331648'

Availability and pricing

Amazon MSK tiered storage is on the market in all AWS areas the place Amazon MSK is on the market excluding the AWS China, AWS GovCloud areas. This low-cost storage tier scales to just about limitless storage and requires no upfront provisioning. You pay just for the amount of knowledge retained and retrieved within the low-cost tier.

For extra details about this function and its pricing, see the Amazon MSK developer information and Amazon MSK pricing web page. For locating the proper sizing on your cluster, see one of the best practices web page.


With Amazon MSK tiered storage you don’t must provision storage for the low-cost tier or handle the infrastructure. Tiered storage lets you scale to just about limitless storage. You’ll be able to entry knowledge within the low-cost tier utilizing the identical shoppers you at present use to learn knowledge from the high-performance major storage tier. Apache Kafka’s client API, streams API, and connectors devour knowledge from each tiers with out modifications. You’ll be able to modify the retention limits on the low-cost storage tier equally as to how one can modify the retention limits on the high-performance storage.

Allow tiered storage in your MSK clusters at this time to retain knowledge longer at a decrease price.

Concerning the Writer

Masudur Rahaman Sayem is a Streaming Architect at AWS. He works with AWS prospects globally to design and construct knowledge streaming structure to resolve real-world enterprise issues. He’s enthusiastic about distributed methods. He additionally likes to learn, particularly traditional comedian books.


Most Popular

Recent Comments