Integrate YugabyteDB with Amazon Redshift Using AWS Managed Kafka Stream and CDC Connector

Balachandar Seetharaman

In this blog, we’ll walk through how to integrate YugabyteDB with the Amazon Redshift database using the Amazon Web Services(AWS) Managed Kafka Stream (MSK) and the YugabyteDB Change Data Capture (CDC) connector.

Introducing YugabyteDB, Amazon Redshift, and AWS MSK

YugabyteDB is an open-source distributed SQL database for transactional (OLTP) applications. YugabyteDB is designed to be a powerful cloud-native database that can run in any cloud – private, public, or hybrid.

Amazon Redshift is a fully managed cloud-based data warehouse service provided by Amazon Web Services (AWS). It is designed to handle petabyte-scale data warehousing and analytics workloads for businesses of all sizes.

Amazon Managed Streaming for Apache Kafka (AWS MSK) is a fully managed, highly available, and secure Apache Kafka service offered by Amazon Web Services (AWS). AWS MSK makes it easy for customers to build and run applications using Apache Kafka without having to manage and operate their own Kafka clusters.

Integration of YugabyteDB (OLTP) with AWS Redshift (Data Warehouse)

YugabyteDB can seamlessly connect to Amazon Redshift using the YugabyteDB CDC connector and AWS MSK. The connection is achieved through the YugabyteDB CDC connector, helping users consume data from a data warehouse and an OLTP (transactional database) without additional processing or batch loads.

Benefits of YugabyteDB Integration with Amazon Redshift:

YugabyteDB and Amazon Redshift provide the following benefits:

  1. Real-time data synchronization: With YugabyteDB’s Change Data Capture (CDC) feature, data changes automatically sync between YugabyteDB and Redshift in real-time, enabling real-time analytics.
  2. Better performance and scalability: By using YugabyteDB as your transactional database, you can offload read-heavy analytical workloads to Redshift, which is optimized for analytical processing. This improves the performance and scalability of both systems.
  3. Multi-cloud and hybrid cloud deployment: YugabyteDB supports multi-cloud and hybrid cloud deployments, which helps users fetch transactional data from different cloud or on-premise environments and data warehouse data from AWS Redshift.
  4. No additional software: The data streams from YugabyteDB to Amazon Redshift goes through AWS MSK which doesn’t require any additional ETL or data integration tools at AWS.

YugabyteDB to Amazon Redshift Architecture

The diagram below (Figure 1) shows the end-to-end integration architecture of YugabyteDB to Amazon Redshift.

Figure 1: Integration architecture of YugabyteDB to Amazon Redshift

The streamed change data from the YugabyteDB Debezium/Kafka connector is streamed directly to Amazon Redshift using AWS MSK.

Data Flow Sequence #Operations/TasksComponent Involved
1YugabyteDB CDC enabled and create the stream ID for specific YSQL database (e.g. your db name)YugabyteDB
2Download the YugabyteDB CDC Connector JARYugabyteDB Debezium Connector JAR
3Setup the AWS MSK Cluster (refer to the IAM policies and roles)AWS MSK, AWS IAM Roles & Policies
4Setup Amazon Redshift and keep the credentials ready for testingAmazon Redshift

Best Use Cases For YugabyteDB Integration with Amazon Redshift

Real-time Pricing Strategy using Customer LifeTime Value (LTV): Customer LTV helps retailers determine the optimal pricing strategy for their products and services. By knowing the expected revenue from each customer, businesses can set prices that maximize revenue while remaining competitive in the market.

YugabyteDB is used widely in the retail industry. It helps store real-time orders and transactions; however, it needs a constant lookup from the customer’s LTV score which (in turn) needs historical data from a data warehouse database, like Amazon Redshift.  Thanks to the CDC integration, YugabyteDB can continuously stream the customer’s transactional data changes to Amazon Redshift. That database stores the customer’s data for longer periods, helping to calculate the LTV score using past purchase histories, customer retention rates, and customer acquisition costs. Using the LTV score from Amazon Redshift, the retail order application determines the right pricing strategy and discounts for the customers in real-time.

Credit Risk Evaluation: Credit risk evaluation is a critical business process for lenders or banks to minimize the risk of default and ensure their firm’s long-term viability.. By collecting and analyzing various data points related to a borrower’s creditworthiness, banks or lenders can make informed decisions and provide credit to borrowers who are most likely to repay their obligations.

YugabyteDB is used as an OLTP database in many financial institutions and banks to collect and store loan or credit card application information. It is difficult to know the customer’s current risk rating since it varies based on their transaction history, debt-to-income ratio, blacklisting details, etc. Using CDC, the Yugabyte database continuously pushes transactional data to Amazon Redshift which does all the risk evaluation by collecting various data coming from different sources (e.g. credit bureaus, payment applications, etc.). The OLTP application can look up the credit risk rating from Redshift in real-time and prove or reject loan or credit applications.

How to Set Up Amazon Redshift Sink

  1. Install YugabyteDB

    You have multiple options to install or deploy YugabyteDB if you don’t have one already available. Note: If you’re running a Windows Machine then you can leverage Docker on Windows with YugabyteDB.

  2. Install and Setup AWS MSK

    Install and setup AWS MSK, using the article, Stream Data From YugabyteDB CDC to AWS MSK Using Debezium. It helps configure IAM Policies and roles required for AWS MSK in AWS. Post creation of IAM Role and setup the MSK cluster using the same article.

  3. Create Cluster Configuration and Worker configuration in MSK

    Cluster Configuration:

    In AWS MSK, under MSK Clusters menu (on the left side) will show cluster configuration. Click to create a new configuration and enter the values below on the configuration settings. Then save it.

    Figure 2: myclusterV2
    auto.create.topics.enable=true
    default.replication.factor=3
    min.insync.replicas=2
    num.io.threads=8
    num.network.threads=5
    num.partitions=3
    num.replica.fetchers=2
    replica.lag.time.max.ms=30000
    socket.receive.buffer.bytes=102400
    socket.request.max.bytes=104857600
    socket.send.buffer.bytes=102400
    unclean.leader.election.enable=true
    zookeeper.session.timeout.ms=18000

    Worker Configuration:

    In AWS MSK, under “MSK Connect”, create a worker configuration (like below) and enter the properties if you are going to receive the data in JSON format.

    key.converter=org.apache.kafka.connect.json.JsonConverter
    key.converter.schemas.enable=true
    value.converter=org.apache.kafka.connect.json.JsonConverter
    value.converter.schemas.enable=true

    NOTE: It is highly recommended to use schema registry with AVRO. In such a case, the above worker configuration is not required.

    Figure 3: Worker configuration in Amazon MSK
  4. Create MSK Connector for YugabyteDB

    Create the Configuration of AWS MSK connector for YugabyteDB as referred in point #6 and choose the worker configuration that you created in Step #3.

    Use the values below in the connector configuration and change the yellow highlights with new data according to your specific YugabyteDB configuration.

    connector.class=io.debezium.connector.yugabytedb.YugabyteDBConnector
    database.streamid=684d878b37e94b279454b0b8d6a2a305
    database.user=yugabyte
    database.dbname=yugabyte
    tasks.max=1
    time.precision.mode=connect
    transforms=unwrap
    database.server.name=dbserver12
    database.port=5433
    include.schema.changes=true
    database.master.addresses=10.9.205.161:7100
    key.converter.schemas.enable=true
    database.hostname=10.9.205.161
    database.password=xxxxx
    transforms.unwrap.drop.tombstones=false
    value.converter.schemas.enable=true
    transforms.unwrap.type=io.debezium.connector.yugabytedb.transforms.YBExtractNewRecordState
    table.include.list=public.target_cdctest
    snapshot.mode=initial
  5. Install Amazon Redshift

    Install Amazon Redshift from your AWS Console and create a cluster and configure.

    Figure 4: Provisioned clusters dashboard in Amazon Redshift
  6. Create Custom Plugin for Amazon Redshift

    Download the Amazon Redshift Sink connector and upload this zip file to your AWS MSK Custom Plugin. You can also rename it.

    Figure 5: Create a custom plugin for Amazon Redshift
  7. Create MSK Sink Connect for Amazon Redshift

    Create the Configuration of AWS MSK connector (see point #6). ou will need to keep it with the Amazon Redshift configuration instead of the YugabyteDB configuration like below. (NOTE:Replace the values in yellow per your specific Amazon Redshift details and AWS MSK Kafka Topic name)

    connector.class=io.confluent.connect.aws.redshift.RedshiftSinkConnector
    aws.redshift.port=5439
    table.name.format=public.target_cdctest
    confluent.topic.bootstrap.servers=b-4.bseetharamanmsk.acsf.c10.kafka.us-west-2.amazonaws.com:9092
    tasks.max=1
    topics=dbserver12.public.target_cdctest
    aws.redshift.password=xxxxx
    aws.redshift.domain=bseetharaman-redshift-demo-cluster.xxgs.us-west-2.redshift.amazonaws.com
    aws.redshift.database=dev
    delete.enabled=true
    auto.evolve=true
    confluent.topic.replication.factor=1
    aws.redshift.user=awsuser
    auto.create=false
    insert.mode=insert
    pk.mode=record_key
    pk.fields=sno
  8. Validate the Data in Amazon Redshift

    Launch the Amazon Redshift query editor and query the tables that are synced from YugabyteDB.

    Figure 6: Amazon Redshift query editor

    Figure 7: View query details

Conclusion and Summary

In this blog post we walked through step-by-step how to integrate YugabyteDB with Amazon Redshift using YugabyteDB’s CDC connector.

These steps help to integrate the YugabyteDB transactional database with your data warehouse (Amazon Redshift DB) so that it meets the requirements for your application. The metrics collected by YugabyteDB CDC provide an intuitive, visual way to monitor the health, performance, and progress of these CDC operations. For detailed information on CDC metrics, please refer to Monitoring Change Data Capture (CDC) Using YugabyteDB’s Metrics Dashboard.

Integrating YugabyteDB and Amazon Redshift through AWS MSK delivers real time analytics and allows you to store large amounts of data that you’ve collected and store it for business intelligence and analytics in your data warehouse. Both pull significant business value from your data.

Additional Resources on YugabyteDB CDC

Balachandar Seetharaman

Related Posts

Explore Distributed SQL and YugabyteDB in Depth

Discover the future of data management.
Learn at Yugabyte University
Get Started
Browse Yugabyte Docs
Explore docs
PostgreSQL For Cloud Native World
Read for Free