Snowflake CDC: Publishing Data Using Amazon S3 and YugabyteDB
We’ll use YugabyteDB’s CDC SDK server, an open source project that provides a streaming platform for CDC. We’ll also use Snowpipe to automate data loading to files as soon as they are available in a S3 bucket.
Before you begin
Before getting started, make sure you have access to the following:
- Data modify access to YugabyteDB
- Write access to an AWS S3 bucket with the below permissions:
- (“s3:ListBucket”, “s3:GetBucketLocation”, “s3:PutObject”,”s3:GetObject”, “s3:GetObjectVersion”, “s3:DeleteObject”, “s3:DeleteObjectVersion”)
- Read and write access to a Snowflake cluster
Snowflake CDC overall architecture
Here’s a quick snapshot of the architecture we’ll be building:
Snowflake CDC setup
For starters, configure the CDCSDK server to work with the S3 sink. Below is a sample configuration:
Please make sure you have the following Amazon IAM permissions:
Finally, get the CDC data to Snowflake by automating it through Snowpipe with these steps.
Example of a working setup
We’ll push the CDC events for the below table “orders”. That means we’ll use the command having three columns: “order_id”, “item”, and “quantity” to create the table.
The created table’s description will then be displayed as shown below:
Below is the CDCSK server we used. Please note we will stream the records to the S3 location (s3://abharadwaj-snowpipe-setup/json_source/). We have also configured the CDCSK server to create and push one file to S3 per 10,000 records (cdcsdk.sink.s3.flush.records).
You’ll see we have created a CDC stream with the StreamId: “f70d731e8dbf4869b2849bdfd86864d2”. Please refer to our documentation to create a CDC stream.
From here, we’ll push 10,000 rows into the YB table “orders” using the below command:
This will also push a file to the configured S3 location, as shown below:
In this example, we have configured a table called: “ORDERS_RECORDS” in Snowflake. Upon querying the table, we see there are indeed 10,000 records:
Below we show what a particular record looks like. In this instance, we show the record associated with order_id = 878 using the following Snowflake query:
Additionally, please note all the column data is stored as a single JSON record in the Snowflake table.
In this short post, we set up a pipeline to stream CDC data from YugabyteDB to Snowflake. We saw how all the column data was stored as a single record (i.e., JSON format) in the Snowflake collection.