Implementing Change Data Capture (CDC) in YugabyteDB
Databases are systems of records in any enterprise. However, unlocking the true business potential of data requires availability across the entire enterprise ecosystem. This means different applications, services, and systems powering the business processes and use cases of the organization. As a result, change data capture (CDC) is an ideal mechanism for integrating downstream applications and services to a database efficiently, reliably, and in a scalable way.
In this post, we explore YugabyteDB’s pull-based approach to CDC introduced in YugabyteDB 2.13 that scans changes from the database’s write-ahead-log (WAL).
What is CDC?
CDC is a process to capture changes made to data in the database. It streams those changes in near real-time to external processes, applications, or other databases. There are several considerations and options when implementing this process for a database. Each has its advantages and disadvantages across ecosystems and use-cases. This process can also be push based or pull based. It can be implemented using row versioning, triggers, database log scanning, PubSub queues, and change scanning.
The YugabyteDB CDC that we introduce in YugabyteDB 2.13 has a dependency on the Debezium connector. A functional end-to-end implementation of YugabyteDB CDC would generally have four components: YugabyteDB Streams, the Debezium Connector, Apache Kafka, and a consumer application, as illustrated below.
The core component of the YugabyteDB CDC is a ‘Stream’. Streams are YugabyteDB endpoints for fetching database changes by applications, processes, and systems. YugabyteDB streams are created at the database level and can be used to access change data from all the tables under a particular database. Any compatible downstream service can consume these changes, in this case by the YugabyteDB Debezium Connector.
YugabyteDB Debezium Connector
Debezium is an open source distributed platform for CDC. The YugabyteDB Debezium connector acts as a YugabyteDB stream client. It captures row-level changes in YugabyteDB database schemas and pushes these changes onto a queue for consumption. The connector captures every committed row-level change (i.e., insert, update and delete and produces a change record for each table and streams it to a Kafka topic specific to that table.)
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, and data integration. YugabyteDB CDC uses Kafka as a queue for change updates generated by the Debezium connector. This enables any downstream Kafka-compatible consumer applications to easily integrate with this process and consume changes in the database.
Kafka’s distributed architecture is highly scalable and resilient, making it the preferred queuing mechanism for YugabyteDB CDC. Users can connect their microservices applications, data analytics, OLAP applications, storage solutions, or any other applications with Kafka connectors and start streaming changes from YugabyteDB.
A consumer application for YugabyteDB CDC can be another database, a data warehouse, analytics application, enterprise search, or cloud storage. It can also be any standard or custom application that can consume events from a Kafka queue using the right sink connectors. These consumer applications get near real-time change updates asynchronously from YugabyteDB tables without writing complex code and sapping performance from the source database.
YugabyteDB CDC becomes even more useful when multiple applications need to consume database changes and Kafka is already present in the ecosystem, which is generally true for large enterprise ecosystems. We have built a sample Java console client that consumes YugabyteDB changes from watched tables using the Kafka JDBC connector and prints those to the console.
Yugabyte CDC Capabilities
YugabyteDB CDC enables streaming of row-level changes from YugabyteDB tables to downstream applications and systems in an efficient manner. It has a Debezium connector that allows it to integrate with queues that have Debezium support. And while compatible with the Postgres Debezium connector in terms of DataType mapping, YugabyteDB CDC is also scalable and fault tolerant.
Owing to the distributed architecture of YugabyteDB and Apache Kafka, YugabyteDB CDC is highly scalable and resilient. YugabyteDB CDC change updates follow at-least-once delivery guarantee semantics. This means each row change will emit at least once; however, in some outlier conditions, it may be repeated. YugabyteDB CDC also supports automatic schema discovery. This means streams need not be reconfigured or restarted if schema changes are made to the tables; however, the consumer application needs to be able to handle those changes.
Prerequisites and limitations
Since this is v1 of YugabyteDB CDC, there are some prerequisites and limitations we will address in future releases. YugabyteDB CDC requires Debezium and Kafka to work and integrate with other applications. Also, tables that need to be watched using YugabyteDB CDC must have a primary key.
Currently, YugabyteDB CDC does not support YCQL tables, system tables, or tables with a UDT data type. But we plan to build support for DROP/TRUNCATE table commands, which are not supported today. Adding new tables and getting change updates from those using CDC would require the creation of new streams, since existing streams cannot support newly-added tables.
We’re working on push-based CDC to remove Debezium and Kafka dependencies while reducing user overhead and implementation complexity. We plan to build connectors to stream changes directly from YugabyteDB to Kafka Connect, cloud storage solutions such as AWS S3, Azure, and Snowflake, as well as Webhook endpoints.
In upcoming YugabyteDB releases, we also plan to build gRPC-based CDC client APIs that can be implemented in different programming languages for microservices consumption. These APIs would allow developers to quickly build and deploy applications that need to consume row-level changes in YugabyteDB schemas.
To learn more about YugabyteDB CDC, check out the documentation. We have also built a Java CDC console client for you to try locally on your computer. You can explore more details about our design and architecture here.
If you have any questions, feedback, or just want to give us a shout out, please join the YugabyteDB community Slack channel.