Sizing CDC Agents
The purpose of this document is to review and explain the factors that can affect pipeline performance. There are three key determinants of pipeline performance:
- CPU: The number of CPU cores.
- Memory: The available memory to process pipeline data.
- Network: The quality, speed, and capacity of your network connection.
Please review the information below before making your choices. You can resize your agent if the requirements of the pipeline change. The agent can be redeployed with a different resource configuration. If the snapshot processing is CPU limited, the agent can initially be deployed with a larger vCPU allocation and downsized once the snapshot has completed to handle the ongoing changes.
Please note that it's not until sometime after the pipeline status has changed to streaming that the pipeline will have reached a point where the pipeline can be resumed without restarting another full snapshot.
Insufficient CPU availability will limit the maximum throughput of the pipeline. While the pipeline can continue to stream changes, computational delays in writing changes in the data source to the storage platform mean the CDC agent will gradually become less accurate. In these circumstances, the true rate of changes will become unclear, with the rate appearing as a constant while the agent shows constant maximum CPU usage.
The longer this delay persists, the more inaccurate the pipeline will become. Both the pipeline and data source will experience errors and may even fail. For example, Postgres retains transaction logs back to the point at which the replication slot is positioned. As the pipeline falls further behind, the retained transaction logs will grow, compounding the issue.
Insufficient memory will typically cause the pipeline to fail. In these circumstances you would need to increase the available memory or reduce the volume of data, or number of tables, being processed by the pipeline.
The required memory directly scales with the number of tables being processed. The CDC agent has parallel tasks for each table, each of which comes with its own independent memory footprint as each table retains its own data buffer for writing the changes out to cloud storage.
To help illustrate the earlier points, the table below shows the results of testing a pipeline capturing evenly distributed changes across 83 tables, running across different services and configurations. The table shows the snapshot and streaming rates for each of the service providers according to CPU and memory allocation levels.
Absolute values will vary depending on the structure and content of the data being captured, and values across different services cannot be directly compared as the resource abstractions are not necessarily comparable.
|Service||CPU Allocation||Memory Allocation||Snapshot Rate||Streaming Rate|
|AWS Fargate||1 vCPU||2GB||22k||11k|
|Google Compute Engine||2 vCPU||4GB||25k||15k|
|Azure Container Instances||1 vCPU||2GB||19k||9k|
The tables below show a general recommendation for CPU and memory allocation according to your anticipated or actual change rate and the number of tables. Please be aware that each pipeline has a unique performance profile and it's recommended that you monitor the agent resource utilization and ensure that the pipeline is keeping up with the incoming changes.
|Change Rate||CPU Allocation|
|Up to 5k/s||1 vCPU|
|Up to 10k/s||2 vCPU|
|Up to 20k/s||4 vCPU|
|Number of Tables||Memory Allocation|
|Up to 100||2GB|
|Up to 200||4GB|
|Up to 400||8GB|
Cloud container services don't allow arbitrary vCPU and memory allocations. When making a choice of resources, the recommendation is to take the lowest configuration that satisfies both requirements.