Data Transfer Workflow: Source to Target in DBConvert Streams

Overview

DBConvert Streams is a high-performance data transfer system designed to efficiently move data between different database systems. This document outlines the conceptual workflow of how data flows from a source database to a target database, explaining the key components, processes, and mechanisms involved.

Architecture Components

The data transfer process involves several key components:

Source Reader: Reads data from the source database
Event Hub: Manages the flow of events between components
Target Writer: Processes events and writes data to the target database
Sequencer: Ensures events are processed in the correct order
Statistics Tracker: Monitors and reports on the transfer process

Detailed Workflow

1. Initialization Phase

Connection Setup:
- Source establishes connection to the source database
- Target establishes connection to the target database
- Both components connect to the Event Hub (NATS JetStream)
Stream Creation:
- Event Hub creates streams for data and control messages
- Consumers are set up to process these streams
Table Structure Creation (if needed):
- Target may create necessary table structures in the target database

2. Data Reading Phase

Source Reading:
- Source reads data from tables in batches
- Data is converted into events containing:
  - Table ID
  - Log position
  - Action type (INSERT, UPDATE, DELETE)
  - Row data (before/after states)
Event Publishing:
- Events are published to the Event Hub
- Events are marked as ordered or unordered based on requirements
- Source tracks and reports statistics on read events

3. Data Transfer Phase

Event Consumption:
- Target subscribes to data streams from the Event Hub
- Events are received in batches
Deduplication:
- Each event batch is assigned a unique key based on:
  - First and last row IDs
  - Number of rows
  - Action type
  - Table ID
  - Data hash
- Duplicate batches are detected and skipped
Event Sequencing:
- Ordered events are processed through the sequencer
- Sequencer ensures events are processed in the correct order
- Unordered events bypass the sequencer for faster processing

4. Data Writing Phase

Event Processing:
- Events are dispatched to worker goroutines
- Workers process events concurrently
Database Operations:
- For INSERT events: Bulk insert or upsert operations
- For UPDATE events: Update existing records
- For DELETE events: Remove records based on primary keys
Transaction Management:
- Operations are executed within transactions
- Transactions are committed or rolled back based on success

5. Monitoring and Reporting

Statistics Tracking:
- Per-table statistics (events, data size)
- Overall transfer statistics (total events, processing rate)
Progress Reporting:
- Periodic reporting of transfer progress
- Final report of completed transfer
Event Count Monitoring:
- Continuous monitoring of source vs. processed event counts
- Detection of potential issues (large differences, stuck processing)

6. Completion Phase

Finalization:
- Verification that all events have been processed
- Grace period for processing remaining events
- Status set to FINISHED when complete
Cleanup:
- Connections closed
- Resources released
- Final statistics reported

Data Flow Diagram

┌─────────────┐     ┌───────────────┐     ┌─────────────┐
│             │     │               │     │             │
│   Source    │────▶│   Event Hub   │────▶│   Target    │
│  Database   │     │  (NATS/JS)    │     │  Database   │
│             │     │               │     │             │
└─────────────┘     └───────────────┘     └─────────────┘
       │                    ▲                    │
       │                    │                    │
       ▼                    │                    ▼
┌─────────────┐     ┌───────────────┐     ┌─────────────┐
│             │     │               │     │             │
│   Source    │────▶│  Event Streams│────▶│   Target    │
│   Reader    │     │  & Consumers  │     │   Writer    │
│             │     │               │     │             │
└─────────────┘     └───────────────┘     └─────────────┘
                                                │
                                                │
                                                ▼
                                          ┌─────────────┐
                                          │             │
                                          │  Worker     │
                                          │   Pool      │
                                          │             │
                                          │             │
                                          └─────────────┘

Key Mechanisms

Event Batching

Events are processed in batches to optimize performance. Each batch contains:

Table identifier
Log position
Action type
Row data (before/after states)
Size information

Deduplication Strategy

To prevent duplicate processing:

Each batch is assigned a unique key
The key includes first ID, last ID, row count, action type, table ID, and data hash
Processed batch keys are stored in a sync.Map
Periodic cleanup prevents memory leaks

Sequencing Logic

For ordered events:

Events are checked against the sequencer
If not ready for processing, they're requeued
When ready, they're processed and marked as complete
Unordered events bypass this check

Error Handling

Transactional Safety: Database operations use transactions
Retry Mechanism: Failed operations can be retried
Pause/Resume: Processing can be paused and resumed
Monitoring: Continuous monitoring detects issues

Performance Considerations

Worker Pool: Up to 40 concurrent workers process events
Channel Buffering: Channels are buffered to handle bursts
Batch Processing: Data is processed in batches for efficiency
Ordered vs. Unordered: Events can be processed in order or out of order

Statistics and Monitoring

The system tracks and reports:

Per-Table Statistics:
- Number of events processed
- Data size transferred
- Percentage of total transfer
Overall Statistics:
- Total events from source
- Total events processed
- Processing rate
- Elapsed time
Health Metrics:
- Difference between source and processed events
- Channel utilization
- Processing status

Data Transfer Workflow: Source to Target in DBConvert Streams ​

Overview ​

Architecture Components ​

Detailed Workflow ​

1. Initialization Phase ​

2. Data Reading Phase ​

3. Data Transfer Phase ​

4. Data Writing Phase ​

5. Monitoring and Reporting ​

6. Completion Phase ​

Data Flow Diagram ​

Key Mechanisms ​

Event Batching ​

Deduplication Strategy ​

Sequencing Logic ​

Error Handling ​

Performance Considerations ​

Statistics and Monitoring ​