Connecting to Sources
We will be working with Pagila, a sample database that represents a DVD rental store. The database contains information about films, actors, customers, and stores. Postgres is used as the database engine. To quickly setup postgres, you can use docker-compose.
version: '3.8'
services:
postgres:
container_name: dozer-pagila-postgres
image: debezium/postgres:13
volumes:
- ./data/init.sql:/docker-entrypoint-initdb.d/init.sql
environment:
POSTGRES_DB: pagila
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ALLOW_IP_RANGE: 0.0.0.0/0
ports:
- '5433:5432'
healthcheck:
test: [ "CMD-SHELL", "pg_isready -U postgres -d pagila" ]
interval: 5s
timeout: 5s
retries: 5
Don't forget to download data using the following command:
#!/bin/bash
# https://github.com/devrimgunduz/pagila
mkdir -p data
curl https://raw.githubusercontent.com/devrimgunduz/pagila/726c724df9f86406577c47790d6f8e6f2be06186/pagila-schema.sql \
https://raw.githubusercontent.com/devrimgunduz/pagila/726c724df9f86406577c47790d6f8e6f2be06186/pagila-data.sql \
| cat > data/init.sql
Sourcing trips data from Postgres
Open a new file dozer-config.yaml and add the following configuration section:
cache_max_map_size: 10737418240
version: 1
app_name: simple-pg-sample
connections:
- name: pagila_conn
config: !Postgres
user: postgres
password: postgres
host: localhost
port: 5433
database: pagila
sources:
- name: actors
table_name: actor
connection: pagila_conn
sinks:
- table_name: actors
config: !Dummy
Start Dozer with the command:
dozer run --enable-progress
You should be seeing a screen like the following
INFO Initializing app: simple-pg-sample
INFO Data directory: ./.dozer
INFO [API] Configuration
+------+----+------+
| Type | IP | Port |
+------+----+------+
| REST | - | - |
+------+----+------+
| GRPC | - | - |
+------+----+------+
INFO [API] Endpoints
+------+------+
| Path | Name |
+------+------+
INFO Source: Initializing input schema: actor
+-------------+-----------+----------+-------+
| Field | Type | Nullable | PK |
+-------------+-----------+----------+-------+
| actor_id | Int | false | true |
+-------------+-----------+----------+-------+
| first_name | String | false | false |
+-------------+-----------+----------+-------+
| last_name | String | false | false |
+-------------+-----------+----------+-------+
| last_update | Timestamp | false | false |
+-------------+-----------+----------+-------+
INFO [pipeline] Validation completed
INFO Source: Initializing input schema: actor
+-------------+-----------+----------+-------+
| Field | Type | Nullable | PK |
+-------------+-----------+----------+-------+
| actor_id | Int | false | true |
+-------------+-----------+----------+-------+
| first_name | String | false | false |
+-------------+-----------+----------+-------+
| last_name | String | false | false |
+-------------+-----------+----------+-------+
| last_update | Timestamp | false | false |
+-------------+-----------+----------+-------+
INFO [pipeline] Validation completed
INFO Starting Internal Server on 0.0.0.0:50053
INFO Starting Rest Api Server on http://0.0.0.0:8080 with security: None
▹▹▹▹▹ actor: 0: 0/s
INFO Snapshotting for connection pagila_conn took 5.471011ms
This means Dozer is succesfully running. Now we can add one more connection to our dozer-config.yaml.
Adding an additonal source
We will now be adding a second connection to a PostgreSQL database. For simplicity, we will use Supabase, a fully managed PostgreSQL database service.
Head over to Supabase and create a new project. Create a new table named zones and import the data from this CSV file. To do that, click on the button Import data via spreadsheet when creating a new table in Supabase. Upload the CSV file and select LocationID as primary key.
Once created, head over to your project settings in Supabase, click on Database and take note of the Connection Info details. These will be used in the Dozer configuration to connect to this database instance.
Now edit your dozer-config.yaml and add a new PostgreSQL connection, a new source and a new endpoint. The new dozer-config.yaml file should look like this:
cache_max_map_size: 10737418240
version: 1
app_name: simple-pg-sample
connections:
- name: pagila_conn
config: !Postgres
user: postgres
password: postgres
host: localhost
port: 5433
database: pagila
- config: !Postgres
user: ** Your Supabase user ** (generally postgres)
password: ** Your Supabase password **
host: ** Your Supabase host **
port: ** Your Supabase port ** (generally 5432)
database: postgres
name: pg
sources:
- name: actors
table_name: actor
connection: pagila_conn
- name: films
table_name: film
connection: pagila_conn
columns:
- film_id
- title
- name: zones
table_name: zones
connection: pg
sinks:
- table_name: actors
config: !Dummy
- table_name: zones
config: !Dummy
Now you can restart Dozer. You will notice that a new collection called zones is available. In the next section we will be adding SQL transformations to our data sources.
In the next section we will be connecting to a real sink instead of dummy for the replication pipeline.