Object Stores
In Dozer, object store connectors provide interfaces to different data storage solutions. A fundamental component of these connectors is the tables attribute, which denotes specific datasets or collections within the storage. Through these table configurations, Dozer enables a modular approach, allowing various storage mediums to be paired with distinct data formats.
Consider the example:
- name: local_dataset
config: !LocalStorage
details:
path: /tmp/data
tables:
- !Table
name: zones
config: !CSV
path: {{file_path}}
extension: .csv
marker_file: true
marker_extension: .marker
Here, local_dataset represents a connector for local storage. Within it, the zones table is defined to use the CSV format, illustrating Dozer's flexibility in combining storages, such as LocalStorage, with data formats like CSV.
Storage Types
Local Storage
The Local Storage connector is used to connect to a local file system, and use it as a source for data ingestion, like any other Object Store.
Configuration
connections:
- name: local_dataset
config: !LocalStorage
details:
path: /tmp/data
tables:
...
Parameters
| Name | Type | Description |
|---|---|---|
path | Path | The path to the local storage folder. |
tables | List | A list of tables to be ingested from the local storage. Refer to the File Formats section for more details. |
AWS S3
The AWS S3 connector is used to connect to an S3 bucket, and use it as a source for data ingestion.
Configuration
connections:
- name: local_dataset
config: !LocalStorage
details:
path: /tmp/data
tables:
...
Parameters
| Name | Type | Description |
|---|---|---|
access_key_id | String | The access key id of the AWS account. |
secret_access_key | String | The secret access key of the AWS account. |
region | String | The region of the S3 bucket. |
bucket_name | String | The name of the S3 bucket. |
tables | List | A list of tables to be ingested from the S3 bucket. Refer to the File Formats section for more details. |
File Formats
CSV
The Dozer CSV reader operates in an "append" mode, continually monitoring a specified directory for new CSV files. Upon detecting new files, it triggers an automatic ingestion process. To enhance the ingestion control, there's a "marker file" mechanism. If this feature is active, a new CSV file will only be ingested if a corresponding marker file is also present in the directory. This ensures deliberate and controlled data ingestion.
Configuration
- !Table
name: zones
config: !CSV
path: {{file_path}}
extension: .csv
marker_file: true
marker_extension: .marker
Parameters
| Name | Type | Description |
|---|---|---|
path | String | The path to folder containing CSV files. |
extension | String | The extension of the CSV file. |
marker_file | Boolean | Optional. Indicates whether to require marker files for ingestion. If true, only files with corresponding marker files are ingested. |
marker_extension | String | Optional. The extension of the marker files. Only relevant if marker_file is set to true. |
Parquet
The Dozer Parquet reader operates in an "append" mode, continually monitoring a specified directory for new Parquet files. Upon detecting new files, it triggers an automatic ingestion process. To enhance the ingestion control, there's a "marker file" mechanism. If this feature is active, a new Parquet file will only be ingested if a corresponding marker file is also present in the directory. This ensures deliberate and controlled data ingestion.
Configuration
- !Table
name: trips
config: !Parquet
path: {{file_path}}
extension: .parquet
marker_file: true
marker_extension: .marker
Parameters
| Name | Type | Description |
|---|---|---|
path | String | The path to the folder containing the Parquet files. |
extension | String | Optional. The extension of the Parquet files. |
marker_file | Boolean | Optional. Indicates whether to require marker files for ingestion. If true, only files with corresponding marker files are ingested. |
marker_extension | String | Optional. The extension of the marker files. Only relevant if marker_file is set to true. |
Trying it out
To test a MySQL sample, clone the dozer-samples GitHub repo and follow the steps described here.