The tap-s3
extractor
pulls data from
S3 that can
then be sent to a destination using a
loader.
Airbyte Usage Notice
Container-based connectors
can introduce deployment challenges including the potential need to run
Docker-in-Docker (not currently supported by services like AWS ECS, Meltano
Cloud, etc. see
FAQ
and
Airbyte's ECS deployment docs
for more details). Before using this variant we recommend considering if/how
you will be able to deploy container-based connectors to production.
For more context on how this Airbyte integration works please checkout out the
FAQ in the Meltano Docs.
Getting Started
Prerequisites
If you haven't already, follow the initial steps of the Getting Started guide:
Installation and configuration
-
Add the tap-s3 extractor to your
project using
:meltano add
-
Configure the tap-s3
settings using
:meltano config
-
Test that extractor settings are valid using
:meltano config
meltano add extractor tap-s3
meltano config tap-s3 set --interactive
meltano config tap-s3 test
Next steps
Follow the remaining steps of the Getting Started guide:
If you run into any issues, learn how to get help.
Capabilities
The current capabilities for
tap-s3
may have been automatically set when originally added to the Hub. Please review the
capabilities when using this extractor. If you find they are out of date, please
consider updating them by making a pull request to the YAML file that defines the
capabilities for this extractor.
This plugin has the following capabilities:
- about
- catalog
- discover
- schema-flattening
- state
- stream-maps
You can
override these capabilities or specify additional ones
in your meltano.yml
by adding the capabilities
key.
Settings
The
tap-s3
settings that are known to Meltano are documented below. To quickly
find the setting you're looking for, click on any setting name from the list:
airbyte_config.dataset
airbyte_config.format.additional_reader_options
airbyte_config.format.advanced_options
airbyte_config.format.batch_size
airbyte_config.format.block_size
airbyte_config.format.buffer_size
airbyte_config.format.columns
airbyte_config.format.delimiter
airbyte_config.format.double_quote
airbyte_config.format.encoding
airbyte_config.format.escape_char
airbyte_config.format.filetype
airbyte_config.format.infer_datatypes
airbyte_config.format.newlines_in_values
airbyte_config.format.quote_char
airbyte_config.format.unexpected_field_behavior
airbyte_config.path_pattern
airbyte_config.provider.aws_access_key_id
airbyte_config.provider.aws_secret_access_key
airbyte_config.provider.bucket
airbyte_config.provider.endpoint
airbyte_config.provider.path_prefix
airbyte_config.schema
airbyte_spec.image
airbyte_spec.tag
docker_mounts
Expand To Show SDK Settings
You can also list these settings using
with the meltano config
list
subcommand:
meltano config tap-s3 list
You can
override these settings or specify additional ones
in your meltano.yml
by adding the settings
key.
Please consider adding any settings you have defined locally to this definition on MeltanoHub by making a pull request to the YAML file that defines the settings for this plugin.
Airbyte Config Dataset (airbyte_config.dataset)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_DATASET
The name of the stream you would like this source to output. Can contain letters, numbers, or underscores.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config dataset [value]
Airbyte Config Format Additional Reader Options (airbyte_config.format.additional_reader_options)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_ADDITIONAL_READER_OPTIONS
Optionally add a valid JSON string here to provide additional options to the csv reader. Mappings must correspond to options detailed here. 'column_types' is used internally to handle schema so overriding that would likely cause problems.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.additional_reader_options [value]
Airbyte Config Format Advanced Options (airbyte_config.format.advanced_options)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_ADVANCED_OPTIONS
Optionally add a valid JSON string here to provide additional Pyarrow ReadOptions. Specify 'column_names' here if your CSV doesn't have header, or if you want to use custom column names. 'block_size' and 'encoding' are already used above, specify them again here will override the values above.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.advanced_options [value]
Airbyte Config Format Batch Size (airbyte_config.format.batch_size)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_BATCH_SIZE
Maximum number of records per batch read from the input files. Batches may be smaller if there aren’t enough rows in the file. This option can help avoid out-of-memory errors if your data is particularly wide.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.batch_size [value]
Airbyte Config Format Block Size (airbyte_config.format.block_size)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_BLOCK_SIZE
The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors., The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.block_size [value]
Airbyte Config Format Buffer Size (airbyte_config.format.buffer_size)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_BUFFER_SIZE
Perform read buffering when deserializing individual column chunks. By default every group column will be loaded fully to memory. This option can help avoid out-of-memory errors if your data is particularly wide.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.buffer_size [value]
Airbyte Config Format Columns (airbyte_config.format.columns)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_COLUMNS
If you only want to sync a subset of the columns from the file(s), add the columns you want here as a comma-delimited list. Leave it empty to sync all columns.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.columns [value]
Airbyte Config Format Delimiter (airbyte_config.format.delimiter)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_DELIMITER
The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\t'.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.delimiter [value]
Airbyte Config Format Double Quote (airbyte_config.format.double_quote)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_DOUBLE_QUOTE
Whether two quotes in a quoted CSV value denote a single quote in the data.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.double_quote [value]
Airbyte Config Format Encoding (airbyte_config.format.encoding)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_ENCODING
The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.encoding [value]
Airbyte Config Format Escape Char (airbyte_config.format.escape_char)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_ESCAPE_CHAR
The character used for escaping special characters. To disallow escaping, leave this field blank.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.escape_char [value]
Airbyte Config Format Filetype (airbyte_config.format.filetype)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_FILETYPE
csv, parquet, avro, jsonl
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.filetype [value]
Airbyte Config Format Infer Datatypes (airbyte_config.format.infer_datatypes)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_INFER_DATATYPES
Configures whether a schema for the source should be inferred from the current data or not. If set to false and a custom schema is set, then the manually enforced schema is used. If a schema is not manually set, and this is set to false, then all fields will be read as strings
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.infer_datatypes [value]
Airbyte Config Format Newlines In Values (airbyte_config.format.newlines_in_values)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_NEWLINES_IN_VALUES
Whether newline characters are allowed in CSV values. Turning this on may affect performance. Leave blank to default to False., Whether newline characters are allowed in JSON values. Turning this on may affect performance. Leave blank to default to False.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.newlines_in_values [value]
Airbyte Config Format Quote Char (airbyte_config.format.quote_char)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_QUOTE_CHAR
The character used for quoting CSV values. To disallow quoting, make this field blank.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.quote_char [value]
Airbyte Config Format Unexpected Field Behavior (airbyte_config.format.unexpected_field_behavior)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_FORMAT_UNEXPECTED_FIELD_BEHAVIOR
How JSON fields outside of explicit_schema (if given) are treated. Check PyArrow documentation for details
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config format.unexpected_field_behavior [value]
Airbyte Config Path Pattern (airbyte_config.path_pattern)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PATH_PATTERN
A regular expression which tells the connector which files to replicate. All files which match this pattern will be replicated. Use | to separate multiple patterns. See this page to understand pattern syntax (GLOBSTAR and SPLIT flags are enabled). Use pattern ** to pick up all files.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config path_pattern [value]
Airbyte Config Provider AWS Access Key Id (airbyte_config.provider.aws_access_key_id)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PROVIDER_AWS_ACCESS_KEY_ID
In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config provider.aws_access_key_id [value]
Airbyte Config Provider AWS Secret Access Key (airbyte_config.provider.aws_secret_access_key)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PROVIDER_AWS_SECRET_ACCESS_KEY
In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config provider.aws_secret_access_key [value]
Airbyte Config Provider Bucket (airbyte_config.provider.bucket)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PROVIDER_BUCKET
Name of the S3 bucket where the file(s) exist.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config provider.bucket [value]
Airbyte Config Provider Endpoint (airbyte_config.provider.endpoint)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PROVIDER_ENDPOINT
Endpoint to an S3 compatible service. Leave empty to use AWS.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config provider.endpoint [value]
Airbyte Config Provider Path Prefix (airbyte_config.provider.path_prefix)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_PROVIDER_PATH_PREFIX
By providing a path-like prefix (e.g. myFolder/thisTable/) under which all the relevant files sit, we can optimize finding these in S3. This is optional but recommended if your bucket contains many folders/files which you don't need to replicate.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config provider.path_prefix [value]
Airbyte Config Schema (airbyte_config.schema)
-
Environment variable:
TAP_S3_AIRBYTE_CONFIG_SCHEMA
Optionally provide a schema to enforce, as a valid JSON string. Ensure this is a mapping of { "column" : "type" }, where types are valid JSON Schema datatypes. Leave as {} to auto-infer the schema.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_config schema [value]
Airbyte Spec Image (airbyte_spec.image)
-
Environment variable:
TAP_S3_AIRBYTE_SPEC_IMAGE
-
Default Value:
airbyte/source-s3
Airbyte image to run
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_spec image [value]
Airbyte Spec Tag (airbyte_spec.tag)
-
Environment variable:
TAP_S3_AIRBYTE_SPEC_TAG
-
Default Value:
latest
Airbyte image tag
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set airbyte_spec tag [value]
Docker Mounts (docker_mounts)
-
Environment variable:
TAP_S3_DOCKER_MOUNTS
Docker mounts to make available to the Airbyte container. Expects a list of maps containing source, target, and type as is documented in the docker --mount documentation
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set docker_mounts [value]
Expand To Show SDK Settings
Flattening Enabled (flattening_enabled)
-
Environment variable:
TAP_S3_FLATTENING_ENABLED
'True' to enable schema flattening and automatically expand nested properties.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set flattening_enabled [value]
Flattening Max Depth (flattening_max_depth)
-
Environment variable:
TAP_S3_FLATTENING_MAX_DEPTH
The max depth to flatten schemas.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set flattening_max_depth [value]
Stream Map Config (stream_map_config)
-
Environment variable:
TAP_S3_STREAM_MAP_CONFIG
User-defined config values to be used within map expressions.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set stream_map_config [value]
Stream Maps (stream_maps)
-
Environment variable:
TAP_S3_STREAM_MAPS
Config object for stream maps capability. For more information check out Stream Maps.
Configure this setting directly using the following Meltano command:
meltano config tap-s3 set stream_maps [value]
Something missing?
This page is generated from a YAML file that you can contribute changes to.
Edit it on GitHub!Looking for help?
#plugins-general
channel.