tap-s3 - Meltano Hub

S3

tap-s3 (airbyte variant)🥈

AWS S3

The tap-s3 extractor pulls data from S3 that can then be sent to a destination using a loader.

Airbyte Usage Notice

This connector uses tap-airbyte-wrapper to call the underlying Airbyte source Docker container. This means you must have Docker installed and running prior to usage. We also recommend using Meltano version 2.13.0 or later.

Container-based connectors can introduce deployment challenges including the potential need to run Docker-in-Docker (not currently supported by services like AWS ECS, Meltano Cloud, etc. see FAQ and Airbyte's ECS deployment docs for more details). Before using this variant we recommend considering if/how you will be able to deploy container-based connectors to production.

For more context on how this Airbyte integration works please checkout out the FAQ in the Meltano Docs.

Getting Started

Prerequisites

If you haven't already, follow the initial steps of the Getting Started guide:

Installation and configuration

Add the tap-s3 extractor to your project using
```
meltano add
```
:

meltano add tap-s3

Configure the tap-s3 settings using

meltano config

:

meltano config tap-s3 set --interactive

Test that extractor settings are valid using

meltano config

:

meltano config tap-s3 test

Next steps

Follow the remaining steps of the Getting Started guide:

If you run into any issues, learn how to get help.

Capabilities

The current capabilities for tap-s3 may have been automatically set when originally added to the Hub. Please review the capabilities when using this extractor. If you find they are out of date, please consider updating them by making a pull request to the YAML file that defines the capabilities for this extractor.

This plugin has the following capabilities:

about
catalog
discover
schema-flattening
state
stream-maps

You can override these capabilities or specify additional ones in your meltano.yml by adding the capabilities key.

Settings

The tap-s3 settings that are known to Meltano are documented below. To quickly find the setting you're looking for, click on any setting name from the list:

Expand To Show SDK Settings

flattening_enabled
flattening_max_depth
stream_map_config
stream_maps

You can also list these settings using

meltano config

with the list subcommand:

meltano config tap-s3 list

You can override these settings or specify additional ones in your meltano.yml by adding the settings key.

Please consider adding any settings you have defined locally to this definition on MeltanoHub by making a pull request to the YAML file that defines the settings for this plugin.

Airbyte Config Dataset (airbyte_config.dataset)

Environment variable: TAP_S3_AIRBYTE_CONFIG_DATASET

The name of the stream you would like this source to output. Can contain letters, numbers, or underscores.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config dataset [value]

Airbyte Config Format Additional Reader Options (airbyte_config.format.additional_reader_options)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_ADDITIONAL_READER_OPTIONS

Optionally add a valid JSON string here to provide additional options to the csv reader. Mappings must correspond to options detailed here. 'column_types' is used internally to handle schema so overriding that would likely cause problems.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.additional_reader_options [value]

Airbyte Config Format Advanced Options (airbyte_config.format.advanced_options)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_ADVANCED_OPTIONS

Optionally add a valid JSON string here to provide additional Pyarrow ReadOptions. Specify 'column_names' here if your CSV doesn't have header, or if you want to use custom column names. 'block_size' and 'encoding' are already used above, specify them again here will override the values above.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.advanced_options [value]

Airbyte Config Format Batch Size (airbyte_config.format.batch_size)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_BATCH_SIZE

Maximum number of records per batch read from the input files. Batches may be smaller if there aren’t enough rows in the file. This option can help avoid out-of-memory errors if your data is particularly wide.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.batch_size [value]

Airbyte Config Format Block Size (airbyte_config.format.block_size)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_BLOCK_SIZE

The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors., The chunk size in bytes to process at a time in memory from each file. If your data is particularly wide and failing during schema detection, increasing this should solve it. Beware of raising this too high as you could hit OOM errors.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.block_size [value]

Airbyte Config Format Buffer Size (airbyte_config.format.buffer_size)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_BUFFER_SIZE

Perform read buffering when deserializing individual column chunks. By default every group column will be loaded fully to memory. This option can help avoid out-of-memory errors if your data is particularly wide.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.buffer_size [value]

Airbyte Config Format Columns (airbyte_config.format.columns)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_COLUMNS

If you only want to sync a subset of the columns from the file(s), add the columns you want here as a comma-delimited list. Leave it empty to sync all columns.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.columns [value]

Airbyte Config Format Delimiter (airbyte_config.format.delimiter)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_DELIMITER

The character delimiting individual cells in the CSV data. This may only be a 1-character string. For tab-delimited data enter '\t'.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.delimiter [value]

Airbyte Config Format Double Quote (airbyte_config.format.double_quote)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_DOUBLE_QUOTE

Whether two quotes in a quoted CSV value denote a single quote in the data.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.double_quote [value]

Airbyte Config Format Encoding (airbyte_config.format.encoding)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_ENCODING

The character encoding of the CSV data. Leave blank to default to UTF8. See list of python encodings for allowable options.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.encoding [value]

Airbyte Config Format Escape Char (airbyte_config.format.escape_char)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_ESCAPE_CHAR

The character used for escaping special characters. To disallow escaping, leave this field blank.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.escape_char [value]

Airbyte Config Format Filetype (airbyte_config.format.filetype)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_FILETYPE

csv, parquet, avro, jsonl

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.filetype [value]

Airbyte Config Format Infer Datatypes (airbyte_config.format.infer_datatypes)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_INFER_DATATYPES

Configures whether a schema for the source should be inferred from the current data or not. If set to false and a custom schema is set, then the manually enforced schema is used. If a schema is not manually set, and this is set to false, then all fields will be read as strings

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.infer_datatypes [value]

Airbyte Config Format Newlines In Values (airbyte_config.format.newlines_in_values)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_NEWLINES_IN_VALUES

Whether newline characters are allowed in CSV values. Turning this on may affect performance. Leave blank to default to False., Whether newline characters are allowed in JSON values. Turning this on may affect performance. Leave blank to default to False.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.newlines_in_values [value]

Airbyte Config Format Quote Char (airbyte_config.format.quote_char)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_QUOTE_CHAR

The character used for quoting CSV values. To disallow quoting, make this field blank.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.quote_char [value]

Airbyte Config Format Unexpected Field Behavior (airbyte_config.format.unexpected_field_behavior)

Environment variable: TAP_S3_AIRBYTE_CONFIG_FORMAT_UNEXPECTED_FIELD_BEHAVIOR

How JSON fields outside of explicit_schema (if given) are treated. Check PyArrow documentation for details

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config format.unexpected_field_behavior [value]

Airbyte Config Path Pattern (airbyte_config.path_pattern)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PATH_PATTERN

A regular expression which tells the connector which files to replicate. All files which match this pattern will be replicated. Use | to separate multiple patterns. See this page to understand pattern syntax (GLOBSTAR and SPLIT flags are enabled). Use pattern ** to pick up all files.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config path_pattern [value]

Airbyte Config Provider AWS Access Key Id (airbyte_config.provider.aws_access_key_id)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PROVIDER_AWS_ACCESS_KEY_ID

In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config provider.aws_access_key_id [value]

Airbyte Config Provider AWS Secret Access Key (airbyte_config.provider.aws_secret_access_key)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PROVIDER_AWS_SECRET_ACCESS_KEY

In order to access private Buckets stored on AWS S3, this connector requires credentials with the proper permissions. If accessing publicly available data, this field is not necessary.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config provider.aws_secret_access_key [value]

Airbyte Config Provider Bucket (airbyte_config.provider.bucket)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PROVIDER_BUCKET

Name of the S3 bucket where the file(s) exist.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config provider.bucket [value]

Airbyte Config Provider Endpoint (airbyte_config.provider.endpoint)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PROVIDER_ENDPOINT

Endpoint to an S3 compatible service. Leave empty to use AWS.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config provider.endpoint [value]

Airbyte Config Provider Path Prefix (airbyte_config.provider.path_prefix)

Environment variable: TAP_S3_AIRBYTE_CONFIG_PROVIDER_PATH_PREFIX

By providing a path-like prefix (e.g. myFolder/thisTable/) under which all the relevant files sit, we can optimize finding these in S3. This is optional but recommended if your bucket contains many folders/files which you don't need to replicate.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config provider.path_prefix [value]

Airbyte Config Schema (airbyte_config.schema)

Environment variable: TAP_S3_AIRBYTE_CONFIG_SCHEMA

Optionally provide a schema to enforce, as a valid JSON string. Ensure this is a mapping of { "column" : "type" }, where types are valid JSON Schema datatypes. Leave as {} to auto-infer the schema.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_config schema [value]

Airbyte Spec Image (airbyte_spec.image)

Environment variable: TAP_S3_AIRBYTE_SPEC_IMAGE
Default Value: airbyte/source-s3

Airbyte image to run

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_spec image [value]

Airbyte Spec Tag (airbyte_spec.tag)

Environment variable: TAP_S3_AIRBYTE_SPEC_TAG
Default Value: latest

Airbyte image tag

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set airbyte_spec tag [value]

Docker Mounts (docker_mounts)

Environment variable: TAP_S3_DOCKER_MOUNTS

Docker mounts to make available to the Airbyte container. Expects a list of maps containing source, target, and type as is documented in the docker --mount documentation

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set docker_mounts [value]

Expand To Show SDK Settings

Flattening Enabled (flattening_enabled)

Environment variable: TAP_S3_FLATTENING_ENABLED

'True' to enable schema flattening and automatically expand nested properties.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set flattening_enabled [value]

Flattening Max Depth (flattening_max_depth)

Environment variable: TAP_S3_FLATTENING_MAX_DEPTH

The max depth to flatten schemas.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set flattening_max_depth [value]

Stream Map Config (stream_map_config)

Environment variable: TAP_S3_STREAM_MAP_CONFIG

User-defined config values to be used within map expressions.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set stream_map_config [value]

Stream Maps (stream_maps)

Environment variable: TAP_S3_STREAM_MAPS

Config object for stream maps capability. For more information check out Stream Maps.

Configure this setting directly using the following Meltano command:

meltano config tap-s3 set stream_maps [value]

Something missing?

This page is generated from a YAML file that you can contribute changes to.

Edit it on GitHub!

Looking for help?

If you're having trouble getting the tap-s3 extractor to work, read the Airbyte connector FAQ, look for an existing issue in the Airbyte repository, file a new issue, or join the Meltano Slack community and ask for help in the

#plugins-general

channel.

Install

meltano add
              tap-s3

Maintenance Status

Repo

https://github.com/airbytehq/airbyte/tree/master/airbyte-integrations/connectors/source-s3

Maintainer

Meltano Stats

Keywords