The tap-spreadsheets-anywhere extractor pulls data from Spreadsheets Anywhere that can then be sent to a destination using a loader.
Alternate Implementations
- Andrew Stewart
- Eric Simmerman (default)🥈
- Jules Huisman
- Laurent Savaete
- Mashey
- Markus Meyer
- QuickBI Oy 🥈
- RFA
- William Liu
Getting Started
Prerequisites
If you haven't already, follow the initial steps of the Getting Started guide:
Installation and configuration
-
Add the tap-spreadsheets-anywhere extractor to your
project using
:meltano add
-
Configure the tap-spreadsheets-anywhere
settings using
:meltano config
-
Test that extractor settings are valid using
:meltano config
meltano add extractor tap-spreadsheets-anywhere
meltano config tap-spreadsheets-anywhere set --interactive
meltano config tap-spreadsheets-anywhere test
Next steps
Follow the remaining steps of the Getting Started guide:
If you run into any issues, learn how to get help.
Capabilities
The current capabilities for
tap-spreadsheets-anywhere
may have been automatically set when originally added to the Hub. Please review the
capabilities when using this extractor. If you find they are out of date, please
consider updating them by making a pull request to the YAML file that defines the
capabilities for this extractor.
This plugin has the following capabilities:
- catalog
- discover
- state
You can
override these capabilities or specify additional ones
in your meltano.yml
by adding the capabilities
key.
Settings
The
tap-spreadsheets-anywhere
settings that are known to Meltano are documented below. To quickly
find the setting you're looking for, click on any setting name from the list:
You can also list these settings using
with the meltano config
list
subcommand:
meltano config tap-spreadsheets-anywhere list
You can
override these settings or specify additional ones
in your meltano.yml
by adding the settings
key.
Please consider adding any settings you have defined locally to this definition on MeltanoHub by making a pull request to the YAML file that defines the settings for this plugin.
Tables (tables)
-
Environment variable:
TAP_SPREADSHEETS_ANYWHERE_TABLES
An array holding json objects that each describe a set of targeted source files.
Each object in the 'tables' array describes one or more CSV or Excel spreadsheet files that adhere to the same schema and are meant to be tapped as the source for a Singer-based data flow. The available keys are:
- path: A string describing the transport and bucket/root directory holding the targeted source files.
- name: A string describing the "table" (aka Singer stream) into which the source data should be loaded.
- search_prefix: (optional) This is an optional prefix to apply after the bucket that will be used to filter files in the listing request from the targeted system. This prefix potentially reduces the number of files returned from the listing request.
- pattern: This is an escaped regular expression that the tap will use to filter the listing result set returned from the listing request. This pattern potentially reduces the number of listed files that are considered as sources for the declared table. It's a bit strange, since this is an escaped string inside of an escaped string, any backslashes in the RegEx will need to be double-escaped.
- start_date: This is the datetime that the tap will use to filter files, based on the modified timestamp of the file.
- key_properties: These are the "primary keys" of the CSV files, to be used by the target for deduplication and primary key definitions downstream in the destination.
- format: Must be either 'csv', 'json', 'excel', or 'detect'. Note that csv can be further customized with delimiter and quotechar variables below.
- invalid_format_action: (optional) By default, the tap will raise an exception if a source file can not be read . Set this key to "ignore" to skip such source files and continue the run.
- field_names: (optional) An array holding the names of the columns in the targeted files. If not supplied, the first row of each file must hold the desired values.
- universal_newlines: (optional) Should the source file parsers honor universal newlines). Setting this to false will instruct the parser to only consider '\n' as a valid newline identifier.
- sample_rate: (optional) The sampling rate to apply when reading a source file for sampling in discovery mode. A sampling rate of 1 will sample every line. A sampling rate of 10 (the default) will sample every 10th line.
- max_sampling_read: (optional) How many lines of the source file should be sampled when in discovery mode attempting to infer a schema. The default is 1000 samples.
- max_sampled_files: (optional) The maximum number of files in the targeted set that will be sampled. The default is 5.
- max_records_per_run: (optional) The maximum number of records that should be written to this stream in a single sync run. The default is unlimited.
- prefer_number_vs_integer: (optional) If the discovery mode sampling process sees only integer values for a field, should number be used anyway so that floats are not considered errors? The default is false but true can help in situations where floats only appear rarely in sources and may not be detected through discovery sampling.
- selected: (optional) Should this table be synced. Defaults to true. Setting to false will skip this table on a sync run.
- worksheet_name: (optional) the worksheet name to pull from in the targeted xls file(s). Only required when format is excel
- delimiter: (optional) the delimiter to use when format is 'csv'. Defaults to a comma ',' but you can set delimiter to 'detect' to leverage the csv "Sniffer" for auto-detecting delimiter.
- quotechar: (optional) the character used to surround values that may contain delimiters - defaults to a double quote '"'
- json_path: (optional) the JSON key under which the list of objets to use is located. Defaults to None, corresponding to an array at the top level of the JSON tree.
For example:
config:
tables:
- path: s3://my-s3-bucket
name: target_table_name
pattern: subfolder/common_prefix.*
start_date: 2017-05-01T00:00:00Z
key_properties: []
format: csv
delimiter: "|"
quotechar: '"'
universal_newlines: false
sample_rate: 10
max_sampling_read: 2000
max_sampled_files: 3
prefer_number_vs_integer: true
selected: true
See the Common Config Examples
section below for more examples or see the repo README for more details.
Configure this setting directly using the following Meltano command:
meltano config tap-spreadsheets-anywhere set tables [value]
Common Config Examples
Debugging
If you're having trouble syncing records try running the following command to view more verbose error messages:
meltano invoke tap-spreadsheets-anywhere --dev
AWS Public IP Ranges JSON File
The JSON response from https://ip-ranges.amazonaws.com/ip-ranges.json can be parsed into records for each prefix in the prefixes
array, using the json_path
key.
config:
tables:
- path: https://ip-ranges.amazonaws.com
format: json
start_date: "2022-12-01T00:00:00Z"
key_properties: [ip_prefix]
name: aws_ips
pattern: "ip-ranges.json"
json_path: "prefixes"
S3 JSONL files
config:
tables:
- path: s3://my-bucket-name
format: json
key_properties: [user_id]
name: user_names
start_date: '2020-01-01T00:00:00Z'
pattern: my_prefix/user_names.json
The user_names.json
files contents would look like:
{"user_id": 1, "first_name": "John", "last_name": "Doe"}
{"user_id": 2, "first_name": "Sarah", "last_name": "Smith"}
Local Excel files
A local data directory containing an Excel file like this /Users/my_local_path/data/excel_file.xlsx
could be synced using:
config:
tables:
- path: "file:///Users/my_local_path"
name: "my_excel_files"
format: "excel"
worksheet_name: "Sheet1"
pattern: ".*"
key_properties: [id]
search_prefix: "data"
start_date: '2020-01-01T00:00:00Z'
Something missing?
This page is generated from a YAML file that you can contribute changes to.
Edit it on GitHub!Looking for help?
#plugins-general
channel.