Spreadsheets Anywhere

tap-spreadsheets-anywhere (ets variant)🥈

Data extractor for CSV and Excel files from any smart_open supported transport (S3, SFTP, localhost, etc...)

The tap-spreadsheets-anywhere extractor pulls data from Spreadsheets Anywhere that can then be sent to a destination using a loader.

Alternate Implementations

Getting Started

Prerequisites

If you haven't already, follow the initial steps of the Getting Started guide:

  1. Install Meltano
  2. Create your Meltano project

Installation and configuration

  1. Add the tap-spreadsheets-anywhere extractor to your project using
    meltano add
    :
  2. meltano add extractor tap-spreadsheets-anywhere
  3. Configure the tap-spreadsheets-anywhere settings using
    meltano config
    :
  4. meltano config tap-spreadsheets-anywhere set --interactive
  5. Test that extractor settings are valid using
    meltano config
    :
  6. meltano config tap-spreadsheets-anywhere test

Next steps

If you run into any issues, learn how to get help.

Capabilities

The current capabilities for tap-spreadsheets-anywhere may have been automatically set when originally added to the Hub. Please review the capabilities when using this extractor. If you find they are out of date, please consider updating them by making a pull request to the YAML file that defines the capabilities for this extractor.

This plugin has the following capabilities:

  • catalog
  • discover
  • state

You can override these capabilities or specify additional ones in your meltano.yml by adding the capabilities key.

Settings

The tap-spreadsheets-anywhere settings that are known to Meltano are documented below. To quickly find the setting you're looking for, click on any setting name from the list:

You can also list these settings using

meltano config
with the list subcommand:

meltano config tap-spreadsheets-anywhere list

You can override these settings or specify additional ones in your meltano.yml by adding the settings key.

Please consider adding any settings you have defined locally to this definition on MeltanoHub by making a pull request to the YAML file that defines the settings for this plugin.

Tables (tables)

  • Environment variable: TAP_SPREADSHEETS_ANYWHERE_TABLES

An array holding json objects that each describe a set of targeted source files.

Each object in the 'tables' array describes one or more CSV or Excel spreadsheet files that adhere to the same schema and are meant to be tapped as the source for a Singer-based data flow. The available keys are:

  • path: A string describing the transport and bucket/root directory holding the targeted source files.
  • name: A string describing the "table" (aka Singer stream) into which the source data should be loaded.
  • search_prefix: (optional) This is an optional prefix to apply after the bucket that will be used to filter files in the listing request from the targeted system. This prefix potentially reduces the number of files returned from the listing request.
  • pattern: This is an escaped regular expression that the tap will use to filter the listing result set returned from the listing request. This pattern potentially reduces the number of listed files that are considered as sources for the declared table. It's a bit strange, since this is an escaped string inside of an escaped string, any backslashes in the RegEx will need to be double-escaped.
  • start_date: This is the datetime that the tap will use to filter files, based on the modified timestamp of the file.
  • key_properties: These are the "primary keys" of the CSV files, to be used by the target for deduplication and primary key definitions downstream in the destination.
  • format: Must be either 'csv', 'json', 'excel', or 'detect'. Note that csv can be further customized with delimiter and quotechar variables below.
  • invalid_format_action: (optional) By default, the tap will raise an exception if a source file can not be read . Set this key to "ignore" to skip such source files and continue the run.
  • field_names: (optional) An array holding the names of the columns in the targeted files. If not supplied, the first row of each file must hold the desired values.
  • universal_newlines: (optional) Should the source file parsers honor universal newlines). Setting this to false will instruct the parser to only consider '\n' as a valid newline identifier.
  • sample_rate: (optional) The sampling rate to apply when reading a source file for sampling in discovery mode. A sampling rate of 1 will sample every line. A sampling rate of 10 (the default) will sample every 10th line.
  • max_sampling_read: (optional) How many lines of the source file should be sampled when in discovery mode attempting to infer a schema. The default is 1000 samples.
  • max_sampled_files: (optional) The maximum number of files in the targeted set that will be sampled. The default is 5.
  • max_records_per_run: (optional) The maximum number of records that should be written to this stream in a single sync run. The default is unlimited.
  • prefer_number_vs_integer: (optional) If the discovery mode sampling process sees only integer values for a field, should number be used anyway so that floats are not considered errors? The default is false but true can help in situations where floats only appear rarely in sources and may not be detected through discovery sampling.
  • selected: (optional) Should this table be synced. Defaults to true. Setting to false will skip this table on a sync run.
  • worksheet_name: (optional) the worksheet name to pull from in the targeted xls file(s). Only required when format is excel
  • delimiter: (optional) the delimiter to use when format is 'csv'. Defaults to a comma ',' but you can set delimiter to 'detect' to leverage the csv "Sniffer" for auto-detecting delimiter.
  • quotechar: (optional) the character used to surround values that may contain delimiters - defaults to a double quote '"'
  • json_path: (optional) the JSON key under which the list of objets to use is located. Defaults to None, corresponding to an array at the top level of the JSON tree.

For example:

config:
  tables:
  - path: s3://my-s3-bucket
    name: target_table_name
    pattern: subfolder/common_prefix.*
    start_date: 2017-05-01T00:00:00Z
    key_properties: []
    format: csv
    delimiter: "|"
    quotechar: '"'
    universal_newlines: false
    sample_rate: 10
    max_sampling_read: 2000
    max_sampled_files: 3
    prefer_number_vs_integer: true
    selected: true

See the Common Config Examples section below for more examples or see the repo README for more details.


Configure this setting directly using the following Meltano command:

meltano config tap-spreadsheets-anywhere set tables [value]

Common Config Examples

Debugging

If you're having trouble syncing records try running the following command to view more verbose error messages:

meltano invoke tap-spreadsheets-anywhere --dev

AWS Public IP Ranges JSON File

The JSON response from https://ip-ranges.amazonaws.com/ip-ranges.json can be parsed into records for each prefix in the prefixes array, using the json_path key.

  config:
    tables:
    - path: https://ip-ranges.amazonaws.com
      format: json
      start_date: "2022-12-01T00:00:00Z"
      key_properties: [ip_prefix]
      name: aws_ips
      pattern: "ip-ranges.json"
      json_path: "prefixes"

S3 JSONL files

  config:
    tables:
    - path: s3://my-bucket-name
      format: json
      key_properties: [user_id]
      name: user_names
      start_date: '2020-01-01T00:00:00Z'
      pattern: my_prefix/user_names.json

The user_names.json files contents would look like:

  {"user_id": 1, "first_name": "John", "last_name": "Doe"}
  {"user_id": 2, "first_name": "Sarah", "last_name": "Smith"}

Local Excel files

A local data directory containing an Excel file like this /Users/my_local_path/data/excel_file.xlsx could be synced using:

  config:
    tables:
    - path: "file:///Users/my_local_path"
      name: "my_excel_files"
      format: "excel"
      worksheet_name: "Sheet1"
      pattern: ".*"
      key_properties: [id]
      search_prefix: "data"
      start_date: '2020-01-01T00:00:00Z'

Something missing?

This page is generated from a YAML file that you can contribute changes to.

Edit it on GitHub!

Looking for help?

If you're having trouble getting the tap-spreadsheets-anywhere extractor to work, look for an existing issue in its repository, file a new issue, or join the Meltano Slack community and ask for help in the
#plugins-general
channel.

Install

meltano add extractor tap-spreadsheets-anywhere

Maintenance Status

  • Maintenance Status

Repo

https://github.com/ets/tap-spreadsheets-anywhere
  • Stars
  • Forks
  • Last Commit Date
  • Open Issues
  • Open PRs
  • Contributors
  • License

Maintainer

  • Eric Simmerman

Meltano Stats

  • Total Executions (Last 3 Months)
  • Projects (Last 3 Months)

Keywords

  • files3csvtsvjsonjsonlsshscpsftpwebhdfsgcsgoogle cloud stoargeazure blob storageabszero authzero config