Meltano Hub - Meltano Hub

Singer Spec

The Singer Specification is an open source standard for defining the format of data exchange. The standard is useful because it enables Data Professionals to move data between arbritrary systems as long as the programs generating and ingesting the data can understand this format.

This documentation is our attempt at simplifying the canonical specification into an easier to understand and follow format for people who are new to the Singer community. The full specification is in the Singer project on GitHub.

Version

The current version of the spec is 0.3.0 and is versioned using Semantic Versioning.

Basics

Messages

The full specification for data exchange consists of three types of JSON-formatted messages: schema, record, and state. The record message contains the actual data being communicated, the schema message defines the structure of the data, and the state message keeps track of the progress of an extraction.

An example of what these messages look like is here:

{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}
{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}
{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}
{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}
{"type": "STATE", "value": {"users": 2, "locations": 1}}

Each record message contains a stream identifier which specifies the unique name of that data. For data coming from an API this can be thought of as the name of the endpoint. For data coming from a database this might be the table name. The schema message will have a matching stream identifier for the records it describes. The term "stream" will be used in the rest of the documentation to identify a set of data being extracted.

Taps

These 3 message types are generated by programs called taps. A tap can be written in any programming language (note that Meltano will only run Python-based taps). Taps output the 3 messages to standard output, aka stdout.

Taps are required to accept 1 file, called a configuration (config) file, and can optionally accept 2 other files called state and catalog files.

File	Description
`config.json`	JSON-formatted file containing any information needed for a tap to run. This can include authorization information such as username and password, date parameters to specify when to start extracting, and anything else useful for pulling a specific set of data.
`state.json`	JSON-formatted file used to store information between when taps are run. There is no specification for the format of a state file other than the JSON requirement. If a tap is able to accept a state file it is expected that it will output state messages as well.
`catalog.json`	JSON-formatted file that specifies which streams and entities within the streams, such as columns or fields, to extract. It also can define how streams are replicated and can include extra metadata about a particular stream.

Targets

The 3 message types are consumed by programs called targets. A target can be written in any programming language (note that Meltano will only run Python-based targets). Targets ingest the 3 messages from standard input, aka stdin.

Targets can optionally accept a configuration file if the target system requires authentication information. For a simple target like a csv file, this would not be required, but for a more complicated target like a SaaS database, the config file would be required.

Taps | Targets

Since taps and targets are able to communicate to each other via the Singer spec, they can be used together to move data between systems. This can be done on the command line by sending the messages from a tap to a target using a Unix pipe, |. A pipe takes information from stdout of one process, in this case a tap, and redirects it to stdin of a second process, in this case a target. This composability means taps and targets can be composed as simply as tap | target.

Details

Messages

Each of the messages have a defined schema and some required and optional fields. Note that while example messages will be shown on multiple lines, each record when output from a tap must be on its own line.

Records

Record messages contain the actual data being extracted from a source system. Every record message must have the following properties:

type - this will always be RECORD
stream - the unique identifier of the data stream
record - a JSON object containing the data being extracted

Record messages can optionally have:

time_extracted - The time the record was observed in the source. This should be an RFC3339 formatted date-time, like "2022-11-20T16:45:33.000Z".

Putting it together, a full record message looks like this:

{
  "type": "RECORD",
  "stream": "tools",
  "time_extracted": "2021-11-20T16:45:33.000Z",
  "record": {
    "id": 1,
    "name": "Meltano",
    "active": true,
    "updated_at": "2021-10-20T16:45:33.000Z"
  }
}

Note that in the above example the message was formatted for readibility, but when output from a tap the entire message will be on a single-line.

Schemas

Schema messages define the structure of the data sent in a record message. Every schema message must have the following properties:

type - this will always be SCHEMA
stream - the unique identifier of the data stream. This will match the stream property in record messages
schema - a JSON Schema describing the record property of record messages for a given stream
key_properties - a list of strings indicating which properties make up the primary key for this stream. Each item in the list must be the name of a top-level property defined in the schema. An empty list may be used to indicate there is no primary key for the stream

What is a JSON Schema?

A JSON Schema is a way to annotate and validate JSON objects. The data types available in raw JSON are limited compared to the variety of types available in many targets. Within the Singer Spec, JSON schema definitions are used to tell a target the exact data type to use when storing data.

Using the record example shown previously, the JSON schema for that record could be:

{
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "string"
      },
      "active": {
        "type": "boolean"
      },
      "updated_at": {
        "type": "string",
        "format": "date-time"
      }
    }

This definition now explicitly defines what kind of data is expected in a record and how to handle it when loading the data.

Also of note, there are several different versions of JSON Schema. The most common one is Draft 4 and Meltano and the SDK both support this draft.

Optional SCHEMA message properties

Schema messages can optionally have:

bookmark_properties - a list of strings indicating which properties the tap is using as bookmarks. Each item in the list must be the name of a top-level property defined in the schema. This is discussed more in the bookmarks section.

Putting it together, a full schema message looks like this:

{
  "type": "SCHEMA",
  "stream": "tools",
  "schema": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "string"
      },
      "active": {
        "type": "boolean"
      },
      "updated_at": {
        "type": "string",
        "format": "date-time"
      }
    }
  },
  "key_properties": ["id"],
  "bookmark_properties": ["updated_at"]
}

Note that in the above example the message was formatted for readability, but when output from a tap the entire message will be on a single-line.

Ordering of SCHEMA and RECORD Messages

Before any record messages for a given data stream are output by a tap, they must be preceded by a schema message for the stream. While the extraction will still work, it will be assumed that the record is schema-less and will be loaded in a potentially unexpected manner.

State

State messages contain any information that a tap is designed to persist. These are used to inform the target of the current place in the extraction of a data stream. Each state message must have the following properties:

type - this will always be STATE
value - this is a JSON object of the state values to be stored

There structure of the value property is not defined by the spec and is determined by each tap independently. However, the following structure is recommended:

{
  "bookmarks": {
    "tools": {
      "updated_at": "2021-10-20T16:45:33.000Z"
    },
    "team": {
      "id": 123
    }
  }
}

The bookmarks key should be familiar since it's an optional key in a schema message. Each property within the bookmarks JSON object is a data stream from a previously defined schema and record. Each stream maps to a JSON object storing the last data point seen in the extraction.

In the above example, the tools stream has extracted data up to the timestamp shown in the updated_at field. Similarly, the team stream has extracted up to id = 123.

Putting it together, a full state message looks like this:

{
  "type": "STATE",
  "value": {
    "bookmarks": {
      "tools": {
        "updated_at": "2021-10-20T16:45:33.000Z"
      },
      "team": {
        "id": 123
      }
    }
  }
}

Note that in the above example the message was formatted for readibility, but when output from a tap the entire message will be on a single-line.

Taps

When taps are run, they can accept three files that provide information necessary for it to work properly: config, state, and catalog files. Taps are required to accept the config file, and can optionally accept the state and catalog files.

Config Files

The config file contains the parameters required by the tap to succesfully extract data from the source. Typically this will include credentials for an API or database connection.

There is no required specification, but it is recommended to have the following fields:

start_date - this is used on the first sync to define how far back in time to pull data. Start dates should conform to the RFC3339 specification.
user_agent - this should be an email address or other contact information should the API provider need to contact you for any reason

Putting this all together, a config file may look like:

# config.json
{
  "api_key" : "asd23ayzxz80adf",
  "start_date" : "2022-01-01T00:00:00Z",
  "user_agent" : "your_email@domain"
}

State Files

Taps can optionally use a state file to start replication from a previous point in a data stream. The structure of the state file and the state message described previously should be nearly identical. The value property in the state message will be the contents of any state.json file passed to a tap.

Using the previous example, a state file would look like this:

# state.json
{
  "bookmarks": {
    "tools": {
      "updated_at": "2021-10-20T16:45:33.000Z"
    },
    "team": {
      "id": 123
    }
  }
}

Catalog Files

Catalog files define the structure of one or many data streams. Taps are capable of both using and generating catalog files.

The structure of a catalog file is a JSON object with a single top-level property:

streams - this is a list containing information for each data stream that can be extracted

Each item within the streams list is another JSON object with the following required properties:

stream - this is the primary identifier of the stream as it will be passed to the target (tools, team, etc.)
tap_stream_id - this is the unique identifier of the stream which can differ from the stream name since some sources may have multiple available streams with the same name
schema - this is the JSON schema of the stream, which will be passed in a SCHEMA message to the Target

Optional properties within the list are:

table_name - this is only used for a database source and is the name of the table
metadata - this is a list that defines extra information about items within a stream. This is discussed more in the Metadata section below

An example catalog with a single stream and no metadata is as follows:

{
  "streams": [
    {
      "stream": "tools",
      "tap_stream_id": "tools",
      "schema": {
        "type": ["null", "object"],
        "additionalProperties": false,
        "properties": {
          "id": {
            "type": [
              "string"
            ]
          },
          "name": {
            "type": [
              "string"
            ]
          },
          "updated_at": {
            "type": [
              "string"
            ],
            "format": "date-time"
          }
        }
      }
    }
  ]
}

Discovery Mode

Discovery mode is how taps can generate catalogs. When a tap is invoked with a --discover it will output the full catalog list of streams available for extraction to stdout. This can then be saved to a catalog.json file.

tap --config config.json --discover > catalog.json

Note that some older taps use properties.json as the catalog file.

Metadata

Metadata is the preferred method of associating extra information about streams and properties within a stream.

There are two kinds of metadata:

discoverable - this metadata should be written and read by a tap
non-discoverable - this metadata is written by other systems, such as Meltano, and should only be read by the tap

A tap is free to write any type of metadata they feel is useful for describing fields in the schema, although several reserved keywords exist. A tap that extracts data from a database should use additional metadata to describe the properties of the database.

Non-discoverable Metadata

Keyword	Tap Type	Description
`selected`	All	Either `true` or `false`. Indicates that this node in the schema has been selected by the user for replication.
`replication-method`	All	Either `FULL_TABLE`, `INCREMENTAL`, or `LOG_BASED`. The replication method to use for a stream. See Data Integration for more details on the replication type.
`replication-key`	All	The name of a property in the source to use as a `bookmark`. For example, this will often be an `updated_at` field or an auto-incrementing primary key (requires `replication-method`).
`view-key-properties`	Database	List of key properties for a database view.

Discoverable Metadata

Keyword	Tap Type	Description
`inclusion`	All	Either `available`, `automatic`, or `unsupported`. `available` means the field is available for selection, and the tap will only emit values for that field if it is marked with `"selected": true`. `automatic` means that the tap will emit values for the field. `unsupported` means that the field exists in the source data but the tap is unable to provide it.
`selected-by-default`	All	Either `true` or `false`. Indicates if a node in the schema should be replicated if a user has not expressed any opinion on whether or not to replicate it.
`valid-replication-keys`	All	List of the fields that could be used as replication keys.
`forced-replication-method`	All	Used to force the replication method to either `FULL_TABLE` or `INCREMENTAL`.
`table-key-properties`	All	List of key properties for a database table.
`schema-name`	Database	Name of the schema.
`is-view`	Database	Either `true` or `false`. Indicates whether a stream corresponds to a database view.
`row-count`	Database	Number of rows in a database table/view.
`database-name`	Database	Name of the database.
`sql-datatype`	Database	Represents the datatype of a database column.

Each piece of metadata has two primary keys:

metadata - this is a JSON object containing all of the metadata for either the stream or a property of the stream
breadcrumb - this identifies whether the metadata applies to the entire stream or a property of the stream. An empty list means the metadata applies to the stream. For specific properties within the stream, the breadcrumb will have the properties key followed by the name of the property being described.

An example of a valid metadata object is as follows:

"metadata": [
  {
    "metadata": {
      "inclusion": "available",
      "table-key-properties": ["id"],
      "selected": true,
      "valid-replication-keys": ["date_modified"],
      "schema-name": "users",
    },
    "breadcrumb": []
  },
  {
    "metadata": {
      "inclusion": "automatic"
    },
    "breadcrumb": ["properties", "id"]
  },
  {
    "metadata": {
      "inclusion": "available",
    "selected": true
    },
    "breadcrumb": ["properties", "name"]
  },
  {
    "metadata": {
      "inclusion": "automatic"
    },
    "breadcrumb": ["properties", "updated_at"]
  }
]

Putting it Together

Putting this all together, a complete catalog example looks like this:

{
  "streams": [
    {
      "stream": "tools",
      "tap_stream_id": "tools",
      "schema": {
        "type": ["null", "object"],
        "additionalProperties": false,
        "properties": {
          "id": {
            "type": [
              "string"
            ],
          },
          "name": {
            "type": [
              "string"
            ],
          },
          "updated_at": {
            "type": [
              "string"
            ],
            "format": "date-time",
          }
        }
      }
    }
  ],
  "metadata": [
    {
      "metadata": {
        "inclusion": "available",
        "table-key-properties": ["id"],
        "selected": true,
        "valid-replication-keys": ["date_modified"],
        "schema-name": "users",
      },
      "breadcrumb": []
    },
    {
      "metadata": {
        "inclusion": "automatic"
      },
      "breadcrumb": ["properties", "id"]
    },
    {
      "metadata": {
        "inclusion": "available",
        "selected": true
      },
      "breadcrumb": ["properties", "name"]
    },
    {
      "metadata": {
        "inclusion": "automatic"
      },
      "breadcrumb": ["properties", "updated_at"]
    }
  ]
}

Metrics

A tap can periodically emit structured log messages containing metrics about read operations. Consumers of the tap logs can parse these metrics out of the logs for monitoring or analysis. Metrics appear in the log output with the following structure:

INFO METRIC: &lt;metrics-json&gt;

where <metrics-json> is a JSON object with the following keys:

Metric Key	Description
`type`	The type of the metric. Indicates how consumers of the data should interpret the value field. There are two types of metrics: `counter` - The value should be interpreted as a number that is added to a cumulative or running total `timer` - The value is the duration in seconds of some operation
`metric`	The name of the metric. This should consist only of letters, numbers, underscore, and dash characters. For example, "http_request_duration"
`value`	The value of the datapoint, either an integer or a float. For example, "1234" or "1.234"
`tags`	Mapping of tags describing the data. The keys can be any strings consisting solely of letters, numbers, underscores, and dashes. For consistency's sake, we recommend using the following tags when they are relevant. Note that for many metrics, many of those tags will not be relevant. `endpoint` - For a Tap that pulls data from an HTTP API, this should be a descriptive name for the endpoint, such as "users" or "deals" or "orders" `http_status_code` - The HTTP status code. For example, 200 or 500 `job_type` - For a process that we are timing, some description of the type of the job. For example, if we have a Tap that does a POST to an HTTP API to generate a report and then polls with a GET until the report is done, we could use a job type of "run_report". `status` - Either "succeeded" or "failed"

Here are some examples of metrics and how those metrics should be interpreted.

Timer for Successful HTTP GET

INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 1.23, "tags": {"endpoint": "orders", "http_status_code": 200, "status": "succeeded"}}

The following is what the object looks like expanded:

{
    "type": "timer",
    "metric": "http_request_duration",
    "value": 1.23,
    "tags": {
        "endpoint": "orders",
        "http_status_code": 200,
        "status": "succeeded"
    }
}

This can be interpreted as: an HTTP request to an "orders" endpoint was made that took 1.23 seconds and succeeded with a status code of 200.

Timer for Failed HTTP GET

INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 30.01, "tags": {"endpoint": "orders", "http_status_code": 500, "status": "failed"}}

This can be interpreted as: an HTTP request to an "orders" endpoint was made that took 30.01 seconds and failed with a status code of 500.

Counter for Records

INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 100, "tags": {"endpoint": "orders"}}
INFO METRIC: {"type": "counter", "metric": "record_count", "value": 14, "tags": {"endpoint": "orders"}}

This can be interpreted as: a total of 314 records were fetched from an "orders" endpoint.

Log Output

Metrics messages are interspersed with the primary 3 messages, so parsing them should be handled programmatically. This is an example of what a realistic log output might look like:

INFO Using API Token authentication.
INFO tickets: Skipping - not selected
{"type": "SCHEMA", "stream": "groups", "schema": {"properties": {"name": {"type": ["string"]}, "created_at": {"format": "date-time", "type": ["string"]}, "url": {"type": ["string"]}, "updated_at": {"format": "date-time", "type": ["string"]}, "deleted": {"type": ["boolean"]}, "id": {"type": ["integer"]}}, "type": ["object"]}, "key_properties": ["id"]}
INFO groups: Starting sync
INFO METRIC: {"type": "timer", "metric": "http_request_duration", "value": 0.6276309490203857, "tags": {"status": "succeeded"}}
{"type": "RECORD", "stream": "groups", "record": {"id": 360007960773, "updated_at": "2020-01-09T09:57:16.000000Z"}}
{"type": "STATE", "value": {"bookmarks": {"groups": {"updated_at": "2020-01-09T09:57:16Z"}}}}

Targets

When targets are run, they can accept a single config file that provides the information necessary for it to work properly.

Config Files

Similar to taps, targets take a configuration file. There is no specification for the structure of a config file, as long as it is JSON based.

State Files

Unlike taps, targets do not take a state file. Targets are expected to read in the state messages from stdin, but typically they do not do anything with the state messages beyond sending them to stdout. This is done once all data that appeared in the stream before the state message has been processed by the Target.

Schema Files

Targets do not take a schema file. However, they are expected to read the schema messages from stdin and perform validation of the incoming record using the provided schema.

Singer Spec#

Version#

Basics#

Messages#

Taps#

Targets#

Taps | Targets#

Details#

Messages#

Records#

Schemas#

What is a JSON Schema?#

Optional SCHEMA message properties#

Ordering of SCHEMA and RECORD Messages#

State#

Taps#

Config Files#

State Files#

Catalog Files#

Discovery Mode#

Metadata#

Putting it Together#

Metrics#

Timer for Successful HTTP GET#

Timer for Failed HTTP GET#

Counter for Records#

Log Output#

Targets#

Config Files#

State Files#

Schema Files#

Singer Spec

Version

Basics

Messages

Taps

Targets

Taps | Targets

Details

Messages

Records

Schemas

What is a JSON Schema?

Optional SCHEMA message properties

Ordering of SCHEMA and RECORD Messages

State

Taps

Config Files

State Files

Catalog Files

Discovery Mode

Metadata

Putting it Together

Metrics

Timer for Successful HTTP GET

Timer for Failed HTTP GET

Counter for Records

Log Output

Targets

Config Files

State Files

Schema Files