Salmon data package specification
Canonical markdown source: https://github.com/dfo-pacific-science/smn-data-pkg/blob/main/SPECIFICATION.md
This page provides a DSU-hosted copy and overview. For canonical field definitions and version history, use the GitHub source above.
This page is schema/reference. For the actual package creation, review, and validation workflow, use the maintained metasalmon docs:
1 Overview
The salmon data package (SDP) is a lightweight, frictionless-style specification for exchanging salmon datasets between scientists, assessment biologists, and data stewards.
It is designed to:
- be simple to adopt with Excel and CSV files
- be tool-friendly for R/Python packages and guided assistants
- be ontology-aware, linking columns and codes to the DFO Salmon Ontology and related vocabularies
- support package-first operational workflows such as SPSR intake without forcing one rigid dataset schema
Each salmon data package instance is a small directory of data files plus a set of metadata CSVs that describe the dataset, tables, columns, and controlled codes.
2 Design Goals
- Interoperable but flexible: support multiple schemas (FSAR, SPSR-like, project-specific) while providing shared semantics via IRIs
- FRICTIONLESS-compatible: align conceptually with frictionless and
datapackage.jsonpatterns so existing tooling can be reused or extended - Ontology-linked: allow columns and codes to reference DFO Salmon Ontology and SKOS vocab IRIs
- Incremental adoption: existing tables can be wrapped in a package with minimal changes
- Machine- and human-readable: scientists can work in spreadsheets; machines can consume the same metadata
3 Package Layout
A salmon data package consists of:
- one root directory (for example,
sdp_myproject/) - an optional root
datapackage.jsonfor tooling that expects a frictionless-style entry file - a
metadata/directory containing:dataset.csv— dataset-level metadatatables.csv— one row per logical tablecolumn_dictionary.csv— one row per column in each tablecodes.csv— optional controlled value lists and SKOS links
- a
data/directory containing one or more data tables referenced frommetadata/tables.csv
Minimal example layout:
sdp_myproject/
├── datapackage.json # optional
├── metadata/
│ ├── dataset.csv
│ ├── tables.csv
│ ├── column_dictionary.csv
│ └── codes.csv # optional
└── data/
├── your_table_1.csv
└── your_table_2.csv
Downloadable starter assets:
The salmon data package specification does not prescribe a single canonical salmon schema. Instead, it standardizes the metadata about whatever schema a project uses and ties it to ontology terms and vocabularies.
Older examples that place metadata CSVs at the package root should be treated as legacy layout. Use metadata/ + data/ for current work.
4 Identifiers and Conventions
dataset_id- Short string identifier unique within your organization or project (for example,
fsar_spsr_chinook_2025) - Used to link rows across
metadata/dataset.csv,metadata/tables.csv,metadata/column_dictionary.csv, andmetadata/codes.csv
- Short string identifier unique within your organization or project (for example,
table_id- Short ID unique within a dataset (for example,
cu_year_index,survey_events)
- Short ID unique within a dataset (for example,
column_name- Exact column name as it appears in the data file (for example,
TOTAL_SPAWNERS,run_size_total)
- Exact column name as it appears in the data file (for example,
- URIs / IRIs
- Fields such as
dataset_iri,entity_iri,term_iri,concept_scheme_iri, andunit_irishould use persistent HTTP IRIs where available
- Fields such as
5 metadata/dataset.csv Schema
One row per logical dataset (often one per salmon data package directory). It can describe multiple related tables.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | Stable identifier used to join to other metadata files. |
| title | string | yes | Human-readable dataset title. |
| description | string | yes | Short description of the dataset contents and purpose. |
| creator | string | yes | Name(s) of primary creator(s) or project. |
| contact_name | string | yes | Primary contact person. |
| contact_email | string | yes | Contact email address. |
| license | string | yes | License name or URL (for example, CC-BY-4.0). |
| temporal_start | date | no | Start date or year covered by the dataset (ISO 8601 where possible). |
| temporal_end | date | no | End date or year covered by the dataset. |
| spatial_extent | string | no | Textual description of spatial coverage. |
| dataset_type | string | no | High-level type (for example, cu_year_index, survey_timeseries). |
| dataset_iri | string | no | IRI for this dataset in a catalog or knowledge graph. |
| source_citation | string | no | Free-text citation for publications, reports, or internal docs. |
Optional additional columns (implementation-specific):
provenance_note— narrative about data lineagecreated/modified— timestampsspec_version— package spec version tag when needed
6 metadata/tables.csv Schema
One row per table in the package.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in metadata/dataset.csv. |
| table_id | string | yes | Short ID for the table (for example, cu_year_index). |
| file_name | string | yes | Relative path to the data file (for example, data/cu_year_index.csv). |
| table_label | string | yes | Human-readable label for the table. |
| description | string | yes | Description of what each row represents and how the table is used. |
| entity_type | string | no | Human-readable entity type (for example, CU-year index, survey event). |
| entity_iri | string | no | IRI of ontology class representing the row-level entity, if applicable. |
| primary_key | string | no | Comma-separated list of columns forming a primary key. |
Notes:
entity_irishould reference a class in the DFO Salmon Ontology when available.primary_keyis advisory; it guides downstream validation and graph loading.
7 metadata/column_dictionary.csv Schema
One row per column in each table. This is the core of how the salmon data package links data columns to ontology terms and vocabularies.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in metadata/dataset.csv. |
| table_id | string | yes | References table_id in metadata/tables.csv. |
| column_name | string | yes | Exact column name in the data file. |
| column_label | string | yes | Short human-readable label. |
| column_description | string | yes | Clear definition of the column’s meaning. |
| column_role | string | yes | One of: identifier, attribute, temporal, categorical, measurement. |
| value_type | string | yes | Basic type: integer, number, string, boolean, date, datetime. |
| required | boolean | no | TRUE if the column is required for each row, otherwise FALSE or blank. |
| unit_label | string | no | Human-readable unit label (for example, number of fish, proportion). |
| unit_iri | string | no | IRI for the unit. |
| term_iri | string | no | IRI for the ontology term from the DFO Salmon Ontology. |
| term_type | string | no | Type of ontology term (for example, owl_class, owl_object_property, skos_concept). |
column_role quick definitions:
| Role | When to use it | Example columns |
|---|---|---|
identifier |
Keys and IDs that identify rows or entities | cu_id, survey_event_id, record_id |
measurement |
Quantitative or measured values | spawner_count, run_size, exploitation_rate |
temporal |
Time and date fields | return_year, sample_date, datetime |
categorical |
Controlled codes or enumerated classes | status_code, species_code, run_timing |
attribute |
Descriptive attributes that are not IDs, time, or measurements | waterbody_name, observer_name, population_name |
Guidance:
- Use
column_role = "identifier"for identifiers such ascu_id,survey_event_id,POP_ID. - Use
column_role = "measurement"for numeric quantities such as escapement or run size. - Use
column_role = "temporal"for time-related columns such as year or date. - Use
column_role = "categorical"for columns with controlled vocabularies or enumerated values. - Use
column_role = "attribute"for descriptive attributes that do not fit the other roles. term_irishould reference a term in the DFO Salmon Ontology when available.term_typeindicates the type of ontology term.
8 metadata/codes.csv Schema
Optional file used for columns that have controlled enumerated values (status codes, methods, gear, and so on). This is where SKOS vocabularies plug in.
Each row corresponds to one code value for one column.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in metadata/dataset.csv. |
| table_id | string | yes | References table_id in metadata/tables.csv. |
| column_name | string | yes | Name of the column in the data file that uses this code. |
| code_value | string | yes | Stored value in the data. |
| code_label | string | no | Human-readable label corresponding to the code. |
| code_description | string | no | Longer description of what the code means. |
| concept_scheme_iri | string | no | IRI of the SKOS concept scheme. |
| term_iri | string | no | IRI of the specific ontology term represented by code_value. |
| term_type | string | no | Type of ontology term (for example, skos_concept, owl_class). |
Guidance:
code_valuemust match exactly the values present in the data table.code_labelandcode_descriptioncan be used to auto-generate documentation.concept_scheme_iriidentifies the SKOS concept scheme containing the controlled vocabulary.term_iriandterm_typealign each coded value with the broader ontology.
9 Relationship to SPSR Intake
For current operational workflows, the salmon data package is the package-first staging format for SPSR contributions.
Working expectation:
- prepare one salmon data package first
- choose the SPSR route scope that matches the table grain
- derive wizard or bulk upload files from the same package
- keep upload files and package metadata versioned together
Minimum route-scoped support story currently starts with:
- CU/composite
- SMU
- Population
10 Relationship to Ontology, R Package, and Guided Assistants
- The ontology defines the formal semantics for:
- measurement types, entities, dimensions, and concepts (via
term_iriandterm_type) - entities (
entity_iriinmetadata/tables.csv) - concept schemes (
concept_scheme_iriinmetadata/codes.csv)
- measurement types, entities, dimensions, and concepts (via
- The R package can:
- read salmon data package metadata and data files
- validate packages against this specification
- join in ontology and vocabulary information
- help reshape data into analysis-friendly formats
- Guided assistants can:
- help users draft
metadata/column_dictionary.csvandmetadata/codes.csv - suggest ontology-aligned IRIs and term types
- assist in decomposing compound terms into multiple columns with appropriate roles and IRIs
- propose package structures compliant with the salmon data package specification
- help users draft
11 Validation Workflow
Run validation locally before publishing or intake.
Generic R path:
metasalmon::validate_salmon_datapackage(
"path/to/salmon-data-package",
require_iris = FALSE
)Use require_iris = TRUE when you want the pre-flight check to fail on missing measurement IRIs.
If you need the maintained end-to-end review loop around this validator, use the metasalmon workflow docs rather than duplicating that procedure here.
Project-specific helper scripts should live in the repo that owns the workflow (for example, the CU escapement cookbook), not in the FADS Open Science Hub repo.
Use value_type = "number" for numeric measurements such as escapement counts. Older examples that say double should be updated before running the current validator.
Related implementation guidance:
12 Versioning and Backwards Compatibility
- Include a
spec_versionfield inmetadata/dataset.csvwhen needed. - New columns should be added in a backwards-compatible way.
- Breaking changes to required columns or semantics should bump the major version.
13 Implementation Notes
- CSV files should use UTF-8 encoding and a comma delimiter by default.
- Dates should follow ISO 8601 where possible (for example,
YYYY-MM-DD); years can be encoded as integers. - R and Python tooling should treat missing optional columns as
NAornullwithout failing. - The specification is intentionally minimal; domain-specific extensions (for example, SPSR route profiles) can be layered on top as recommended profiles.