Salmon Data Exchange Package Specification
1 Overview
The Salmon Data Exchange Package (SDEP) is a lightweight, frictionless-style specification for exchanging salmon datasets between scientists, assessment biologists, and data stewards.
It is designed to:
- Be simple to adopt with Excel and CSV files.
- Be tool-friendly for R/Python packages and custom GPT assistants.
- Be ontology-aware, linking columns and codes to the DFO Salmon Ontology and related vocabularies.
- Be compatible with frictionless data packages and Darwin Core–style semantic layers without forcing a single rigid schema.
Each SDEP instance is a small directory of CSV data files plus a set of metadata CSVs that describe the dataset, tables, columns, and controlled codes.
2 Design Goals
- Interoperable but flexible: support multiple schemas (FSAR, SPSR-like, project-specific) while providing shared semantics via URIs.
- FRICTIONLESS-compatible: align conceptually with frictionless/datapackage ideas so existing tooling can be reused or extended.
- Ontology-linked: allow columns and codes to reference DFO Salmon Ontology and SKOS vocab IRIs.
- Incremental adoption: existing tables can be wrapped in a package with minimal changes (no need to refactor everything at once).
- Machine- and human-readable: scientists can work in spreadsheets; machines can consume the same metadata.
3 Package Layout
A Salmon Data Exchange Package consists of:
- One root directory (e.g.,
SDEP_myproject/) containing:dataset.csv– dataset-level metadata.tables.csv– one row per logical table.column_dictionary.csv– one row per column in each table.codes.csv– optional: controlled value lists and SKOS links.
- One or more data files (typically CSV) referenced from
tables.csv, usually in adata/subdirectory.
Minimal example layout:
dataset.csvtables.csvcolumn_dictionary.csvcodes.csv(optional for fully numeric or unconstrained columns)data/your_table_1.csvdata/your_table_2.csv- etc.
The SDEP spec does not prescribe a single canonical salmon schema. Instead, it standardizes the metadata about whatever schema a project uses and ties it to ontology and vocabularies.
4 Identifiers and Conventions
dataset_id- Short string identifier unique within your organization or project (e.g.,
fsar_spsr_chinook_2025). - Used to link rows across
dataset.csv,tables.csv,column_dictionary.csv, andcodes.csv.
- Short string identifier unique within your organization or project (e.g.,
table_id- Short ID unique within a dataset (e.g.,
cu_year_index,survey_events).
- Short ID unique within a dataset (e.g.,
column_name- Exact column name as it appears in the data file (e.g.,
TOTAL_SPAWNERS,run_size_total).
- Exact column name as it appears in the data file (e.g.,
- URIs / IRIs
- Fields such as
dataset_iri,entity_iri,term_iri,concept_scheme_iri,unit_irishould use persistent HTTP IRIs where available (e.g., w3id for DFO Salmon Ontology terms, vocabulary concepts, and units).
- Fields such as
5 dataset.csv Schema
One row per logical dataset (often one per SDEP directory). Can describe multiple related tables.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | Stable identifier used to join to other metadata files. |
| title | string | yes | Human-readable dataset title. |
| description | string | yes | Short description of the dataset contents and purpose. |
| creator | string | yes | Name(s) of primary creator(s) or project. |
| contact_name | string | yes | Primary contact person. |
| contact_email | string | yes | Contact email address. |
| license | string | yes | License name or URL (e.g., CC-BY-4.0). |
| temporal_start | date | no | Start date or year covered by the dataset (ISO 8601 where possible). |
| temporal_end | date | no | End date or year covered by the dataset. |
| spatial_extent | string | no | Textual description of spatial coverage (e.g., CUs, regions, coordinates). |
| dataset_type | string | no | High-level type (e.g., cu_year_index, survey_timeseries, benchmark). |
| dataset_iri | string | no | IRI for this dataset in a catalog or knowledge graph. |
| source_citation | string | no | Free-text citation for publications, reports, or internal docs. |
Optional additional columns (implementation-specific):
provenance_note– narrative about data lineage.created/modified– timestamps.
6 tables.csv Schema
One row per table in the package.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in dataset.csv. |
| table_id | string | yes | Short ID for the table (e.g., cu_year_index). |
| file_name | string | yes | Relative path to the data file (e.g., data/cu_year_index.csv). |
| table_label | string | yes | Human-readable label for the table. |
| description | string | yes | Description of what each row represents and how the table is used. |
| entity_type | string | no | Human-readable entity type (e.g., CU-year index, survey event). |
| entity_iri | string | no | IRI of ontology class representing the row-level entity, if applicable. |
| primary_key | string | no | Comma-separated list of columns forming a primary key (e.g., cu_id,year). |
Notes:
entity_irishould reference a class in the DFO Salmon Ontology (e.g., a CU-year index class, survey event class) when available.primary_keyis advisory; it guides downstream validation and graph loading.
7 column_dictionary.csv Schema
One row per column in each table. This is the core of how SDEP links data columns to ontology and vocabularies.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in dataset.csv. |
| table_id | string | yes | References table_id in tables.csv. |
| column_name | string | yes | Exact column name in the data file. |
| column_label | string | yes | Short human-readable label. |
| column_description | string | yes | Clear definition of the column’s meaning. |
| column_role | string | yes | One of: identifier, attribute, temporal, categorical, measurement. |
| value_type | string | yes | Basic type: integer, double, string, boolean, date, datetime. |
| required | boolean | no | TRUE if the column is required for each row, otherwise FALSE or blank. |
| unit_label | string | no | Human-readable unit label (e.g., number of fish, proportion). |
| unit_iri | string | no | IRI for the unit (e.g., UCUM or other unit ontology). |
| term_iri | string | no | IRI for the ontology term (e.g., measurement type, dimension, concept) from the DFO Salmon Ontology. |
| term_type | string | no | Type of ontology term (e.g., owl_class, owl_object_property, skos_concept). |
Guidance:
- Use
column_role = "identifier"for identifiers such ascu_id,survey_event_id,POP_ID. - Use
column_role = "measurement"for numeric quantities (e.g., escapement, run size, exploitation rate). - Use
column_role = "temporal"for time-related columns (e.g., year, date, datetime). - Use
column_role = "categorical"for columns with controlled vocabularies or enumerated values (e.g., species, run type, status codes). - Use
column_role = "attribute"for descriptive attributes that are not identifiers, measurements, temporal, or categorical (e.g., population name, waterbody name). term_irishould reference a term in the DFO Salmon Ontology (e.g., measurement types, dimensions, concepts) when available.term_typeindicates the type of ontology term (e.g.,owl_classfor classes,owl_object_propertyfor properties,skos_conceptfor SKOS concepts).
8 codes.csv Schema
Optional file used for columns that have controlled enumerated values (status codes, methods, gear, etc.). This is where SKOS vocabularies plug in.
Each row corresponds to one code value for one column.
Required columns:
| Column | Type | Required | Description |
|---|---|---|---|
| dataset_id | string | yes | References dataset_id in dataset.csv. |
| table_id | string | yes | References table_id in tables.csv. |
| column_name | string | yes | Name of the column in the data file that uses this code. |
| code_value | string | yes | Stored value in the data (e.g., R, Y, G, critical, adipose_intact). |
| code_label | string | no | Human-readable label corresponding to the code. |
| code_description | string | no | Longer description of what the code means. |
| concept_scheme_iri | string | no | IRI of the SKOS concept scheme (e.g., the DFO Salmon Status scheme). |
| term_iri | string | no | IRI of the specific ontology term (SKOS concept or other) that code_value represents. |
| term_type | string | no | Type of ontology term (e.g., skos_concept, owl_class). |
Guidance:
code_valuemust match exactly the values present in the data table.code_labelandcode_descriptioncan be used to auto-generate documentation.concept_scheme_iriidentifies the SKOS concept scheme containing the controlled vocabulary.term_iriandterm_typeenable alignment with the larger salmon ontology and SKOS vocabularies, linking each code value to its corresponding ontology term.
9 Relationship to Ontology, R Package, and GPT
- The ontology defines the formal semantics for:
- measurement types, entities, dimensions, and concepts (via
term_iriandterm_type), - entities (
entity_iriintables.csv), - concept schemes (
concept_scheme_iriincodes.csv).
- measurement types, entities, dimensions, and concepts (via
- The R package will:
- read SDEP metadata and data files,
- validate packages against this specification,
- join in ontology/vocabulary information,
- help reshape data into analysis-friendly tidy formats.
- The custom GPT will:
- help users draft
column_dictionary.csvandcodes.csv, - suggest ontology-aligned IRIs and term types,
- assist in decomposing compound terms into multiple columns with appropriate roles and IRIs,
- propose SDEP-compliant package structures for new projects.
- help users draft
10 Versioning and Backwards Compatibility
- Include a
spec_versionfield indataset.csvin future iterations if needed (e.g.,sdep-0.1.0). - New columns should be added in a backwards-compatible way:
- existing tooling must continue to work when extra columns are present.
- Breaking changes to required columns or semantics should bump the major version.
11 Implementation Notes
- CSV files should use UTF-8 encoding and a comma delimiter by default.
- Dates should follow ISO 8601 where possible (e.g.,
YYYY-MM-DD); years can be encoded as integers. - R and Python tooling should treat missing optional columns as
NA/nullwithout failing. - The spec is intentionally minimal; domain-specific extensions (e.g., FSAR/SPSR templates) can be layered on top as recommended profiles.