Skip to contents

Overview

This guide follows the 5-Minute Quickstart and focuses on the manual package assembly path. If you have not yet generated a starter dictionary yet, start with the 5-Minute Quickstart first.

If you already created a package with create_sdp(), reviewed it in Excel, and now need the follow-on publication workflow, use After Excel Review: Finalize and Publish Your Package.

When all of the pieces are ready, metasalmon writes files matching the Salmon Data Package specification so you can upload or hand the folder to someone else with confidence. create_sdp() is the main one-shot path; this article covers the more explicit workflow where you assemble the metadata tables yourself and then call write_salmon_datapackage().

1) Start with your data

library(metasalmon)
library(readr)

# Replace with your own data path
df <- read_csv("my-table.csv")

If you already have a dictionary and metadata from the quickstart, skip directly to Describe the dataset and tables.

Keep a working copy of your data frame handy so you can re-run these steps whenever the source data changes.

2) Build a starter column dictionary

If you already ran the quickstart and already have dict, skip this section.

dict <- infer_dictionary(
  df,
  dataset_id = "my-dataset-2026",
  table_id = "main-table"
)

The dictionary lists every column and assigns a column_role (identifier, attribute, measurement, temporal, or categorical). Move through the rows and fill in column_label, column_description, and value_type so reviewers understand what each field means, and mark columns as required when they must appear in every row.

3) Describe the dataset and tables

dataset_meta <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  title = "My Project Data",
  description = "Sample data describing salmon measurements",
  creator = "Your Team",
  contact_name = "Data Steward",
  contact_email = "data@example.gov",
  license = "Open Government License - Canada"
)

table_meta <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  table_id = "main-table",
  file_name = "data/main-table.csv",
  table_label = "Main Salmon Table",
  description = "Escapement and effort data by population"
)

Include extra columns such as spatial_extent, temporal_start, or table_label when they help others understand the scope.

4) Add codes lists when needed

Only create codes.csv when a column uses categorical values (species, run_type, gear, etc.). Each row ties a code_value to a short label and, ideally, the ontology term that explains what the code means.

codes <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  table_id = "main-table",
  column_name = "RUN_TYPE",
  code_value = "FALL",
  code_label = "Fall run timing"
)

If the column reuses a published controlled vocabulary (like the DFO Salmon Ontology), include the matching IRI in term_iri so automated tools can link to the definition. In the one-shot create_sdp() workflow, code-level semantic suggestions are seeded automatically for factor and low-cardinality character source columns unless you opt into semantic_code_scope = "all".

5) Create the package

resources <- list(main = df)

pkg_path <- write_salmon_datapackage(
  resources = resources,
  dataset_meta = dataset_meta,
  table_meta = table_meta,
  dict = dict,
  codes = codes,
  path = "my-data-package",
  format = "csv",
  overwrite = TRUE
)

list.files(pkg_path, recursive = TRUE)

This writes the canonical metadata CSV files under metadata/, the data tables under data/, and a derived datapackage.json at the package root. The metadata/*.csv files are the source of truth; if they disagree with datapackage.json, fix the CSV metadata and rewrite the package. The folder is now ready for publication, archiving, or sharing with colleagues. Share the whole folder (or a zip of the whole folder), not just datapackage.json.

Optional: include DwC-DP export hints

You can attach optional Darwin Core Data Package (DwC-DP) mappings when you need an export view for biodiversity tools. The default is OFF to keep SDP canonical.

dict <- readr::read_csv("inst/extdata/column_dictionary.csv", show_col_types = FALSE)
sem <- suggest_semantics(dict, include_dwc = TRUE)
attr(sem, "dwc_mappings") |>
  dplyr::filter(dwc_table %in% c("event", "occurrence")) |>
  dplyr::select(column_name, dwc_table, dwc_field, term_iri)

Keep the SDP column names intact; use the DwC mappings only when exporting a DwC-DP view.

Optional: export EDH XML metadata

When your publication workflow includes DFO Enterprise Data Hub / GeoNetwork, edh_build_hnap_xml() writes the HNAP-aware EDH export.

edh_hnap_xml <- file.path(pkg_path, "metadata", "metadata-edh-hnap.xml")
edh_build_hnap_xml(dataset_meta, output_path = edh_hnap_xml)

file.exists(edh_hnap_xml)

The HNAP-aware export recognizes extra optional columns when present, including created, modified, status, distribution_url, reference_system, bbox_*, and localized fields like title_fr / description_fr.

If you are creating the package in one shot, create_sdp(..., include_edh_xml = TRUE) now writes the same XML automatically to metadata/metadata-edh-hnap.xml.

After manual review/editing of metadata/dataset.csv, regenerate the XML from the finalized package with the post-review helper that wraps the canonical edh_build_hnap_xml() builder:

That post-review rebuild is the preferred path when the first-pass package gets edited in Excel before EDH submission.

Validate and enrich the XML output against your local EDH profile before production upload.

Using suggest_dwc_mappings() directly

For more control over DwC-DP mapping suggestions, use suggest_dwc_mappings():

dict <- tibble::tibble(
 column_name = c("event_date", "decimal_latitude", "scientific_name"),
 column_label = c("Event Date", "Decimal Latitude", "Scientific Name"),
 column_description = c("Date the event occurred", "Latitude in decimal degrees", "Species scientific name")
)
dict <- suggest_dwc_mappings(dict)
attr(dict, "dwc_mappings")
# Shows suggested DwC-DP table/field mappings with term IRIs

Semantic suggestions with role-aware sources

When using suggest_semantics(), the function automatically queries role-appropriate sources:

# Default: ontology suggestions only (DwC mappings OFF)
sem <- suggest_semantics(dict)

# Include DwC-DP mappings alongside ontology suggestions
sem_with_dwc <- suggest_semantics(dict, include_dwc = TRUE)

# View ontology suggestions
suggestions <- attr(sem, "semantic_suggestions")

# View DwC mappings (only when include_dwc = TRUE)
dwc_maps <- attr(sem_with_dwc, "dwc_mappings")

The ontology suggestions use role-aware ranking (Phase 2) that prefers: - QUDT for units - GBIF/WoRMS for taxa/entities - STATO/OBA for properties - gcdfo patterns for methods

Terms from Wikidata are flagged with alignment_only = TRUE and ranked lower.

Validation before publication

  • Run validate_dictionary(dict) to ensure the dictionary has required columns and valid column_role/value_type combinations.
  • If you generated codes.csv, double-check that every code used in the data has an entry there.
  • Re-open the package with read_salmon_datapackage(pkg_path) to confirm the metadata, dictionary, and data align.

Next steps