Skip to contents

Overview

This guide follows the 5-Minute Quickstart and focuses on the publishing hardening steps. If you have not yet generated a starter dictionary yet, start with the 5-Minute Quickstart first.

When all of the pieces are ready, metasalmon writes files matching the Salmon Data Package specification so you can upload or hand the folder to someone else with confidence. create_sdp() is the main one-shot path; this article covers the more explicit, manual workflow where you assemble the metadata tables yourself and then call write_salmon_datapackage().

1) Start with your data

library(metasalmon)
library(readr)

# Replace with your own data path
df <- read_csv("my-table.csv")

If you already have a dictionary and metadata from the quickstart, skip directly to Describe the dataset and tables.

Keep a working copy of your data frame handy so you can re-run these steps whenever the source data changes.

2) Build a starter column dictionary

If you already ran the quickstart and already have dict, skip this section.

dict <- infer_dictionary(
  df,
  dataset_id = "my-dataset-2026",
  table_id = "main-table"
)

The dictionary lists every column and assigns a column_role (identifier, attribute, measurement, temporal, or categorical). Move through the rows and fill in column_label, column_description, and value_type so reviewers understand what each field means, and mark columns as required when they must appear in every row.

3) Describe the dataset and tables

dataset_meta <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  title = "My Project Data",
  description = "Sample data describing salmon measurements",
  creator = "Your Team",
  contact_name = "Data Steward",
  contact_email = "data@example.gov",
  license = "Open Government License - Canada"
)

table_meta <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  table_id = "main-table",
  file_name = "data/main-table.csv",
  table_label = "Main Salmon Table",
  description = "Escapement and effort data by population"
)

Include extra columns such as spatial_extent, temporal_start, or table_label when they help others understand the scope.

4) Add codes lists when needed

Only create codes.csv when a column uses categorical values (species, run_type, gear, etc.). Each row ties a code_value to a short label and, ideally, the ontology term that explains what the code means.

codes <- tibble::tibble(
  dataset_id = "my-dataset-2026",
  table_id = "main-table",
  column_name = "RUN_TYPE",
  code_value = "FALL",
  code_label = "Fall run timing"
)

If the column reuses a published controlled vocabulary (like the DFO Salmon Ontology), include the matching IRI in term_iri so automated tools can link to the definition. In the one-shot create_sdp() workflow, code-level semantic suggestions are seeded automatically only for factor/categorical source columns unless you opt into semantic_code_scope = "all".

5) Create the package

resources <- list(main = df)

pkg_path <- write_salmon_datapackage(
  resources = resources,
  dataset_meta = dataset_meta,
  table_meta = table_meta,
  dict = dict,
  codes = codes,
  path = "my-data-package",
  format = "csv",
  overwrite = TRUE
)

list.files(pkg_path, recursive = TRUE)

This writes the canonical metadata CSV files under metadata/, the data tables under data/, and a derived datapackage.json at the package root. The metadata/*.csv files are the source of truth; if they disagree with datapackage.json, fix the CSV metadata and rewrite the package. The folder is now ready for publication, archiving, or sharing with colleagues. Share the whole folder (or a zip of the whole folder), not just datapackage.json.

Optional: include DwC-DP export hints

You can attach optional Darwin Core Data Package (DwC-DP) mappings when you need an export view for biodiversity tools. The default is OFF to keep SDP canonical.

dict <- readr::read_csv("inst/extdata/column_dictionary.csv", show_col_types = FALSE)
sem <- suggest_semantics(dict, include_dwc = TRUE)
attr(sem, "dwc_mappings") |>
  dplyr::filter(dwc_table %in% c("event", "occurrence")) |>
  dplyr::select(column_name, dwc_table, dwc_field, term_iri)

Keep the SDP column names intact; use the DwC mappings only when exporting a DwC-DP view.

Optional: export EDH XML metadata

When your publication workflow includes DFO Enterprise Data Hub / GeoNetwork, edh_build_iso19139_xml() now defaults to the richer HNAP-aware EDH export and still offers the older compact ISO 19139 path as an explicit fallback.

edh_hnap_xml <- file.path(pkg_path, "metadata", "metadata-edh-hnap.xml")
edh_build_iso19139_xml(dataset_meta, output_path = edh_hnap_xml)

edh_iso_xml <- file.path(pkg_path, "metadata", "metadata-iso19139.xml")
edh_build_iso19139_xml(
  dataset_meta,
  output_path = edh_iso_xml,
  profile = "iso19139"
)

file.exists(edh_hnap_xml)
file.exists(edh_iso_xml)

The default HNAP-aware path recognizes extra optional columns when present, including created, modified, status, distribution_url, reference_system, bbox_*, and localized fields like title_fr / description_fr.

Validate and enrich either XML output against your local EDH profile before production upload.

Using suggest_dwc_mappings() directly

For more control over DwC-DP mapping suggestions, use suggest_dwc_mappings():

dict <- tibble::tibble(
 column_name = c("event_date", "decimal_latitude", "scientific_name"),
 column_label = c("Event Date", "Decimal Latitude", "Scientific Name"),
 column_description = c("Date the event occurred", "Latitude in decimal degrees", "Species scientific name")
)
dict <- suggest_dwc_mappings(dict)
attr(dict, "dwc_mappings")
# Shows suggested DwC-DP table/field mappings with term IRIs

Semantic suggestions with role-aware sources

When using suggest_semantics(), the function automatically queries role-appropriate sources:

# Default: ontology suggestions only (DwC mappings OFF)
sem <- suggest_semantics(dict)

# Include DwC-DP mappings alongside ontology suggestions
sem_with_dwc <- suggest_semantics(dict, include_dwc = TRUE)

# View ontology suggestions
suggestions <- attr(sem, "semantic_suggestions")

# View DwC mappings (only when include_dwc = TRUE)
dwc_maps <- attr(sem_with_dwc, "dwc_mappings")

The ontology suggestions use role-aware ranking (Phase 2) that prefers: - QUDT for units - GBIF/WoRMS for taxa/entities - STATO/OBA for properties - gcdfo patterns for methods

Terms from Wikidata are flagged with alignment_only = TRUE and ranked lower.

Validation before publication

  • Run validate_dictionary(dict) to ensure the dictionary has required columns and valid column_role/value_type combinations.
  • If you generated codes.csv, double-check that every code used in the data has an entry there.
  • Re-open the package with read_salmon_datapackage(pkg_path) to confirm the metadata, dictionary, and data align.

Next steps

  • See the “How It Fits Together” section in the README for a visual map of how the components interact.
  • Read the Linking to Standard Vocabularies guide when you want to align your dictionary with published vocabularies.
  • Try the Using AI to Document Your Data workflow for drafting descriptions and ontology-aligned metadata quickly.