Skip to contents

metasalmon logo

The Problem

You’ve spent years collecting salmon data. But when you try to share it:

  • Colleagues ask “What does SPAWN_EST mean?”
  • Combining datasets fails because everyone uses different column names
  • Your future self opens old data and can’t remember what the codes mean
  • Other researchers can’t use your data without emailing you for explanations

The Solution

metasalmon wraps your salmon data with a data dictionary that travels with it—explaining every column, every code, and linking to standard scientific definitions. These definitions come from the DFO Salmon Ontology and other published controlled vocabularies, and the data is packaged according to the Salmon Data Package Specification. For extra help, our custom Salmon Data Standardizer GPT can generate metadata drafts, salmon data packages, and guide your data dictionary creation in coordination with this R package.

Think of it like adding a detailed legend to your spreadsheet that never gets lost.

What You Get

Your Data + metasalmon = Data Package
Raw CSV files Data dictionary Self-documenting dataset
Cryptic column names Clear descriptions Anyone can understand it
Inconsistent codes Linked to standards Works with other datasets

Quick Example

First, install the package from GitHub:

# Install from GitHub (recommended)
install.packages("remotes")
remotes::install_github("dfo-pacific-science/metasalmon")

Then use it to create a data package:

library(metasalmon)

# Load your escapement data
df <- read.csv("my-coho-data.csv")

# Generate a data dictionary automatically
dict <- infer_dictionary(df, dataset_id = "fraser-coho-2024", table_id = "escapement")

# Check that it looks right
validate_dictionary(dict)

# Add metadata about your dataset
dataset_meta <- tibble::tibble(
  dataset_id = "fraser-coho-2024",
  title = "Fraser River Coho Escapement Data",
  description = "Escapement monitoring data for coho salmon",
  creator = "Your Name",
  contact_name = "Your Name",
  contact_email = "your.email@dfo-mpo.gc.ca",
  license = "Open Government License - Canada"
)

table_meta <- tibble::tibble(
  dataset_id = "fraser-coho-2024",
  table_id = "escapement",
  file_name = "escapement.csv",
  table_label = "Escapement Data",
  description = "Coho escapement counts by population and year"
)

# Create a shareable package
create_salmon_datapackage(
  resources = list(escapement = df),
  dataset_meta = dataset_meta,
  table_meta = table_meta,
  dict = dict,
  path = "my-data-package"
)

Result: A folder containing your data + documentation that anyone can understand.

Semantic validation loop (new)

# Fetch the latest DFO Salmon Ontology (content-negotiated, cached locally)
onto_path <- fetch_salmon_ontology()

# Run semantic validation and surface missing IRIs early
validate_semantics(dict)

# If GPT proposed many new terms, deduplicate before filing issues
deduped <- deduplicate_proposed_terms(readr::read_csv("gpt_proposed_terms.csv"))

Who Is This For?

If you are… Start here
A biologist who wants to share data 5-Minute Quickstart
Curious how it works How It Fits Together
A data steward standardizing datasets Data Dictionary & Publication
Interested in AI-assisted documentation AI Assistance (Advanced)
Reading CSVs from private GitHub repos GitHub CSV Access

Installation

# Install from GitHub
install.packages("remotes")
remotes::install_github("dfo-pacific-science/metasalmon")

What’s In a Data Package?

When you create a package, you get a folder containing:

my-data-package/
  +-- escapement.csv          # Your data
  +-- column_dictionary.csv   # What each column means
  +-- codes.csv               # What each code value means (if applicable)
  +-- datapackage.json        # Machine-readable metadata

Anyone opening this folder - whether a colleague, a reviewer, or your future self - can immediately understand your data.

Key Features

For everyday use:

  • Automatically generate data dictionaries from your data frames
  • Validate that your dictionary is complete and correct
  • Create shareable packages that work across R, Python, and other tools
  • Read CSVs directly from private GitHub repositories

For data stewards (optional):

  • Link columns to standard DFO Salmon Ontology terms
  • Add I-ADOPT measurement metadata (property, entity, unit, constraint)
  • Use AI assistance to help write descriptions
  • Suggest Darwin Core Data Package table/field mappings for biodiversity data
  • Opt in to DwC-DP export hints via suggest_semantics(..., include_dwc = TRUE) while keeping the Salmon Data Package as the canonical deliverable.
  • Role-aware vocabulary search with find_terms() and sources_for_role():
    • Units: QUDT preferred, then NVS P06
    • Entities/taxa: GBIF and WoRMS taxon resolvers
    • Properties: STATO/OBA measurement ontologies
    • Cross-source agreement boosting for high-confidence matches
  • Per-source diagnostics, scoring, and optional rerank explain why find_terms() matches rank where they do and expose failures, so you can tune role-aware queries with confidence.
  • End-to-end semantic QA loop with fetch_salmon_ontology() + validate_semantics(), plus deduplicate_proposed_terms() to prevent term proliferation before opening ontology issues.

How It Fits Together

metasalmon brings together four pieces: your raw data, the Salmon Data Package specification, the DFO Salmon Ontology (and other vocabularies), and the Salmon Data Standardizer GPT. When you finish the workflow, the dictionary, dataset/table metadata, and optional code lists are already aligned with the specification, which makes the package ready to publish. The ontology keeps the column meanings consistent, and the GPT assistant helps draft descriptions and term choices so you can close the loop without juggling multiple tools.

The high-level flow is:

  • Raw tables lead into column_dictionary.csv (and codes.csv when there are categorical columns).
  • Dataset/table metadata fill the required specification fields (title, description, creator, contact, etc.), so the package folder can be shared or uploaded.
  • The DFO Salmon Ontology and published vocabularies supply term_iri/entity_iri links that describe what each column and row represents.
  • create_salmon_datapackage() consumes the metadata, dictionary, codes, and data to write the files in the Salmon Data Package format, while the GPT assistant helps polish the metadata and suggests vocabulary links.

For Developers

Development setup and package structure

Installation for Development

install.packages(c("devtools", "roxygen2", "testthat", "knitr", "rmarkdown",
                   "tibble", "readr", "jsonlite", "cli", "rlang", "dplyr",
                   "tidyr", "purrr", "withr", "frictionless"))

Build and Check

devtools::document()
devtools::test()
devtools::check()
devtools::build_vignettes()
pkgdown::build_site()

Package Structure

  • R/: Core functions for dictionary and package operations
  • inst/extdata/: Example data files and templates
  • tests/testthat/: Automated tests
  • vignettes/: Long-form documentation
  • docs/: pkgdown site output

DFO Salmon Ontology

This package can optionally link your data to the DFO Salmon Ontology. You don’t need to understand ontologies to use metasalmon - this is handled automatically for users who want it.

See the Reusing Standards for Salmon Data Terms guide for details.