Linking to Standard Vocabularies • metasalmon

Overview

This article is the deep-dive for vocabulary mapping. Start with the 5-Minute Quickstart for the baseline infer/validate/package flow, then come back here to map columns to standard terms.

When you reuse published salmon data terms, your dictionary becomes easier for other scientists and machines to understand. This guide explains when to reuse an existing term, how to point to it, and how to discover terms without drowning in jargon.

Why reuse shared terms?

Consistency: Everyone who links to the same term knows they are talking about the same thing. For example, connecting SPAWN_EST to a DFO term called “Natural spawner count” removes guesswork.
Automation: Tools can read a standard IRI (Internationalized Resource Identifier, a kind of web address for a definition) and know what to expect without human explanation.
Future-proofing: Standards, like the DFO Salmon Ontology, evolve alongside policy. By linking to them now, you can pick up improvements later.

Choosing a term

Look for a term that matches the column. Use find_terms("spawner count") to search DFO, OLS, and other vocabularies.

library(metasalmon)
devtools::load_all(".")
find_terms("spawner count",
           role = "property",
           sources = sources_for_role("property")) |>
  dplyr::select(label, source, ontology, score, alignment_only) |>
  head()

This example uses the role-aware source set and surfaces alignment_only so you can down-weight Wikidata crosswalks when reviewing candidates. The score column shows the computed ranking, which factors in ontology preferences, cross-source agreement, and ZOOMA confidence.

Available sources by role

find_terms() can query multiple vocabulary sources. Use sources_for_role() to get the recommended sources for each I-ADOPT role:

sources_for_role("unit")
# Returns: c("qudt", "nvs", "ols")

sources_for_role("entity")
# Returns: c("smn", "gcdfo", "gbif", "worms", "bioportal", "ols")

The shared Salmon Domain Ontology is queried first for salmon-domain roles. Its reusable shared terms are canonically served under the smn namespace (for example https://w3id.org/smn/Stock), while gcdfo remains the DFO-specific source with canonical IRIs under https://w3id.org/gcdfo/salmon#. metasalmon does not silently rewrite legacy salmon:-namespace variants.

Role	Recommended Sources	Notes
`unit`	QUDT, NVS P06, OLS	QUDT preferred for SI units
`property`	SMN, GCDFO, NVS P01, OLS, ZOOMA	Shared salmon-domain properties first; broader fallbacks after
`entity`	SMN, GCDFO, GBIF, WoRMS, BioPortal, OLS	Shared salmon-domain entities first; taxon resolvers after
`method`	SMN, GCDFO, BioPortal, OLS, ZOOMA	Shared/domain method semantics first
`variable`	SMN, GCDFO, NVS, OLS, ZOOMA	Shared salmon-domain variables first
`constraint`	SMN, GCDFO, OLS	Shared/context vocabularies first

Searching for units with QUDT

For unit columns, QUDT provides authoritative unit IRIs:

find_terms("kilogram", role = "unit", sources = sources_for_role("unit")) |>
  dplyr::select(label, iri, source, score) |>
  head()

Searching for taxa with GBIF/WoRMS

For species or organism columns, use taxon resolvers:

find_terms("Oncorhynchus kisutch", role = "entity", sources = c("gbif", "worms")) |>
  dplyr::select(label, iri, source, ontology, score) |>
  head()

Interpreting results

The results include several columns for transparency:

score: Computed ranking incorporating source preferences, role boosts, and cross-source agreement
alignment_only: TRUE for Wikidata terms (useful for crosswalks, not canonical modeling)
agreement_sources: How many sources returned this term (higher = more confidence)
zooma_confidence/zooma_annotator: ZOOMA annotation confidence when applicable

Filter out alignment-only terms when selecting canonical IRIs:

results <- find_terms("salmon", role = "entity", sources = sources_for_role("entity"))
canonical <- results[!results$alignment_only, ]

Debugging slow or empty searches

If a search returns unexpected results, check the diagnostics:

results <- find_terms("temperature", role = "property", sources = c("gcdfo", "ols", "nvs", "zooma"))
diagnostics <- attr(results, "diagnostics")
print(diagnostics)
# Shows: source, query, status (success/error), count, elapsed_secs, error message

Decide what kind of term it is:
- Use a controlled vocabulary term (SKOS concept) when the column holds one of a set of codes (species, run type, etc.).
- Use an ontology class (OWL) when the column names a category you would treat as a type in data (for example, what kind of unit or entity each row is about).
Capture the link: place the chosen URI in term_iri for the dictionary row. Mention it once; you do not need to repeat the same URI in multiple columns unless they genuinely mean different things.

dict <- infer_dictionary(
  df,
  dataset_id = "my-dataset-2026",
  table_id = "main-table"
)

dict$term_iri[dict$column_name == "SPAWN_EST"] <- "https://w3id.org/gcdfo/salmon#NaturalSpawnerCount"

Working with semantic web terms (plain language)

IRI: think of it as the web address that points to a formal definition. You only need to copy-paste it; you do not need to understand the underlying formal logic.
term_iri: attaches the chosen web address to a column so the column is self-explanatory.
entity_iri and property_iri: optional links for measurement columns. Use property_iri to specify what characteristic was measured (e.g., “count”), and entity_iri to specify what was measured (e.g., “spawning salmon”).
constraint_iri: an optional I-ADOPT component that qualifies the measurement (e.g., “maximum”, “annual average”). Use it only when it adds clarity.

When to skip linking

If a column is purely administrative or custom to your survey, it is fine to leave term_iri blank and rely on your own column_description.
If you cannot find a fitting term, use metadata-first workflows (gpt_proposed_terms.csv) or the new gap detection tools in metasalmon:
- detect_semantic_term_gaps() identifies candidates where SMN is missing but fallback sources found useful matches.
- render_ontology_term_request() lets you choose shared SMN vs profile-specific requests and generate ready-to-post issue text.
- submit_term_request_issues() creates GitHub issues (dry-run first) against the ontology request template.

Building vocabulary-aware code lists

If a categorical column such as SPECIES uses a published vocabulary, add a code_value row that matches the vocabulary notation and include the term_iri for the concept.
If you do not have a matching vocabulary, describe the codes clearly in code_label and code_description so reviewers do not have to guess.

Exploring suggestions with metasalmon

suggest_semantics() can look at your dictionary and offer term_iri, entity_iri, or I-ADOPT components based on the bundled catalog.

dict_suggested <- suggest_semantics(df, dict)
suggestions <- attr(dict_suggested, "semantic_suggestions")
head(suggestions)

# Optional explicit merge step: fills only missing fields unless overwrite = TRUE
# and matches suggestions by both column_name and dictionary_role.
dict <- apply_semantic_suggestions(dict_suggested, columns = "SPAWN_EST")

Treat the suggestions as a starting point, not gospel. The helper above is just a safe accelerator; you should still review the picked IRIs and tweak anything that misses your domain nuance.

Next steps

See the “How It Fits Together” section in the README for context on how the dictionary, ontology, and GPT assistant team up.
Follow the Publishing Data Packages guide when you finalize your metadata before publishing.
Try the Using AI to Document Your Data workflow to draft column descriptions or propose new terms when the catalog does not yet cover your concept.