
Linking to Standard Vocabularies
reusing-standards-salmon-data-terms.RmdOverview
This article is the deep-dive for vocabulary mapping. Start with the 5-Minute Quickstart for the baseline infer/validate/package flow, then come back here to map columns to standard terms.
When you reuse published salmon data terms, your dictionary becomes easier for other scientists and machines to understand. This guide explains when to reuse an existing term, how to point to it, and how to discover terms without drowning in jargon.
Why reuse shared terms?
-
Consistency: Everyone who links to the same term
knows they are talking about the same thing. For example, connecting
SPAWN_ESTto a DFO term called “Natural spawner count” removes guesswork. -
Automation: Tools can read a standard
IRI(Internationalized Resource Identifier, a kind of web address for a definition) and know what to expect without human explanation. - Future-proofing: Standards, like the DFO Salmon Ontology, evolve alongside policy. By linking to them now, you can pick up improvements later.
Choosing a term
-
Look for a term that matches the column. Use
find_terms("spawner count")to search DFO, OLS, and other vocabularies.
library(metasalmon)
devtools::load_all(".")
find_terms("spawner count",
role = "property",
sources = sources_for_role("property")) |>
dplyr::select(label, source, ontology, score, alignment_only) |>
head()This example uses the role-aware source set and surfaces
alignment_only so you can down-weight Wikidata crosswalks
when reviewing candidates. The score column shows the
computed ranking, which factors in ontology preferences, cross-source
agreement, and ZOOMA confidence.
Available sources by role
find_terms() can query multiple vocabulary sources. Use
sources_for_role() to get the recommended sources for each
I-ADOPT role:
sources_for_role("unit")
# Returns: c("qudt", "nvs", "ols")
sources_for_role("entity")
# Returns: c("smn", "gcdfo", "gbif", "worms", "bioportal", "ols")The shared Salmon Domain Ontology is queried first
for salmon-domain roles. Its reusable shared terms are canonically
served under the smn namespace (for example
https://w3id.org/smn/Stock), while gcdfo
remains the DFO-specific source with canonical IRIs under
https://w3id.org/gcdfo/salmon#. metasalmon does not
silently rewrite legacy salmon:-namespace variants.
| Role | Recommended Sources | Notes |
|---|---|---|
unit |
QUDT, NVS P06, OLS | QUDT preferred for SI units |
property |
SMN, GCDFO, NVS P01, OLS, ZOOMA | Shared salmon-domain properties first; broader fallbacks after |
entity |
SMN, GCDFO, GBIF, WoRMS, BioPortal, OLS | Shared salmon-domain entities first; taxon resolvers after |
method |
SMN, GCDFO, BioPortal, OLS, ZOOMA | Shared/domain method semantics first |
variable |
SMN, GCDFO, NVS, OLS, ZOOMA | Shared salmon-domain variables first |
constraint |
SMN, GCDFO, OLS | Shared/context vocabularies first |
Searching for units with QUDT
For unit columns, QUDT provides authoritative unit IRIs:
find_terms("kilogram", role = "unit", sources = sources_for_role("unit")) |>
dplyr::select(label, iri, source, score) |>
head()Searching for taxa with GBIF/WoRMS
For species or organism columns, use taxon resolvers:
find_terms("Oncorhynchus kisutch", role = "entity", sources = c("gbif", "worms")) |>
dplyr::select(label, iri, source, ontology, score) |>
head()Interpreting results
The results include several columns for transparency:
- score: Computed ranking incorporating source preferences, role boosts, and cross-source agreement
-
alignment_only:
TRUEfor Wikidata terms (useful for crosswalks, not canonical modeling) - agreement_sources: How many sources returned this term (higher = more confidence)
- zooma_confidence/zooma_annotator: ZOOMA annotation confidence when applicable
Filter out alignment-only terms when selecting canonical IRIs:
results <- find_terms("salmon", role = "entity", sources = sources_for_role("entity"))
canonical <- results[!results$alignment_only, ]Debugging slow or empty searches
If a search returns unexpected results, check the diagnostics:
results <- find_terms("temperature", role = "property", sources = c("gcdfo", "ols", "nvs", "zooma"))
diagnostics <- attr(results, "diagnostics")
print(diagnostics)
# Shows: source, query, status (success/error), count, elapsed_secs, error message-
Decide what kind of term it is:
- Use a controlled vocabulary term (SKOS concept) when the column holds one of a set of codes (species, run type, etc.).
- Use an ontology class (OWL) when the column names a category you would treat as a type in data (for example, what kind of unit or entity each row is about).
-
Capture the link: place the chosen URI in
term_irifor the dictionary row. Mention it once; you do not need to repeat the same URI in multiple columns unless they genuinely mean different things.
dict <- infer_dictionary(
df,
dataset_id = "my-dataset-2026",
table_id = "main-table"
)
dict$term_iri[dict$column_name == "SPAWN_EST"] <- "https://w3id.org/gcdfo/salmon#NaturalSpawnerCount"Working with semantic web terms (plain language)
- IRI: think of it as the web address that points to a formal definition. You only need to copy-paste it; you do not need to understand the underlying formal logic.
- term_iri: attaches the chosen web address to a column so the column is self-explanatory.
-
entity_iri and property_iri:
optional links for measurement columns. Use
property_irito specify what characteristic was measured (e.g., “count”), andentity_irito specify what was measured (e.g., “spawning salmon”). - constraint_iri: an optional I-ADOPT component that qualifies the measurement (e.g., “maximum”, “annual average”). Use it only when it adds clarity.
When to skip linking
- If a column is purely administrative or custom to your survey, it is
fine to leave
term_iriblank and rely on your owncolumn_description. - If you cannot find a fitting term, use metadata-first workflows
(
gpt_proposed_terms.csv) or the new gap detection tools in metasalmon:-
detect_semantic_term_gaps()identifies candidates where SMN is missing but fallback sources found useful matches. -
render_ontology_term_request()lets you choose shared SMN vs profile-specific requests and generate ready-to-post issue text. -
submit_term_request_issues()creates GitHub issues (dry-run first) against the ontology request template.
-
Building vocabulary-aware code lists
- If a categorical column such as
SPECIESuses a published vocabulary, add acode_valuerow that matches the vocabulary notation and include theterm_irifor the concept. - If you do not have a matching vocabulary, describe the codes clearly
in
code_labelandcode_descriptionso reviewers do not have to guess.
Exploring suggestions with metasalmon
suggest_semantics() can look at your dictionary and
offer term_iri, entity_iri, or I-ADOPT
components based on the bundled catalog.
dict_suggested <- suggest_semantics(df, dict)
suggestions <- attr(dict_suggested, "semantic_suggestions")
head(suggestions)
# Optional explicit merge step: fills only missing fields unless overwrite = TRUE
# and matches suggestions by both column_name and dictionary_role.
dict <- apply_semantic_suggestions(dict_suggested, columns = "SPAWN_EST")Treat the suggestions as a starting point, not gospel. The helper above is just a safe accelerator; you should still review the picked IRIs and tweak anything that misses your domain nuance.
Next steps
- See the “How It Fits Together” section in the README for context on how the dictionary, ontology, and GPT assistant team up.
- Follow the Publishing Data Packages guide when you finalize your metadata before publishing.
- Try the Using AI to Document Your Data workflow to draft column descriptions or propose new terms when the catalog does not yet cover your concept.