
Linking to Standard Vocabularies
reusing-standards-salmon-data-terms.RmdOverview
When you reuse published salmon data terms, your dictionary becomes easier for other scientists and machines to understand. This guide explains when to reuse an existing term, how to point to it, and how to discover terms without drowning in jargon.
Why reuse shared terms?
-
Consistency: Everyone who links to the same term
knows they are talking about the same thing. For example, connecting
SPAWN_ESTto a DFO term called “Natural spawner count” removes guesswork. -
Automation: Tools can read a standard
IRI(Internationalized Resource Identifier, a kind of web address for a definition) and know what to expect without human explanation. - Future-proofing: Standards, like the DFO Salmon Ontology, evolve alongside policy. By linking to them now, you can pick up improvements later.
Choosing a term
-
Look for a term that matches the column. Use
find_terms("spawner count")to search DFO, OLS, and other vocabularies.
library(metasalmon)
devtools::load_all(".")
find_terms("spawner count",
role = "property",
sources = sources_for_role("property")) |>
dplyr::select(label, source, ontology, score, alignment_only) |>
head()This example uses the role-aware source set and surfaces
alignment_only so you can down-weight Wikidata crosswalks
when reviewing candidates. The score column shows the
computed ranking, which factors in ontology preferences, cross-source
agreement, and ZOOMA confidence.
Available sources by role
find_terms() can query multiple vocabulary sources. Use
sources_for_role() to get the recommended sources for each
I-ADOPT role:
sources_for_role("unit")
# Returns: c("qudt", "nvs", "ols")
sources_for_role("entity")
# Returns: c("gbif", "worms", "bioportal", "ols")| Role | Recommended Sources | Notes |
|---|---|---|
unit |
QUDT, NVS P06, OLS | QUDT preferred for SI units |
property |
NVS P01, OLS, ZOOMA | Measurement ontologies like STATO/OBA |
entity |
GBIF, WoRMS, BioPortal, OLS | Taxon resolvers for species |
method |
BioPortal, OLS, ZOOMA | gcdfo SKOS methods preferred |
variable |
NVS, OLS, ZOOMA | Compound observables |
Searching for units with QUDT
For unit columns, QUDT provides authoritative unit IRIs:
find_terms("kilogram", role = "unit", sources = sources_for_role("unit")) |>
dplyr::select(label, iri, source, score) |>
head()Searching for taxa with GBIF/WoRMS
For species or organism columns, use taxon resolvers:
find_terms("Oncorhynchus kisutch", role = "entity", sources = c("gbif", "worms")) |>
dplyr::select(label, iri, source, ontology, score) |>
head()Interpreting results
The results include several columns for transparency:
- score: Computed ranking incorporating source preferences, role boosts, and cross-source agreement
-
alignment_only:
TRUEfor Wikidata terms (useful for crosswalks, not canonical modeling) - agreement_sources: How many sources returned this term (higher = more confidence)
- zooma_confidence/zooma_annotator: ZOOMA annotation confidence when applicable
Filter out alignment-only terms when selecting canonical IRIs:
results <- find_terms("salmon", sources = c("ols", "nvs"))
canonical <- results[!results$alignment_only, ]Debugging slow or empty searches
If a search returns unexpected results, check the diagnostics:
results <- find_terms("temperature", sources = c("ols", "nvs", "zooma"))
diagnostics <- attr(results, "diagnostics")
print(diagnostics)
# Shows: source, query, status (success/error), count, elapsed_secs, error message-
Decide what kind of term it is:
- Use a controlled vocabulary term (SKOS concept) when the column holds one of a set of codes (species, run type, etc.).
- Use an ontology class (OWL) when the column names a category you would treat as a type in data (for example, what kind of unit or entity each row is about).
-
Capture the link: place the chosen URI in
term_irifor the dictionary row. Mention it once; you do not need to repeat the same URI in multiple columns unless they genuinely mean different things.
dict <- infer_dictionary(
df,
dataset_id = "my-dataset-2026",
table_id = "main-table"
)
dict$term_iri[dict$column_name == "SPAWN_EST"] <- "https://w3id.org/gcdfo/salmon#NaturalSpawnerCount"Working with semantic web terms (plain language)
- IRI: think of it as the web address that points to a formal definition. You only need to copy-paste it; you do not need to understand the underlying formal logic.
- term_iri: attaches the chosen web address to a column so the column is self-explanatory.
-
entity_iri and property_iri:
optional links for measurement columns. Use
property_irito specify what characteristic was measured (e.g., “count”), andentity_irito specify what was measured (e.g., “spawning salmon”). - constraint_iri: an optional I-ADOPT component that qualifies the measurement (e.g., “maximum”, “annual average”). Use it only when it adds clarity.
When to skip linking
- If a column is purely administrative or custom to your survey, it is
fine to leave
term_iriblank and rely on your owncolumn_description. - If you cannot find a fitting term, note the idea in
gpt_proposed_terms.csvand move on. You can revisit it later when new terms are published.
Building vocabulary-aware code lists
- If a categorical column such as
SPECIESuses a published vocabulary, add acode_valuerow that matches the vocabulary notation and include theterm_irifor the concept. - If you do not have a matching vocabulary, describe the codes clearly
in
code_labelandcode_descriptionso reviewers do not have to guess.
Exploring suggestions with metasalmon
suggest_semantics() can look at your dictionary and
offer term_iri, entity_iri, or I-ADOPT
components based on the bundled catalog.
dict_suggested <- suggest_semantics(df, dict, sources = c("ols", "nvs"))
suggestions <- attr(dict_suggested, "semantic_suggestions")
head(suggestions)It is safe to accept suggestions as a starting point, then tweak the term labels to match your domain-specific language.
Next steps
- See the “How It Fits Together” section in the README for context on how the dictionary, ontology, and GPT assistant team up.
- Follow the Publishing Data Packages guide when you finalize your metadata before publishing.
- Try the Using AI to Document Your Data workflow to draft column descriptions or propose new terms when the catalog does not yet cover your concept.