Support Controlled Vocabularies
📘 Guide: End-to-End Process for Supporting Controlled Vocabularies in DFO
🧾 Step 1: Scientist Submits a Data Dictionary
- The scientist downloads and completes the data dictionary template (Excel)
- They describe each variable: name, definition, method, units, etc.
- They may include URIs from known vocabularies if possible
- The spreadsheet includes:
- A
README
sheet with instructions - A
Metadata
tab for project-level context - A
Data Dictionary
tab with one row per variable
- A
🔍 Step 2: Validate Against DFO Vocabulary
- The data dictionary is passed into your R validation function (
validate_data_dictionary()
) - It compares submitted variables to your reference vocabulary, hosted as JSON on GitHub Pages
- The validator returns:
- Matched terms
- Unmatched terms
- Suggested mappings using fuzzy logic (e.g.,
stringdist
or Jaro-Winkler)
- The scientist or steward reviews and adjusts terms as needed
🧠 Step 3: Consider Alignment to NCEAS Salmon Ontology
- For each submitted or candidate term:
- Use an automated lookup tool to check if a concept already exists in the NCEAS Salmon Ontology. Start by using the BioPortal Annotator or API to match term labels and definitions. Optionally, build a lightweight script that compares local vocabulary terms to the ontology by label, synonyms, or definition text using exact or fuzzy string matching. Consider integrating the result into your pipeline for batch validation.
- If it exists:
- Use
skos:exactMatch
orskos:closeMatch
in your.ttl
(Turtle) or.jsonld
RDF vocab files. These mappings are typically stored in the ontology files generated from your.csv
source. Add a column likeexact_match_uri
orclose_match_uri
to your.csv
, your export script can incorporate those values automatically during serialization to.ttl
or.jsonld
.
- Use
- If it doesn’t exist, but is broadly applicable:
- Prepare a Pull Request or issue to suggest the term to the NCEAS maintainers
- Include:
- Label, definition, synonyms
- Usage example or data sources
🧰 Step 4: Curate and Extend DFO-Controlled Vocabularies
- Maintain DFO vocabularies in GitHub as:
.csv
for editing.json
for R/python access and validation.jsonld
and.ttl
for ontology use
- Assign persistent URIs using
https://w3id.org/dfo/spsi#term-id
- Organize vocabularies by data domain or project (e.g., monitoring, habitat, biology)
🧱 Step 5: Build DFO-Specific Ontology for Salmon
- Extend your vocabularies with relationships:
broader
,narrower
,related
hasUnit
,hasMethod
,derivedFrom
- Use tools like
rdflib
orskosify
to model in Turtle or OWL - Add namespace info:
@prefix dfo-spsi: <https://w3id.org/dfo/spsi#>
- Over time, link this to:
- NCEAS ontology
- NERC units
- ENVO/OBI for environment/sample methods
🔁 Step 6: Publish and Reuse
- Host vocab on GitHub Pages + register URIs at w3id.org
- Link from your Quarto site (
/data-standards/vocab-index.html
) - Allow download in multiple formats (JSON, JSON-LD, TTL)
- Use vocab in:
- Data validation tools
- Metadata editors
- APIs and dashboards
- AI pipelines (semantic RAG, search, integration)
This workflow will enable DFO Pacific to build a structured, version-controlled, and interoperable vocabulary and ontology infrastructure for salmon data, with clear ties to domain-wide standards and extensibility for AI use cases.
🤔 Why Bother with Controlled Vocabularies and Ontologies?
It’s a fair question. Why invest all this time defining terms, aligning with ontologies, and assigning URIs?
Because it unlocks a future where your data works harder for you. Controlled vocabularies and ontologies aren’t just academic exercises — they’re the foundation for better discovery, smarter integration, and automation across the science lifecycle.
Here are some tangible ways this work adds value:
🔍 1. Enhanced Data Discovery
“I know someone collected salmon smolt data… but where do I find it?”
By tagging datasets, columns, and metadata with terms from your SPSI vocabulary, you enable: - Keyword search that actually understands synonyms and related terms - Filters based on data type, units, methods, or ecological domain - Smart discovery interfaces (e.g., “Show me everything related to juvenile survival”)
➡️ Example: A data catalog that lets users search by controlled term rather than inconsistent column names across spreadsheets.
🔄 2. Semi-Automated Data Integration
“These datasets report the same metric but use different terms, formats, or units.”
Using controlled terms with defined relationships (e.g., broader
, exactMatch
, hasUnit
), you can: - Detect overlapping fields across submissions automatically - Align columns and units across multiple datasets - Standardize value domains (e.g., Red/Green/Amber
vs Critical/Stable/Concerned
)
➡️ Example: A script that reads in new FSAR data and automatically maps it to the SPSR schema for validation and loading into a database.
🧠 3. Smarter Applications (AI & Beyond)
“Can’t AI just figure this stuff out?”
Only if you feed it structure.
Controlled vocabularies and ontologies allow you to: - Ground large language models (LLMs) in your domain’s specific terminology - Build Retrieval-Augmented Generation (RAG) systems for question answering - Use semantic search tools to find relevant variables, concepts, and datasets - Train AI to assist in metadata generation, anomaly detection, or dataset classification
➡️ Example: A chatbot that helps scientists describe their data using your vocabulary, or recommends matching fields from existing standards.
🔗 4. Future-Proof Interoperability
“What if we want to share this with other agencies or join a broader platform?”
Standardized vocabularies with persistent URIs and ontology alignments (e.g., with NCEAS, ENVO, OBO) make it easy to: - Share DFO terms with external partners - Convert metadata to international schemas (e.g., Darwin Core, ISO 19115) - Plug into federated data platforms without starting over
➡️ Example: Publishing your vocabulary to w3id.org lets others reference and reuse your terms as global identifiers.
🧰 5. Better Metadata, Better Stewardship
“I just want people to fill out metadata that makes sense.”
Controlled vocabularies: - Reduce ambiguity - Improve machine readability - Make metadata easier to validate and automate
➡️ Example: A dropdown menu in a data intake form linked to your vocab that auto-fills units, definitions, and example values — while storing clean, machine-readable metadata under the hood.
🧭 TL;DR
Controlled vocabularies and ontologies are not the end — they’re the beginning.
They allow you to: - Find data more easily - Integrate it more reliably - Build tools more effectively - And use AI more meaningfully
They turn your data into infrastructure.
And they help ensure that knowledge created today can be used, re-used, and trusted — long into the future.