Support Controlled Vocabularies

📘 Guide: End-to-End Process for Supporting Controlled Vocabularies in DFO

🧾 Step 1: Scientist Submits a Data Dictionary

The scientist downloads and completes the data dictionary template (Excel)
- They describe each variable: name, definition, method, units, etc.
- They may include URIs from known vocabularies if possible
The spreadsheet includes:
- A README sheet with instructions
- A Metadata tab for project-level context
- A Data Dictionary tab with one row per variable

🔍 Step 2: Validate Against DFO Vocabulary

The data dictionary is passed into your R validation function (validate_data_dictionary())
It compares submitted variables to your reference vocabulary, hosted as JSON on GitHub Pages
The validator returns:
- Matched terms
- Unmatched terms
- Suggested mappings using fuzzy logic (e.g., stringdist or Jaro-Winkler)
The scientist or steward reviews and adjusts terms as needed

🧠 Step 3: Consider Alignment to NCEAS Salmon Ontology

For each submitted or candidate term:
- Use an automated lookup tool to check if a concept already exists in the NCEAS Salmon Ontology. Start by using the BioPortal Annotator or API to match term labels and definitions. Optionally, build a lightweight script that compares local vocabulary terms to the ontology by label, synonyms, or definition text using exact or fuzzy string matching. Consider integrating the result into your pipeline for batch validation.
- If it exists:
  - Use skos:exactMatch or skos:closeMatch in your .ttl (Turtle) or .jsonld RDF vocab files. These mappings are typically stored in the ontology files generated from your .csv source. Add a column like exact_match_uri or close_match_uri to your .csv, your export script can incorporate those values automatically during serialization to .ttl or .jsonld.
- If it doesn’t exist, but is broadly applicable:
  - Prepare a Pull Request or issue to suggest the term to the NCEAS maintainers
  - Include:
    - Label, definition, synonyms
    - Usage example or data sources

🧰 Step 4: Curate and Extend DFO-Controlled Vocabularies

Maintain DFO vocabularies in GitHub as:
- .csv for editing
- .json for R/python access and validation
- .jsonld and .ttl for ontology use
Assign persistent URIs using https://w3id.org/dfo/spsi#term-id
Organize vocabularies by data domain or project (e.g., monitoring, habitat, biology)

🧱 Step 5: Build DFO-Specific Ontology for Salmon

Extend your vocabularies with relationships:
- broader, narrower, related
- hasUnit, hasMethod, derivedFrom
Use tools like rdflib or skosify to model in Turtle or OWL
Add namespace info:
- @prefix dfo-spsi: <https://w3id.org/dfo/spsi#>
Over time, link this to:
- NCEAS ontology
- NERC units
- ENVO/OBI for environment/sample methods

🔁 Step 6: Publish and Reuse

Host vocab on GitHub Pages + register URIs at w3id.org
Link from your Quarto site (/data-standards/vocab-index.html)
Allow download in multiple formats (JSON, JSON-LD, TTL)
Use vocab in:
- Data validation tools
- Metadata editors
- APIs and dashboards
- AI pipelines (semantic RAG, search, integration)

This workflow will enable DFO Pacific to build a structured, version-controlled, and interoperable vocabulary and ontology infrastructure for salmon data, with clear ties to domain-wide standards and extensibility for AI use cases.

🤔 Why Bother with Controlled Vocabularies and Ontologies?

It’s a fair question. Why invest all this time defining terms, aligning with ontologies, and assigning URIs?

Because it unlocks a future where your data works harder for you. Controlled vocabularies and ontologies aren’t just academic exercises — they’re the foundation for better discovery, smarter integration, and automation across the science lifecycle.

Here are some tangible ways this work adds value:

🔍 1. Enhanced Data Discovery

“I know someone collected salmon smolt data… but where do I find it?”

By tagging datasets, columns, and metadata with terms from your SPSI vocabulary, you enable: - Keyword search that actually understands synonyms and related terms - Filters based on data type, units, methods, or ecological domain - Smart discovery interfaces (e.g., “Show me everything related to juvenile survival”)

➡️ Example: A data catalog that lets users search by controlled term rather than inconsistent column names across spreadsheets.

🔄 2. Semi-Automated Data Integration

“These datasets report the same metric but use different terms, formats, or units.”

Using controlled terms with defined relationships (e.g., broader, exactMatch, hasUnit), you can: - Detect overlapping fields across submissions automatically - Align columns and units across multiple datasets - Standardize value domains (e.g., Red/Green/Amber vs Critical/Stable/Concerned)

➡️ Example: A script that reads in new FSAR data and automatically maps it to the SPSR schema for validation and loading into a database.

🧠 3. Smarter Applications (AI & Beyond)

“Can’t AI just figure this stuff out?”

Only if you feed it structure.

Controlled vocabularies and ontologies allow you to: - Ground large language models (LLMs) in your domain’s specific terminology - Build Retrieval-Augmented Generation (RAG) systems for question answering - Use semantic search tools to find relevant variables, concepts, and datasets - Train AI to assist in metadata generation, anomaly detection, or dataset classification

➡️ Example: A chatbot that helps scientists describe their data using your vocabulary, or recommends matching fields from existing standards.

🔗 4. Future-Proof Interoperability

“What if we want to share this with other agencies or join a broader platform?”

Standardized vocabularies with persistent URIs and ontology alignments (e.g., with NCEAS, ENVO, OBO) make it easy to: - Share DFO terms with external partners - Convert metadata to international schemas (e.g., Darwin Core, ISO 19115) - Plug into federated data platforms without starting over

➡️ Example: Publishing your vocabulary to w3id.org lets others reference and reuse your terms as global identifiers.

🧰 5. Better Metadata, Better Stewardship

“I just want people to fill out metadata that makes sense.”

Controlled vocabularies: - Reduce ambiguity - Improve machine readability - Make metadata easier to validate and automate

➡️ Example: A dropdown menu in a data intake form linked to your vocab that auto-fills units, definitions, and example values — while storing clean, machine-readable metadata under the hood.

🧭 TL;DR

Controlled vocabularies and ontologies are not the end — they’re the beginning.

They allow you to: - Find data more easily - Integrate it more reliably - Build tools more effectively - And use AI more meaningfully

They turn your data into infrastructure.

And they help ensure that knowledge created today can be used, re-used, and trusted — long into the future.