
5-Minute Quickstart
metasalmon.RmdInstallation
First, install metasalmon from GitHub:
# Install from GitHub (recommended)
install.packages("remotes")
remotes::install_github("dfo-pacific-science/metasalmon")What You’ll Learn
By the end of this guide, you’ll be able to:
- Turn your salmon data spreadsheet into a shareable “data package”
- Create a data dictionary that explains what each column means
- Share data that colleagues can immediately understand
What You’ll Need
- R installed (version 4.4 or higher)
- Your salmon data as a CSV file (or use our example data)
- About 5 minutes
Step 1: Load Your Data
First, let’s load the metasalmon package and some example data. We’ll use a sample of NuSEDS escapement data for Fraser River coho that comes with the package.
library(metasalmon)
# Load the example data included with the package
data_path <- system.file("extdata", "nuseds-fraser-coho-sample.csv",
package = "metasalmon")
df <- readr::read_csv(data_path, show_col_types = FALSE)
# Take a look at what we have
head(df)What you see: A table with columns like
POP_ID, SPECIES, ANALYSIS_YR,
MAX_ESTIMATE, etc.
The problem: What does POP_ID mean?
What are the valid values for SPECIES? If you shared this
CSV with a colleague, they’d have questions.
Using your own data? Just replace the
data_pathline with your file path:df <- readr::read_csv("path/to/your-data.csv", show_col_types = FALSE)
Step 2: Generate a Data Dictionary
metasalmon can look at your data and create a starter dictionary - a table that describes each column.
dict <- infer_dictionary(
df,
dataset_id = "fraser-coho-2024",
table_id = "escapement"
)
# See what it created
print(dict)What you get: A table with one row per column in your data, describing:
| Field | What it means |
|---|---|
column_name |
The column name from your data |
column_label |
A human-readable label (you can edit this) |
value_type |
Is it text (string), a number (integer,
number), a date? |
column_role |
Is this an ID, a measurement, a category, or a date? |
The dictionary is a starting point - metasalmon makes educated guesses, but you can (and should) review and improve the descriptions.
Tip: To see all columns in the dictionary, use
View(dict)in RStudio orprint(dict, width = Inf).
Need Help Finding Standard Terms?
Not sure what the official salmon data standard term is for a column?
The suggest_semantics() function can automatically suggest
standard terminology from scientific vocabularies:
# Get semantic suggestions for your dictionary
dict_suggested <- suggest_semantics(
df,
dict,
sources = c("ols", "nvs")
)
# View the suggestions
suggestions <- attr(dict_suggested, "semantic_suggestions")
head(suggestions)This will search standard ontologies and vocabularies to find matching terms for your columns, helping you link your data to recognized scientific standards.
Want faster results? Use the Salmon Data Standardizer GPT to get AI-powered suggestions for terminology, descriptions, and metadata. Just upload your dictionary and data sample!
# Optional: include DwC-DP export mappings alongside ontology suggestions
sem <- suggest_semantics(dict) # default: DwC export off
sem_with_dwc <- suggest_semantics(dict, include_dwc = TRUE)DwC-DP mappings stay optional; keep SDP columns as canonical and use the DwC view only when exporting to biodiversity tooling.
Step 3: Check Your Dictionary
Before packaging, let’s make sure the dictionary is valid:
validate_dictionary(dict)What happens:
- Green checkmarks = everything looks good
- Warnings = optional improvements you could make
- Errors = things you need to fix before proceeding
If you see errors, the message will tell you what’s wrong. Common fixes:
# Example: Fix a column type that was guessed incorrectly
dict$value_type[dict$column_name == "YEAR"] <- "integer"
# Example: Add a better description
dict$column_description[dict$column_name == "POP_ID"] <-
"Unique population identifier from the NuSEDS database"
# Validate again
validate_dictionary(dict)Step 4: Describe Your Dataset
Now we need to add some basic information about the dataset as a whole - who created it, what it contains, and how others can use it.
# Dataset-level metadata (describes the overall dataset)
dataset_meta <- tibble::tibble(
dataset_id = "fraser-coho-2024",
title = "Fraser River Coho Escapement Data",
description = "Sample escapement monitoring data for coho salmon in PFMA 29",
creator = "DFO Pacific Science",
contact_name = "Your Name",
contact_email = "your.email@dfo-mpo.gc.ca",
license = "Open Government License - Canada",
temporal_start = "2001",
temporal_end = "2024",
spatial_extent = "PFMA 29, Fraser River watershed"
)
# Table-level metadata (describes this specific table)
table_meta <- tibble::tibble(
dataset_id = "fraser-coho-2024",
table_id = "escapement",
file_name = "escapement.csv",
table_label = "Escapement Data",
description = "Coho escapement counts by population and year",
primary_key = "POP_ID"
)What these fields mean:
| Field | Purpose |
|---|---|
dataset_id |
A short identifier (letters, numbers, hyphens) |
title |
Human-readable title for the dataset |
description |
What the data contains |
creator |
Who created/collected the data |
contact_name/email |
Who to contact with questions |
license |
How others can use the data |
table_id |
Identifier for this table (matches dict) |
file_name |
What to name the output CSV file |
Step 5: Create Your Data Package
Now let’s bundle everything together into a shareable package:
pkg_path <- create_salmon_datapackage(
resources = list(escapement = df),
dataset_meta = dataset_meta,
table_meta = table_meta,
dict = dict,
path = "my-first-package",
overwrite = TRUE
)
# See what was created
list.files(pkg_path)What you get: A folder called
my-first-package/ containing:
| File | Purpose |
|---|---|
escapement.csv |
Your data |
column_dictionary.csv |
What each column means |
datapackage.json |
Machine-readable metadata |
dataset.csv |
Dataset-level information |
tables.csv |
Table-level information |
Step 6: Share It!
Your data package is ready. You can:
- Email the folder to a colleague (zip it first)
- Upload to a data repository like Zenodo or CIOOS
- Archive it for your future self
- Include it in a research compendium
When someone opens your package, they’ll find not just data, but complete documentation explaining what every column means.
Reading a Package Back
Later, you (or a colleague) can load the package back into R:
# Read the package
pkg <- read_salmon_datapackage(pkg_path)
# What's inside?
names(pkg)
# Access the components
pkg$dataset # Dataset metadata
pkg$tables # Table metadata
pkg$dictionary # Column descriptions
pkg$resources # Your actual data
# Get your data back as a tibble
head(pkg$resources$escapement)What’s Next?
You’ve created your first Salmon Data Package! Here are some ways to go deeper:
- Using AI to Document Your Data - Use the Salmon Data Standardizer GPT to get AI-powered suggestions for terminology, descriptions, and metadata
- Publishing Data Packages - More control over metadata and publishing
- Linking to Standard Vocabularies - Connect your data to scientific standards
- Accessing Data from GitHub - Read CSVs from private repositories
- Glossary of Terms - Definitions of technical terms
- FAQ - Common questions and troubleshooting
Troubleshooting
“validate_dictionary() shows errors”
This usually means a column type was guessed incorrectly. Check the error message and fix:
# See what types are valid
# string, integer, number, boolean, date, datetime
# Fix a specific column
dict$value_type[dict$column_name == "PROBLEM_COLUMN"] <- "string"“Column not found in dictionary”
Make sure your table_id in table_meta
matches the table_id you used in
infer_dictionary().
“I don’t understand what a field means”
See the Glossary for plain-English definitions of terms like “column_role”, “value_type”, etc.
“I want to add better descriptions”
Edit the dictionary directly before creating the package:
# View and edit in RStudio
View(dict)
# Or edit programmatically
dict$column_description[dict$column_name == "MAX_ESTIMATE"] <-
"Maximum escapement estimate for the population in a given year"
dict$column_label[dict$column_name == "MAX_ESTIMATE"] <-
"Maximum Estimate"