
Frequently Asked Questions
faq.RmdGeneral Questions
Do I need to understand ontologies to use this package?
No. You can create perfectly good data packages without knowing anything about ontologies, SKOS, OWL, or IRIs. metasalmon handles the technical details automatically.
If you’re curious about what these terms mean, see the Glossary. But you don’t need to understand them to use the package effectively.
What’s the difference between this and just sharing a CSV?
A CSV file contains only data - numbers and text. Anyone opening it has to guess what the columns mean, what units are used, and what the codes represent.
A data package contains:
| Component | What it provides |
|---|---|
| Your data | The actual numbers (same as a CSV) |
| Data dictionary | What each column means |
| Code lists | What each code value means (e.g., “CO” = “Coho Salmon”) |
| Metadata | Who created it, when, license, contact info |
| Standard links | (Optional) Links to official scientific definitions |
Think of it like shipping a package: A CSV is like sending a box with no label. A data package is like sending a box with a detailed packing list, return address, and handling instructions.
How long does it take to create a data package?
| Scenario | Time |
|---|---|
| Simple dataset, using defaults | 5-10 minutes |
| Adding custom descriptions | 15-30 minutes |
| Complex dataset with many categorical columns | 30-60 minutes |
| Using AI assistance for descriptions | 30-60 minutes |
The 5-Minute Quickstart walks you through the fastest path.
Can I edit the data package after creating it?
Yes. A data package is just a folder with files. You can:
- Edit the metadata/.csv and data/.csv files directly in Excel or R
-
Use
create_sdp()for the main one-shot workflow from raw tables -
Use
write_salmon_datapackage()when you already assembled metadata and want the advanced/manual writer -
Regenerate
datapackage.jsonby rewriting the package after metadata changes
The package structure is designed to be human-readable and editable.
The metadata/*.csv files are canonical;
datapackage.json is a derived export, so avoid treating the
JSON as the source of truth.
Can colleagues open my package without installing metasalmon?
Yes! The data package format is based on Frictionless Data, an international standard. Your colleagues can:
- Open the CSVs in Excel, R, Python, or any spreadsheet software
- Read the dictionary to understand what columns mean
- Use the metadata to understand the dataset’s context
metasalmon makes it easy to create packages, but anyone can read them without special software. When you share one, send the whole folder (or a zip of the whole folder) so the canonical metadata stays with the data files.
Does this work with Python?
The data packages created by metasalmon follow the Frictionless Data standard, which has excellent Python support:
from frictionless import Package
# Load a metasalmon-created package
package = Package("path/to/my-data-package/datapackage.json")
# Access the data
for resource in package.resources:
print(resource.read_rows())See the Frictionless Python documentation for more details.
Technical Questions
validate_dictionary() shows errors. What do I do?
Here are the most common errors and how to fix them:
| Error | Meaning | Fix |
|---|---|---|
| “Missing required column” | Your dictionary is missing a required column | Check the template:
system.file("extdata", "column_dictionary.csv", package = "metasalmon")
|
| “Invalid value_type” | You used a type that doesn’t exist | Use one of: string, integer,
number, boolean, date,
datetime
|
| “Invalid column_role” | You used a role that doesn’t exist | Use one of: identifier, attribute,
measurement, temporal,
categorical
|
| “Duplicate column names” | Two rows have the same column_name
|
Remove or rename the duplicate |
Example fix:
# See what the error message says, then fix it:
dict$value_type[dict$column_name == "PROBLEM_COLUMN"] <- "string"
dict$column_role[dict$column_name == "ANOTHER_COLUMN"] <- "attribute"
# Validate again
validate_dictionary(dict)My data has special characters. Will that cause problems?
metasalmon handles UTF-8 encoded data correctly. If you have special characters (accents, non-English letters):
- Make sure your CSV is saved as UTF-8
- Use
readr::read_csv()which defaults to UTF-8 - If using base R, specify encoding:
read.csv("file.csv", encoding = "UTF-8")
Can I use this with data in formats other than CSV?
metasalmon works with data frames in R. You can read data from any format into R first:
# From Excel
library(readxl)
df <- read_excel("your-data.xlsx")
# From a database
library(DBI)
con <- dbConnect(...)
df <- dbReadTable(con, "your_table")
# From Parquet
library(arrow)
df <- read_parquet("your-data.parquet")
# Then use metasalmon as usual
dict <- infer_dictionary(df, dataset_id = "my-data", table_id = "main")What’s the difference between dataset_id and table_id?
| Field | Scope | Example |
|---|---|---|
dataset_id |
The overall dataset (may contain multiple tables) | "fraser-coho-monitoring-2024" |
table_id |
A specific table within the dataset |
"escapement", "age-composition",
"catch"
|
If you have a single table, use the same approach but with descriptive IDs:
# Single table dataset
dict <- infer_dictionary(df,
dataset_id = "fraser-coho-2024",
table_id = "escapement"
)If you have multiple related tables:
# Multi-table dataset
dict_esc <- infer_dictionary(df_escapement,
dataset_id = "fraser-coho-2024",
table_id = "escapement"
)
dict_age <- infer_dictionary(df_age,
dataset_id = "fraser-coho-2024", # Same dataset_id
table_id = "age-composition" # Different table_id
)How do I add code lists for categorical columns?
If you have columns with coded values (like
SPECIES = "CO" for Coho), you can add a code list:
- In the default
create_sdp()workflow, code-level semantic suggestions are seeded automatically only for factor/categorical source columns. - If you want broader code-level suggestion seeding, use
semantic_code_scope = "all".
codes <- tibble::tibble(
dataset_id = "fraser-coho-2024",
table_id = "escapement",
column_name = "SPECIES",
code_value = c("CO", "CH", "PK", "SO", "CM"),
code_label = c("Coho Salmon", "Chinook Salmon", "Pink Salmon",
"Sockeye Salmon", "Chum Salmon"),
code_description = NA_character_
)
# Include codes when creating the package
pkg_path <- write_salmon_datapackage(
resources = list(escapement = df),
dataset_meta = dataset_meta,
table_meta = table_meta,
dict = dict,
codes = codes, # Add this parameter
path = "my-package"
)Can I include multiple tables in one package?
Yes! Just include multiple data frames in the resources
list:
resources <- list(
escapement = df_escapement,
age_composition = df_age,
catch = df_catch
)
# Create dictionaries for each table
dict_all <- dplyr::bind_rows(
dict_escapement,
dict_age,
dict_catch
)
# Create table metadata for each table
table_meta <- tibble::tibble(
dataset_id = rep("fraser-coho-2024", 3),
table_id = c("escapement", "age_composition", "catch"),
file_name = c("data/escapement.csv", "data/age_composition.csv", "data/catch.csv"),
table_label = c("Escapement Data", "Age Composition", "Catch Data"),
description = c("Spawner counts", "Age structure", "Harvest numbers")
)
pkg_path <- write_salmon_datapackage(
resources = resources,
dataset_meta = dataset_meta,
table_meta = table_meta,
dict = dict_all,
path = "multi-table-package"
)Workflow Questions
Should I edit the dictionary before or after validation?
After - The validation tells you what needs fixing:
- Generate dictionary:
dict <- infer_dictionary(...) - Validate:
validate_dictionary(dict)- note any errors/warnings - Fix issues: Edit
dictto fix problems - Validate again:
validate_dictionary(dict)- should pass now - Create package:
write_salmon_datapackage(...)
How detailed should my descriptions be?
Aim for descriptions that would help a colleague (or your future self) understand the data without asking questions:
| Too brief | Good | Too detailed |
|---|---|---|
| “Count” | “Estimated count of naturally spawning adult coho salmon” | “This field contains the estimated count of naturally spawning adult coho salmon as determined by visual surveys conducted during peak spawning season (typically October-November) using standard area-under-the-curve methodology…” |
A good rule of thumb: one sentence that answers “What is this and how was it measured?”
Do I need to fill in all the semantic fields (IRI, property_iri, etc.)?
No. These fields are optional and are primarily for data stewards who want to link data to standard scientific vocabularies.
For most users, the basic fields (column_name,
column_label, column_description,
value_type, column_role) are sufficient for
creating useful, shareable data packages.
Getting More Help
Is there more documentation?
Yes! Here are the key resources:
| Resource | Best for |
|---|---|
| 5-Minute Quickstart | Getting started fast |
| Publishing Data Packages | Detailed control over metadata and publishing |
| Linking to Standard Vocabularies | Connecting data to scientific standards |
| Accessing Data from GitHub | Reading CSVs from private repositories |
| Glossary of Terms | Understanding technical terms |
| Function Reference | Looking up specific functions |