Standardize Your Data
Cookbook Guide
Step 1: Assess Your Current Data
What you need: - Your existing dataset - List of column names and their current meanings
How to do it:
- Open your dataset and list all columns
- Document what each column represents in plain language
- Identify columns that might map to standard terms:
- Geographic identifiers (CU, SMU, DU)
- Temporal fields (brood year, return year)
- Biological measurements (escapement, spawners)
- Categorical codes (run type, life stage)
Example assessment:
| Current Column | What It Means | Potential Standard Term |
|---|---|---|
CU_code |
Conservation unit identifier | CU |
BY |
Brood year | Brood Year |
Esc |
Escapement count | Escapement |
Step 2: Map to Standard Terms
What you need: - Your column assessment - Access to the Controlled Vocabulary & Thesauri
How to do it:
- For each column, search the vocabulary for matching terms
- Review term definitions to confirm they match your usage
- Copy the URI for each confirmed match
- Create a mapping table:
Example mapping:
current_column,standard_term_label,standard_term_uri
CU_code,Conservation Unit,http://purl.dataone.org/odo/SALMON_00000239
BY,Brood Year,http://purl.dataone.org/odo/SALMON_00000782
Esc,Escapement,http://purl.dataone.org/odo/SALMON_00000123
Tips: - If no exact match exists, find the closest broader term - Document any gaps for future vocabulary expansion - Consider synonyms and alternative names
Step 3: Update Your Data Dictionary
What you need: - Your mapping table - Data Dictionary Template
How to do it:
- Download the data dictionary template
- Create one row per column in your dataset
- Fill in required fields:
variable_name: Your current column namelabel: Human-readable labeldata_type: Appropriate type (string, integer, float, boolean, categorical)definition: Clear explanation of what the column containsstandard_term_uri: URI from your mapping (if available)unit: Unit of measurement (for numeric fields)accepted_values: List of valid values (for categorical fields)
Example dictionary entry:
variable_name,label,data_type,definition,standard_term_uri,unit
CU_code,Conservation Unit,string,The conservation unit identifier,http://purl.dataone.org/odo/SALMON_00000239,
BY,Brood Year,integer,The year in which spawning occurred,http://purl.dataone.org/odo/SALMON_00000782,year
Esc,Escapement,float,Number of fish returning to spawn,http://purl.dataone.org/odo/SALMON_00000123,number of fish
Step 4: Standardize Code Lists
What you need: - List of categorical columns - Access to controlled vocabularies for codes
How to do it:
- Identify all categorical columns (run type, life stage, etc.)
- List all unique values currently in use
- Search for matching controlled vocabularies
- Map your values to standard codes
- Document mappings in
codes.csv:
Example code mapping:
variable_name,current_value,standard_value,concept_scheme_uri
run_type,Spring,Spring Run,http://purl.dataone.org/odo/SALMON_00000456
run_type,Summer,Summer Run,http://purl.dataone.org/odo/SALMON_00000457
run_type,Fall,Fall Run,http://purl.dataone.org/odo/SALMON_00000458
Tips: - Create a lookup table for data transformation - Document any values that don’t have standard equivalents - Consider creating a new controlled vocabulary if needed
Step 5: Apply Standards to Your Data
What you need: - Standardized data dictionary - Code mappings - Your original dataset
How to do it:
- Rename columns (optional): Update column names to match standard labels
- Transform codes: Replace current values with standard codes using your mapping
- Validate data types: Ensure numeric fields are properly formatted
- Add metadata: Include your data dictionary with the dataset
Example R code for transformation:
library(dplyr)
# Read your data
data <- read.csv("your_data.csv")
# Apply code mappings
code_mapping <- read.csv("code_mapping.csv")
data <- data %>%
left_join(code_mapping, by = c("run_type" = "current_value")) %>%
mutate(run_type = standard_value) %>%
select(-current_value, -standard_value)
# Validate data types
data <- data %>%
mutate(
BY = as.integer(BY),
Esc = as.numeric(Esc)
)
# Save standardized data
write.csv(data, "standardized_data.csv", row.names = FALSE)Step 6: Validate and Document
What you need: - Standardized dataset - Completed data dictionary - Validation checklist
How to do it:
- Validate completeness:
- All columns have entries in the dictionary
- All URIs are valid and accessible
- Required fields are populated
- Check consistency:
- Code values match controlled vocabularies
- Data types are correct
- Units are documented
- Document changes:
- Create a changelog of transformations
- Note any assumptions or decisions made
- Record version information
Validation checklist: - [ ] All columns documented in data dictionary - [ ] Standard term URIs included where available - [ ] Code lists mapped to controlled vocabularies - [ ] Data types validated - [ ] Units documented for numeric fields - [ ] Transformation steps documented
Next Steps
- Review the Controlled Vocabulary & Thesauri for available terms
- Check the Salmon Data Exchange Package Specification for package structure
- Explore R tools for automated standardization
- Contact the Data Stewardship Unit for support