Integrate Data
Cookbook Guide
Step 1: Identify Data Sources
What you need: - List of datasets you want to integrate - Understanding of what each dataset contains
How to do it:
- Inventory your sources:
- List all datasets to be integrated
- Document the source system or origin
- Note the temporal and geographic scope
- Assess compatibility:
- Review column names and structures
- Identify common fields (CU, year, etc.)
- Note differences in formats or codes
Example source inventory:
| Dataset | Source | Key Columns | Temporal Scope | Geographic Scope |
|---|---|---|---|---|
| Escapement 2023 | Stream surveys | CU, BY, Esc | 2023 | Fraser River CUs |
| Catch 2023 | FOS | CU, Year, Catch | 2023 | All Areas |
| Stock status | SPSR | CU, BY, Status | 2010-2023 | All CUs |
Step 2: Map Common Fields
What you need: - Column lists from each dataset - Access to standard vocabularies
How to do it:
- Identify common fields:
- Geographic identifiers (CU, SMU, DU)
- Temporal fields (year, brood year)
- Biological measures (escapement, catch, spawners)
- Create a field mapping table:
Example field mapping:
| Source Dataset | Source Column | Standard Term | Standard URI | Target Column |
|---|---|---|---|---|
| Escapement | CU_code | Conservation Unit | http://purl.dataone.org/odo/SALMON_00000239 | CU |
| Escapement | BY | Brood Year | http://purl.dataone.org/odo/SALMON_00000782 | BroodYear |
| Catch | CU | Conservation Unit | http://purl.dataone.org/odo/SALMON_00000239 | CU |
| Catch | Year | Return Year | http://purl.dataone.org/odo/SALMON_00000891 | ReturnYear |
- Resolve differences:
- Standardize column names
- Align code values
- Convert units if needed
Step 3: Standardize Code Values
What you need: - Code lists from each dataset - Controlled vocabularies for codes
How to do it:
- Extract unique values from categorical columns across all datasets
- Map to standard codes using controlled vocabularies
- Create transformation rules:
Example code standardization:
source_dataset,source_column,source_value,standard_value,concept_scheme_uri
Escapement,run_type,Spring,Spring Run,http://purl.dataone.org/odo/SALMON_00000456
Escapement,run_type,S,Spring Run,http://purl.dataone.org/odo/SALMON_00000456
Catch,run_type,SPRING,Spring Run,http://purl.dataone.org/odo/SALMON_00000456
- Apply transformations to each dataset before integration
Step 4: Align Data Structures
What you need: - Standardized field mappings - Code transformations - Your source datasets
How to do it:
- Rename columns to match standard names
- Transform codes to standard values
- Standardize data types (ensure numeric fields are numeric, dates are dates)
- Handle missing values consistently
Example R code for alignment:
library(dplyr)
library(readr)
# Read source datasets
escapement <- read.csv("escapement_2023.csv")
catch <- read.csv("catch_2023.csv")
# Standardize escapement data
escapement_std <- escapement %>%
rename(
CU = CU_code,
BroodYear = BY,
Escapement = Esc
) %>%
mutate(
BroodYear = as.integer(BroodYear),
Escapement = as.numeric(Escapement)
)
# Standardize catch data
catch_std <- catch %>%
rename(
CU = CU,
ReturnYear = Year,
Catch = TotalCatch
) %>%
mutate(
ReturnYear = as.integer(ReturnYear),
Catch = as.numeric(Catch)
)
# Save standardized datasets
write.csv(escapement_std, "escapement_standardized.csv", row.names = FALSE)
write.csv(catch_std, "catch_standardized.csv", row.names = FALSE)Step 5: Join and Integrate
What you need: - Standardized datasets - Join keys (common fields)
How to do it:
- Identify join keys:
- Common fields across datasets (CU, year, etc.)
- Ensure keys are standardized
- Perform joins:
- Use appropriate join type (inner, left, full)
- Handle one-to-many relationships
- Preserve data integrity
Example integration:
# Join escapement and catch data
# Note: Escapement uses BroodYear, Catch uses ReturnYear
# Need to align temporal fields
integrated <- escapement_std %>%
# Convert brood year to return year (assuming 4-year cycle)
mutate(ReturnYear = BroodYear + 4) %>%
left_join(
catch_std,
by = c("CU", "ReturnYear")
) %>%
# Calculate exploitation rate
mutate(
ExploitationRate = ifelse(
!is.na(Catch) & !is.na(Escapement),
Catch / (Catch + Escapement),
NA
)
)
# Save integrated dataset
write.csv(integrated, "integrated_data.csv", row.names = FALSE)Tips: - Document join logic and assumptions - Handle temporal mismatches carefully - Validate join results (check for unexpected duplicates or missing matches)
Step 6: Validate Integration
What you need: - Integrated dataset - Source datasets for comparison - Validation rules
How to do it:
- Check completeness:
- Verify expected records are present
- Identify missing matches
- Document exclusions
- Validate relationships:
- Check join keys are correct
- Verify calculated fields
- Ensure data ranges are reasonable
- Compare summaries:
- Compare totals before and after integration
- Check for unexpected changes
- Validate derived metrics
Example validation:
# Compare totals
escapement_total <- sum(escapement_std$Escapement, na.rm = TRUE)
catch_total <- sum(catch_std$Catch, na.rm = TRUE)
integrated_esc_total <- sum(integrated$Escapement, na.rm = TRUE)
integrated_catch_total <- sum(integrated$Catch, na.rm = TRUE)
# Check totals match
stopifnot(abs(escapement_total - integrated_esc_total) < 0.01)
stopifnot(abs(catch_total - integrated_catch_total) < 0.01)
# Check for unexpected values
summary(integrated$ExploitationRate) # Should be between 0 and 1Step 7: Document Integration
What you need: - Integrated dataset - Integration code/logic - Source metadata
How to do it:
- Create integration documentation:
- List all source datasets
- Document join logic and keys
- Note any transformations applied
- Record assumptions and decisions
- Create integrated data dictionary:
- Document all columns in integrated dataset
- Include standard term URIs
- Note source of each field
- Version control:
- Tag version of integrated dataset
- Document date of integration
- Note any issues or limitations
Example documentation template:
# Integrated Dataset Documentation
## Source Datasets
- Escapement 2023 (source: stream surveys)
- Catch 2023 (source: FOS)
## Integration Date
2024-01-15
## Join Logic
- Joined on CU and ReturnYear
- Escapement BroodYear converted to ReturnYear (BroodYear + 4)
## Transformations Applied
- Column name standardization
- Code value standardization
- ExploitationRate calculation: Catch / (Catch + Escapement)
## Known Issues
- Some CUs missing catch data for 2023
- Temporal alignment assumes 4-year cycle (may vary by CU)Next Steps
- Review the Controlled Vocabulary & Thesauri for standard terms
- Check the Salmon Data Exchange Package Specification for integration patterns
- Explore R tools for automated integration workflows
- Contact the Data Stewardship Unit for integration support