Integrate Data
Cookbook Guide
Step 1: Inventory Sources
Document each source dataset:
- owner/source system
- temporal scope
- geographic scope
- key identifiers
- known caveats
Example inventory:
| Dataset | Source | Key Columns | Temporal Scope | Geographic Scope |
|---|---|---|---|---|
| Escapement series | Survey program | CU_code, BY, Esc |
2010–2024 | Fraser CUs |
| Catch series | FOS | CU, Year, Catch |
2010–2024 | Multiple areas |
| Status outputs | SPSR | CU, BY, STATUS |
2010–2024 | All CUs |
Step 2: Build a Common Semantic Map
Create a crosswalk from source fields to target standard fields and canonical IRIs.
Example (illustrative — verify exact term IRIs in WIDOCO):
| Source Dataset | Source Column | Standard Label | Standard IRI | Target Column |
|---|---|---|---|---|
| Escapement | CU_code |
Conservation Unit | https://w3id.org/gcdfo/salmon#ConservationUnit |
CU |
| Escapement | BY |
Brood Year | https://w3id.org/gcdfo/salmon#BroodYear |
BroodYear |
| Catch | Year |
Return Year | https://w3id.org/gcdfo/salmon#ReturnYear |
ReturnYear |
Use full IRIs only.
Step 3: Harmonize Controlled Values
Normalize categorical values before joining.
source_dataset,source_column,source_value,standard_value,concept_iri
Escapement,run_type,Spring,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun
Escapement,run_type,S,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun
Catch,run_type,SPRING,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun
Step 4: Standardize Structures
- align column names
- align data types
- align temporal semantics (brood year vs return year)
- document all transformation assumptions
Step 5: Integrate with Explicit Join Logic
- define join keys and expected cardinality
- run joins in script
- check for dropped/unmatched records
- compute derived fields only after validating joins
Step 6: Validate the Integrated Output
Validation checks:
- source totals vs integrated totals
- row counts by key strata (CU/year/etc.)
- derived metric sanity checks (ranges, null rate)
- unresolved mapping exceptions list
Step 7: Package for Reuse
Package the integrated output with:
- updated
column_dictionary.csv - code mappings (
codes.csv) - integration README (logic + assumptions)
Then continue to Salmon data package + SPSR intake path if this dataset is moving into SPSR/FSAR workflows.