Integrate Data

Cookbook Guide

Step 1: Inventory Sources

Document each source dataset:

  • owner/source system
  • temporal scope
  • geographic scope
  • key identifiers
  • known caveats

Example inventory:

Dataset Source Key Columns Temporal Scope Geographic Scope
Escapement series Survey program CU_code, BY, Esc 2010–2024 Fraser CUs
Catch series FOS CU, Year, Catch 2010–2024 Multiple areas
Status outputs SPSR CU, BY, STATUS 2010–2024 All CUs

Step 2: Build a Common Semantic Map

Create a crosswalk from source fields to target standard fields and canonical IRIs.

Example (illustrative — verify exact term IRIs in WIDOCO):

Source Dataset Source Column Standard Label Standard IRI Target Column
Escapement CU_code Conservation Unit https://w3id.org/gcdfo/salmon#ConservationUnit CU
Escapement BY Brood Year https://w3id.org/gcdfo/salmon#BroodYear BroodYear
Catch Year Return Year https://w3id.org/gcdfo/salmon#ReturnYear ReturnYear

Use full IRIs only.

Step 3: Harmonize Controlled Values

Normalize categorical values before joining.

source_dataset,source_column,source_value,standard_value,concept_iri
Escapement,run_type,Spring,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun
Escapement,run_type,S,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun
Catch,run_type,SPRING,Spring Run,https://w3id.org/gcdfo/salmon#SpringRun

Step 4: Standardize Structures

  • align column names
  • align data types
  • align temporal semantics (brood year vs return year)
  • document all transformation assumptions

Step 5: Integrate with Explicit Join Logic

  • define join keys and expected cardinality
  • run joins in script
  • check for dropped/unmatched records
  • compute derived fields only after validating joins

Step 6: Validate the Integrated Output

Validation checks:

  • source totals vs integrated totals
  • row counts by key strata (CU/year/etc.)
  • derived metric sanity checks (ranges, null rate)
  • unresolved mapping exceptions list

Step 7: Package for Reuse

Package the integrated output with:

  • updated column_dictionary.csv
  • code mappings (codes.csv)
  • integration README (logic + assumptions)

Then continue to Salmon data package + SPSR intake path if this dataset is moving into SPSR/FSAR workflows.

Next Steps