3 Step 1: Source Data Verification
3.1 Objective
Confirm that inputs are current, identifiable, and interpretable before you start selecting sites or building series.
3.2 Build one inventory, not a pile of guesses
Create a simple source inventory with at least these columns:
file_namefile_roleannual_static_conditionalsource_ownerpull_dateexpected_year_startexpected_year_endhard_coded_name_requirednotes
This catches a lot of avoidable errors early: stale files, missing static lookups, and scripts that expect literal file names.
3.3 Do this exactly
- Inventory every file you expect to use.
- Check identifiers and naming fields.
- Check unit meaning and missing-value semantics.
- Build a quick year-coverage matrix for key sources.
- Write anomaly notes and mismatch files before moving on.
3.4 What to verify
3.4.1 Identifiers and names
Confirm that the fields you need for joins are present and readable:
CU_ID- site or project identifiers
- pop identifiers where relevant
- year fields
- run timing, species, or estimate-class fields if the workflow depends on them
Also check for obvious name drift. A small spelling change can quietly break a large join.
3.4.2 Meaning of the numbers
For each main input, confirm:
- wild vs total,
- expanded vs observed,
- biological zero vs missing,
- current-year placeholder behaviour,
- whether any values were already adjusted upstream.
A clean numeric column is still dangerous if it means something different than last year.
3.5 Quick validation script
x <- read.csv("DATA_IN/<input-file>.csv")
stopifnot(!any(names(x) == ""))
stopifnot("Year" %in% names(x))
stopifnot(!all(is.na(x$Year)))
key <- c("Year")
if (all(key %in% names(x))) {
stopifnot(!any(duplicated(x[key])))
}Adapt the key fields to the actual file. The point is to fail early on obvious schema or duplication problems.
3.6 Species-specific verification notes
- Sockeye: verify stream names, timing labels, and decoder alignment before any matching or CU roll-up.
- Coho: verify tributary names against the POPID lookup before downstream joins; check zero-versus-missing treatment for estimate types.
- Chum: verify Harrison and major-system year alignment before aggregation.
- Pink: verify official CU year coverage separately from the historical NuSEDS layer.