
Deduplicate proposed ontology terms
deduplicate_proposed_terms.RdApplies I-ADOPT compositional deduplication to a gpt_proposed_terms dataframe. This prevents term proliferation by:
Removing duplicates across tables (same term_label)
Collapsing age-stratified variants (X Age 1..7) into one base term
Collapsing phase-stratified variants (Ocean/Terminal/Mainstem X) into one base term
Identifying terms that should use constraint_iri instead of new term_iri
Value
A tibble with deduplicated terms and additional columns:
is_base_term: TRUE if this is the canonical base term for a patternneeds_age_facet: TRUE if age variants should use constraint_irineeds_phase_facet: TRUE if phase variants should use constraint_iricollapsed_from: Count of how many variants were collapsed into this termdedup_notes: Explanation of deduplication applied
Details
The function returns a deduplicated dataframe with added columns for facet handling.
Target ratio: For a dictionary with N measurement columns, expect ~N/10 to N/5 distinct base terms, NOT N terms. If the output still has >30 rows, consider further manual review.
Anti-patterns detected:
"Spawners Age 1", "Spawners Age 2", ... patterns → collapsed to "SpawnerCount"
Duplicate term_labels across different tables → deduplicated
Phase-stratified variants (Ocean X, Terminal X) → collapsed to base term
Examples
if (FALSE) { # \dontrun{
# Load raw proposed terms
proposed <- readr::read_csv("work/semantics/gpt_proposed_terms.csv")
# Deduplicate
deduped <- deduplicate_proposed_terms(proposed)
# Review collapsed terms
deduped |> dplyr::filter(collapsed_from > 1)
# Write cleaned output
readr::write_csv(deduped, "work/semantics/gpt_proposed_terms_deduped.csv")
} # }