Skip to contents

Primary one-shot wrapper: infer dictionary/table metadata/codes/dataset metadata from raw data tables and immediately write a review-ready Salmon Data Package.

Usage

create_sdp(
  resources,
  path = NULL,
  dataset_id = "dataset-1",
  table_id = "table_1",
  guess_types = TRUE,
  seed_semantics = TRUE,
  semantic_sources = c("smn", "gcdfo", "ols", "nvs"),
  semantic_max_per_role = 1,
  seed_verbose = TRUE,
  seed_codes = NULL,
  seed_table_meta = TRUE,
  seed_dataset_meta = TRUE,
  semantic_code_scope = c("factor", "all", "none"),
  llm_assess = FALSE,
  llm_provider = c("openai", "openrouter", "openai_compatible", "chapi"),
  llm_model = NULL,
  llm_api_key = NULL,
  llm_base_url = NULL,
  llm_top_n = 5L,
  llm_context_files = NULL,
  llm_context_text = NULL,
  llm_timeout_seconds = 60,
  llm_request_fn = NULL,
  check_updates = interactive(),
  format = "csv",
  overwrite = FALSE,
  include_edh_xml = FALSE,
  ...
)

Arguments

resources

Either a named list of data frames (one per resource table) or a single data frame (converted internally to a one-table list).

path

Character; directory path where package will be written. If omitted, defaults to file.path(getwd(), paste0(<dataset_id>-sdp)) using a filesystem-safe dataset id slug.

dataset_id

Dataset identifier applied to all inferred metadata rows.

table_id

Fallback table identifier when resources is a single data frame.

guess_types

Logical; if TRUE (default), infer value_type for each dictionary column.

seed_semantics

Logical; if TRUE (default), seed semantic suggestions during inference.

semantic_sources

Vector of vocabulary sources passed to suggest_semantics().

semantic_max_per_role

Maximum number of suggestions retained per I-ADOPT role.

seed_verbose

Logical; if TRUE, emit progress messages while seeding semantic suggestions.

seed_codes

Optional codes.csv-style seed metadata.

seed_table_meta

Optional tables.csv-style seed metadata. Use TRUE (default) to infer starter table metadata from resources.

seed_dataset_meta

Optional dataset.csv-style seed metadata. Use TRUE (default) to infer starter dataset metadata from resources.

semantic_code_scope

Character string controlling which codes.csv rows are sent through suggest_semantics() during one-shot seeding. "factor" (default) analyzes codes sourced from factor columns and low-cardinality character columns in the original data frame(s); "all" analyzes all inferred or supplied code rows; "none" skips code-level semantic suggestions.

llm_assess

Logical; if TRUE, run the optional LLM shortlist assessment inside suggest_semantics().

llm_provider

LLM provider preset forwarded to suggest_semantics().

llm_model

Optional LLM model identifier forwarded to suggest_semantics().

llm_api_key

Optional API key override forwarded to suggest_semantics().

llm_base_url

Optional OpenAI-compatible base URL forwarded to suggest_semantics().

llm_top_n

Maximum number of retrieved candidates sent to the LLM per target.

llm_context_files

Optional local context files forwarded to suggest_semantics(). See that function for supported file types, including HTML, DOCX, .R, .Rmd, .qmd, PDF, and Excel context files.

llm_context_text

Optional inline context snippets forwarded to suggest_semantics().

llm_timeout_seconds

Timeout for each LLM request in seconds.

llm_request_fn

Advanced/test hook overriding the low-level OpenAI-compatible request function.

check_updates

Logical; if TRUE, run a short, non-fatal check_for_updates() call after writing the package and mention newer releases only when one is available. Defaults to interactive().

format

Character; resource format: "csv" (default, only format supported)

overwrite

Logical; if FALSE (default), errors if path exists. If TRUE, replacement is only allowed for empty directories or directories previously written by metasalmon.

include_edh_xml

Logical; when TRUE, writes an HNAP-aware EDH XML metadata file to metadata/metadata-edh-hnap.xml using edh_build_hnap_xml(). The default is FALSE.

...

Deprecated legacy EDH arguments accepted for backwards compatibility: edh_profile, EDH_Profile, and EDH_profile all enable EDH XML export and must be "dfo_edh_hnap" when supplied. edh_xml_path is ignored with a warning because XML now always writes to the default metadata path. Any other extra arguments error.

Value

Invisibly returns the package path.

Details

This one-shot helper creates a review-ready package by default: semantic suggestions are seeded and the top-ranked column-level suggestions are auto-applied only into missing dictionary IRI fields. Table-level observation-unit suggestions stay enabled, and create_sdp() can auto-apply them into missing tables.csv$observation_unit_iri values when the suggestion still looks lexically compatible with the available table context (prefer observation_unit/description, otherwise fall back to table_label/table_id); compatible suggestions can also backfill tables.csv$observation_unit labels when missing. To reduce review noise conservatively, code-level suggestions default to factor and low-cardinality character source columns only; set semantic_code_scope = "all" to broaden that or "none" to disable it. The package root contains README-review.txt, semantic_suggestions.csv (when available), datapackage.json, metadata/, and data/. Review the prefilled values already written into metadata/tables.csv and metadata/column_dictionary.csv first; use semantic_suggestions.csv as a fallback shortlist when you want more context or a better match. To keep that review file usable, semantic_suggestions.csv trims code-level suggestions that do not have enough human-readable context to review safely. When llm_assess = TRUE, the same review file also carries llm_* columns so the shortlisted LLM judgment stays explicit and reviewable. Any auto-applied column/table IRI draft is written back into the metadata CSVs as a REVIEW:-prefixed value for manual confirmation there. Required-field review placeholders are also inserted into the inferred metadata files. In interactive use, create_sdp() can also mention an available package update; set check_updates = FALSE to skip that network check. The package bundles two Fraser coho examples: nuseds-fraser-coho-sample.csv (30 rows across 1996-2024) for the quickest demo, and nuseds-fraser-coho-2023-2024.csv (173 rows from the official Open Government Canada Fraser and BC Interior workbook) for a fuller multi-year example. The bundled system.file("extdata", "example-data-README.md", package = "metasalmon") note points to the upstream record/resource URLs, licensing, and the repository data-raw/ script used to derive the fuller example.

Examples

if (FALSE) { # \dontrun{
data_path <- system.file("extdata", "nuseds-fraser-coho-sample.csv", package = "metasalmon")
fraser_coho <- readr::read_csv(data_path, show_col_types = FALSE)

pkg <- create_sdp(
  fraser_coho,
  dataset_id = "fraser-coho-2024",
  table_id = "escapement",
  overwrite = FALSE
)
} # }