Reading CSVs from Private GitHub Repositories • metasalmon

Overview

metasalmon includes helper functions for reading CSV files directly from private GitHub repositories. This is useful when your data lives in a private repo and you want to:

Access data without manually downloading files
Pin to specific versions (tags or commits) for reproducibility
Share analysis scripts that automatically fetch the latest data

One-Time Setup

Before reading from private repositories, you need to authenticate with GitHub. The ms_setup_github() function guides you through this process:

library(metasalmon)

# Run once to set up authentication
ms_setup_github()

This function will:

Check for git - Verify that git is installed on your system
Create a PAT - Open a browser to create a GitHub Personal Access Token (PAT) with repo scope if you don’t have one
Store the PAT - Save your token securely using gitcreds so you don’t need to enter it repeatedly
Verify access - Confirm that authentication works by testing against a repository

Verifying Access to a Specific Repository

By default, ms_setup_github() verifies access to a test repository. You can specify your own repository to verify:

# Verify access to a specific private repository
ms_setup_github(repo = "your-org/your-private-repo")

Troubleshooting Authentication

If you encounter authentication issues:

# Check if you have a stored PAT
gh::gh_token()

# Re-set your credentials if needed
gitcreds::gitcreds_set()

# Then run setup again
ms_setup_github(repo = "your-org/your-repo")

Common issues:

Problem	Solution
“No GitHub PAT found”	Run `ms_setup_github()` to create and store a token
“Access blocked by org SSO”	Re-authorize your PAT for the organization at https://github.com/settings/tokens
“Unable to reach repository”	Check the repo slug is correct (`owner/name` format)

Reading CSV Files

Once authenticated, use read_github_csv() to read CSV files directly into R:

# Read a CSV from a private repository
my_data <- read_github_csv(
  path = "data/my-dataset.csv",
  repo = "your-org/your-repo"
)

# View the data
head(my_data)

Specifying Branches, Tags, or Commits

By default, files are read from the main branch. You can specify a different reference:

# Read from a specific branch
dev_data <- read_github_csv(
  path = "data/my-dataset.csv",
  ref = "develop",
  repo = "your-org/your-repo"
)

# Pin to a release tag for reproducibility
stable_data <- read_github_csv(
  path = "data/my-dataset.csv",
  ref = "v1.0.0",
  repo = "your-org/your-repo"
)

# Pin to a specific commit SHA for exact reproducibility
exact_data <- read_github_csv(
  path = "data/my-dataset.csv",
  ref = "abc1234",
  repo = "your-org/your-repo"
)

Best practice: For reproducible analyses, pin to a tag or commit rather than a branch name. Branch names like main can change over time.

Passing Options to read_csv()

Additional arguments are passed through to readr::read_csv():

# Specify column types
typed_data <- read_github_csv(
  path = "data/my-dataset.csv",
  repo = "your-org/your-repo",
  col_types = "ccin"  # character, character, integer, number
)

# Skip rows or limit reading
partial_data <- read_github_csv(
  path = "data/my-dataset.csv",
  repo = "your-org/your-repo",
  skip = 1,
  n_max = 100
)

Reading All CSVs from a Directory

If you have multiple CSV files in a directory, you can read them all at once using read_github_csv_dir(). This is similar to using dir() with lapply() for local files:

# Read all CSV files from a directory
data_list <- read_github_csv_dir(
  path = "data/observations",
  repo = "your-org/your-repo"
)

# Access individual data frames by name (file name without .csv extension)
observations <- data_list$observations
metadata <- data_list$metadata

# List all available datasets
names(data_list)

Filtering Files with Patterns

You can use a regular expression pattern to filter which files to read:

# Only read files matching a pattern
subset <- read_github_csv_dir(
  path = "data",
  repo = "your-org/your-repo",
  pattern = "^obs_.*\\.csv$"  # Files starting with "obs_" and ending in .csv
)

Pinning to Specific Versions

Like read_github_csv(), you can pin to specific tags or commits:

# Read all CSVs from a pinned version
data_v1 <- read_github_csv_dir(
  path = "data/observations",
  ref = "v1.0.0",
  repo = "your-org/your-repo"
)

Passing Options to read_csv()

Additional arguments are passed through to readr::read_csv() for each file:

# Apply the same read_csv options to all files
data_typed <- read_github_csv_dir(
  path = "data/observations",
  repo = "your-org/your-repo",
  col_types = "ccin"  # Applied to all CSV files
)

Note: The function returns a named list where names are the file names without the .csv extension. If no CSV files are found, it returns an empty list.

Getting Raw URLs

Sometimes you need the raw GitHub URL rather than the data itself. Use github_raw_url():

# Get the stable raw URL for a file
url <- github_raw_url(
  path = "data/my-dataset.csv",
  repo = "your-org/your-repo"
)

# Pin to a specific version
versioned_url <- github_raw_url(
  path = "data/my-dataset.csv",
  ref = "v1.0.0",
  repo = "your-org/your-repo"
)

# The URL can be used in documentation or shared with others
# (they'll still need authentication to access private repos)
print(versioned_url)

Note: The raw URL does not contain your token - authentication is handled separately via headers when you use read_github_csv().

Working with Full URLs

If you already have a GitHub URL (blob or raw), you can pass it directly:

# From a GitHub blob URL (the kind you see in the browser)
data1 <- read_github_csv(
  path = "https://github.com/your-org/your-repo/blob/main/data/file.csv"
)

# From a raw.githubusercontent.com URL
data2 <- read_github_csv(
  path = "https://raw.githubusercontent.com/your-org/your-repo/main/data/file.csv"
)

When using full URLs, the repo and ref parameters are extracted automatically and don’t need to be specified.

Example Workflow

Here’s a complete example workflow for a reproducible analysis:

library(metasalmon)

# First time only: set up authentication
# ms_setup_github(repo = "dfo-pacific-science/my-data-repo")

# Read the latest data from main branch
current_data <- read_github_csv(
  path = "data/escapement-2024.csv",
  repo = "dfo-pacific-science/my-data-repo"
)

# For a published analysis, pin to a specific version
# This ensures anyone running your script gets the exact same data
archived_data <- read_github_csv(
  path = "data/escapement-2024.csv",
  ref = "v2.1.0",  # Use a release tag
  repo = "dfo-pacific-science/my-data-repo"
)

# Document the data source in your analysis
cat("Data source:", github_raw_url(
  "data/escapement-2024.csv",
  ref = "v2.1.0",
  repo = "dfo-pacific-science/my-data-repo"
))

Security Notes

Tokens are not embedded in URLs - Your PAT is sent via HTTP headers, not in the URL itself
Tokens are stored locally - The gitcreds package stores your token in your system’s credential manager
Don’t commit tokens - Never include your PAT in scripts or version-controlled files
Review token scopes - The repo scope is required for private repositories; consider using fine-grained PATs for additional security

Function Reference

Function	Purpose
`ms_setup_github()`	One-time authentication setup
`read_github_csv()`	Read a CSV file from GitHub into R
`read_github_csv_dir()`	Read all CSV files from a GitHub directory into a named list
`github_raw_url()`	Get the raw URL for a file (no data fetched)

For detailed documentation, see: