
Reading CSVs from Private GitHub Repositories
github-csv-access.RmdOverview
metasalmon includes helper functions for reading CSV files directly from private GitHub repositories. This is useful when your data lives in a private repo and you want to:
- Access data without manually downloading files
- Pin to specific versions (tags or commits) for reproducibility
- Share analysis scripts that automatically fetch the latest data
One-Time Setup
Before reading from private repositories, you need to authenticate
with GitHub. The ms_setup_github() function guides you
through this process:
library(metasalmon)
# Run once to set up authentication
ms_setup_github()This function will:
- Check for git - Verify that git is installed on your system
-
Create a PAT - Open a browser to create a GitHub
Personal Access Token (PAT) with
reposcope if you don’t have one -
Store the PAT - Save your token securely using
gitcredsso you don’t need to enter it repeatedly - Verify access - Confirm that authentication works by testing against a repository
Verifying Access to a Specific Repository
By default, ms_setup_github() verifies access to a test
repository. You can specify your own repository to verify:
# Verify access to a specific private repository
ms_setup_github(repo = "your-org/your-private-repo")Troubleshooting Authentication
If you encounter authentication issues:
# Check if you have a stored PAT
gh::gh_token()
# Re-set your credentials if needed
gitcreds::gitcreds_set()
# Then run setup again
ms_setup_github(repo = "your-org/your-repo")Common issues:
| Problem | Solution |
|---|---|
| “No GitHub PAT found” | Run ms_setup_github() to create and store a token |
| “Access blocked by org SSO” | Re-authorize your PAT for the organization at https://github.com/settings/tokens |
| “Unable to reach repository” | Check the repo slug is correct (owner/name format) |
Reading CSV Files
Once authenticated, use read_github_csv() to read CSV
files directly into R:
# Read a CSV from a private repository
my_data <- read_github_csv(
path = "data/my-dataset.csv",
repo = "your-org/your-repo"
)
# View the data
head(my_data)Specifying Branches, Tags, or Commits
By default, files are read from the main branch. You can
specify a different reference:
# Read from a specific branch
dev_data <- read_github_csv(
path = "data/my-dataset.csv",
ref = "develop",
repo = "your-org/your-repo"
)
# Pin to a release tag for reproducibility
stable_data <- read_github_csv(
path = "data/my-dataset.csv",
ref = "v1.0.0",
repo = "your-org/your-repo"
)
# Pin to a specific commit SHA for exact reproducibility
exact_data <- read_github_csv(
path = "data/my-dataset.csv",
ref = "abc1234",
repo = "your-org/your-repo"
)Best practice: For reproducible analyses, pin to a
tag or commit rather than a branch name. Branch names like
main can change over time.
Passing Options to read_csv()
Additional arguments are passed through to
readr::read_csv():
# Specify column types
typed_data <- read_github_csv(
path = "data/my-dataset.csv",
repo = "your-org/your-repo",
col_types = "ccin" # character, character, integer, number
)
# Skip rows or limit reading
partial_data <- read_github_csv(
path = "data/my-dataset.csv",
repo = "your-org/your-repo",
skip = 1,
n_max = 100
)Reading All CSVs from a Directory
If you have multiple CSV files in a directory, you can read them all
at once using read_github_csv_dir(). This is similar to
using dir() with lapply() for local files:
# Read all CSV files from a directory
data_list <- read_github_csv_dir(
path = "data/observations",
repo = "your-org/your-repo"
)
# Access individual data frames by name (file name without .csv extension)
observations <- data_list$observations
metadata <- data_list$metadata
# List all available datasets
names(data_list)Filtering Files with Patterns
You can use a regular expression pattern to filter which files to read:
# Only read files matching a pattern
subset <- read_github_csv_dir(
path = "data",
repo = "your-org/your-repo",
pattern = "^obs_.*\\.csv$" # Files starting with "obs_" and ending in .csv
)Pinning to Specific Versions
Like read_github_csv(), you can pin to specific tags or
commits:
# Read all CSVs from a pinned version
data_v1 <- read_github_csv_dir(
path = "data/observations",
ref = "v1.0.0",
repo = "your-org/your-repo"
)Passing Options to read_csv()
Additional arguments are passed through to
readr::read_csv() for each file:
# Apply the same read_csv options to all files
data_typed <- read_github_csv_dir(
path = "data/observations",
repo = "your-org/your-repo",
col_types = "ccin" # Applied to all CSV files
)Note: The function returns a named list where names
are the file names without the .csv extension. If no CSV
files are found, it returns an empty list.
Getting Raw URLs
Sometimes you need the raw GitHub URL rather than the data itself.
Use github_raw_url():
# Get the stable raw URL for a file
url <- github_raw_url(
path = "data/my-dataset.csv",
repo = "your-org/your-repo"
)
# Pin to a specific version
versioned_url <- github_raw_url(
path = "data/my-dataset.csv",
ref = "v1.0.0",
repo = "your-org/your-repo"
)
# The URL can be used in documentation or shared with others
# (they'll still need authentication to access private repos)
print(versioned_url)Note: The raw URL does not contain your token -
authentication is handled separately via headers when you use
read_github_csv().
Working with Full URLs
If you already have a GitHub URL (blob or raw), you can pass it directly:
# From a GitHub blob URL (the kind you see in the browser)
data1 <- read_github_csv(
path = "https://github.com/your-org/your-repo/blob/main/data/file.csv"
)
# From a raw.githubusercontent.com URL
data2 <- read_github_csv(
path = "https://raw.githubusercontent.com/your-org/your-repo/main/data/file.csv"
)When using full URLs, the repo and ref
parameters are extracted automatically and don’t need to be
specified.
Example Workflow
Here’s a complete example workflow for a reproducible analysis:
library(metasalmon)
# First time only: set up authentication
# ms_setup_github(repo = "dfo-pacific-science/my-data-repo")
# Read the latest data from main branch
current_data <- read_github_csv(
path = "data/escapement-2024.csv",
repo = "dfo-pacific-science/my-data-repo"
)
# For a published analysis, pin to a specific version
# This ensures anyone running your script gets the exact same data
archived_data <- read_github_csv(
path = "data/escapement-2024.csv",
ref = "v2.1.0", # Use a release tag
repo = "dfo-pacific-science/my-data-repo"
)
# Document the data source in your analysis
cat("Data source:", github_raw_url(
"data/escapement-2024.csv",
ref = "v2.1.0",
repo = "dfo-pacific-science/my-data-repo"
))Security Notes
- Tokens are not embedded in URLs - Your PAT is sent via HTTP headers, not in the URL itself
-
Tokens are stored locally - The
gitcredspackage stores your token in your system’s credential manager - Don’t commit tokens - Never include your PAT in scripts or version-controlled files
-
Review token scopes - The
reposcope is required for private repositories; consider using fine-grained PATs for additional security
Function Reference
| Function | Purpose |
|---|---|
ms_setup_github() |
One-time authentication setup |
read_github_csv() |
Read a CSV file from GitHub into R |
read_github_csv_dir() |
Read all CSV files from a GitHub directory into a named list |
github_raw_url() |
Get the raw URL for a file (no data fetched) |
For detailed documentation, see: