Skip to content

data-designer-github

data-designer-github is a Data Designer seed reader for repository files. It turns GitHub repositories or local git repositories into seed rows that carry file content, path metadata, repository provenance, and commit identifiers.

Use it when a workflow needs code repository data as the starting point for generation, review, transformation, or indexing tasks. The reader is intentionally file-oriented: each matching text file becomes one seed row, and downstream Data Designer columns decide how to summarize, critique, rewrite, label, or enrich that row.

Installation

uv add data-designer data-designer-github

The plugin is discovered through the data_designer.plugins entry point once it is installed in the same environment as Data Designer.

Seed source

Use the github seed source when the seed dataset should come from one or more repositories.

Field Required Description
path No A local git repository path, or a directory whose immediate children are git repositories.
repositories No GitHub repositories to clone. Entries may be owner/name, https://github.com/owner/name, or https://github.com/owner/name.git.
repository_paths No Additional explicit local git repository paths to read.
ref No Branch, tag, or commit to check out for cloned GitHub repositories.
clone_depth No Shallow clone depth for GitHub repositories. Defaults to 1; set to None for a full clone.
clone_timeout_seconds No Timeout for each clone or checkout operation. Defaults to 300.
file_pattern No Inherited file glob from Data Designer's filesystem seed source. For example, *.py.
recursive No Whether file_pattern is applied recursively.
include_extensions No File extensions to include after the glob match. Defaults to common code and documentation extensions. Set to None to allow every extension.
include_file_names No Extensionless file names to include, such as Dockerfile and Makefile.
exclude_patterns No Relative path glob patterns to skip, including .git, cache, build, virtualenv, and dependency directories by default.
max_file_size_bytes No Maximum file size to hydrate into content. Defaults to 1_000_000.
encoding No Text encoding used when reading file contents. Defaults to utf-8.

At least one of path, repositories, or repository_paths is required.

Output columns

Column Description
repo_id Repository identifier. GitHub repositories use owner/name; local repositories use their GitHub remote when available, otherwise the directory name.
repo_url Remote origin URL when available.
commit_sha Checked-out commit SHA for the repository.
source_kind github for cloned repositories, or git_repository for local repositories.
repository_path Local path used by the reader. GitHub repositories are cloned into a temporary runtime directory.
source_path Absolute path to the file that produced the seed row.
relative_path File path relative to the repository root.
file_name Basename of the file.
file_extension Lowercase file extension.
code_lang Language hint inferred from the file name or extension.
size_bytes File size at manifest time.
content_sha256 SHA-256 hash of the hydrated file bytes.
content Decoded text content.

Behavior

When the reader is attached, it resolves local repository roots, clones any configured GitHub repositories, records the checked-out commit, and builds a manifest of matching files. File content is read during row hydration, so Data Designer can batch and sample repository content using the same seed reader interfaces as other filesystem-backed datasets.

The plugin reads repository files only. It does not parse code into functions, classes, symbols, dependency graphs, or AST nodes. If a workflow needs those structures, use this reader to collect stable file-level inputs and add downstream columns that perform the language-specific analysis.

The plugin shells out to git for repository operations and does not manage GitHub API tokens. Public repositories work directly. Private repositories require the execution environment's git credential configuration to already have access.