From e565ab2049a2a4aeadf039d14c368da6c2736d4a Mon Sep 17 00:00:00 2001 From: Dahlia Li Date: Mon, 17 Nov 2025 13:52:22 -0500 Subject: [PATCH 01/10] Add initial MkDocs documentation site --- docs/index.md | 3 +++ mkdocs.yml | 11 +++++++++++ 2 files changed, 14 insertions(+) create mode 100644 docs/index.md create mode 100644 mkdocs.yml diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..3a6fe78 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,3 @@ +# TaxonoPy Documentation + +Welcome! This is the initial MkDocs site for the TaxonoPy project. diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..9733d23 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,11 @@ +site_name: TaxonoPy +site_description: Documentation for the TaxonoPy package + +repo_url: https://github.com/Imageomics/taxonopy +repo_name: Imageomics/taxonopy + +theme: + name: material + +nav: + - Home: index.md From 8bef3761d00eca5820e4b94f6eefe7770509d022 Mon Sep 17 00:00:00 2001 From: Dahlia Li Date: Fri, 21 Nov 2025 20:22:24 -0500 Subject: [PATCH 02/10] Add MkDocs documentation pages and navigation --- docs/acknowledgements.md | 3 + docs/command_line_usage/help.md | 29 ++++++++++ docs/command_line_usage/tutorial.md | 90 +++++++++++++++++++++++++++++ docs/index.md | 69 +++++++++++++++++++++- mkdocs.yml | 32 +++++++++- 5 files changed, 218 insertions(+), 5 deletions(-) create mode 100644 docs/acknowledgements.md create mode 100644 docs/command_line_usage/help.md create mode 100644 docs/command_line_usage/tutorial.md diff --git a/docs/acknowledgements.md b/docs/acknowledgements.md new file mode 100644 index 0000000..ffe3664 --- /dev/null +++ b/docs/acknowledgements.md @@ -0,0 +1,3 @@ +# Acknowledgments + +The [Imageomics Institute](https://imageomics.org/) is supported by the National Science Foundation under [Award No. 2118240](https://www.nsf.gov/awardsearch/showAward?AWD_ID=2118240) "HDR Institute: Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning." Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. \ No newline at end of file diff --git a/docs/command_line_usage/help.md b/docs/command_line_usage/help.md new file mode 100644 index 0000000..f6e5b40 --- /dev/null +++ b/docs/command_line_usage/help.md @@ -0,0 +1,29 @@ +# Help +You may view the help for the command line interface by running: + +```bash +taxonopy --help +``` +This will show you the available commands and options: + +``` +usage: taxonopy [-h] [--cache-dir CACHE_DIR] [--show-cache-path] [--cache-stats] [--clear-cache] [--show-config] [--version] {resolve,trace,common-names} ... + +TaxonoPy: Resolve taxonomic names using GNVerifier and trace data provenance. + +positional arguments: + {resolve,trace,common-names} + resolve Run the taxonomic resolution workflow + trace Trace data provenance of TaxonoPy objects + common-names Merge vernacular names (post-process) into resolved outputs + +options: + -h, --help show this help message and exit + --cache-dir CACHE_DIR + Directory for TaxonoPy cache (can also be set with TAXONOPY_CACHE_DIR environment variable) (default: None) + --show-cache-path Display the current cache directory path and exit (default: False) + --cache-stats Display statistics about the cache and exit (default: False) + --clear-cache Clear the TaxonoPy object cache. May be used in isolation. (default: False) + --show-config Show current configuration and exit (default: False) + --version Show version number and exit +``` \ No newline at end of file diff --git a/docs/command_line_usage/tutorial.md b/docs/command_line_usage/tutorial.md new file mode 100644 index 0000000..a699f45 --- /dev/null +++ b/docs/command_line_usage/tutorial.md @@ -0,0 +1,90 @@ +# Command Line Tutorial + +**Command ```resolve```:** +The ```resolve``` command is used to perform taxonomic resolution on a dataset. It takes a directory of Parquet partitions as input and outputs a directory of resolved Parquet partitions. +``` +usage: taxonopy resolve [-h] -i INPUT -o OUTPUT_DIR [--output-format {csv,parquet}] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE] [--force-input] [--batch-size BATCH_SIZE] [--all-matches] + [--capitalize] [--fuzzy-uninomial] [--fuzzy-relaxed] [--species-group] [--refresh-cache] + +options: + -h, --help show this help message and exit + -i, --input INPUT Path to input Parquet or CSV file/directory + -o, --output-dir OUTPUT_DIR + Directory to save resolved and unsolved output files + --output-format {csv,parquet} + Output file format + --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} + Set logging level + --log-file LOG_FILE Optional file to write logs to + --force-input Force use of input metadata without resolution + +GNVerifier Settings: + --batch-size BATCH_SIZE + Max number of name queries per GNVerifier API/subprocess call + --all-matches Return all matches instead of just the best one + --capitalize Capitalize the first letter of each name + --fuzzy-uninomial Enable fuzzy matching for uninomial names + --fuzzy-relaxed Relax fuzzy matching criteria + --species-group Enable group species matching + +Cache Management: + --refresh-cache Force refresh of cached objects (input parsing, grouping) before running. +``` +It is recommended to keep GNVerifier settings at their defaults. + +**Command ```trace```**: +The ```trace``` command is used to trace the provenance of a taxonomic entry. It takes a UUID and an input path as arguments and outputs the full path of the entry through TaxonoPy. +``` +usage: taxonopy trace [-h] {entry} ... + +positional arguments: + {entry} + entry Trace an individual taxonomic entry by UUID + +options: + -h, --help show this help message and exit + +usage: taxonopy trace entry [-h] --uuid UUID --from-input FROM_INPUT [--format {json,text}] [--verbose] + +options: + -h, --help show this help message and exit + --uuid UUID UUID of the taxonomic entry + --from-input FROM_INPUT + Path to the original input dataset + --format {json,text} Output format + --verbose Show full details including all UUIDs in group +``` + +**Command ```common-names```:** +The ```common-names``` command is used to merge vernacular names into the resolved output. It takes a directory of resolved Parquet partitions as input and outputs a directory of resolved Parquet partitions with common names. + +``` +usage: taxonopy common-names [-h] --resolved-dir ANNOTATION_DIR --output-dir OUTPUT_DIR + +options: + -h, --help show this help message and exit + --resolved-dir ANNOTATION_DIR + Directory containing your *.resolved.parquet files + --output-dir OUTPUT_DIR + Directory to write annotated .parquet files +``` + +Note that the ```common-names``` command is a post-processing step and should be run after the ```resolve``` command. + +## Example Usage +To perform taxonomic resolution on a dataset with subsequent common name annotation, run: +``` +taxonopy resolve \ + --input /path/to/formatted/input \ + --output-dir /path/to/resolved/output +``` +``` +taxonopy common-names \ + --resolved-dir /path/to/resolved/output \ + --output-dir /path/to/resolved_with_common-names/output +``` +TaxonoPy creates a cache of the objects associated with input entries for use with the ```trace``` command. By default, this cache is stored in the ```~/.cache/taxonopy``` directory. + +## Development + +See the [Wiki Development Page](https://github.com/Imageomics/TaxonoPy/wiki/Development) for development instructions. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 3a6fe78..e87ad7e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,3 +1,68 @@ -# TaxonoPy Documentation +# TaxonoPy + +Welcome! This is the initial MkDocs site for the TaxonoPy project. + +TaxonoPy (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier) +. See below for the structure of inputs and outputs. + +## Purpose +The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies. + +Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers: + +- [The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/) +- [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/) +- [FathomNet](https://www.fathomnet.org/) +- [The Encyclopedia of Life (EOL)](https://eol.org/) + +The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa. + +## Input +A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include: + +- **uuid**: a unique identifier for each sample (required). +- **kingdom, phylum, class, order, family, genus, species**: the taxonomic ranks of the organism (required, may have sparsity). +- **scientific_name**: the scientific name to the most specific rank available (optional). +- **common_name**: the common (i.e. vernacular) name of the organism (optional). + + + +See the example data in: +``` + +- `examples/input/sample.parquet` +- `examples/resolved/sample.resolved.parquet` (generated with `taxonopy resolve`) +- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with `taxonopy common-names`) + +``` + + +## Challenges +This taxonomy information is provided by each data provider and original sources, but the classification can be: + +- **Inconsistent** — between and within sources (e.g., kingdom *Metazoa* vs. *Animalia*) +- **Incomplete** — missing ranks or containing "holes" +- **Incorrect** — spelling errors, nonstandard terms, or outdated classifications +- **Ambiguous** — homonyms, synonyms, and terms with multiple interpretations + +Taxonomic authorities exist to standardize classification, but: + +- There are multiple authorities +- They may disagree +- A given organism may be missing from some + +## Solution +TaxonoPy uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used. + +## Installation +TaxonoPy can be installed with pip after setting up a virtual environment. + +### User Installation with pip +To install the latest version of TaxonoPy, run: + +``` bash + +pip install taxonopy + +``` -Welcome! This is the initial MkDocs site for the TaxonoPy project. diff --git a/mkdocs.yml b/mkdocs.yml index 9733d23..499ae04 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -4,8 +4,34 @@ site_description: Documentation for the TaxonoPy package repo_url: https://github.com/Imageomics/taxonopy repo_name: Imageomics/taxonopy -theme: - name: material - nav: - Home: index.md + - Command Line Usage: + - Tutorial: command_line_usage/tutorial.md + - Help: command_line_usage/help.md + - Acknowledgements: acknowledgements.md + +theme: + name: material + features: + - navigation.tabs + - navigation.tabs.sticky + - content.code.copy + - content.code.annotate + +plugins: +- search +- mkdocstrings: + handlers: + python: + paths: [src] # search packages in the src folder + options: + docstring_style: google + merge_init_into_class: true +markdown_extensions: + - admonition + - attr_list + - md_in_html + - pymdownx.blocks.caption + - pymdownx.details + - pymdownx.superfences From d5b5e270670e11fe86f3e006c7ebe08c0044b6c5 Mon Sep 17 00:00:00 2001 From: Matt Thompson Date: Mon, 24 Nov 2025 15:16:03 -0500 Subject: [PATCH 03/10] Add mkdocs documentation dependencies --- pyproject.toml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/pyproject.toml b/pyproject.toml index 46de2a7..827952d 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -52,6 +52,12 @@ nb = [ "jupyterlab", "marimo[recommended]" ] +docs = [ + "mkdocs", + "mkdocs-material", + "mkdocs-material-extensions", + "mkdocstrings-python" +] [project.urls] Documentation = "https://github.com/Imageomics/TaxonoPy" From 497f517918d6c9e71b883653f1c9ef4f8349c70c Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Tue, 27 Jan 2026 13:51:35 -0500 Subject: [PATCH 04/10] Banner and logo images --- docs/_assets/taxonopy_banner.svg | 150 +++++++++++++++++++++++++++++++ docs/_assets/taxonopy_logo.svg | 129 ++++++++++++++++++++++++++ 2 files changed, 279 insertions(+) create mode 100755 docs/_assets/taxonopy_banner.svg create mode 100755 docs/_assets/taxonopy_logo.svg diff --git a/docs/_assets/taxonopy_banner.svg b/docs/_assets/taxonopy_banner.svg new file mode 100755 index 0000000..488f2b2 --- /dev/null +++ b/docs/_assets/taxonopy_banner.svg @@ -0,0 +1,150 @@ + + + + diff --git a/docs/_assets/taxonopy_logo.svg b/docs/_assets/taxonopy_logo.svg new file mode 100755 index 0000000..471f7af --- /dev/null +++ b/docs/_assets/taxonopy_logo.svg @@ -0,0 +1,129 @@ + + + + From 63fa90ff5a090b23fdf62903333eda8e82d0c247 Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Wed, 28 Jan 2026 18:41:40 -0500 Subject: [PATCH 05/10] Expand docs navigation and quick reference [AI-assisted session] --- docs/_scripts/gen_cli_help_docs.py | 61 +++++++++++ docs/development/contributing/index.md | 5 + docs/index.md | 57 +++++----- docs/stylesheets/extra.css | 123 ++++++++++++++++++++++ docs/user-guide/cache.md | 36 +++++++ docs/user-guide/installation.md | 26 +++++ docs/user-guide/io/index.md | 9 ++ docs/user-guide/io/input.md | 14 +++ docs/user-guide/io/output.md | 21 ++++ docs/user-guide/quick-reference.md | 137 +++++++++++++++++++++++++ mkdocs.yml | 59 +++++++++-- 11 files changed, 515 insertions(+), 33 deletions(-) create mode 100644 docs/_scripts/gen_cli_help_docs.py create mode 100644 docs/development/contributing/index.md create mode 100644 docs/stylesheets/extra.css create mode 100644 docs/user-guide/cache.md create mode 100644 docs/user-guide/installation.md create mode 100644 docs/user-guide/io/index.md create mode 100644 docs/user-guide/io/input.md create mode 100644 docs/user-guide/io/output.md create mode 100644 docs/user-guide/quick-reference.md diff --git a/docs/_scripts/gen_cli_help_docs.py b/docs/_scripts/gen_cli_help_docs.py new file mode 100644 index 0000000..e9be7f5 --- /dev/null +++ b/docs/_scripts/gen_cli_help_docs.py @@ -0,0 +1,61 @@ +"""Generate CLI help pages for MkDocs without shell execution.""" + +from __future__ import annotations + +import argparse +import sys +from pathlib import Path + +import mkdocs_gen_files + +ROOT = Path(__file__).resolve().parents[2] +SRC = ROOT / "src" +if str(SRC) not in sys.path: + sys.path.insert(0, str(SRC)) + +from taxonopy.cli import create_parser # noqa: E402 + + +def get_subparser(parser: argparse.ArgumentParser, name: str) -> argparse.ArgumentParser: + for action in parser._actions: + if isinstance(action, argparse._SubParsersAction): + if name in action.choices: + return action.choices[name] + raise KeyError(f"Subparser '{name}' not found") + + +def render_section(title: str, help_text: str) -> str: + return f"## `{title}`\n\n```console\n{help_text.rstrip()}\n```\n" + + +def main() -> None: + parser = create_parser() + parser.prog = "taxonopy" + resolve_parser = get_subparser(parser, "resolve") + trace_parser = get_subparser(parser, "trace") + trace_entry_parser = get_subparser(trace_parser, "entry") + common_parser = get_subparser(parser, "common-names") + + resolve_parser.prog = "taxonopy resolve" + trace_parser.prog = "taxonopy trace" + trace_entry_parser.prog = "taxonopy trace entry" + common_parser.prog = "taxonopy common-names" + + sections = [ + ("taxonopy --help", parser.format_help()), + ("taxonopy resolve --help", resolve_parser.format_help()), + ("taxonopy trace --help", trace_parser.format_help()), + ("taxonopy trace entry --help", trace_entry_parser.format_help()), + ("taxonopy common-names --help", common_parser.format_help()), + ] + + with mkdocs_gen_files.open("command_line_usage/help.md", "w") as file_handle: + file_handle.write("# Help\n\n") + file_handle.write("Command reference for the TaxonoPy CLI.\n\n") + for title, help_text in sections: + file_handle.write(render_section(title, help_text)) + file_handle.write("\n") + + +if __name__ in {"__main__", ""}: + main() diff --git a/docs/development/contributing/index.md b/docs/development/contributing/index.md new file mode 100644 index 0000000..be90067 --- /dev/null +++ b/docs/development/contributing/index.md @@ -0,0 +1,5 @@ +# Contributing + +We welcome contributions to TaxonoPy. More detailed guidance will be added here. + +If you have suggestions or run into a bug, please open an issue at [https://github.com/Imageomics/TaxonoPy/issues](https://github.com/Imageomics/TaxonoPy/issues). diff --git a/docs/index.md b/docs/index.md index e87ad7e..10a039e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,12 +1,35 @@ -# TaxonoPy +--- +title: Home +hide: + - title +--- + +# TaxonoPy {: .taxonopy-home-title } + +![TaxonoPy banner](_assets/taxonopy_banner.svg) + +

Cleanly Aligned Biodiversity Taxonomy

+ Welcome! This is the initial MkDocs site for the TaxonoPy project. -TaxonoPy (taxon-o-py) is a command-line tool for creating an internally consistent taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier) -. See below for the structure of inputs and outputs. +TaxonoPy (taxon-o-py) is a command-line tool for creating an internally consistent 7-rank Linnaean taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). +It does not define its own authority; instead it leans on trusted sources indexed by GNVerifier, such as the Catalogue of Life and the GBIF Backbone Taxonomy. See the full list of [GNVerifier data sources](https://verifier.globalnames.org/data_sources). -## Purpose -The motivation for this package is to create an internally consistent and standardized classification set for organisms in a large biodiversity dataset composed from different data providers that may use very similar and overlapping but not identical taxonomic hierarchies. +Support for flexible source selection is still evolving. Today, TaxonoPy ships with a pinned default GNVerifier source configuration (currently GBIF Backbone Taxonomy, source 11), while additional sources remain available through GNVerifier. + +## Package Purpose +TaxonoPy helps build a single, internally consistent classification across large biodiversity datasets assembled from multiple providers, each of which may use overlapping but non‑identical taxonomic hierarchies. The goal is AI-ready biodiversity data with clean, aligned taxonomy. Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers: @@ -15,26 +38,11 @@ Its development has been driven by its application in the TreeOfLife-200M (TOL) - [FathomNet](https://www.fathomnet.org/) - [The Encyclopedia of Life (EOL)](https://eol.org/) -The names (and classification) of taxa may be (and often are) inconsistent across these resources. This package addresses this problem by creating an internally consistent classification set for such taxa. - -## Input -A directory containing Parquet partitions of the seven-rank Linnaean taxonomic metadata for organisms in the dataset. Labels should include: - -- **uuid**: a unique identifier for each sample (required). -- **kingdom, phylum, class, order, family, genus, species**: the taxonomic ranks of the organism (required, may have sparsity). -- **scientific_name**: the scientific name to the most specific rank available (optional). -- **common_name**: the common (i.e. vernacular) name of the organism (optional). - - +Across these resources, taxon names and classifications often conflict. TaxonoPy resolves those differences into a coherent, standardized taxonomy for the combined dataset. -See the example data in: -``` - -- `examples/input/sample.parquet` -- `examples/resolved/sample.resolved.parquet` (generated with `taxonopy resolve`) -- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with `taxonopy common-names`) - -``` +!!! warning + TaxonoPy does not guarantee perfect alignment or edge case coverage; it is a progressive effort to improve taxonomic coverage in an evolving landscape. + If you have suggestions or encounter bugs, please see the [Contributing](development/contributing/index.md) page. ## Challenges @@ -65,4 +73,3 @@ To install the latest version of TaxonoPy, run: pip install taxonopy ``` - diff --git a/docs/stylesheets/extra.css b/docs/stylesheets/extra.css new file mode 100644 index 0000000..eeb4204 --- /dev/null +++ b/docs/stylesheets/extra.css @@ -0,0 +1,123 @@ +/* Brand palette aligned to the TaxonoPy banner/logo. */ + +/* Light mode */ +:root, +[data-md-color-scheme="default"] { + /* Deep forest green from the wordmark */ + --md-primary-fg-color: #16381f; + --md-primary-fg-color--light: #2a5a3b; + --md-primary-fg-color--dark: #0f2a17; + + /* Teal-blue from the branches/"Py" */ + --md-accent-fg-color: #2d94b8; + --md-accent-fg-color--transparent: rgba(45, 148, 184, 0.16); + + /* Link styling: green by default, blue on hover for clarity. */ + --taxonopy-link-color: #2e7a45; + --taxonopy-link-hover-color: #2d94b8; + --md-typeset-a-color: var(--taxonopy-link-color); + + /* Subtle brand tint for UI surfaces */ + --taxonopy-surface-tint: #dcebd2; + + /* Resolution table highlights */ + --taxonopy-cell-added-bg: rgba(46, 122, 69, 0.22); + --taxonopy-cell-changed-bg: rgba(255, 193, 7, 0.24); +} + +/* Dark mode tweaks to keep contrast crisp on slate */ +[data-md-color-scheme="slate"] { + --md-primary-fg-color: #1f5a32; + --md-primary-fg-color--light: #2e7a45; + --md-primary-fg-color--dark: #163d23; + + --md-accent-fg-color: #2d94b8; + --md-accent-fg-color--transparent: rgba(45, 148, 184, 0.20); + + /* Link styling: green by default, blue on hover for clarity. */ + --taxonopy-link-color: #43a463; + --taxonopy-link-hover-color: #2d94b8; + --md-typeset-a-color: var(--taxonopy-link-color); + + --taxonopy-surface-tint: #1b2a20; + + /* Resolution table highlights */ + --taxonopy-cell-added-bg: rgba(88, 214, 141, 0.28); + --taxonopy-cell-changed-bg: rgba(255, 214, 102, 0.28); +} + +/* Apply the tint very lightly to header/nav surfaces. */ +.md-header, +.md-tabs { + background-color: var(--md-primary-fg-color); +} + +.md-tabs { + box-shadow: inset 0 -1px 0 rgba(255, 255, 255, 0.08); +} + +/* Make the header logo slightly larger. */ +.md-header__button.md-logo img { + height: 1.9rem; +} + +/* Hide the home page title so the banner is the first visible element. */ +.taxonopy-home-title { + display: none; +} + +/* Ensure link hover color matches the accent in both themes. */ +.md-typeset a:hover, +.md-typeset a:focus { + color: var(--taxonopy-link-hover-color); +} + +/* Tighten row height in the quick-reference table. */ +.table-cell-scroll table th, +.table-cell-scroll table td { + padding: 0.25rem 0.5rem; + line-height: 1.15; + vertical-align: top; +} + +/* Attempt per-cell horizontal scrolling without extra markup. */ +.table-cell-scroll table { + table-layout: fixed; + width: 100%; +} + +.table-cell-scroll th, +.table-cell-scroll td { + max-width: 16ch; + overflow-x: auto; + white-space: nowrap; +} + +.table-cell-scroll th::-webkit-scrollbar, +.table-cell-scroll td::-webkit-scrollbar { + height: 6px; +} + +figure.table-caption > figcaption { + margin: 0 0 0.45rem; + text-align: left; + font-style: normal; + font-size: 0.95em; + color: var(--md-default-fg-color--light); +} + +/* Highlight resolved output differences vs input. */ +.cell-added, +.cell-changed { + display: inline-block; + padding: 0 0.2rem; + border-radius: 0.2rem; +} + +.cell-added { + background: var(--taxonopy-cell-added-bg, rgba(46, 122, 69, 0.18)); +} + +.cell-changed { + background: var(--taxonopy-cell-changed-bg, rgba(255, 193, 7, 0.22)); +} diff --git a/docs/user-guide/cache.md b/docs/user-guide/cache.md new file mode 100644 index 0000000..91a8e80 --- /dev/null +++ b/docs/user-guide/cache.md @@ -0,0 +1,36 @@ +# Cache + +TaxonoPy caches intermediate results (like parsed inputs and grouped entries) to +speed up repeated runs on the same dataset. + +## Location + +By default, the cache lives under: + +- `~/.cache/taxonopy` + +You can override this with: + +- `TAXONOPY_CACHE_DIR` environment variable, or +- `--cache-dir` CLI flag. + +## Namespaces and Reproducibility + +Each `resolve` run uses a cache namespace derived from: + +- the command name, +- the TaxonoPy version, and +- a fingerprint of the input files (paths + size + modified time). + +This keeps caches isolated across datasets and releases. + +## Useful CLI Flags + +- `--show-cache-path` — print the active cache directory and exit. +- `--cache-stats` — show cache statistics and exit. +- `--clear-cache` — remove cached objects. +- `--refresh-cache` (resolve only) — ignore cached parse/group results. +- `--full-rerun` (resolve only) — clears cache for the input and overwrites outputs. + +If you change input files or want to force a clean run, use `--refresh-cache` or +`--full-rerun`. diff --git a/docs/user-guide/installation.md b/docs/user-guide/installation.md new file mode 100644 index 0000000..d3b2181 --- /dev/null +++ b/docs/user-guide/installation.md @@ -0,0 +1,26 @@ +# Installation + +TaxonoPy can be installed with pip after setting up a virtual environment. + +```console +pip install taxonopy +``` + +## GNVerifier Dependency + +TaxonoPy relies on the GNVerifier CLI to resolve taxonomic names. When you run +`taxonopy resolve`, it will automatically try the following: + +1. **Docker (recommended).** If Docker is available, TaxonoPy checks for the + configured GNVerifier image (default: `gnames/gnverifier:v1.2.5`) and pulls it + if needed. The first run may take a bit longer while the image downloads. + See [gnames/gnverifier on Docker Hub](https://hub.docker.com/r/gnames/gnverifier). +2. **Local GNVerifier.** If Docker is unavailable or the image pull fails, + TaxonoPy looks for a local `gnverifier` binary on your `PATH`. The version + used will be whatever is installed on your system, which may differ from the + pinned container version. For install instructions, see the GNVerifier README: + [gnverifier installation](https://github.com/gnames/gnverifier?tab=readme-ov-file#installation). + +If neither Docker nor a local GNVerifier is available, TaxonoPy will raise an +error when you attempt to resolve names. In that case, install Docker or install +GNVerifier locally and ensure the `gnverifier` command is available. diff --git a/docs/user-guide/io/index.md b/docs/user-guide/io/index.md new file mode 100644 index 0000000..9600f4f --- /dev/null +++ b/docs/user-guide/io/index.md @@ -0,0 +1,9 @@ +# IO + +TaxonoPy accepts CSV or Parquet inputs with the same schema. Use the pages below +for the exact input columns, the structure of resolved/unsolved outputs, and how +the cache supports provenance and transparency throughout the resolution process. + +- [Input](input.md) +- [Output](output.md) +- [Cache](../cache.md) diff --git a/docs/user-guide/io/input.md b/docs/user-guide/io/input.md new file mode 100644 index 0000000..13b60eb --- /dev/null +++ b/docs/user-guide/io/input.md @@ -0,0 +1,14 @@ +# Input + +Provide a file or directory containing 7-rank Linnaean taxonomic metadata. +Inputs may be CSV or Parquet. Expected columns include: + +- **uuid**: a unique identifier for each sample (required). +- **kingdom, phylum, class, order, family, genus, species**: the taxonomic ranks of the organism (required, may have sparsity). +- **scientific_name**: the scientific name to the most specific rank available (optional). +- **common_name**: the common (i.e. vernacular) name of the organism (optional). + +## Example Files + +- [sample.parquet](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.parquet) +- [sample.csv](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.csv) diff --git a/docs/user-guide/io/output.md b/docs/user-guide/io/output.md new file mode 100644 index 0000000..614e5e8 --- /dev/null +++ b/docs/user-guide/io/output.md @@ -0,0 +1,21 @@ +# Output + +When you run `taxonopy resolve`, TaxonoPy writes two outputs for each input file: + +- **Resolved**: `.resolved.` +- **Unsolved**: `.unsolved.` + +The output directory mirrors the input directory structure. Output format is +controlled by the `--output-format` flag (`csv` or `parquet`). + +## What’s Inside + +Each output row corresponds to one input record. Resolved entries contain the +standardized taxonomy where available, while unsolved entries preserve the +original input ranks. Both outputs include resolution metadata such as status +and strategy information. + +## Example Files + +- `examples/resolved/sample.resolved.parquet` (generated with `taxonopy resolve`) +- `examples/resolved_with_common_names/sample.resolved.parquet` (generated with `taxonopy common-names`) diff --git a/docs/user-guide/quick-reference.md b/docs/user-guide/quick-reference.md new file mode 100644 index 0000000..e28e095 --- /dev/null +++ b/docs/user-guide/quick-reference.md @@ -0,0 +1,137 @@ +# Quick Reference + +## Install + +```console +pip install taxonopy +``` + +For detailed setup instructions including GNVerifier and troubleshooting, see [Installation](installation.md). + +## Sample Input + +Download the same sample dataset in either format and place it in `examples/input/`: + +- [sample.parquet](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.parquet) +- [sample.csv](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.csv) + +Sample contents: + +
+ +| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | +| bc2a3f9f-c1f9-48df-9b01-d045475b9d5f | Metazoa | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Homo sapiens | +| 21ed76d8-9a3b-406e-a1a3-ef244422bf8e | Plantae | Tracheophyta | `null` | Fagales | Fagaceae | Quercus | Quercus alba | Quercus alba | +| 4d166a61-b6e5-4709-91ba-b623111014e9 | Animalia | `null` | `null` | Hymenoptera | Apidae | Apis | Apis mellifera | Apis mellifera | +| 85b96dc2-70ab-446e-afb5-6a4b92b0a450 | `null` | `null` | `null` | `null` | `null` | `null` | Amanita muscaria | `null` | +| 38327554-ffbf-4180-b4cf-63c311a26f4e | Animalia | `null` | `null` | `null` | `null` | `null` | Laelia rosea | `null` | +| 8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 | Plantae | `null` | `null` | `null` | `null` | `null` | Laelia rosea | `null` | +| a95f3e29-ed48-41f4-9577-64d4243a0396 | `null` | `null` | `null` | `null` | `null` | `null` | `null` | `null` | + +
+ +In the final example entry, there is no available taxonomic data, which can happen in large datasets where there maybe a corresponding image but incomplete annotation. + +## Execute a Basic Resolution + +```console +taxonopy resolve --input examples/input --output-dir examples/output +``` + +!!! note "Input values" + There are three kinds of values you can pass to `--input`: + + - A single file path (CSV or Parquet). + - A flat directory of partitioned files (TaxonoPy will glob everything inside). + - A directory tree (TaxonoPy will glob recursively and preserve the folder structure in the output). + + In all three cases, the base filename is preserved in the output. That is, the output keeps the original filename(s) and adds `.resolved` / `.unsolved` before the extension. + + If you download both `sample.csv` and `sample.parquet` into `examples/input/`, resolve will fail due to mixed input formats; keep only one format per input directory. + +The command above will read in the sample data from `examples/input/`, execute resolution, and write the results to `examples/output/`. + +By default, outputs are written to Parquet format, whether the input is CSV or Parquet. To set the output format to CSV, use the `--output-format csv` flag. + +The output files consist of: + +- `sample.resolved.parquet` +- `sample.unsolved.parquet` +- `resolution_stats.json` + +The `sample.resolved.parquet` file contains all the entries where some resolution strategy was applied. In this example, it contains: + +
+ +Green highlights show values added during resolution. Yellow highlights indicate values that changed from the input. + +| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | common_name | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| bc2a3f9f-c1f9-48df-9b01-d045475b9d5f | Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Homo sapiens | `null` | +| 21ed76d8-9a3b-406e-a1a3-ef244422bf8e | Plantae | Tracheophyta | Magnoliopsida | Fagales | Fagaceae | Quercus | Quercus alba | Quercus alba | `null` | +| 4d166a61-b6e5-4709-91ba-b623111014e9 | Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis | Apis mellifera | Apis mellifera | `null` | +| 85b96dc2-70ab-446e-afb5-6a4b92b0a450 | Fungi | Basidiomycota | Agaricomycetes | Agaricales | Amanitaceae | Amanita | Amanita muscaria | `null` | `null` | +| 38327554-ffbf-4180-b4cf-63c311a26f4e | Animalia | Arthropoda | Insecta | Lepidoptera | Erebidae | Laelia | Laelia rosea | `null` | `null` | +| 8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 | Plantae | Tracheophyta | Liliopsida | Asparagales | Orchidaceae | Laelia | Laelia rosea | `null` | `null` | + +
+ +/// table-caption +Sample resolved output (selected columns) +/// + +The `sample.unsolved.parquet` file contains entries that could not be resolved (for example, rows with no usable taxonomy information). In this example, it contains: + +
+ +| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | common_name | +| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | +| a95f3e29-ed48-41f4-9577-64d4243a0396 | `null` | `null` | `null` | `null` | `null` | `null` | `null` | `null` | `null` | + +
+ +/// table-caption +Sample unsolved output (selected columns) +/// + +The `resolution_stats.json` file summarizes counts of how many entries from the input fell into each final status across the `resolved` and `unsolved` files. + +TaxonoPy also writes cache data to disk (default: `~/.cache/taxonopy`) so it can trace provenance and avoid reprocessing. Use `--show-cache-path`, `--cache-stats`, or `--clear-cache` if you want to inspect or manage it, or see the [Cache](cache.md) guide for details. + +## Trace an Entry + +You can trace how a single UUID was resolved. For example, let's trace one of the _Laelia rosea_ entries: + +```console +taxonopy trace entry --uuid 8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3 --from-input examples/input/sample.csv +``` + +TaxonoPy uses whatever rank context you provide (even if sparse) to disambiguate identical names. _Laelia rosea_ resolves differently under Animalia vs. Plantae as a hemihomonym. If higher ranks are missing, TaxonoPy would not have been able to disambiguate. + +Excerpt (incomplete) from the trace output: + +```json +{ + "query_plan": { + "term": "Laelia rosea", + "rank": "species", + "source_id": 11 + }, + "resolution_attempts": [ + { + "status": "EXACT_MATCH_PRIMARY_SOURCE_ACCEPTED_INNER_RANK_DISAMBIGUATION", + "resolution_strategy_name": "ExactMatchPrimarySourceAcceptedInnerRankDisambiguation", + "resolved_classification": { + "kingdom": "Plantae", + "phylum": "Tracheophyta", + "class_": "Liliopsida", + "order": "Asparagales", + "family": "Orchidaceae", + "genus": "Laelia", + "species": "Laelia rosea" + } + } + ] +} +``` diff --git a/mkdocs.yml b/mkdocs.yml index 499ae04..0214979 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,24 +1,54 @@ -site_name: TaxonoPy +site_name: TaxonoPy User Guide site_description: Documentation for the TaxonoPy package repo_url: https://github.com/Imageomics/taxonopy repo_name: Imageomics/taxonopy nav: - - Home: index.md - - Command Line Usage: - - Tutorial: command_line_usage/tutorial.md - - Help: command_line_usage/help.md - - Acknowledgements: acknowledgements.md + - TaxonoPy: + - User guide: index.md + - Quick reference: user-guide/quick-reference.md + - CLI help reference: command_line_usage/help.md + - Installation: user-guide/installation.md + - IO: + - user-guide/io/index.md + - Input: user-guide/io/input.md + - Output: user-guide/io/output.md + - Cache: user-guide/cache.md + - Development: + - Contributing: + - development/contributing/index.md + - Acknowledgements: acknowledgements.md -theme: +theme: name: material + logo: _assets/taxonopy_logo.svg + favicon: _assets/taxonopy_logo.svg + palette: + - media: "(prefers-color-scheme: light)" + scheme: default + primary: green + accent: blue + toggle: + icon: material/brightness-7 + name: Switch to dark mode + - media: "(prefers-color-scheme: dark)" + scheme: slate + primary: green + accent: blue + toggle: + icon: material/brightness-4 + name: Switch to light mode features: - navigation.tabs - navigation.tabs.sticky + - navigation.indexes - content.code.copy - content.code.annotate +extra_css: +- stylesheets/extra.css + plugins: - search - mkdocstrings: @@ -28,10 +58,23 @@ plugins: options: docstring_style: google merge_init_into_class: true +- gen-files: + scripts: + # Generates command_line_usage/help.md from argparse during build. + - docs/_scripts/gen_cli_help_docs.py markdown_extensions: - admonition - attr_list - md_in_html - - pymdownx.blocks.caption + - pymdownx.blocks.caption: + types: + - name: figure-caption + prefix: "" + - name: table-caption + prefix: "" + classes: "table-caption" + auto: false + prepend: true - pymdownx.details + - pymdownx.highlight - pymdownx.superfences From 38c0faa61ae74cd593e786e963ef133d6d3e7248 Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Wed, 28 Jan 2026 18:42:08 -0500 Subject: [PATCH 06/10] Update sample input datasets [AI-assisted session] --- examples/input/sample.csv | 8 ++++++++ examples/input/sample.parquet | Bin 3968 -> 5892 bytes 2 files changed, 8 insertions(+) create mode 100644 examples/input/sample.csv diff --git a/examples/input/sample.csv b/examples/input/sample.csv new file mode 100644 index 0000000..9fc00ab --- /dev/null +++ b/examples/input/sample.csv @@ -0,0 +1,8 @@ +uuid,kingdom,phylum,class,order,family,genus,species,scientific_name +bc2a3f9f-c1f9-48df-9b01-d045475b9d5f,Metazoa,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens,Homo sapiens +21ed76d8-9a3b-406e-a1a3-ef244422bf8e,Plantae,Tracheophyta,,Fagales,Fagaceae,Quercus,Quercus alba,Quercus alba +4d166a61-b6e5-4709-91ba-b623111014e9,Animalia,,,Hymenoptera,Apidae,Apis,Apis mellifera,Apis mellifera +85b96dc2-70ab-446e-afb5-6a4b92b0a450,,,,,,,Amanita muscaria, +38327554-ffbf-4180-b4cf-63c311a26f4e,Animalia,,,,,,Laelia rosea, +8f688a17-1f7a-42b2-b3dc-bd4c8fc0eee3,Plantae,,,,,,Laelia rosea, +a95f3e29-ed48-41f4-9577-64d4243a0396,,,,,,,, diff --git a/examples/input/sample.parquet b/examples/input/sample.parquet index 23696633ea47bc4148746fcdc1350a80c24b5748..c9eda3a12e29f1edbfc439f3d351d2a14932c297 100644 GIT binary patch literal 5892 zcmcgwL2Tn_6^_$R+Ub;TXLcQ>ONFA+jVe*k#EzZBqgf%Rounygr|mFF6NE;;?cc3i z+sVdB({zNyY1spZJ*?0kMu-E46W}m zl*vs0_w|$#B#VL{St3)yPw3=ZoKo zI-kvwEYGP~Dsf^iEpsxj5||1Zp6AoNNM*#!19Vo`1TL2*fF~lJp-Nnqh^j28X(CE# zw9Di(LQaxI&M;Ji6Zw3aQ$@|-vKbBVh>$fz>fssLe4g+*jyH0I69rY^)QqlisxE4I zLrYUiGiXO-$;ePa=BO^_0nZRQS<2!d?w;rZKb zq8-v^>#*NJ{5xzsfIqV`8`LMJhbv*>&7|jKf|-Z{c!FK5uu;#)>?-=$BK9!mQOwzu z>7g!h{&a!IFwqF$RoI;&F{zIbZ>J3$bf}9Y70CgysCx?@H~Kbpnyn6PhuN84nI5i6 zObyn+k^wtY(TQdBRAe`Yp1%IUO@wUPreo?p^|(xkW543@@W0%po}!n+nmi@Tn9@#9T;r zo3M8x?Cd+y#ALT?>hDAb(gG#RXDi_3wJUCH8C1_5dYW2LTn)w!4nDwGi1(-=l4wfb0+oU zsnoBbhK?;#VcboGot(J@J0Raf{S6cU1C#pebn1QB=GCPs484NTGkC*-5&MVeA%D!o zf5N2RpGm#ExtJYapu(v0h&qc!a0Jy$1P}F(O#Dww>Myra|Cm|Kjx9!EP!U0=vDySd z2f5^-zF^`nnbap=N&R*fQODGtFzVg0sD3(npr0}EUoxq`&!s+`L(oeVCk*(tv4F#D z_IDqRzl^3nyq$XS#$xu%>qXeda|WG6v780DxO_c4Xa8p6-;bt#^k(X#I~f0euMA;V zmGkYyKYuemmzs|M@U6(gY9iJmj!xXf4{knNG98_sKGm8~3LJOoe)2otT}>|e)2E(} zUXps#XG?g6!F&lz%@QmzOls-g#STh+M^8KbmT&BsT2rTJ?{w`}|6I{jkEqr`xead$ z*u(+F=~~wJR+AUL230KVE}*zxkA95|JG_uzuRpH|G@MML;h`xuyhv)iA51w%j9}mUTmldU_z&iLwkhLf9jB}7^ zU|YzQZL8ZTRmFW_`$XNSc2sA}d}J2X{l~q7vOSPla!a-MELy6bm5-VSH9qZ@*EWw- z;gnZvk}u1+rfL^EjmPCI=y<(m9hXhSXtWx&)8l<09!Hj{c5YRRz{3oy(S&k z_qPpw!|Dy!sA0yLzPie0Lha6JVt4cupnYwsb02)-Vq) z&ZjP%9%`Abt71A*AqK7gT6sFrApbPe^J$Y>)om2zZoS42_t=LV?t5qJ6b>O*4>dl( zfm7|0+V)8U@&|c_9O-BqRxsyI>UPVjXLfykt(vaa_73&ZL%a*X>8WeLeW3l#fX@cx zv*l8s_c1Yz=wDZ!JB^ye56)&xOu7YoKYOh_IomzBtF6ZVmHBk0m8wT@rms6L^x4%* z7Vxw^ymNSm!n`@#SK0ob&Xw0OPwSaY3+mbl|!5s4r^?W){1@2mNWi& zHZ0BAZmELk)!3~grMtBz_R5Y|t1sfTt9?zdPE^EPEB7l$TVS^-D@q+Z!@zC}o{w>! z$6m6FZKbdaZDYL&LgfU9O&khnq&&jGYdUM>jNQ8sYUG6kzHSeDjXucQ6L9axF&Zug1T`LK}= znxOYoKxxU(eNTY)XGbvPVfwJ>HBD%?IPvt};PGKYjcKVHu+25Jmk(eF?(cGZ(61Fm z&1@bw_MznhX=rc0QheJk^uaS+PtiWN4#StmH|!7a#(?(s>Nud|0soid^Y}aOb-U|j zw1-VU!S@580QBB_!|ny?Lw<(u8yF#ekko?L=e;S+;hY2fz58w?k!+FndtKV$v{sAr eUYL@D7ow#9smf8?LH;YUOG{rf)_1W|YZ literal 3968 zcmd5Z1bh?YJSky3n+A}F47XJ&V@Y)nET zIAmw;J?Gr-esj*b=S<>yGC~9Non=g zLZq|#p+5>=9Gmc~?<`&(Kl|0z^eZ;ANG^F+txkXh1<#HxXLriA3{OoU^797~%$m_T%-cOgf(xJg-RX8%cXj1gjnp zuOX%ID41Jv;uU-kdD{E=`{a%_fpqXWx z<);$6C22P{4fOpB7*JK{!z0vJ@neYFR7ZJM_z|eJ&<8xu3v>ouol+UOw1T?rJ-}7ak)*xdJkYlc3=kTjFM9J;i%24{@gfAaWLOy3AwI)R!`ckP;nP_o zmoJp+=V;Jfr8GKgT&Bw;W&;GsC9XX+lo>?OME z2SNfi(LDOoImsx(V_b(|qhq23&|n2}4KqfrYzO?R#J-iZ+uM;_Ee7!VKh~y3t6)HN zviaLJ>HLmW)Fidlr>?FmfSKDyYPw6MW1*fBl@TIfYpScNEUAwk!%96<6}L2bwQd;! z0B?Ok6Q8|P5OC+N1cqPsW_loRu<(0U(>d2U^hSde_ ztUp{SzW@Cv2D`VHD(w1BdQBqQeYv`It)oP-T0@g(=h^}I=Up^OwC|g$xUQ20-b&4w zU2La}T@TX%qP-icf~>h!vJVmHZkP@c?dFy$&|111`x-G8chjedcBR#WvHGsVZXv|) zyYYSg+*XBHOG9Ay5#!n(#8_lij8)cc_QpO~{_7t41kv7CtIOSF^u&2pg_B*{2XVfz z7vjv$wJ=!=vKy`2@sukeJ()Z{_LK!*N?ERaCwJr3^TI)0*9YOJ_s_w6K7m7+3(+xM zw{$RoXF`7@4*FB4P=7?%XB@ke_#DOQJgzyCi0OK45X|)$&PNgNpa4svB`Wko7H&O) zQ`}wGC&WD|7U_dCunvEM?yNVb^B>liu+lJ+^w zS9n6|3OPCpu0#EpznG$SKF1>IF%u5LfdfR5<9U-8Ep-8U+;V?pCwwTz;eYRc0At&d A1ONa4 From c8da99052d6aa05cdcd0af1f5184920e0c10e46a Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Wed, 28 Jan 2026 18:42:38 -0500 Subject: [PATCH 07/10] Add mkdocs-gen-files to docs extras [AI-assisted session] --- pyproject.toml | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index aee68b4..5e57402 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -57,7 +57,8 @@ docs = [ "mkdocs", "mkdocs-material", "mkdocs-material-extensions", - "mkdocstrings-python" + "mkdocstrings-python", + "mkdocs-gen-files" ] [project.urls] From e76b1be790f0a90984ad3d0932fc7662c61951a8 Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Fri, 30 Jan 2026 13:01:01 -0500 Subject: [PATCH 08/10] Add docs page and PR preview deployment workflow --- .github/workflows/deploy-docs.yaml | 103 +++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 .github/workflows/deploy-docs.yaml diff --git a/.github/workflows/deploy-docs.yaml b/.github/workflows/deploy-docs.yaml new file mode 100644 index 0000000..3e4da89 --- /dev/null +++ b/.github/workflows/deploy-docs.yaml @@ -0,0 +1,103 @@ +name: Build & Deploy MkDocs (gh-pages with PR previews) + +on: + workflow_dispatch: + pull_request: + branches: [ main ] + types: [opened, synchronize, reopened, closed] + push: + branches: [ main ] + +permissions: + contents: write + pages: write + +jobs: + build: + # Run for push, workflow dispatch, PRs from SAME repo that are not closed + if: | + github.event_name == 'push' || + github.event_name == 'workflow_dispatch' || + (github.event_name == 'pull_request' && + github.event.pull_request.head.repo.fork == false && + github.event.action != 'closed') + runs-on: ubuntu-latest + concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + steps: + - uses: actions/checkout@v4 + with: + fetch-depth: 0 + - uses: actions/setup-python@v5 + with: + python-version: "3.11" + - name: Install deps + run: | + python -m pip install --upgrade pip + pip install '.[docs]' + - name: Build with MkDocs + run: mkdocs build + - name: Upload built site as artifact + uses: actions/upload-artifact@v4 + with: + name: site + path: ./site + + deploy: + needs: build + # Deploy on push to main (root) or PRs from SAME repo (not closed) -> pr-/ + if: | + github.event_name == 'push' || + (github.event_name == 'pull_request' && + github.event.pull_request.head.repo.fork == false && + github.event.action != 'closed') + runs-on: ubuntu-latest + concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + steps: + - name: Download built site + uses: actions/download-artifact@v4 + with: + name: site + path: ./site + - name: Deploy to gh-pages + uses: peaceiris/actions-gh-pages@v4 + with: + github_token: ${{ secrets.GITHUB_TOKEN }} + publish_branch: gh-pages + publish_dir: ./site + keep_files: true + destination_dir: ${{ github.event_name == 'pull_request' && format('pr-{0}', github.event.number) || '' }} + + cleanup: + # Only when a same-repo PR closes + if: > + github.event_name == 'pull_request' && + github.event.pull_request.head.repo.fork == false && + github.event.action == 'closed' + runs-on: ubuntu-latest + steps: + - name: Checkout gh-pages + uses: actions/checkout@v4 + with: + ref: gh-pages + fetch-depth: 0 + - name: Configure git author + run: | + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + - name: Remove preview folder + shell: bash + run: | + set -euo pipefail + PR_DIR="pr-${{ github.event.number }}" + echo "Attempting to remove $PR_DIR" + if [ -d "$PR_DIR" ]; then + git rm -r "$PR_DIR" + git commit -m "Remove preview for PR #${{ github.event.number }}" + git push origin gh-pages + else + echo "No preview folder $PR_DIR found; nothing to do." + fi From 3e5f1340abd85f7fddc7701f9e7464d1da3c7d3d Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Fri, 30 Jan 2026 16:48:02 -0500 Subject: [PATCH 09/10] Wordsmith landing page --- docs/index.md | 43 ++++++++++++++++++------------------------- 1 file changed, 18 insertions(+), 25 deletions(-) diff --git a/docs/index.md b/docs/index.md index 10a039e..c4468ac 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,7 +8,7 @@ hide: ![TaxonoPy banner](_assets/taxonopy_banner.svg) -

Cleanly Aligned Biodiversity Taxonomy

+

Reproducibly Aligned Biological Taxonomies

-Welcome! This is the initial MkDocs site for the TaxonoPy project. - -TaxonoPy (taxon-o-py) is a command-line tool for creating an internally consistent 7-rank Linnaean taxonomic hierarchy using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). -It does not define its own authority; instead it leans on trusted sources indexed by GNVerifier, such as the Catalogue of Life and the GBIF Backbone Taxonomy. See the full list of [GNVerifier data sources](https://verifier.globalnames.org/data_sources). - -Support for flexible source selection is still evolving. Today, TaxonoPy ships with a pinned default GNVerifier source configuration (currently GBIF Backbone Taxonomy, source 11), while additional sources remain available through GNVerifier. +TaxonoPy (taxon-o-py) is a command-line tool for creating reproducibly aligned biological taxonomies using the [Global Names Verifier (gnverifier)](https://github.com/gnames/gnverifier). ## Package Purpose -TaxonoPy helps build a single, internally consistent classification across large biodiversity datasets assembled from multiple providers, each of which may use overlapping but non‑identical taxonomic hierarchies. The goal is AI-ready biodiversity data with clean, aligned taxonomy. +TaxonoPy aligns data to a single, internally consistent 7-rank Linnaean taxonomic hierarchy across large biodiversity datasets assembled from multiple providers, each of which may use overlapping but nonuniform taxonomies. The goal is AI-ready biodiversity data with clean, aligned taxonomy. -Its development has been driven by its application in the TreeOfLife-200M (TOL) dataset. This dataset contains over 200 million samples of organisms from four core data providers: +Its development has been driven by its application in the [TreeOfLife-200M dataset](https://huggingface.co/datasets/imageomics/TreeOfLife-200M). This dataset contains over 200 million labeled images of organisms from four core data providers: - [The Global Biodiversity Information Facility (GBIF)](https://www.gbif.org/) - [BIOSCAN-5M](https://biodiversitygenomics.net/projects/5m-insects/) @@ -40,13 +35,8 @@ Its development has been driven by its application in the TreeOfLife-200M (TOL) Across these resources, taxon names and classifications often conflict. TaxonoPy resolves those differences into a coherent, standardized taxonomy for the combined dataset. -!!! warning - TaxonoPy does not guarantee perfect alignment or edge case coverage; it is a progressive effort to improve taxonomic coverage in an evolving landscape. - If you have suggestions or encounter bugs, please see the [Contributing](development/contributing/index.md) page. - - ## Challenges -This taxonomy information is provided by each data provider and original sources, but the classification can be: +The taxonomy information is provided by each data provider and original sources, but the classification can be: - **Inconsistent** — between and within sources (e.g., kingdom *Metazoa* vs. *Animalia*) - **Incomplete** — missing ranks or containing "holes" @@ -60,16 +50,19 @@ Taxonomic authorities exist to standardize classification, but: - A given organism may be missing from some ## Solution -TaxonoPy uses the taxonomic hierarchies provided by the TOL core data providers to query GNVerifier and create a standardized classification for each sample in the TOL dataset. It prioritizes the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11), since this represents the largest part of the TOL dataset. Where GBIF misses, backup sources such as the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used. +TaxonoPy uses the the taxonomic lineages provided by diverse sources to submit batched queries to GNVerifier and resolve to a standardized classification path for each sample in the dataset. It is currently configured to prioritize alignment to the [GBIF Backbone Taxonomy](https://verifier.globalnames.org/data_sources/11). Where GBIF misses, backup sources of the [Catalogue of Life](https://verifier.globalnames.org/data_sources/1) and [Open Tree of Life (OTOL) Reference Taxonomy](https://verifier.globalnames.org/data_sources/179) are used. -## Installation -TaxonoPy can be installed with pip after setting up a virtual environment. +## Getting Started +To get started with TaxonoPy, see the [Quick Reference](user-guide/quick-reference.md) guide. -### User Installation with pip -To install the latest version of TaxonoPy, run: - -``` bash - -pip install taxonopy +--- -``` +!!! warning + Taxonomic classifications are human-constructed models of biological diversity, not direct representations of biological reality. + Names and ranks reflect taxonomic concepts that may vary between authorities, evolve over time, and differ in scope or interpretation. + + TaxonoPy aims to produce a **consistent, transparent, and fit-for-purpose classification** suitable for large-scale data integration and AI workflows. + It prioritizes internal coherence and interoperability across datasets and providers by aligning source data to a selected reference taxonomy. + + It is a progressive effort to improve taxonomic alignment in an evolving landscape. + If you have suggestions or encounter bugs, please see the [Contributing](development/contributing/index.md) page. \ No newline at end of file From 3ad259f34967001b1ba356f27a478c6f505e4d91 Mon Sep 17 00:00:00 2001 From: Matthew Thompson Date: Fri, 30 Jan 2026 17:15:31 -0500 Subject: [PATCH 10/10] =?UTF-8?q?Add=20tooltip=20explaning=20Metazoa?= =?UTF-8?q?=E2=86=92Animalia=20resolution?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/user-guide/quick-reference.md | 21 +++++++-------------- mkdocs.yml | 1 + 2 files changed, 8 insertions(+), 14 deletions(-) diff --git a/docs/user-guide/quick-reference.md b/docs/user-guide/quick-reference.md index e28e095..1f504e4 100644 --- a/docs/user-guide/quick-reference.md +++ b/docs/user-guide/quick-reference.md @@ -15,8 +15,7 @@ Download the same sample dataset in either format and place it in `examples/inpu - [sample.parquet](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.parquet) - [sample.csv](https://raw.githubusercontent.com/Imageomics/TaxonoPy/main/examples/input/sample.csv) -Sample contents: - +_**Sample input**: Note the divergence in kingdoms (Metazoa vs Animalia), missing interior ranks, and fully null entry._
| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | @@ -31,7 +30,7 @@ Sample contents:
-In the final example entry, there is no available taxonomic data, which can happen in large datasets where there maybe a corresponding image but incomplete annotation. +In the final example entry, there is no available taxonomic data, which can happen in large datasets where there may be a corresponding image but incomplete annotation. ## Execute a Basic Resolution @@ -62,13 +61,13 @@ The output files consist of: The `sample.resolved.parquet` file contains all the entries where some resolution strategy was applied. In this example, it contains: -
-Green highlights show values added during resolution. Yellow highlights indicate values that changed from the input. +_**Sample resolved output (selected columns)**: Green highlights show values added during resolution. Yellow highlights indicate values that changed from the input._ +
| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | common_name | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -| bc2a3f9f-c1f9-48df-9b01-d045475b9d5f | Animalia | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Homo sapiens | `null` | +| bc2a3f9f-c1f9-48df-9b01-d045475b9d5f | Animalia[?](https://verifier.globalnames.org/?all_matches=on&capitalize=on&ds=11&format=html&names=Homo+sapiens "The input lineage here mirrors what the Encyclopedia of Life provides (Metazoa as the clade that maps to the kingdom rank); when queried against GNVerifier, this rank maps to Animalia. Click to see the GNVerifier result.") | Chordata | Mammalia | Primates | Hominidae | Homo | Homo sapiens | Homo sapiens | `null` | | 21ed76d8-9a3b-406e-a1a3-ef244422bf8e | Plantae | Tracheophyta | Magnoliopsida | Fagales | Fagaceae | Quercus | Quercus alba | Quercus alba | `null` | | 4d166a61-b6e5-4709-91ba-b623111014e9 | Animalia | Arthropoda | Insecta | Hymenoptera | Apidae | Apis | Apis mellifera | Apis mellifera | `null` | | 85b96dc2-70ab-446e-afb5-6a4b92b0a450 | Fungi | Basidiomycota | Agaricomycetes | Agaricales | Amanitaceae | Amanita | Amanita muscaria | `null` | `null` | @@ -77,12 +76,10 @@ Green highlights show values added during resolution. Yellow highlights indicate
-/// table-caption -Sample resolved output (selected columns) -/// - The `sample.unsolved.parquet` file contains entries that could not be resolved (for example, rows with no usable taxonomy information). In this example, it contains: + +_**Sample unsolved output: Sequestered entries with no usable taxonomy information.**_
| uuid | kingdom | phylum | class | order | family | genus | species | scientific_name | common_name | @@ -91,10 +88,6 @@ The `sample.unsolved.parquet` file contains entries that could not be resolved (
-/// table-caption -Sample unsolved output (selected columns) -/// - The `resolution_stats.json` file summarizes counts of how many entries from the input fell into each final status across the `resolved` and `unsolved` files. TaxonoPy also writes cache data to disk (default: `~/.cache/taxonopy`) so it can trace provenance and avoid reprocessing. Use `--show-cache-path`, `--cache-stats`, or `--clear-cache` if you want to inspect or manage it, or see the [Cache](cache.md) guide for details. diff --git a/mkdocs.yml b/mkdocs.yml index 0214979..52ea42c 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -45,6 +45,7 @@ theme: - navigation.indexes - content.code.copy - content.code.annotate + - content.tooltips extra_css: - stylesheets/extra.css