Skip to content

docs: document year=-1/month=-1 sentinel values for lab cross samples#1104

Open
khushthecoder wants to merge 1 commit intomalariagen:masterfrom
khushthecoder:GH1092-document-year-sentinel
Open

docs: document year=-1/month=-1 sentinel values for lab cross samples#1104
khushthecoder wants to merge 1 commit intomalariagen:masterfrom
khushthecoder:GH1092-document-year-sentinel

Conversation

@khushthecoder
Copy link

docs: document year=-1/month=-1 sentinel values for lab cross samples

Closes #1092

Summary

Add a notes section to the sample_metadata() docstring explaining the year=-1 and month=-1 sentinel convention for lab cross samples (mosquitoes bred in the laboratory with no real collection date), along with a code example showing how to filter them out.

This is a documentation-only change — no runtime behavior is modified.

Background

Lab cross samples in the dataset use year=-1 and month=-1 as sentinel values meaning "no real collection date exists." This convention is internally consistent (e.g., quarter is also set to -1 when month == -1, and plot_samples_bar() filters year > 0), but it is not documented in the public API. A user encountering the data for the first time has no way to know that -1 is a sentinel value from the Python API alone.

This can lead to:

  • ValueError when attempting date arithmetic (e.g., pd.to_datetime(df["year"].astype(str) + "-01-01"))
  • NaN values in cohort metadata columns for lab crosses (expected, but confusing without context)

What Changed

malariagen_data/anoph/sample_metadata.py

Added a notes section to the @doc decorator on sample_metadata():

notes="""
    Some samples in the dataset are lab crosses — mosquitoes bred in
    the laboratory that have no real collection date. These samples
    use ``year=-1`` and ``month=-1`` as sentinel values. They may
    cause unexpected results in date-based analyses (e.g.,
    ``pd.to_datetime`` will fail on negative year values).

    To exclude lab cross samples, use::

        df = api.sample_metadata(sample_query="year >= 0")
""",

What Did NOT Change

  • ❌ No new parameters (users can already filter with sample_query or sample_sets)
  • ❌ No runtime warnings (lab crosses are a known property of the data, not an error)
  • ❌ No changes to _parse_general_metadata(), general_metadata(), or any other method
  • ❌ No changes to plotting functions or caching logic

Design Rationale

Per maintainer feedback, adding an exclude_lab_crosses parameter would be redundant since sample_query="year >= 0" already accomplishes this. A runtime warning would be noisy for experienced researchers. The right fix is documentation — making the sentinel convention discoverable through the API docs.

Closes malariagen#1092

Add a notes section to the sample_metadata() @doc decorator explaining
that some samples are lab crosses (mosquitoes bred in the laboratory)
that use year=-1 and month=-1 as sentinel values for 'no collection date'.

Include an example showing how to filter them out using sample_query.

This is a documentation-only change — no runtime behavior is modified.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

year=-1 sentinel value in lab cross samples causes crashes and silent NaN values in sample_metadata() downstream analysis

1 participant