Skip to content

Fix NotFittedError in umap_embedding when fewer than 2 rows have nonzero variance#19

Merged
lappalainenj merged 4 commits intomainfrom
copilot/fix-umap-embedding-error
Feb 27, 2026
Merged

Fix NotFittedError in umap_embedding when fewer than 2 rows have nonzero variance#19
lappalainenj merged 4 commits intomainfrom
copilot/fix-umap-embedding-error

Conversation

Copy link
Contributor

Copilot AI commented Feb 27, 2026

  • Explore codebase and understand the issue
  • Fix umap_embedding in flyvis/analysis/clustering.py to handle the case where fewer than 2 rows have non-zero variance (return early with NaN embedding and None reducer)
  • Update return type annotation to Optional[UMAP]
  • Clarify mask semantics in docstring for the early-return case
  • Fix typo: 'contant' → 'constant'
  • Add unit tests in tests/test_clustering.py for the edge cases (all 3 pass)
  • Code review and security scan passed
Original prompt

This section details on the original issue you should resolve

<issue_title>Edge case with umap_embedding</issue_title>
<issue_description>When a matrix with only one row with nonzero variance and all the rest have zero variance, the model doesn't fit-
Error is:
NotFittedError: This UMAP instance is not fitted yet.
on line:
connected_vertices_mask = ~disconnected_vertices(reducer)
Proposed solution:


def umap_embedding(
    X: np.ndarray,
    n_neighbors: int = 5,
    min_dist: float = 0.12,
    spread: float = 9.0,
    random_state: int = 42,
    n_components: int = 2,
    metric: str = "correlation",
    n_epochs: int = 1500,
    **kwargs,
) -> Tuple[np.ndarray, np.ndarray, UMAP]:
    from umap.utils import disconnected_vertices
    """
    Perform UMAP embedding on input data.

    Args:
        X: Input data with shape (n_samples, n_features).
        n_neighbors: Number of neighbors to consider for each point.
        min_dist: Minimum distance between points in the embedding space.
        spread: Determines how spread out all embedded points are overall.
        random_state: Random seed for reproducibility.
        n_components: Number of dimensions in the embedding space.
        metric: Distance metric to use.
        n_epochs: Number of training epochs for embedding optimization.
        **kwargs: Additional keyword arguments for UMAP.

    Returns:
        A tuple containing:
        - embedding: The UMAP embedding (n_samples, n_components). May be NaN if insufficient data.
        - mask: Boolean mask (length n_samples) showing which rows had nonzero variance and were connected.
        - reducer: The fitted UMAP object or None if insufficient data.

    Raises:
        ValueError: If n_components is too large relative to sample size.

    Note:
        This function handles reshaping of input data and removes constant rows.
    """
    if n_components > X.shape[0] - 2:
        raise ValueError(
            "number of components must be 2 smaller than sample size. "
            "See: https://github.com/lmcinnes/umap/issues/201"
        )

    if len(X.shape) > 2:
        # Flatten (n_samples, n_features_1, ...) → (n_samples, n_features)
        X = X.reshape(X.shape[0], -1)

    # Prepare an output array of NaNs.
    n_samples = X.shape[0]
    embedding = np.full((n_samples, n_components), np.nan)

    # Mask out rows that have zero (or near-zero) variance.
    mask = ~np.isclose(X.std(axis=1), 0)
    X_nonconst = X[mask]

    # If fewer than 2 rows remain, skip UMAP and return embedding of NaNs.
    if X_nonconst.shape[0] < 2:
        return embedding, mask, None

    # Fit UMAP
    reducer = UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        random_state=random_state,
        n_components=n_components,
        metric=metric,
        spread=spread,
        n_epochs=n_epochs,
        **kwargs,
    )
    _embedding = reducer.fit_transform(X_nonconst)

    # Remove any “disconnected” vertices UMAP couldn’t place
    # (e.g. if the graph is disjoint).
    connected_vertices_mask = ~disconnected_vertices(reducer)

    # Incorporate the connected-vertices mask into our existing mask.
    mask[mask] = mask[mask] & connected_vertices_mask

    # Place the valid embeddings back into the final array.
    embedding[mask] = _embedding[connected_vertices_mask]

    return embedding, mask, reducer
```</issue_description>

## Comments on the Issue (you are @copilot in this section)

<comments>
<comment_new><author>@lappalainenj</author><body>
Thanks for reporting this! Would you mind creating a PR and adding a little unit test for this case?</body></comment_new>
</comments>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

…tant rows exist

Co-authored-by: lappalainenj <34949352+lappalainenj@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix edge case in UMAP embedding function Fix NotFittedError in umap_embedding when fewer than 2 rows have nonzero variance Feb 27, 2026
@lappalainenj lappalainenj requested a review from Copilot February 27, 2026 17:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an edge case in umap_embedding where UMAP was never fit (due to too few non-constant rows), but downstream logic still assumed a fitted reducer—previously resulting in NotFittedError.

Changes:

  • Add an early-return guard in umap_embedding to return an all-NaN embedding and reducer=None when fewer than 2 non-constant rows remain.
  • Update umap_embedding docstring to document the reducer=None / all-NaN embedding behavior.
  • Add unit tests covering all-constant input and single non-constant-row input.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
flyvis/analysis/clustering.py Adds early return for insufficient non-constant rows; updates docstring to describe the new return behavior.
tests/test_clustering.py Adds regression tests for the insufficient-data edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 459 to 463
- embedding: The UMAP embedding (n_samples, n_components). May be NaN
if insufficient data.
- mask: Boolean mask (length n_samples) showing which rows had nonzero
variance and were connected.
- reducer: The fitted UMAP object or None if insufficient data.
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring now says reducer may be None, but the function’s return type annotation still indicates a non-optional UMAP. Please update the type hint to reflect the new behavior (e.g., Optional[UMAP]) so static type checking matches runtime behavior. Also consider clarifying the mask semantics in the reducer is None early-return case (it currently reflects non-constant rows, not connectivity).

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 512596e. Updated the return type to Optional[UMAP] and expanded the mask docstring entry to clarify the semantics in both the early-return case (only non-constant rows are True) and the normal case (non-constant rows that were also connected in the UMAP graph).

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tics in docstring

Co-authored-by: lappalainenj <34949352+lappalainenj@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Feb 27, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@lappalainenj lappalainenj marked this pull request as ready for review February 27, 2026 20:59
@lappalainenj lappalainenj merged commit 0f59d6c into main Feb 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Edge case with umap_embedding

3 participants