Skip to content

Comments

feat: extract brain area anatomy from NWB location fields#1807

Open
bendichter wants to merge 6 commits intomasterfrom
add-brain-area-anatomy
Open

feat: extract brain area anatomy from NWB location fields#1807
bendichter wants to merge 6 commits intomasterfrom
add-brain-area-anatomy

Conversation

@bendichter
Copy link
Member

Summary

Closes #1806

  • Extracts brain location values from NWB files (ImagingPlane.location, electrodes.location, IntracellularElectrode.location)
  • Matches locations against bundled Allen Mouse Brain Atlas CCF structures (~94KB JSON, 1,327 structures)
  • Populates BioSample.anatomy within the wasDerivedFrom field using MBAO identifiers
  • Only applies to mouse species (NCBITaxon_10090); non-mouse species are skipped

New files

  • dandi/data/allen_ccf_structures.json — bundled Allen CCF structures
  • dandi/data/generate_allen_structures.py — regeneration script (python -m dandi.data.generate_allen_structures)
  • dandi/metadata/brain_areas.py — parsing and matching module
  • dandi/tests/test_brain_areas.py — 29 unit tests

Modified files

  • dandi/pynwb_utils.py_get_brain_locations() extracts location strings from NWB
  • dandi/metadata/util.pyextract_wasDerivedFrom() now adds anatomy to deepest BioSample
  • dandi/tests/test_metadata.py — 3 integration tests for anatomy in wasDerivedFrom

Test plan

  • python -m pytest dandi/tests/test_brain_areas.py -v — 29 unit tests pass
  • python -m pytest dandi/tests/test_metadata.py -v -k anatomy — 3 integration tests pass
  • pre-commit run --files <all changed files> — all hooks pass
  • CI passes

🤖 Generated with Claude Code

Extract brain location values from NWB files (ImagingPlane.location,
electrodes.location, IntracellularElectrode.location), match them
against bundled Allen Mouse Brain Atlas CCF structures, and populate
BioSample.anatomy within wasDerivedFrom.

Closes #1806

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
except ValueError:
tokens.append(val)
return tokens
except (ValueError, SyntaxError):

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 4 days ago

In general, to fix an “empty except” issue, you either (a) narrow the exception scope and add at least logging or a comment explaining why it is safe to ignore, or (b) re-raise or otherwise react to the error instead of silently discarding it. Here the intended behavior is to ignore dict-literal parse failures and let later heuristics run, so the best fix is to keep catching ValueError/SyntaxError but log the failure at a low level (e.g., debug) and/or document it.

Concretely, in dandi/metadata/brain_areas.py around lines 91–115, replace the bare except (ValueError, SyntaxError): pass with an except block that logs the exception using the existing module logger lgr (already defined using get_logger()) and then continues. This does not change the external behavior (the function still just falls through to the subsequent parsing logic) but addresses the CodeQL concern by making the handling explicit and non-empty. No new imports are needed since lgr is already in scope.

Suggested changeset 1
dandi/metadata/brain_areas.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/dandi/metadata/brain_areas.py b/dandi/metadata/brain_areas.py
--- a/dandi/metadata/brain_areas.py
+++ b/dandi/metadata/brain_areas.py
@@ -112,7 +112,11 @@
                             tokens.append(val)
                 return tokens
         except (ValueError, SyntaxError):
-            pass
+            lgr.debug(
+                "Failed to parse brain location as dict literal %r; "
+                "falling back to alternative parsing strategies",
+                location,
+            )
 
     # Try key-value format (e.g. "area: VISp, depth: 175")
     if re.search(r"\w+\s*:", location) and "://" not in location:
EOF
@@ -112,7 +112,11 @@
tokens.append(val)
return tokens
except (ValueError, SyntaxError):
pass
lgr.debug(
"Failed to parse brain location as dict literal %r; "
"falling back to alternative parsing strategies",
location,
)

# Try key-value format (e.g. "area: VISp, depth: 175")
if re.search(r"\w+\s*:", location) and "://" not in location:
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendichter something should be at least logged here or best to give example on when is it happening -- is that a broken structure from allen (report upstream?) or expected and unhandled explicitly above ? may be it would signal need to just add one more _TRIVIAL_VALUES ?

overall -- unclear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to try our best to process the variety of different forms of the location string.

I disagree with your reviewer bot- there is a clear explanation on the next line.

Above, Claude is trying to parse the string as a dict literal. If it can't it proceeds to other parsers. It is using try/except here instead of if/else. It's easier to try to parse a string as a dict literal and handle failure than it is to have a test for a dict literal. It's not how I usually code, but it's valid and understandable. Would you rather we always used if/else for flow control?

False positives are not a problem, because in the end all tokens will be matched against the CCF.

If you want, we can get rid of this and just reduce the variety of edge cases we can handle here. It just means we will miss area extraction on a few outlier datasets.

At the same time, I am working on a PR to NWB Inspector to try to clamp down on this variety.

val = val.decode("utf-8", errors="replace")
if val and isinstance(val, str):
locations.append(val)
except Exception:

Check notice

Code scanning / CodeQL

Empty except Note

'except' clause does nothing but pass and there is no explanatory comment.

Copilot Autofix

AI 4 days ago

General approach: avoid silent exception swallowing. At minimum, record what went wrong (e.g., via logging) so that missing brain locations can be diagnosed later, while still allowing the function to continue and return whatever locations were successfully collected.

Best concrete fix here: replace the bare except Exception: pass at lines 452–453 with an except Exception as exc: block that logs a warning using the existing logger lgr. The log message should clearly indicate that an error occurred while extracting icephys electrode locations and include the exception details. We should not re-raise, to avoid breaking existing callers that expect best-effort behavior. No new imports are needed because lgr is already defined at line 55 via get_logger().

Specific change:

  • File dandi/pynwb_utils.py, in _get_brain_locations, replace:
            except Exception:
                pass

with something like:

            except Exception as exc:
                lgr.warning(
                    "Failed to extract brain locations from intracellular electrodes: %s",
                    exc,
                )

This keeps functionality the same (errors are still non-fatal) but eliminates the empty except and preserves diagnostic information.

Suggested changeset 1
dandi/pynwb_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/dandi/pynwb_utils.py b/dandi/pynwb_utils.py
--- a/dandi/pynwb_utils.py
+++ b/dandi/pynwb_utils.py
@@ -449,8 +449,11 @@
                             val = val.decode("utf-8", errors="replace")
                         if val and isinstance(val, str):
                             locations.append(val)
-            except Exception:
-                pass
+            except Exception as exc:
+                lgr.warning(
+                    "Failed to extract brain locations from intracellular electrodes: %s",
+                    exc,
+                )
 
     return locations
 
EOF
@@ -449,8 +449,11 @@
val = val.decode("utf-8", errors="replace")
if val and isinstance(val, str):
locations.append(val)
except Exception:
pass
except Exception as exc:
lgr.warning(
"Failed to extract brain locations from intracellular electrodes: %s",
exc,
)

return locations

Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too wide -- not even log message... RF to become more specific, and add logging here even if at .debug level. Otherwise we might endup wild chasing bugs later here

@bendichter bendichter added the minor Increment the minor version when merged label Feb 21, 2026
- Cast json.load() return to explicit type
- Use str() for AnyHttpUrl identifier comparisons
- Add None-checks before indexing wasDerivedFrom in tests
- Add explanatory comments to bare except clauses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 80.25078% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.25%. Comparing base (da76ac3) to head (d6b5be1).

Files with missing lines Patch % Lines
dandi/data/generate_allen_structures.py 0.00% 23 Missing ⚠️
dandi/metadata/brain_areas.py 84.49% 20 Missing ⚠️
dandi/pynwb_utils.py 60.97% 16 Missing ⚠️
dandi/metadata/util.py 86.36% 3 Missing ⚠️
dandi/tests/test_metadata.py 98.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1807      +/-   ##
==========================================
+ Coverage   75.12%   75.25%   +0.12%     
==========================================
  Files          84       87       +3     
  Lines       11925    12244     +319     
==========================================
+ Hits         8959     9214     +255     
- Misses       2966     3030      +64     
Flag Coverage Δ
unittests 75.25% <80.25%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bendichter
Copy link
Member Author

sister PR: NeurodataWithoutBorders/nwbinspector#671

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would work but we need to make it a little "better", in particular a little more specific to how it processes user entered data found in NWB so we do not end up chasing silently "handled" bugs

except ValueError:
tokens.append(val)
return tokens
except (ValueError, SyntaxError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendichter something should be at least logged here or best to give example on when is it happening -- is that a broken structure from allen (report upstream?) or expected and unhandled explicitly above ? may be it would signal need to just add one more _TRIVIAL_VALUES ?

overall -- unclear

try:
float(val)
except ValueError:
tokens.append(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very non-specific -- so when there is a number -- we ignore, and when string -- we take? could it just be a regex right away and avoid guess work?

d = ast.literal_eval(location)
if isinstance(d, dict):
# Look for known area keys
for key in ("area", "location", "region", "brain_area", "brain_region"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter what key it is , as does nwb prescribe it or could be 'Region', 'AREA' (hence need to be lower cased at least) or even some 'brain-area' or 'brain area'? my point for that would be to swap the loop as to go through keys, harmonize their names (strip spaces, -, _), lower case -- and then compare and also add 'name', ' areaname', ... ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NWB says that the value of location should simply be the name of the brain area, but there's no validation of this, and while we usually do get some sort of brain area name, there are also other types of things. Sometimes this makes sense, as the user wanted to encode additional information like depth or multiple areas. This is useful information, but we are really only interested in the brain area right now. The _parse_location_string function was written to handle the most common edge cases I saw. That's what is meant by "known" area keys -- we've seen them before. There are lots of other words that could be used here but honestly this is an edge case so I'm not too worried if we miss a dataset or two.

"10090" in species_str
or "mus musculus" in species_str
or "mouse" in species_str
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have matching for species elsewhere -- reuse here if you can about matching to mice.

Comment on lines 12 to 63
@pytest.mark.ai_generated
class TestParseLocationString:
def test_simple_acronym(self) -> None:
assert _parse_location_string("VISp") == ["VISp"]

def test_simple_name(self) -> None:
assert _parse_location_string("Primary visual area") == ["Primary visual area"]

def test_comma_separated(self) -> None:
assert _parse_location_string("VISp,VISrl,VISlm") == ["VISp", "VISrl", "VISlm"]

def test_comma_separated_with_spaces(self) -> None:
assert _parse_location_string("VISp, VISrl, VISlm") == [
"VISp",
"VISrl",
"VISlm",
]

def test_dict_literal_with_area(self) -> None:
result = _parse_location_string("{'area': 'VISp', 'depth': '20'}")
assert result == ["VISp"]

def test_dict_literal_no_area_key(self) -> None:
result = _parse_location_string("{'region_name': 'VISp', 'depth': '20'}")
# Should return non-numeric string values
assert "VISp" in result

def test_key_value_pairs(self) -> None:
result = _parse_location_string("area: VISp, depth: 175")
assert result == ["VISp"]

def test_trivial_unknown(self) -> None:
assert _parse_location_string("unknown") == []

def test_trivial_none(self) -> None:
assert _parse_location_string("none") == []

def test_trivial_na(self) -> None:
assert _parse_location_string("n/a") == []

def test_trivial_brain(self) -> None:
assert _parse_location_string("brain") == []

def test_empty_string(self) -> None:
assert _parse_location_string("") == []

def test_whitespace_only(self) -> None:
assert _parse_location_string(" ") == []

def test_comma_list_with_trivial(self) -> None:
result = _parse_location_string("VISp, unknown, CA1")
assert result == ["VISp", "CA1"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ask agent to RF into a single parametric test -- this way imho too much clutter.

val = val.decode("utf-8", errors="replace")
if val and isinstance(val, str):
locations.append(val)
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too wide -- not even log message... RF to become more specific, and add logging here even if at .debug level. Otherwise we might endup wild chasing bugs later here

tox.ini Outdated
[pytest]
addopts = --tb=short --durations=10 --timeout=300
markers =
ai_generated: marks tests as AI-generated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this, my bad, it was added in a PR and now separate PR I am about to merge:


[tool.codespell]
skip = "_version.py,due.py,versioneer.py,*.vcr.yaml,venv,venvs,pyproject.toml"
skip = "_version.py,due.py,versioneer.py,*.vcr.yaml,venv,venvs,pyproject.toml,allen_ccf_structures.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❯ codespell dandi/data/allen_ccf_structures.json
dandi/data/allen_ccf_structures.json:1: AONd ==> and
dandi/data/allen_ccf_structures.json:1: interpolar ==> interpolator
dandi/data/allen_ccf_structures.json:1: fpr ==> for, far, fps
dandi/data/allen_ccf_structures.json:1: ND ==> AND, 2ND
dandi/data/allen_ccf_structures.json:1: OT ==> TO, OF, OR, NOT, IT
dandi/data/allen_ccf_structures.json:1: ECT ==> ETC
dandi/data/allen_ccf_structures.json:1: onl ==> only
dandi/data/allen_ccf_structures.json:1: accesory ==> accessory

the last one should be reported/fixed upstream... in principle we could whitelist those and ensure that we do not try to match into typos.

@yarikoptic
Copy link
Member

I hope I left detailed enough comments for an agentic beast to address them ;)

bendichter and others added 4 commits February 24, 2026 21:56
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
- Refactor test_brain_areas.py into parametrized tests to reduce clutter
- Add debug logging to empty except clauses in brain_areas.py and pynwb_utils.py
- Make dict key matching case-insensitive and flexible (normalise hyphens,
  underscores, spaces) via _extract_area_from_dict / _normalise_key
- Apply docstring suggestions for _parse_location_string
- Extract _is_mouse() that reuses species_map for mouse detection with
  walrus operator
- Remove ai_generated marker from tox.ini (handled by PR #1812)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

minor Increment the minor version when merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extract asset brain area from location field

2 participants