feat: extract brain area anatomy from NWB location fields by bendichter · Pull Request #1807 · dandi/dandi-cli

bendichter · 2026-02-21T15:20:48Z

Summary

Closes #1806

Extracts brain location values from NWB files (ImagingPlane.location, electrodes.location, IntracellularElectrode.location)
Matches locations against bundled Allen Mouse Brain Atlas CCF structures (~94KB JSON, 1,327 structures)
Populates BioSample.anatomy within the wasDerivedFrom field using MBAO identifiers
Only applies to mouse species (NCBITaxon_10090); non-mouse species are skipped

New files

dandi/data/allen_ccf_structures.json — bundled Allen CCF structures
dandi/data/generate_allen_structures.py — regeneration script (python -m dandi.data.generate_allen_structures)
dandi/metadata/brain_areas.py — parsing and matching module
dandi/tests/test_brain_areas.py — 29 unit tests

Modified files

dandi/pynwb_utils.py — _get_brain_locations() extracts location strings from NWB
dandi/metadata/util.py — extract_wasDerivedFrom() now adds anatomy to deepest BioSample
dandi/tests/test_metadata.py — 3 integration tests for anatomy in wasDerivedFrom

Test plan

python -m pytest dandi/tests/test_brain_areas.py -v — 29 unit tests pass
python -m pytest dandi/tests/test_metadata.py -v -k anatomy — 3 integration tests pass
pre-commit run --files <all changed files> — all hooks pass
CI passes

🤖 Generated with Claude Code

Extract brain location values from NWB files (ImagingPlane.location, electrodes.location, IntracellularElectrode.location), match them against bundled Allen Mouse Brain Atlas CCF structures, and populate BioSample.anatomy within wasDerivedFrom. Closes #1806 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yarikoptic · 2026-02-24T22:58:01Z

dandi/metadata/brain_areas.py

+                        except ValueError:
+                            tokens.append(val)
+                return tokens
+        except (ValueError, SyntaxError):


In general, to fix an “empty except” issue, you either (a) narrow the exception scope and add at least logging or a comment explaining why it is safe to ignore, or (b) re-raise or otherwise react to the error instead of silently discarding it. Here the intended behavior is to ignore dict-literal parse failures and let later heuristics run, so the best fix is to keep catching ValueError/SyntaxError but log the failure at a low level (e.g., debug) and/or document it.

Concretely, in dandi/metadata/brain_areas.py around lines 91–115, replace the bare except (ValueError, SyntaxError): pass with an except block that logs the exception using the existing module logger lgr (already defined using get_logger()) and then continues. This does not change the external behavior (the function still just falls through to the subsequent parsing logic) but addresses the CodeQL concern by making the handling explicit and non-empty. No new imports are needed since lgr is already in scope.

@bendichter something should be at least logged here or best to give example on when is it happening -- is that a broken structure from allen (report upstream?) or expected and unhandled explicitly above ? may be it would signal need to just add one more _TRIVIAL_VALUES ?

overall -- unclear

The idea here is to try our best to process the variety of different forms of the location string.

I disagree with your reviewer bot- there is a clear explanation on the next line.

Above, Claude is trying to parse the string as a dict literal. If it can't it proceeds to other parsers. It is using try/except here instead of if/else. It's easier to try to parse a string as a dict literal and handle failure than it is to have a test for a dict literal. It's not how I usually code, but it's valid and understandable. Would you rather we always used if/else for flow control?

False positives are not a problem, because in the end all tokens will be matched against the CCF.

If you want, we can get rid of this and just reduce the variety of edge cases we can handle here. It just means we will miss area extraction on a few outlier datasets.

At the same time, I am working on a PR to NWB Inspector to try to clamp down on this variety.

yarikoptic · 2026-02-24T23:10:59Z

dandi/pynwb_utils.py

+                            val = val.decode("utf-8", errors="replace")
+                        if val and isinstance(val, str):
+                            locations.append(val)
+            except Exception:


General approach: avoid silent exception swallowing. At minimum, record what went wrong (e.g., via logging) so that missing brain locations can be diagnosed later, while still allowing the function to continue and return whatever locations were successfully collected.

Best concrete fix here: replace the bare except Exception: pass at lines 452–453 with an except Exception as exc: block that logs a warning using the existing logger lgr. The log message should clearly indicate that an error occurred while extracting icephys electrode locations and include the exception details. We should not re-raise, to avoid breaking existing callers that expect best-effort behavior. No new imports are needed because lgr is already defined at line 55 via get_logger().

Specific change:

File dandi/pynwb_utils.py, in _get_brain_locations, replace:

except Exception: pass

with something like:

except Exception as exc: lgr.warning( "Failed to extract brain locations from intracellular electrodes: %s", exc, )

This keeps functionality the same (errors are still non-fatal) but eliminates the empty except and preserves diagnostic information.

too wide -- not even log message... RF to become more specific, and add logging here even if at .debug level. Otherwise we might endup wild chasing bugs later here

- Cast json.load() return to explicit type - Use str() for AnyHttpUrl identifier comparisons - Add None-checks before indexing wasDerivedFrom in tests - Add explanatory comments to bare except clauses Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-02-21T15:29:18Z

Codecov Report

❌ Patch coverage is 80.25078% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.25%. Comparing base (da76ac3) to head (d6b5be1).

Files with missing lines	Patch %	Lines
dandi/data/generate_allen_structures.py	0.00%	23 Missing ⚠️
dandi/metadata/brain_areas.py	84.49%	20 Missing ⚠️
dandi/pynwb_utils.py	60.97%	16 Missing ⚠️
dandi/metadata/util.py	86.36%	3 Missing ⚠️
dandi/tests/test_metadata.py	98.14%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1807      +/-   ##
==========================================
+ Coverage   75.12%   75.25%   +0.12%     
==========================================
  Files          84       87       +3     
  Lines       11925    12244     +319     
==========================================
+ Hits         8959     9214     +255     
- Misses       2966     3030      +64

Flag	Coverage Δ
unittests	`75.25% <80.25%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bendichter · 2026-02-21T16:10:47Z

sister PR: NeurodataWithoutBorders/nwbinspector#671

yarikoptic

I think it would work but we need to make it a little "better", in particular a little more specific to how it processes user entered data found in NWB so we do not end up chasing silently "handled" bugs

yarikoptic · 2026-02-24T22:58:01Z

dandi/metadata/brain_areas.py

+                        except ValueError:
+                            tokens.append(val)
+                return tokens
+        except (ValueError, SyntaxError):


@bendichter something should be at least logged here or best to give example on when is it happening -- is that a broken structure from allen (report upstream?) or expected and unhandled explicitly above ? may be it would signal need to just add one more _TRIVIAL_VALUES ?

overall -- unclear

yarikoptic · 2026-02-24T22:59:54Z

dandi/metadata/brain_areas.py

+                        try:
+                            float(val)
+                        except ValueError:
+                            tokens.append(val)


this is very non-specific -- so when there is a number -- we ignore, and when string -- we take? could it just be a regex right away and avoid guess work?

dandi/metadata/brain_areas.py

yarikoptic · 2026-02-24T23:05:06Z

dandi/metadata/brain_areas.py

+            d = ast.literal_eval(location)
+            if isinstance(d, dict):
+                # Look for known area keys
+                for key in ("area", "location", "region", "brain_area", "brain_region"):


does it matter what key it is , as does nwb prescribe it or could be 'Region', 'AREA' (hence need to be lower cased at least) or even some 'brain-area' or 'brain area'? my point for that would be to swap the loop as to go through keys, harmonize their names (strip spaces, -, _), lower case -- and then compare and also add 'name', ' areaname', ... ?

NWB says that the value of location should simply be the name of the brain area, but there's no validation of this, and while we usually do get some sort of brain area name, there are also other types of things. Sometimes this makes sense, as the user wanted to encode additional information like depth or multiple areas. This is useful information, but we are really only interested in the brain area right now. The _parse_location_string function was written to handle the most common edge cases I saw. That's what is meant by "known" area keys -- we've seen them before. There are lots of other words that could be used here but honestly this is an edge case so I'm not too worried if we miss a dataset or two.

yarikoptic · 2026-02-24T23:06:48Z

dandi/metadata/util.py

+        "10090" in species_str
+        or "mus musculus" in species_str
+        or "mouse" in species_str
+    )


we have matching for species elsewhere -- reuse here if you can about matching to mice.

yarikoptic · 2026-02-24T23:08:07Z

dandi/tests/test_brain_areas.py

+@pytest.mark.ai_generated
+class TestParseLocationString:
+    def test_simple_acronym(self) -> None:
+        assert _parse_location_string("VISp") == ["VISp"]
+
+    def test_simple_name(self) -> None:
+        assert _parse_location_string("Primary visual area") == ["Primary visual area"]
+
+    def test_comma_separated(self) -> None:
+        assert _parse_location_string("VISp,VISrl,VISlm") == ["VISp", "VISrl", "VISlm"]
+
+    def test_comma_separated_with_spaces(self) -> None:
+        assert _parse_location_string("VISp, VISrl, VISlm") == [
+            "VISp",
+            "VISrl",
+            "VISlm",
+        ]
+
+    def test_dict_literal_with_area(self) -> None:
+        result = _parse_location_string("{'area': 'VISp', 'depth': '20'}")
+        assert result == ["VISp"]
+
+    def test_dict_literal_no_area_key(self) -> None:
+        result = _parse_location_string("{'region_name': 'VISp', 'depth': '20'}")
+        # Should return non-numeric string values
+        assert "VISp" in result
+
+    def test_key_value_pairs(self) -> None:
+        result = _parse_location_string("area: VISp, depth: 175")
+        assert result == ["VISp"]
+
+    def test_trivial_unknown(self) -> None:
+        assert _parse_location_string("unknown") == []
+
+    def test_trivial_none(self) -> None:
+        assert _parse_location_string("none") == []
+
+    def test_trivial_na(self) -> None:
+        assert _parse_location_string("n/a") == []
+
+    def test_trivial_brain(self) -> None:
+        assert _parse_location_string("brain") == []
+
+    def test_empty_string(self) -> None:
+        assert _parse_location_string("") == []
+
+    def test_whitespace_only(self) -> None:
+        assert _parse_location_string("   ") == []
+
+    def test_comma_list_with_trivial(self) -> None:
+        result = _parse_location_string("VISp, unknown, CA1")
+        assert result == ["VISp", "CA1"]


ask agent to RF into a single parametric test -- this way imho too much clutter.

yarikoptic · 2026-02-24T23:10:59Z

dandi/pynwb_utils.py

+                            val = val.decode("utf-8", errors="replace")
+                        if val and isinstance(val, str):
+                            locations.append(val)
+            except Exception:


too wide -- not even log message... RF to become more specific, and add logging here even if at .debug level. Otherwise we might endup wild chasing bugs later here

yarikoptic · 2026-02-24T23:15:24Z

tox.ini

 [pytest]
 addopts = --tb=short --durations=10 --timeout=300
+markers =
+    ai_generated: marks tests as AI-generated


remove this, my bad, it was added in a PR and now separate PR I am about to merge:

ai: fix up CLAUDE.md as to pytest markers placement #1812

yarikoptic · 2026-02-24T23:19:12Z

pyproject.toml


 [tool.codespell]
-skip = "_version.py,due.py,versioneer.py,*.vcr.yaml,venv,venvs,pyproject.toml"
+skip = "_version.py,due.py,versioneer.py,*.vcr.yaml,venv,venvs,pyproject.toml,allen_ccf_structures.json"


❯ codespell dandi/data/allen_ccf_structures.json dandi/data/allen_ccf_structures.json:1: AONd ==> and dandi/data/allen_ccf_structures.json:1: interpolar ==> interpolator dandi/data/allen_ccf_structures.json:1: fpr ==> for, far, fps dandi/data/allen_ccf_structures.json:1: ND ==> AND, 2ND dandi/data/allen_ccf_structures.json:1: OT ==> TO, OF, OR, NOT, IT dandi/data/allen_ccf_structures.json:1: ECT ==> ETC dandi/data/allen_ccf_structures.json:1: onl ==> only dandi/data/allen_ccf_structures.json:1: accesory ==> accessory

the last one should be reported/fixed upstream... in principle we could whitelist those and ensure that we do not try to match into typos.

yarikoptic · 2026-02-24T23:21:31Z

I hope I left detailed enough comments for an agentic beast to address them ;)

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

- Refactor test_brain_areas.py into parametrized tests to reduce clutter - Add debug logging to empty except clauses in brain_areas.py and pynwb_utils.py - Make dict key matching case-insensitive and flexible (normalise hyphens, underscores, spaces) via _extract_area_from_dict / _normalise_key - Apply docstring suggestions for _parse_location_string - Extract _is_mouse() that reuses species_map for mouse detection with walrus operator - Remove ai_generated marker from tox.ini (handled by PR #1812) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security bot found potential problems Feb 21, 2026

View reviewed changes

bendichter added the minor Increment the minor version when merged label Feb 21, 2026

bendichter mentioned this pull request Feb 21, 2026

Check that mouse NWB files use Allen CCF (MBAO) ontology terms for location fields NeurodataWithoutBorders/nwbinspector#670

Open

yarikoptic requested changes Feb 24, 2026

View reviewed changes

bendichter and others added 4 commits February 24, 2026 21:56

Apply suggestion from @yarikoptic

e07f69f

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

Apply suggestion from @yarikoptic

8d06a2b

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

Apply suggestion from @yarikoptic

b6810cb

Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>

@@ -112,7 +112,11 @@
                                         tokens.append(val)
                             return tokens
                     except (ValueError, SyntaxError):
-                        pass
+                        lgr.debug(
+                            "Failed to parse brain location as dict literal %r; "
+                            "falling back to alternative parsing strategies",
+                            location,
+                        )
                 # Try key-value format (e.g. "area: VISp, depth: 175")
                 if re.search(r"\w+\s*:", location) and "://" not in location:

@@ -449,8 +449,11 @@
                                         val = val.decode("utf-8", errors="replace")
                                     if val and isinstance(val, str):
                                         locations.append(val)
-                        except Exception:
-                            pass
+                        except Exception as exc:
+                            lgr.warning(
+                                "Failed to extract brain locations from intracellular electrodes: %s",
+                                exc,
+                            )
                 return locations

Comments

Conversation

bendichter commented Feb 21, 2026

Summary

New files

Modified files

Test plan

Uh oh!

Check notice

Copilot Autofix

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Check notice

Copilot Autofix

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bendichter commented Feb 21, 2026

Uh oh!

yarikoptic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yarikoptic commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 21, 2026 •

edited

Loading