From 4f33046820dd183998c31ffba1d290c9df17a832 Mon Sep 17 00:00:00 2001
From: Michael Dietze <dietze@bu.edu>
Date: Wed, 28 Jan 2026 09:21:03 -0500
Subject: [PATCH 1/7] Update gsoc_ideas.mdx

---
 src/pages/gsoc_ideas.mdx | 155 ++++++++++++++++++---------------------
 1 file changed, 73 insertions(+), 82 deletions(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index 65c07fa0..ee579bbe 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -1,10 +1,10 @@
 ---
-title: 'GSoC 2025 - PEcAn Project Ideas'
+title: 'GSoC 2026 - PEcAn Project Ideas'
 ---
 
 # GSoC - PEcAn Project Ideas{#background}
 
-PEcAn is an open-source ecosystem modeling framework integrating data, models, and uncertainty quantification. Below is a list of potential ideas where contributors can help improve and expand PEcAn. To get started contributing to PEcAn, check out [this guide](https://github.com/PecanProject/pecan/discussions/3469). Come find us on Slack to discuss. If you have questions or would like to propose your own idea, contact @kooper in or join our **[#gsoc-2025](https://pecanproject.slack.com/archives/C0853U6GF71)** channel in Slack!
+PEcAn is an open-source ecosystem modeling framework integrating data, models, and uncertainty quantification. Below is a list of potential ideas where contributors can help improve and expand PEcAn. To get started contributing to PEcAn, check out [this guide](https://github.com/PecanProject/pecan/discussions/3469). Come find us on Slack to discuss. If you have questions or would like to propose your own idea, contact @kooper in or join our **[#gsoc](https://pecanproject.slack.com/archives/C0853U6GF71)** channel in Slack!
 
 ---
 
@@ -12,64 +12,87 @@ PEcAn is an open-source ecosystem modeling framework integrating data, models, a
 
 Below is a list of project ideas. Feel free to contact the listed mentors on Slack to discuss further or contact @kooper with new ideas and he can help connect you with mentors.
 
-1. [Global Sensitivity Analysis / Uncertainty Partitioning](#sa)  
-2. [Parallelization of Model Runs on HPC](#hpc)  
-3. [Database and Data Improvements](#db)  
-4. [Development of Notebook-based PEcAn Workflows](#notebook)  
-5. [Refactoring Compile-time Flags to Runtime Flags in SIPNET](#sipnet) 
+1. [Refactor and Parallelize Input Processing Pipelines](#input)  
+2. [Benchmarking and Validation Framework](#validation)  
+3. [Increase PEcAn modularity](#module)   
 
 ---
 
-### 1. Global Sensitivity Analysis / Uncertainty Partitioning{#sa}
+### 1. Refactor and Parallelize Input Processing Pipelines{#input}
 
-This project would extend PEcAn's existing uncertainty partitioning routines, which are primarily one-at-a-time and focused on model parameters, to also consider ensemble-based uncertainties in other model inputs (meteorology, soils, vegetation, phenology, etc). This project would employ Sobol' methods and some uncommitted code exists that manually prototyped how this would be done in PEcAn. The goal would be to refactor/reimplement this prototype into a reliable, automated system and apply it to some key test cases in both natural and managed ecosystems.
+Input-processing code in PEcAn (e.g., meteorological preparation) is currently centered around monolithic orchestration functions such as `do.conversions` and `met.process`. These functions mix low-level data transformations with sequential control flow, implicit dependencies, and caching behavior, making them difficult to test, debug, scale, or parallelize across sites and ensemble members.
 
+This project will deprecate `do.conversions` as currently implemented and replace it with input preprocessing workflows that are explicitly structured around data dependencies and are naturally parallelizable across data streams, sites, and ensemble members. The work will refactor or deprecate `met.process` to remove monolithic orchestration and reduce or eliminate opaque caching, while retaining and strengthening existing low-level transformation functions.
+
+As part of the refactor, orchestration logic should be rebuilt to make inputs, outputs, and dependencies explicit. A workflow tool such as {targets} may be used to help define and validate the dependency graph and caching behavior, but must not become a required or exclusive execution path for PEcAn.
+
+This refactoring should also reduce or eliminate implicit dependencies on the global settings object (see Project 3), enabling clearer APIs and improved testability.
 
 **Expected outcomes:**
 
 A successful project would complete the following tasks:
 
-* Reliable, automated Sobol sensitivity analyss and uncertainty partitioning across multiple model inputs.
-* Applications to test case(s) in natural and / or managed ecosystems.
+* Deprecation plan for do.conversions, with a replacement that provides a modular suite of preprocessing tools that
+   * explicitly defines inputs and outputs, and 
+   * supports parallel execution across products, sites, and ensemble members. 
+Key here is a high-level plan for development that will continue beyond what is accomplished this summer.
+
+* Refactor and/or deprecation plan for met.process that:
+   * removes monolithic orchestration and hidden control flow,
+   * reduces or eliminates over-engineered caching,
+   * retains and documents low-level transformation functions.
+
+* Demonstration of parallel execution on a multi-site or multi-ensemble example.
+
+* Basic correctness and performance benchmarks, including unit and integration tests and validation of PEcAn-standard inputs (formats and units).
+
+* Updated developer documentation covering:
+   * the new input-processing architecture,
+   * how to add a new preprocessing step,
+   * migration guidance from legacy entry points.
+
 
 **Prerequisites:**
 
 - Required: R (existing workflow and prototype is in R)
-- Helpful: familiarity with sensitivity analyses
+- Helpful: familiarity with parallel computing, workflow refactoring
 
 **Contact person:**
 
-Mike @Dietze
+David LeBauer (@dlebaur), @Henry Priest
 
 **Duration:**
 
-Flexible to work as either a Medium (175hr) or Large (350 hr)
+Large (350 hr)
 
 **Difficulty:**
 
-Medium
+High
 
 ---
 
-### 2. Parallelization of Model Runs on HPC{#hpc}
+### 2. Benchmarking and Validation Framework{#validation}
 
-This project would extend PEcAn's existing run mechanisms to be able to run on a High Performance Compute cluster (HPC) using [Apptainer](https://apptainer.org). For uncertaintity analysis, PEcAn will run the same model 1000s of times with small permutations. This is a perfect use for an HPC run. The goal is to not submit 1000s of jobs, but have a single job with multiple nodes that will run all of the ensembles efficiently. Running can be orchistrated using RabbitMQ, but other methods are also encouraged. The end goal should be for the PEcAn system to be launched, and run the full workflow on the HPC from start to finish leveraging as many nodes as it is given during the submission.
+A key task in any modeling workflow is the validation of model outputs against held out observations. When a validation dataset is used repeatedly and agreed upon by a broad community to have particular value in assessing model performance it often gets elevated to the status of a persistent "benchmark" dataset. In PEcAn, there is a need to replace our earlier benchmarking module, whose design was never fully implemented, with a simpler framework. In designing this framework we'd encourage participants to build upon the existing low-level infrastructure in the existing benchmarking module for model-data alignment tools and comparison metrics like RMSE, MAE, and R2. Work should also build upon and generalize existing examples of "one off" validation scripts (e.g., CARB cropland validations, North American data assimilation validations). 
 
 **Expected outcomes:**
 
 A successful project would complete the following tasks:
 
-* Show different ways to launch jobs (rabbitmq, lock files, simple round robin, etc)
-* Report of different options and how they can be enabled.
+* A high-level design and plan for development that will continue beyond what is accomplished this summer
+* Unit and integration tests
+* A generalized example of a validation workflow and/or notebook using California cropland datasets spanning multiple sites and crop types.
+* Documentation
+
 
 **Prerequisites:**
 
-- Required: R (existing workflow and prototype is in R), Docker
-- Helpful: Familiarity with HPC and Apptainer
+- Required: R (existing workflow and prototype is in R), familiarity with statistical methods for model validation
+- Helpful: Familiarity with existing benchmarking workflow systems
 
 **Contact person:**
 
-Rob @Kooper
+Chris Black (@infotroph)
 
 **Duration:**
 
@@ -80,40 +103,40 @@ Flexible to work as either a Medium (175hr) or Large (350 hr)
 Medium
 
 ---
-### 3. Database and Data Improvements{#db}
-
-PEcAn relies on the BETYdb database to store trait and yield data as well as model provenance information. This project aims to separate trait data from provenance tracking, ensure that PEcAn is able to run without the server currently required to run the Postgres database used by BETYdb, and enable flexible data sharing in place of a server-reliant sync mechanism. The goal is to make PEcAn workflows easier to test, deploy, and use while also making data more accessible.
-
+### 3. Increase PEcAn modularity{#module}
 
-**Potential Directions**
+Existing PEcAn workflows rely heavily on reading a large `settings` object and writing .RData files or other opaque artifacts to disk to pass state between steps. This behavior reduces transparency, testability, and user understanding. The high-level goal of this project is to make PEcAn’s core functionality more modular and transparent, so that users can more easily build, maintain, and expand PEcAn workflows.
 
-- **Minimal BETYdb Database:** Create a simplified version of BETYdb for demonstrations and Integration tests, which might include:
-   - Review the provenance information we currently log, identify components that no longer need to be tracked or that should be temporary rather than permanent records, and build tools to clean unneeded records from the database.
-   - Design and create a freestanding version of the trait data, including choosing the format and distribution method, implementing whatever pipelines are needed to move the data over, and documenting how to use and update the result.
-   - Review the information we currently log, identify components that no longer need to be tracked or that should be temporary rather than permanent, and build tools to clean unneeded/expired records from the database.
+This project refactors a single, well-defined workflow so that functions return explicit R objects (e.g., data frames or lists) instead of relying on hidden on-disk side effects. 
 
-- **Non-Database Setup:** Enable workflows that do not require PostgreSQL or a web front-end, potentially including:
-   - Identify PEcAn modules that are still DB-dependent and refactor them to allow freestanding use
-   - Implement mechanisms for decoupling the DB from the model pipelines in time and space while still tracking provenance. Perhaps this could involve separate prep/execution/post-logging phases, but we encourage your creative suggestions.
-   - Create tools that maximize interoperability with data from other sources, including from external databases or the user's own observations.
-   - Identify functionality from the "BETYdb network" sync system that is out of date and replace or remove it as needed.
+To minimize disruption with existing workflows, the preferred approach would be: 
+* To begin by documenting existing functionality
+* Where needed, write tests for existing functionality
+* Document new functionality
+* Write tests for new functionality (TDD)
+* Refactoring of functions to return objects
+* Then refactor downstream functions use those objects
+* Only after that’s working, stop writing out the files.
+* If time permits, analyze how PEcAn's high-level modules are using the `settings` object and, where possible, refactor function inputs to only pass the required subset of variables or variable lists.
+* Along the way, it would also be beneficial to reassess which functions need to be exported, with the idea that fewer exported functions would make it easier for new users to see what PEcAn’s core modules actually are, and better documenting the core functions and modules we expect users to need to learn/use
 
 **Expected outcomes**:
 
-A successful project would complete a subset of the following tasks:
-- A lightweight, distributable demo Postgres database.
-- A distributable dataset of the existing trait and yield records in a maximally reusable format (i.e. maybe _not_ Postgres)
-- A workflow that is independent of the Postgres database.
+* Refactored functions that return explicit R objects instead of writing .RData
+* Clear definition and doucmentation of object structures passed between steps
+* Backward-compatible wrappers where needed to avoid breaking existing workflows
+* Unit tests that no longer depend on on-disk state or output_dir
+* Documentation describing .RData deprecation, migration guidance, and examples
+
 
 **Skills Required**:
 
-- Familiarity with database concepts required
-- Postgres experience helpful (and required if proposing DB cleanup tasks)
-- R experience helpful (and required if proposing PEcAn code changes)
+- Required: R (existing workflow and prototype is in R) and R package development
+- Helpful: familiarity with code refactoring
 
 **Contact person:**
 
-Chris Black (@infotroph)
+Mike @Dietze
 
 **Duration:**
 
@@ -121,7 +144,12 @@ Suitable for a Medium (175hr) or Large (350 hr) project.
 
 **Difficulty:**
 
-Intermediate to hard
+Medium
+
+<!--
+
+
+# This comment section for ideas that may be potentially viable in future (with revision)
 
 
 ---
@@ -152,43 +180,6 @@ Medium (175hr)
 Medium
 
 
-### 5. Refactoring Compile-time Flags to Runtime Flags in SIPNET{#sipnet}
-
-**Project Overview**
-
-The ecosystem SIPNET is a core component of many PEcAn analyses. SIPNET is compiled with multiple compile-time flags that control whether different features are turned on and off. Thus, as currently configured, each model structure requires a separate compiled binary. 
-
-This project will refactor these flags to be runtime-configurable via command-line arguments or a configuration file, improving usability and testing efficiency.
-
-**Expected Outcomes**
-
-- Convert selected SIPNET compile-time flags to runtime options.
-- Develop a global configuration object for managing runtime flags.
-- Improve testability by enabling different configurations without recompiling.
-
-**Prerequisites**
-
-- Required: C, experience with compilers and build systems.
-- Helpful: Understanding of simulation models.
-
-**Mentor(s)**
-
-- David LeBauer (@dlebauer)
-- Mike Longfritz
-
-**Duration**
-- Medium (175hr) or Large (350hr)
-
-**Difficulty**
-- Medium to Hard
-
-
-<!--
-
-
-# This comment section for ideas that may be potentially viable in future (with revision)
-
-
 #### BETYdb R data package
 
 BETYdb's web front end is built on a version of Ruby on Rails that is functional byt no longer supported. A key feature of BETYdb is that the data is open and accessible. 

From 7b09a8423e2833733efe6b5205e2b783a0079fee Mon Sep 17 00:00:00 2001
From: David LeBauer <dlebauer@gmail.com>
Date: Thu, 29 Jan 2026 15:14:34 -0700
Subject: [PATCH 2/7] Apply suggestion from @dlebauer

---
 src/pages/gsoc_ideas.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index ee579bbe..1fb73a90 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -59,7 +59,7 @@ Key here is a high-level plan for development that will continue beyond what is
 
 **Contact person:**
 
-David LeBauer (@dlebaur), @Henry Priest
+David LeBauer (@dlebauer), @Henry Priest
 
 **Duration:**
 

From c09bf826d7771fac0cf6abcba05b9658f56cd29a Mon Sep 17 00:00:00 2001
From: David LeBauer <dlebauer@gmail.com>
Date: Thu, 29 Jan 2026 18:26:19 -0700
Subject: [PATCH 3/7] remove curly braces to maybe fix build

---
 src/pages/gsoc_ideas.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index 1fb73a90..c4635bf3 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -24,7 +24,7 @@ Input-processing code in PEcAn (e.g., meteorological preparation) is currently c
 
 This project will deprecate `do.conversions` as currently implemented and replace it with input preprocessing workflows that are explicitly structured around data dependencies and are naturally parallelizable across data streams, sites, and ensemble members. The work will refactor or deprecate `met.process` to remove monolithic orchestration and reduce or eliminate opaque caching, while retaining and strengthening existing low-level transformation functions.
 
-As part of the refactor, orchestration logic should be rebuilt to make inputs, outputs, and dependencies explicit. A workflow tool such as {targets} may be used to help define and validate the dependency graph and caching behavior, but must not become a required or exclusive execution path for PEcAn.
+As part of the refactor, orchestration logic should be rebuilt to make inputs, outputs, and dependencies explicit. A workflow tool such as `targets` may be used to help define and validate the dependency graph and caching behavior, but must not become a required or exclusive execution path for PEcAn.
 
 This refactoring should also reduce or eliminate implicit dependencies on the global settings object (see Project 3), enabling clearer APIs and improved testability.
 

From 8b1d3827d434bd4d2f1e923fd7f555d40d2758a5 Mon Sep 17 00:00:00 2001
From: Chris Black <chris@ckblack.org>
Date: Thu, 29 Jan 2026 23:12:51 -0800
Subject: [PATCH 4/7] Update src/pages/gsoc_ideas.mdx

Co-authored-by: David LeBauer <dlebauer@gmail.com>
---
 src/pages/gsoc_ideas.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index c4635bf3..ed1aa9c2 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -136,7 +136,7 @@ To minimize disruption with existing workflows, the preferred approach would be:
 
 **Contact person:**
 
-Mike @Dietze
+Mike Dietze (@mdietze)
 
 **Duration:**
 

From 71b0a5694138d33712ed65777c9d26c3f119b2c8 Mon Sep 17 00:00:00 2001
From: Michael Dietze <dietze@bu.edu>
Date: Fri, 30 Jan 2026 05:41:56 -0500
Subject: [PATCH 5/7] Update src/pages/gsoc_ideas.mdx

Co-authored-by: Chris Black <chris@ckblack.org>
---
 src/pages/gsoc_ideas.mdx | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index ed1aa9c2..52ca2b3c 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -146,6 +146,33 @@ Suitable for a Medium (175hr) or Large (350 hr) project.
 
 Medium
 
+---
+### 4. Standardizing Model Couplers Across Models{#couplertools}
+
+PEcAn models frequently duplicate similar logic for writing configuration files, translating meteorological inputs, and handling model-specific I/O. This copy–paste pattern increases maintenance cost and makes it harder to integrate new models consistently.
+
+This project identifies a small set of shared configuration and I/O patterns and refactors them into documented helper functions with well-defined interfaces. Possible examples include netCDF reading/writing, parsing standardized input files,  test fixtures, settings validation, and others. The approach should be demonstrated across a limited number of models coupler packages under active development.
+
+**Expected outcomes:**
+A successful project will deliver an inventory of duplicated configuration and I/O patterns along with one or more of the following steps toward deduplication:
+
+* Shared helper functions with explicit inputs, outputs, and unit conventions
+* Refactored model code using standardized helpers
+* Unit tests ensuring consistent behavior across models
+* Updated developer documentation describing standard interfaces and recommended usage
+
+**Prerequisites:**
+Required: Proficiency in R
+Helpful: experience with unit testing
+
+**Contact person:**
+Chris Black, @infotroph
+
+**Duration:**: 
+Medium (175hr) or Large (350 hr) depending on number of deliverables
+
+**Difficulty:**
+Medium
 <!--
 
 

From 48a7026e8a8cb56cfaeba1d79768626a033fe037 Mon Sep 17 00:00:00 2001
From: Michael Dietze <dietze@bu.edu>
Date: Fri, 30 Jan 2026 05:42:13 -0500
Subject: [PATCH 6/7] Update src/pages/gsoc_ideas.mdx

Co-authored-by: Chris Black <chris@ckblack.org>
---
 src/pages/gsoc_ideas.mdx | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index 52ca2b3c..b1e03f35 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -14,7 +14,8 @@ Below is a list of project ideas. Feel free to contact the listed mentors on Sla
 
 1. [Refactor and Parallelize Input Processing Pipelines](#input)  
 2. [Benchmarking and Validation Framework](#validation)  
-3. [Increase PEcAn modularity](#module)   
+3. [Increase PEcAn modularity](#module)
+4. [Standardizing Model Couplers Across Models](#couplertools)
 
 ---
 

From 31360dee943f6b53766ac2c53afa395fcae6bbfd Mon Sep 17 00:00:00 2001
From: Michael Dietze <dietze@bu.edu>
Date: Fri, 30 Jan 2026 05:44:12 -0500
Subject: [PATCH 7/7] Update src/pages/gsoc_ideas.mdx

---
 src/pages/gsoc_ideas.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/pages/gsoc_ideas.mdx b/src/pages/gsoc_ideas.mdx
index b1e03f35..b269ee36 100644
--- a/src/pages/gsoc_ideas.mdx
+++ b/src/pages/gsoc_ideas.mdx
@@ -137,7 +137,7 @@ To minimize disruption with existing workflows, the preferred approach would be:
 
 **Contact person:**
 
-Mike Dietze (@mdietze)
+Mike @Dietze
 
 **Duration:**