Skip to content

MIBM Performance Optimizations#1157

Open
danieljvickers wants to merge 16 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations
Open

MIBM Performance Optimizations#1157
danieljvickers wants to merge 16 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations

Conversation

@danieljvickers
Copy link
Member

@danieljvickers danieljvickers commented Feb 17, 2026

Description

Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.

This PR has extended optimization to STL IBs, which are now fully capable of running on the GPU. An example is given below.

Type of change

  • New feature
  • Refactor
  • Other: Performance Tuning

Testing

All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:

import json
import argparse

num_cells = [240, 240, 240]
dim = [8., 8., 8.]
num_particles = 1000

# create a stationary fluid
case = {
    "run_time_info": "T",
    "parallel_io": "T",
    "m": num_cells[0]-1,
    "n": num_cells[1]-1,
    "p": num_cells[2]-1,
    "dt": 0.005,
    "t_step_start": 0,
    "t_step_stop": 2,
    "t_step_save": 1,
    "num_patches": 1,
    "model_eqns": 2,
    "alt_soundspeed": "F",
    "num_fluids": 1,
    "mpp_lim": "F",
    "mixture_err": "T",
    "time_stepper": 3,
    "recon_type": 1,
    "weno_eps": 1e-16,
    "riemann_solver": 2,
    "wave_speeds": 1,
    "avg_state": 2,
    "precision": 2,
    "format": 1,
    "prim_vars_wrt": "T",
    "E_wrt": "T",
    "viscous": "T",
    "x_domain%beg": -0.5*dim[0],
    "x_domain%end": 0.5*dim[0],
    "y_domain%beg": -0.5*dim[1],
    "y_domain%end": 0.5*dim[1],
    "z_domain%beg": -0.5*dim[2],
    "z_domain%end": 0.5*dim[2],
    "bc_x%beg": -3,
    "bc_x%end": -3,
    "bc_y%beg": -3,
    "bc_y%end": -3,
    "bc_z%beg": -3,
    "bc_z%end": -3,
    "patch_icpp(1)%geometry": 9,
    "patch_icpp(1)%z_centroid": 0.0,
    "patch_icpp(1)%length_z": dim[2],
    "patch_icpp(1)%y_centroid": 0.0,
    "patch_icpp(1)%length_y": dim[1],
    "patch_icpp(1)%x_centroid": 0.0,
    "patch_icpp(1)%length_x": dim[0],
    "weno_order": 5,
    "patch_icpp(1)%pres": 1.0,
    "patch_icpp(1)%alpha_rho(1)": 1.0,
    "patch_icpp(1)%alpha(1)": 1.0,
    "patch_icpp(1)%vel(1)": 0.0,
    "patch_icpp(1)%vel(2)": 0.0,
    "patch_icpp(1)%vel(3)": 0.0,
    "fluid_pp(1)%gamma": 2.5000000000000004,
    "fluid_pp(1)%pi_inf": 0.0,
    "fluid_pp(1)%Re(1)": 2500000,
}

import random
random.seed(42)

dx = [float(dim[i]) / float(num_cells[i]) for i in range(3)]

#set particle properties
particle_radius_cells = 5 # particle radius in grid cells
particle_cell_spacing = particle_radius_cells*2 + 5 # set the spacing to be double the radius plus 5 to garuntee no image points in other IBs
mpi_cell_spacing = particle_radius_cells + 5 # space safely away from MPI halo regions to prevent out of bounds errors
genreation_bounds = [6., 6., 6.] # generate particles in this box safely away from the boundary
velocity_magnitude = 1.

# convert non-dimnesional grid cell units to the units of the simulation
radius_units = float(particle_radius_cells) * dx[0]
paticle_units_spacing = float(particle_cell_spacing) * dx[0]
paticle_units_spacing_squared = paticle_units_spacing**2
mpi_units_spacing = [float(mpi_cell_spacing) * dx[i] for i in range(3)]

# generate an array of xyz values that garuntee non-overlapping grid cells
particles = []
while len(particles) < num_particles:
  # generate a completely random position
  position = [random.random() for i in range(3)]
  position = [(position[i] - 0.5) * genreation_bounds[i] for i in range(3)]

  # first check if the particle is too close to the MPI halo regions
  valid = True
  # for i in range(3):
  #   valid = valid and abs(position[i]) >= mpi_units_spacing[i]

  # check for the minimum spacing between all particles, exiting once we find an error
  for particle in particles:
    distance_squared = sum([(particle[i] - position[i])**2 for i in range(3)])
    valid = valid and distance_squared >= paticle_units_spacing_squared
    if not valid:
      break

  if valid:
    particles.append(position)
    # print(f"\rProgress: {100.*float(len(particles))/float(num_particles)}%", end="", flush=True)

# print()
# convert out array of positions to valid JSON for the immersed boundary
ib_properties = {"ib": "T", "num_ibs": num_particles,}
for i in range(len(particles)):
  ib_properties[f'patch_ib({i+1})%radius'] = radius_units
  ib_properties[f'patch_ib({i+1})%slip'] = 'F'
  ib_properties[f'patch_ib({i+1})%geometry'] = 8
  ib_properties[f'patch_ib({i+1})%moving_ibm'] = 2
  ib_properties[f'patch_ib({i+1})%mass'] = 1.
  ib_properties[f'patch_ib({i+1})%x_centroid'] = particles[i][0]
  ib_properties[f'patch_ib({i+1})%y_centroid'] = particles[i][1]
  ib_properties[f'patch_ib({i+1})%z_centroid'] = particles[i][2]
  # move the particle away radially to garuntee they never touch during the simulation
  position_mag = (sum([(particles[i][j])**2 for j in range(3)]))**0.5
  for j in range(3):
    ib_properties[f'patch_ib({i+1})%vel({j+1})'] = particles[i][j] * velocity_magnitude / position_mag

print(json.dumps({**case, **ib_properties}))

These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:

Screenshot 2026-02-16 at 2 57 19 PM

Following these optimizations, the same profile achieves almost 50 ms per RK substep:
Screenshot 2026-02-17 at 10 51 00 AM

Checklist

  • I updated documentation if user-facing behavior changed

See the developer guide for full coding standards.

GPU changes (expand if you modified src/simulation/)
  • GPU results match CPU results
  • Tested on NVIDIA GPU or AMD GPU

Summary by CodeRabbit

Release Notes

  • New Features

    • Model-based levelset patch geometry support
  • Enhancements

    • Increased maximum patches to 1000 (previously 10)
    • Increased maximum immersed boundaries to 1000 (previously 10)
    • Enhanced STL model processing for immersed boundary computations
    • Added performance profiling instrumentation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

This PR implements GPU-accelerated immersed boundary method (IBM) support by introducing GPU-friendly data structures for STL models, refactoring IB patch handling with type-based markers, adding bounding-index optimizations, expanding model limits from 10 to 1000, and integrating profiling instrumentation across multiple computational kernels.

Changes

Cohort / File(s) Summary
GPU Parallelization Infrastructure
src/common/include/parallel_macros.fpp
Added END_GPU_ATOMIC_CAPTURE() macro for atomic capture ending directives, mirroring existing GPU parallel loop patterns.
Configuration & Limits
src/common/m_constants.fpp, toolchain/mfc/params/definitions.py, toolchain/mfc/case_validator.py, toolchain/mfc/params_tests/test_definitions.py
Increased num_patches_max from 10 to 1000, updated corresponding IBM validation limits and parameter count expectations from ~3400 to ~40000.
Model Data Structures
src/common/m_derived_types.fpp, src/common/m_helper.fpp
Modified t_model_array with interpolate changed from logical to integer; added GPU-optimized fields (ntrs, trs_v, trs_n). Added GPU routine annotation to f_cross function.
GPU Model Data & Functions
src/common/m_model.fpp
Introduced public GPU-friendly data arrays (gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_v, gpu_interpolate, etc.) and new subroutines s_pack_model_for_gpu, f_model_is_inside_flat, f_distance_normals_3d_flat for GPU computation pathways.
Immersed Boundary Patch Framework
src/simulation/m_ib_patches.fpp
Replaced raw integer array ib_markers_sf with typed ib_markers (type-based) across all patch routines (circle, airfoil, model, sphere, cylinder, etc.); added get_bounding_indices for GPU loop optimization; substantially expanded s_instantiate_STL_models with GPU data packing and transfer logic.
Levelset Computation
src/simulation/m_compute_levelset.fpp
Refactored to use GPU data structures (gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices) and flat-variant distance/normal functions; replaced boolean interpolation checks with integer comparisons.
IBM & Time Stepping
src/simulation/m_ibm.fpp, src/simulation/m_time_steppers.fpp
Introduced NVTX profiling ranges for IBM setup/computation steps; unified 2D/3D logic with GPU parallelization scaffolding; added atomic counters for ghost-point tracking; integrated GPU data transfers and synchronization around IB patch application and force computation.

Sequence Diagram(s)

sequenceDiagram
    participant CPU as CPU/Host
    participant GPU as GPU Device
    participant Model as Model Data
    participant LevelSet as LevelSet Computation
    participant IBM as IBM Forces

    CPU->>Model: s_instantiate_STL_models()
    Note over Model: Read & transform STL models
    Model->>Model: Compute bounding boxes
    Model->>Model: Pack data (ntrs, trs_v, trs_n)
    
    CPU->>GPU: s_pack_model_for_gpu()
    Note over GPU: Upload gpu_ntrs, gpu_trs_v, gpu_trs_n<br/>gpu_boundary_v, gpu_interpolate
    
    CPU->>LevelSet: s_compute_levelset()
    LevelSet->>GPU: f_distance_normals_3d_flat()
    Note over GPU: Parallel computation using<br/>pre-packed GPU data arrays
    GPU->>LevelSet: Return distance & normals
    
    CPU->>CPU: s_apply_ib_patches(ib_markers)
    Note over CPU: Type-based ib_markers<br/>with bounding-index constraints
    
    CPU->>IBM: s_update_mib()
    IBM->>GPU: NVTX profiled GPU kernels
    Note over GPU: Unified 2D/3D logic<br/>Atomic counters for ghost points<br/>Image point interpolation
    GPU->>IBM: Force data
    
    IBM->>CPU: Return computed forces
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Review effort 4/5, size:XXL

Suggested reviewers

  • sbryngelson
  • wilfonba

Poem

🐰 A GPU path unfolds so bright,
Where models pack and take their flight,
From CPU threads to CUDA cores—
IBM forces through GPU doors!
With atomic counts and bounding grace,
We've found immersion's perfect place. 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'MIBM Performance Optimizations' clearly and concisely describes the main objective of the PR, which is to optimize the Moving Immersed Boundary Method code for GPU execution.
Description check ✅ Passed The PR description is comprehensive and well-structured. It includes a clear summary of changes, motivation, expanded upper bounds, testing results with performance metrics, and a practical example case. However, the GPU changes checklist items were not marked as completed despite GPU changes being made.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 17, 2026

Codecov Report

❌ Patch coverage is 57.08812% with 112 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.24%. Comparing base (c5493c1) to head (bc972ca).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_ib_patches.fpp 63.12% 56 Missing and 3 partials ⚠️
src/simulation/m_ibm.fpp 50.00% 23 Missing and 13 partials ⚠️
src/common/m_model.fpp 37.03% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1157      +/-   ##
==========================================
+ Coverage   44.07%   44.24%   +0.17%     
==========================================
  Files          70       70              
  Lines       20431    20505      +74     
  Branches     1974     1950      -24     
==========================================
+ Hits         9004     9073      +69     
- Misses      10291    10306      +15     
+ Partials     1136     1126      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danieljvickers danieljvickers marked this pull request as ready for review February 19, 2026 18:39
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 12 files

Confidence score: 2/5

  • High risk due to concrete breakages: stl_bounding_boxes re-allocation in src/simulation/m_ib_patches.fpp can crash at runtime with multiple STL patches, and src/simulation/m_compute_levelset.fpp changes make if (interpolate) invalid in Fortran 3D path.
  • An unconditional interpolate = .false. in src/simulation/m_ib_patches.fpp disables STL interpolation, making the intended logic dead and likely altering results.
  • Pay close attention to src/simulation/m_ib_patches.fpp and src/simulation/m_compute_levelset.fpp - crash/compile risks around interpolation and allocation.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="toolchain/mfc/case_validator.py">

<violation number="1" location="toolchain/mfc/case_validator.py:687">
P3: Update the validation error message to match the new upper bound (1000) so users get accurate guidance when the check fails.</violation>
</file>

<file name="src/simulation/m_compute_levelset.fpp">

<violation number="1" location="src/simulation/m_compute_levelset.fpp:647">
P1: Changing interpolate to INTEGER breaks the `if (interpolate) then` check in the 3D-model path; INTEGER values are not valid logical expressions in Fortran, so this will not compile. Either keep interpolate LOGICAL (and assign with a logical comparison) or update the condition to compare against 1.</violation>
</file>

<file name="src/simulation/m_time_steppers.fpp">

<violation number="1" location="src/simulation/m_time_steppers.fpp:805">
P3: Fix the NVTX range label typo (“IMERSED”) so profiling markers are consistent and searchable.</violation>
</file>

<file name="src/common/m_model.fpp">

<violation number="1" location="src/common/m_model.fpp:550">
P2: Ray directions should be random around zero; adding point(k) skews them toward the position vector and biases inside/outside tests. Use only the random offset for directions.</violation>
</file>

<file name="src/simulation/m_ib_patches.fpp">

<violation number="1" location="src/simulation/m_ib_patches.fpp:1026">
P1: Debugging override left in: `interpolate = .false.` unconditionally disables STL model interpolation, making the interpolation check dead code. This appears to be a leftover from testing — remove this line to restore the intended interpolation behavior.</violation>

<violation number="2" location="src/simulation/m_ib_patches.fpp:1069">
P1: Runtime crash with multiple STL models: `stl_bounding_boxes` is allocated inside the per-patch loop. When a second STL patch is encountered, Fortran will raise an error because the array is already allocated. Move the allocation before the loop or guard it with a check and use `num_ibs` for the first dimension.</violation>
</file>

<file name="src/simulation/m_ibm.fpp">

<violation number="1" location="src/simulation/m_ibm.fpp:114">
P3: `s_populate_ib_buffers` is now dead code — it is a private subroutine that is no longer called from anywhere after the refactoring removed its callsites in `s_ibm_setup` and `s_update_mib`. Consider removing the subroutine to avoid confusion for future contributors.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


integer :: i, j, k, patch_id, boundary_edge_count, total_vertices
logical :: interpolate
integer :: interpolate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Changing interpolate to INTEGER breaks the if (interpolate) then check in the 3D-model path; INTEGER values are not valid logical expressions in Fortran, so this will not compile. Either keep interpolate LOGICAL (and assign with a logical comparison) or update the condition to compare against 1.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/simulation/m_compute_levelset.fpp, line 647:

<comment>Changing interpolate to INTEGER breaks the `if (interpolate) then` check in the 3D-model path; INTEGER values are not valid logical expressions in Fortran, so this will not compile. Either keep interpolate LOGICAL (and assign with a logical comparison) or update the condition to compare against 1.</comment>

<file context>
@@ -655,7 +644,7 @@ contains
 
         integer :: i, j, k, patch_id, boundary_edge_count, total_vertices
-        logical :: interpolate
+        integer :: interpolate
         real(wp), dimension(1:3) :: center, xyz_local
         real(wp) :: normals(1:3) !< Boundary normal buffer
</file context>

write (*, "(A, 3(2X, F20.10))") " > Max:", grid_mm(:, 2)
end if

allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Runtime crash with multiple STL models: stl_bounding_boxes is allocated inside the per-patch loop. When a second STL patch is encountered, Fortran will raise an error because the array is already allocated. Move the allocation before the loop or guard it with a check and use num_ibs for the first dimension.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/simulation/m_ib_patches.fpp, line 1069:

<comment>Runtime crash with multiple STL models: `stl_bounding_boxes` is allocated inside the per-patch loop. When a second STL patch is encountered, Fortran will raise an error because the array is already allocated. Move the allocation before the loop or guard it with a check and use `num_ibs` for the first dimension.</comment>

<file context>
@@ -865,83 +806,368 @@ contains
+                    write (*, "(A, 3(2X, F20.10))") "    >         Max:", grid_mm(:, 2)
+                end if
+
+                allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
+                stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
+                stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
</file context>

else
call f_check_interpolation_2D(boundary_v, boundary_edge_count, (/dx_local, dy_local, 0._wp/), interpolate)
end if
interpolate = .false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Debugging override left in: interpolate = .false. unconditionally disables STL model interpolation, making the interpolation check dead code. This appears to be a leftover from testing — remove this line to restore the intended interpolation behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/simulation/m_ib_patches.fpp, line 1026:

<comment>Debugging override left in: `interpolate = .false.` unconditionally disables STL model interpolation, making the interpolation check dead code. This appears to be a leftover from testing — remove this line to restore the intended interpolation behavior.</comment>

<file context>
@@ -865,83 +806,368 @@ contains
+                else
+                    call f_check_interpolation_2D(boundary_v, boundary_edge_count, (/dx_local, dy_local, 0._wp/), interpolate)
+                end if
+                interpolate = .false.
+
+                ! Show the number of edges and boundary edges in 2D STL models
</file context>

! random jitter in the origin helps us estimate volume fraction instead of only at the cell center
ray_origins(i, k) = point(k) + (f_model_random_number(rand_seed) - 0.5_wp)*spacing(k)
! cast sample rays in all directions
ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Ray directions should be random around zero; adding point(k) skews them toward the position vector and biases inside/outside tests. Use only the random offset for directions.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/common/m_model.fpp, line 550:

<comment>Ray directions should be random around zero; adding point(k) skews them toward the position vector and biases inside/outside tests. Use only the random offset for directions.</comment>

<file context>
@@ -493,58 +526,116 @@ contains
+                ! random jitter in the origin helps us estimate volume fraction instead of only at the cell center
+                ray_origins(i, k) = point(k) + (f_model_random_number(rand_seed) - 0.5_wp)*spacing(k)
+                ! cast sample rays in all directions
+                ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
+            end do
             ray_dirs(i, :) = ray_dirs(i, :)/sqrt(sum(ray_dirs(i, :)*ray_dirs(i, :)))
</file context>
Suggested change
ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp

self.prohibit(ib and n <= 0,
"Immersed Boundaries do not work in 1D (requires n > 0)")
self.prohibit(ib and (num_ibs <= 0 or num_ibs > 10),
self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Update the validation error message to match the new upper bound (1000) so users get accurate guidance when the check fails.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At toolchain/mfc/case_validator.py, line 687:

<comment>Update the validation error message to match the new upper bound (1000) so users get accurate guidance when the check fails.</comment>

<file context>
@@ -684,7 +684,7 @@ def check_ibm(self):
         self.prohibit(ib and n <= 0,
                      "Immersed Boundaries do not work in 1D (requires n > 0)")
-        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 10),
+        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
                      "num_ibs must be between 1 and num_patches_max (10)")
         self.prohibit(not ib and num_ibs > 0,
</file context>

integer :: i
logical :: forces_computed

call nvtxStartRange("PROPAGATE-IMERSED-BOUNDARIES")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Fix the NVTX range label typo (“IMERSED”) so profiling markers are consistent and searchable.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/simulation/m_time_steppers.fpp, line 805:

<comment>Fix the NVTX range label typo (“IMERSED”) so profiling markers are consistent and searchable.</comment>

<file context>
@@ -802,6 +802,8 @@ contains
         integer :: i
         logical :: forces_computed
 
+        call nvtxStartRange("PROPAGATE-IMERSED-BOUNDARIES")
+
         forces_computed = .false.
</file context>
Suggested change
call nvtxStartRange("PROPAGATE-IMERSED-BOUNDARIES")
call nvtxStartRange("PROPAGATE-IMMERSED-BOUNDARIES")

@@ -28,6 +28,8 @@ module m_ibm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: s_populate_ib_buffers is now dead code — it is a private subroutine that is no longer called from anywhere after the refactoring removed its callsites in s_ibm_setup and s_update_mib. Consider removing the subroutine to avoid confusion for future contributors.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/simulation/m_ibm.fpp, line 114:

<comment>`s_populate_ib_buffers` is now dead code — it is a private subroutine that is no longer called from anywhere after the refactoring removed its callsites in `s_ibm_setup` and `s_update_mib`. Consider removing the subroutine to avoid confusion for future contributors.</comment>

<file context>
@@ -96,19 +100,18 @@ contains
-        call s_apply_ib_patches(ib_markers%sf(0:m, 0:n, 0:p))
         $:GPU_UPDATE(device='[ib_markers%sf]')
-        call s_populate_ib_buffers()
+        call s_apply_ib_patches(ib_markers)
         $:GPU_UPDATE(host='[ib_markers%sf]')
         do i = 1, num_ibs
</file context>

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
src/simulation/m_time_steppers.fpp (1)

805-855: ⚠️ Potential issue | 🟡 Minor

Fix NVTX range label typo for consistent profiling.

The label reads “IMERSED” instead of “IMMERSED”, which can break grouping in profiling tools.

✏️ Proposed fix
-        call nvtxStartRange("PROPAGATE-IMERSED-BOUNDARIES")
+        call nvtxStartRange("PROPAGATE-IMMERSED-BOUNDARIES")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_time_steppers.fpp` around lines 805 - 855, The NVTX range
label passed to nvtxStartRange in the MTimeSteppers routine is misspelled
("PROPAGATE-IMERSED-BOUNDARIES"); update the string to the correct label
"PROPAGATE-IMMERSED-BOUNDARIES" so profiling tools group this region correctly
(look for the nvtxStartRange(...) call in the do-loop that updates patch_ib and
ensure nvtxEndRange remains paired).
src/simulation/m_compute_levelset.fpp (1)

641-709: ⚠️ Potential issue | 🔴 Critical

Fix integer interpolate used as logical in 3D model path (won't compile).

Line 691 uses if (interpolate) then but interpolate is declared as integer (line 647). This is not valid Fortran and will fail compilation. The 2D branch already correctly uses if (interpolate == 1) then (line 704).

Proposed fix
             ! Get the shortest distance between the cell center and the interpolated model boundary
-            if (interpolate) then
+            if (interpolate == 1) then
                 gp%levelset = f_interpolated_distance(gpu_interpolated_boundary_v(:, :, patch_id), total_vertices, xyz_local)
             else
                 gp%levelset = distance
             end if
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_compute_levelset.fpp` around lines 641 - 709, The integer
variable interpolate in subroutine s_model_levelset is being used as a logical
in the 3D branch (if (interpolate) then) which will not compile; update that
conditional to explicitly compare the integer (e.g., if (interpolate == 1) then)
or change the variable to a logical and adjust all uses accordingly so that the
3D path (the block that currently calls f_interpolated_distance or assigns
gp%levelset = distance) uses a valid boolean check on interpolate.
src/simulation/m_ibm.fpp (3)

589-676: ⚠️ Potential issue | 🟠 Major

Add patch_id to the GPU private list in s_find_ghost_points.

patch_id is assigned in every iteration (lines 628, 666) and used only within that iteration. Without being declared private, multiple GPU threads will access a shared patch_id variable, causing race conditions when reading from patch_ib(patch_id)%....

Proposed fix
-        $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
+        $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx,patch_id]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ibm.fpp` around lines 589 - 676, The variable patch_id in
s_find_ghost_points is written per-iteration but not listed as private in the
$:GPU_PARALLEL_LOOP, causing concurrent threads to race when reading
patch_ib(patch_id); update the GPU pragma for s_find_ghost_points to include
patch_id in the private list (alongside i,j,k,ii,jj,kk,is_gp,local_idx) so each
thread gets its own patch_id before using patch_ib(patch_id)%... .

441-534: ⚠️ Potential issue | 🟠 Major

Fix pointer association and silent out-of-bounds risk in GPU kernel image-point search.

Inside the GPU-parallel loop, s_cc => x_cc/y_cc/z_cc is pointer association within device code, and the bounds safety check (lines 509–521) is compiled out for GPU builds (#if !defined(MFC_OpenACC)). This means the do while loop can walk index beyond [-buff_size, bound] silently, storing invalid values in ip_grid without error or warning on GPU.

Recommended fixes:

  • Add explicit bounds clamping inside the loop: if index reaches the boundary, exit and clamp to a valid index.
  • Alternatively, replace the linear search with a binary search over monotone s_cc for both safety and performance.
Current unsafe loop (lines 506–524)
do while ((temp_loc < s_cc(index) &
           .or. temp_loc > s_cc(index + 1)))
    index = index + dir
`#if` !defined(MFC_OpenACC) && !defined(MFC_OpenMP)
    if (index < -buff_size .or. index > bound) then
        ! error handling (skipped on GPU)
        error stop "Ghost Point and Image Point on Different Processors"
    end if
`#endif`
end do
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ibm.fpp` around lines 441 - 534, s_compute_image_points
contains a GPU-parallel do-while that advances index using pointer s_cc and can
run out-of-bounds on device builds because the host-only error check is compiled
out; fix by adding an in-device-safe boundary clamp or replace the linear scan
with a binary search over s_cc. Specifically, inside the loop that updates index
(references: s_cc, index, dir, temp_loc, bound, buff_size) add an explicit check
that if index reaches the lower/upper limits (-buff_size or bound) you break the
loop and clamp index to the nearest valid cell before assigning
ghost_points_in(q)%ip_grid(:); alternatively, implement a device-safe
binary_search helper that takes s_cc and temp_loc and returns the correct index
and use that in place of the do-while to guarantee safety and better
performance.

967-1120: ⚠️ Potential issue | 🟠 Major

Add GPU_UPDATE(device=...) after MPI reductions to maintain self-contained GPU synchronization.

The MPI reductions at lines 1108–1109 occur on the host, followed by host-side writes to patch_ib(i)%force and patch_ib(i)%torque. While the caller (s_update_mib at line 940 in m_time_steppers.fpp) does eventually sync patch_ib to device, making s_compute_ib_forces self-contained by adding GPU_UPDATE(device='[patch_ib(1:num_ibs)]') after the final loop ensures coherence is handled within the function and aligns with the guideline: "Wrap MPI calls with GPU_UPDATE to ensure GPU-host synchronization before and after MPI communication."

Suggested fix
         do i = 1, num_ibs
             patch_ib(i)%force(:) = forces(i, :)
             patch_ib(i)%torque(:) = matmul(patch_ib(i)%rotation_matrix_inverse, torques(i, :))
         end do
+
+        ! Sync updated patch_ib to device for GPU kernels
+        $:GPU_UPDATE(device='[patch_ib(1:num_ibs)]')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ibm.fpp` around lines 967 - 1120, The MPI reductions in
s_compute_ib_forces (calls to s_mpi_allreduce_vectors_sum for forces and
torques) are host-side and must be bracketed by a GPU update to ensure
device/host coherence; insert GPU_UPDATE(device='[patch_ib(1:num_ibs)]')
immediately after the two s_mpi_allreduce_vectors_sum calls and before any
writes to patch_ib(i)%force or patch_ib(i)%torque so the routine handles
synchronization itself.
src/common/m_model.fpp (1)

498-576: ⚠️ Potential issue | 🟠 Major

Ray direction generation is incorrect and risks biased/degenerate rays.

Ray directions in both f_model_is_inside (line 550) and f_model_is_inside_flat (line 608) are computed as point(k) + f_model_random_number(rand_seed) - 0.5_wp. This couples the direction to absolute coordinates, causing heavy bias for large coordinate values. Additionally, both functions normalize without guarding against zero-length vectors, risking division by zero.

Ray directions should be independent of the point's absolute location—only the ray origin should depend on point. The fix removes the point(k) term and adds max(..., sgm_eps) guards on normalization (consistent with existing patterns in the codebase):

Proposed changes
Line 550 (f_model_is_inside):
-                ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
+                ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp

Line 552:
-            ray_dirs(i, :) = ray_dirs(i, :)/sqrt(sum(ray_dirs(i, :)*ray_dirs(i, :)))
+            ray_dirs(i, :) = ray_dirs(i, :)/max(sqrt(sum(ray_dirs(i, :)*ray_dirs(i, :))), sgm_eps)

Line 608 (f_model_is_inside_flat):
-                dir(k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
+                dir(k) = f_model_random_number(rand_seed) - 0.5_wp

Lines 610–611:
-            dir_mag = sqrt(dir(1)*dir(1) + dir(2)*dir(2) + dir(3)*dir(3))
-            dir(:) = dir(:)/dir_mag
+            dir_mag = sqrt(dir(1)*dir(1) + dir(2)*dir(2) + dir(3)*dir(3))
+            dir(:) = dir(:)/max(dir_mag, sgm_eps)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_model.fpp` around lines 498 - 576, The ray direction generation
in f_model_is_inside (and similarly f_model_is_inside_flat) incorrectly adds
point(k) to the random direction (coupling direction to absolute coordinates)
and normalizes without guarding against zero length; fix by computing
ray_dirs(i,k) = f_model_random_number(rand_seed) - 0.5_wp (no point(k) term) and
when normalizing replace direct division by sqrt(sum(...)) with division by
max(sqrt(sum(ray_dirs(i,:)*ray_dirs(i,:))), sgm_eps_wp) (or the project-wide
small epsilon symbol) to avoid divide-by-zero; update both f_model_is_inside and
f_model_is_inside_flat and keep ray_origins assignment as-is using
f_model_random_number(rand_seed).
src/simulation/m_ib_patches.fpp (2)

169-323: ⚠️ Potential issue | 🟠 Major

Add loop-local scalars xa, yc, and dycdxc to the private list in GPU_PARALLEL_LOOP.

Lines 265–266: These three scalars are assigned per-iteration inside the GPU-parallel region (lines 274–280) but are not declared private, creating a race condition where multiple GPU threads write to the same global variables. Update the private clause to private='[i,j,xy_local,k,f,xa,yc,dycdxc]'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ib_patches.fpp` around lines 169 - 323, The
GPU_PARALLEL_LOOP in subroutine s_ib_airfoil currently omits loop-local scalars
xa, yc, and dycdxc from its private list, causing a race condition; update the
GPU_PARALLEL_LOOP directive's private clause to include xa, yc, and dycdxc (so
it becomes private='[i,j,xy_local,k,f,xa,yc,dycdxc]') so each GPU thread gets
its own copies when evaluating the airfoil logic that assigns xa, yc, and dycdxc
and writes into ib_markers%sf.

28-112: ⚠️ Potential issue | 🟠 Major

Public API change: s_apply_ib_patches now takes type(integer_field) — all callers verified and updated correctly.

s_apply_ib_patches changed its argument type (line 60). Both call sites in src/simulation/m_ibm.fpp (lines 114, 945) correctly pass type(integer_field) as argument. This breaking signature change affects the public API and is appropriately flagged for review.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ib_patches.fpp` around lines 28 - 112, s_apply_ib_patches
now expects an argument of type(integer_field) (the signature changed), so
update all call sites and any interface/wrapper that referenced the old
signature; specifically ensure the calls in m_ibm.fpp (the two places that were
already updated) match the new prototype and remove/adjust any legacy
declarations or interface blocks that declared s_apply_ib_patches with the old
argument type, and update any documentation/comments that describe
s_apply_ib_patches to reflect it takes a type(integer_field) argument.
🧹 Nitpick comments (2)
src/common/m_constants.fpp (1)

25-25: Run all three targets after updating shared constants.

This change is in src/common/, so please ensure pre_process, simulation, and post_process targets are exercised to catch any cross-target regressions.
Based on learnings: “Changes to src/common/ affect all three executables (pre_process, simulation, post_process); test all three targets.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_constants.fpp` at line 25, You changed the shared constant
num_patches_max in src/common/m_constants.fpp which can impact all executables;
rebuild and run the pre_process, simulation, and post_process targets (or their
test suites) to exercise cross-target behavior and catch regressions —
specifically rebuild the three binaries that consume num_patches_max
(pre_process, simulation, post_process), run their full test inputs or
integration runs, and verify no runtime assertions, array bounds, or config
mismatches occur after the change.
src/simulation/m_ibm.fpp (1)

536-586: Use reduction instead of atomic operations to improve scaling for grid-cell classification.

num_gps_local and num_inner_gps_local are currently incremented via GPU_ATOMIC(update) for every classified cell, causing high contention on large grids. Since these are simple sums, use GPU_PARALLEL_LOOP reductions instead:

$:GPU_PARALLEL_LOOP(collapse=3, private='[i,j,k,ii,jj,kk,is_gp]', firstprivate='[gp_layers,gp_layers_z]', &
    & reduction='[[num_gps_local, num_inner_gps_local]]', reductionOp='[+, +]')

Replace the copy directive and remove both GPU_ATOMIC calls. Gang/vector reductions avoid per-thread atomics and scale significantly better.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ibm.fpp` around lines 536 - 586, The loop in subroutine
s_find_num_ghost_points uses per-cell GPU_ATOMIC updates on num_gps_local and
num_inner_gps_local causing contention; replace the copy-based GPU_PARALLEL_LOOP
with a reduction-based directive so the counts are accumulated via GPU
reductions instead of atomics (use GPU_PARALLEL_LOOP with collapse=3,
private='[i,j,k,ii,jj,kk,is_gp]', firstprivate='[gp_layers,gp_layers_z]' and a
reduction clause for [num_gps_local, num_inner_gps_local] with '+'), remove the
two GPU_ATOMIC(atomic='update') regions and keep the rest of the
ghost-classification logic unchanged, then assign num_gps_out and
num_inner_gps_out from the reduced locals as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/common/m_derived_types.fpp`:
- Around line 186-199: The t_model_array derived type declares scalar fields
boundary_edge_count, total_vertices, interpolate and ntrs without defaults which
can leave garbage values; add default initializers for these scalars in the type
definition (set boundary_edge_count=0, total_vertices=0, interpolate=0, ntrs=0)
so all instances start deterministic before any assignment, and then run and
verify pre_process, post_process and simulation builds/tests to ensure the
initialization change doesn't break their workflows; update any constructors or
code that relied on uninitialized behavior if necessary.

In `@src/common/m_model.fpp`:
- Around line 1209-1267: The subroutine f_distance_normals_3D_flat can index
trs_n(:, tri_idx, pid) with tri_idx==0 when ntrs <= 0; add a guard at the top of
f_distance_normals_3D_flat that checks if ntrs <= 0 and in that case set
distance to a safe value (e.g. a large positive sentinel or 0._wp as
appropriate) and set normals to a defined default (e.g. zero vector) and return
immediately to avoid using tri_idx/uninitialized indices; ensure you reference
and update tri_idx, distance and normals in this early-return path so no
out-of-bounds access to trs_n occurs.
- Around line 20-42: Add a module finalizer in m_model.fpp (e.g., subroutine
s_finalize_model or a MODULE PROCEDURE bound to final) that checks
ALLOCATED(...) and calls DEALLOCATE(...) for all public allocatable arrays
gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_v, gpu_interpolated_boundary_v,
gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices, stl_bounding_boxes
and models to ensure GPU/host memory is freed during simulation cleanup; export
or document the finalizer so the cleanup path in your driver or cleanup routine
calls it. Also remove the unnecessary "use m_model" import from
src/pre_process/m_icpp_patches.fpp (or the specific pre_process file mentioned)
since it does not reference the gpu_* arrays or models, eliminating the unwanted
coupling.
- Around line 1402-1415: s_pack_model_for_gpu currently unconditionally
allocates ma%trs_v and ma%trs_n; make it re-entrant by checking existing
allocations before allocating: set ma%ntrs = ma%model%ntrs, then for ma%trs_v
and ma%trs_n use allocated(...) to detect prior allocation and only allocate if
not allocated or if the current allocation dimensions don't match ma%ntrs (in
that case deallocate then allocate with the correct bounds); ensure you copy
into ma%trs_v(:, :, i) and ma%trs_n(:, i) after the allocation checks complete.

In `@src/simulation/m_ib_patches.fpp`:
- Around line 954-1169: s_instantiate_STL_models has four fixes: allocate
stl_bounding_boxes once before the patch loop with shape (num_ibs,1:3,1:3)
instead of calling allocate(stl_bounding_boxes(patch_id,...)) inside the loop;
ensure total_vertices is initialized or only copied into
models(patch_id)%total_vertices when interpolate is true (do not propagate
uninitialized total_vertices when interpolation is skipped); pass spacing
magnitudes to the interpolation routines by using (/dx_local, dy_local,
dz_local/) when calling f_interpolate_3D/f_interpolate_2D (and remove or fix the
unconditional interpolate = .false. that currently masks interpolation logic);
and protect module-level gpu_* allocations (gpu_ntrs, gpu_trs_v, gpu_trs_n,
gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices, gpu_boundary_v,
gpu_interpolated_boundary_v) with if (.not. allocated(...)) checks before each
allocate to allow repeated calls to s_instantiate_STL_models.
- Around line 828-953: s_update_mib’s restart path must refresh the
device-resident model arrays before generating markers: ensure
s_instantiate_STL_models() (or an equivalent GPU_ENTER_DATA/refresh routine for
module arrays gpu_ntrs, gpu_trs_v, gpu_trs_n) is called or re-entered before
s_apply_ib_patches() in the restart branch so device data reflects any changes
from s_update_ib_rotation_matrix(); in s_ib_model and s_ib_3d_model, add
explicit device presence to the GPU_PARALLEL_LOOP clauses by including
present='[gpu_ntrs, gpu_trs_v, gpu_trs_n]' (in addition to existing copyin) to
avoid relying on implicit persistence; and narrow the 2D scan in s_ib_model by
reusing the bounding-index logic from s_ib_3d_model (compute corner_distance
from stl_bounding_boxes and call get_bounding_indices to set i/j limits) to
avoid scanning the full grid for sparse STL patches.

In `@toolchain/mfc/case_validator.py`:
- Around line 687-688: The error message passed to self.prohibit for the num_ibs
check is stale (mentions 10) while the actual upper bound is 1000; update the
message to reflect the real max and, to prevent future drift, construct the
message using the authoritative constraint value (e.g. the CONSTRAINTS constant
or the class attribute used for bounds) rather than a hardcoded literal—locate
the self.prohibit call that checks num_ibs (the line with self.prohibit(ib and
(num_ibs <= 0 or num_ibs > 1000), ...)) and interpolate/format the upper-bound
from CONSTRAINTS (or self.CONSTRAINTS / CONSTRAINTS['num_ibs']['max'] as
appropriate) so the message reads "num_ibs must be between 1 and <max>".

---

Outside diff comments:
In `@src/common/m_model.fpp`:
- Around line 498-576: The ray direction generation in f_model_is_inside (and
similarly f_model_is_inside_flat) incorrectly adds point(k) to the random
direction (coupling direction to absolute coordinates) and normalizes without
guarding against zero length; fix by computing ray_dirs(i,k) =
f_model_random_number(rand_seed) - 0.5_wp (no point(k) term) and when
normalizing replace direct division by sqrt(sum(...)) with division by
max(sqrt(sum(ray_dirs(i,:)*ray_dirs(i,:))), sgm_eps_wp) (or the project-wide
small epsilon symbol) to avoid divide-by-zero; update both f_model_is_inside and
f_model_is_inside_flat and keep ray_origins assignment as-is using
f_model_random_number(rand_seed).

In `@src/simulation/m_compute_levelset.fpp`:
- Around line 641-709: The integer variable interpolate in subroutine
s_model_levelset is being used as a logical in the 3D branch (if (interpolate)
then) which will not compile; update that conditional to explicitly compare the
integer (e.g., if (interpolate == 1) then) or change the variable to a logical
and adjust all uses accordingly so that the 3D path (the block that currently
calls f_interpolated_distance or assigns gp%levelset = distance) uses a valid
boolean check on interpolate.

In `@src/simulation/m_ib_patches.fpp`:
- Around line 169-323: The GPU_PARALLEL_LOOP in subroutine s_ib_airfoil
currently omits loop-local scalars xa, yc, and dycdxc from its private list,
causing a race condition; update the GPU_PARALLEL_LOOP directive's private
clause to include xa, yc, and dycdxc (so it becomes
private='[i,j,xy_local,k,f,xa,yc,dycdxc]') so each GPU thread gets its own
copies when evaluating the airfoil logic that assigns xa, yc, and dycdxc and
writes into ib_markers%sf.
- Around line 28-112: s_apply_ib_patches now expects an argument of
type(integer_field) (the signature changed), so update all call sites and any
interface/wrapper that referenced the old signature; specifically ensure the
calls in m_ibm.fpp (the two places that were already updated) match the new
prototype and remove/adjust any legacy declarations or interface blocks that
declared s_apply_ib_patches with the old argument type, and update any
documentation/comments that describe s_apply_ib_patches to reflect it takes a
type(integer_field) argument.

In `@src/simulation/m_ibm.fpp`:
- Around line 589-676: The variable patch_id in s_find_ghost_points is written
per-iteration but not listed as private in the $:GPU_PARALLEL_LOOP, causing
concurrent threads to race when reading patch_ib(patch_id); update the GPU
pragma for s_find_ghost_points to include patch_id in the private list
(alongside i,j,k,ii,jj,kk,is_gp,local_idx) so each thread gets its own patch_id
before using patch_ib(patch_id)%... .
- Around line 441-534: s_compute_image_points contains a GPU-parallel do-while
that advances index using pointer s_cc and can run out-of-bounds on device
builds because the host-only error check is compiled out; fix by adding an
in-device-safe boundary clamp or replace the linear scan with a binary search
over s_cc. Specifically, inside the loop that updates index (references: s_cc,
index, dir, temp_loc, bound, buff_size) add an explicit check that if index
reaches the lower/upper limits (-buff_size or bound) you break the loop and
clamp index to the nearest valid cell before assigning
ghost_points_in(q)%ip_grid(:); alternatively, implement a device-safe
binary_search helper that takes s_cc and temp_loc and returns the correct index
and use that in place of the do-while to guarantee safety and better
performance.
- Around line 967-1120: The MPI reductions in s_compute_ib_forces (calls to
s_mpi_allreduce_vectors_sum for forces and torques) are host-side and must be
bracketed by a GPU update to ensure device/host coherence; insert
GPU_UPDATE(device='[patch_ib(1:num_ibs)]') immediately after the two
s_mpi_allreduce_vectors_sum calls and before any writes to patch_ib(i)%force or
patch_ib(i)%torque so the routine handles synchronization itself.

In `@src/simulation/m_time_steppers.fpp`:
- Around line 805-855: The NVTX range label passed to nvtxStartRange in the
MTimeSteppers routine is misspelled ("PROPAGATE-IMERSED-BOUNDARIES"); update the
string to the correct label "PROPAGATE-IMMERSED-BOUNDARIES" so profiling tools
group this region correctly (look for the nvtxStartRange(...) call in the
do-loop that updates patch_ib and ensure nvtxEndRange remains paired).

---

Nitpick comments:
In `@src/common/m_constants.fpp`:
- Line 25: You changed the shared constant num_patches_max in
src/common/m_constants.fpp which can impact all executables; rebuild and run the
pre_process, simulation, and post_process targets (or their test suites) to
exercise cross-target behavior and catch regressions — specifically rebuild the
three binaries that consume num_patches_max (pre_process, simulation,
post_process), run their full test inputs or integration runs, and verify no
runtime assertions, array bounds, or config mismatches occur after the change.

In `@src/simulation/m_ibm.fpp`:
- Around line 536-586: The loop in subroutine s_find_num_ghost_points uses
per-cell GPU_ATOMIC updates on num_gps_local and num_inner_gps_local causing
contention; replace the copy-based GPU_PARALLEL_LOOP with a reduction-based
directive so the counts are accumulated via GPU reductions instead of atomics
(use GPU_PARALLEL_LOOP with collapse=3, private='[i,j,k,ii,jj,kk,is_gp]',
firstprivate='[gp_layers,gp_layers_z]' and a reduction clause for
[num_gps_local, num_inner_gps_local] with '+'), remove the two
GPU_ATOMIC(atomic='update') regions and keep the rest of the
ghost-classification logic unchanged, then assign num_gps_out and
num_inner_gps_out from the reduced locals as before.

Comment on lines 186 to 199
type :: t_model_array
! Original CPU-side fields (unchanged)
type(t_model), allocatable :: model
real(wp), allocatable, dimension(:, :, :) :: boundary_v
real(wp), allocatable, dimension(:, :) :: interpolated_boundary_v
integer :: boundary_edge_count
integer :: total_vertices
logical :: interpolate
integer :: interpolate

! GPU-friendly flattened arrays
integer :: ntrs ! copy of model%ntrs
real(wp), allocatable, dimension(:, :, :) :: trs_v ! (3, 3, ntrs) - triangle vertices
real(wp), allocatable, dimension(:, :) :: trs_n ! (3, ntrs) - triangle normals
end type t_model_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the current state of the file at the specified lines
cat -n src/common/m_derived_types.fpp | sed -n '180,210p'

Repository: MFlowCode/MFC

Length of output: 1607


🏁 Script executed:

# Search for all uses of %interpolate across the codebase
rg -nP '%interpolate\b' src --type-add 'fpp:*.fpp' --type 'fpp'

Repository: MFlowCode/MFC

Length of output: 344


🏁 Script executed:

# Look for suspicious logical patterns with %interpolate
rg -nP '%interpolate\b.*(\.(true|false)|\.not\.)' src --type-add 'fpp:*.fpp' --type 'fpp'

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Check if there are any conditional patterns that may need review
rg -nP 'if\s*\(.*%interpolate' src --type-add 'fpp:*.fpp' --type 'fpp' -A 2

Repository: MFlowCode/MFC

Length of output: 854


🏁 Script executed:

# Check for assignments to %interpolate in the codebase
rg -nP '%interpolate\s*=' src --type-add 'fpp:*.fpp' --type 'fpp' -B 1 -A 1

Repository: MFlowCode/MFC

Length of output: 429


t_model_array scalars must be initialized to prevent garbage values reaching GPU or uninitialized field reads.

All uses of %interpolate correctly employ integer semantics (0/1, no logical operators); however, the type definition still lacks default initialization for boundary_edge_count, total_vertices, interpolate, and ntrs. Uninitialized scalars can leak garbage into GPU staging arrays or cause undefined behavior if any code path reads these fields before assignment.

Apply default initialization
         integer :: boundary_edge_count = 0
         integer :: total_vertices = 0
         integer :: interpolate = 0

         integer :: ntrs = 0  ! copy of model%ntrs

Additionally, verify that changes do not break pre_process and post_process executables, not just simulation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
type :: t_model_array
! Original CPU-side fields (unchanged)
type(t_model), allocatable :: model
real(wp), allocatable, dimension(:, :, :) :: boundary_v
real(wp), allocatable, dimension(:, :) :: interpolated_boundary_v
integer :: boundary_edge_count
integer :: total_vertices
logical :: interpolate
integer :: interpolate
! GPU-friendly flattened arrays
integer :: ntrs ! copy of model%ntrs
real(wp), allocatable, dimension(:, :, :) :: trs_v ! (3, 3, ntrs) - triangle vertices
real(wp), allocatable, dimension(:, :) :: trs_n ! (3, ntrs) - triangle normals
end type t_model_array
type :: t_model_array
! Original CPU-side fields (unchanged)
type(t_model), allocatable :: model
real(wp), allocatable, dimension(:, :, :) :: boundary_v
real(wp), allocatable, dimension(:, :) :: interpolated_boundary_v
integer :: boundary_edge_count = 0
integer :: total_vertices = 0
integer :: interpolate = 0
! GPU-friendly flattened arrays
integer :: ntrs = 0 ! copy of model%ntrs
real(wp), allocatable, dimension(:, :, :) :: trs_v ! (3, 3, ntrs) - triangle vertices
real(wp), allocatable, dimension(:, :) :: trs_n ! (3, ntrs) - triangle normals
end type t_model_array
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_derived_types.fpp` around lines 186 - 199, The t_model_array
derived type declares scalar fields boundary_edge_count, total_vertices,
interpolate and ntrs without defaults which can leave garbage values; add
default initializers for these scalars in the type definition (set
boundary_edge_count=0, total_vertices=0, interpolate=0, ntrs=0) so all instances
start deterministic before any assignment, and then run and verify pre_process,
post_process and simulation builds/tests to ensure the initialization change
doesn't break their workflows; update any constructors or code that relied on
uninitialized behavior if necessary.

Comment on lines +20 to 42
public :: f_model_read, s_model_write, s_model_free, f_model_is_inside, models, gpu_ntrs, &
gpu_trs_v, gpu_trs_n, gpu_boundary_v, gpu_interpolated_boundary_v, gpu_interpolate, gpu_boundary_edge_count, &
gpu_total_vertices, stl_bounding_boxes

! Subroutines for STL immersed boundaries
public :: f_check_boundary, f_register_edge, f_check_interpolation_2D, &
f_check_interpolation_3D, f_interpolate_2D, f_interpolate_3D, &
f_interpolated_distance, f_normals, f_distance, f_distance_normals_3D, f_tri_area
f_interpolated_distance, f_normals, f_distance, f_distance_normals_3D, f_tri_area, s_pack_model_for_gpu, &
f_model_is_inside_flat, f_distance_normals_3d_flat

!! array of STL models that can be allocated and then used in IB marker and levelset compute
type(t_model_array), allocatable, target :: models(:)
!! GPU-friendly flat arrays for STL model data
integer, allocatable :: gpu_ntrs(:)
real(wp), allocatable, dimension(:, :, :, :) :: gpu_trs_v
real(wp), allocatable, dimension(:, :, :) :: gpu_trs_n
real(wp), allocatable, dimension(:, :, :, :) :: gpu_boundary_v
real(wp), allocatable, dimension(:, :, :) :: gpu_interpolated_boundary_v
integer, allocatable :: gpu_interpolate(:)
integer, allocatable :: gpu_boundary_edge_count(:)
integer, allocatable :: gpu_total_vertices(:)
real(wp), allocatable :: stl_bounding_boxes(:, :, :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find allocation and deallocation sites for the new module variables
echo "=== Allocation/Deallocation patterns ==="
rg -nP --type fpp '(allocate|deallocate)\s*\(' src/common/m_model.fpp | head -30

echo -e "\n=== GPU array allocations in entire src/common ==="
rg -nP --type fpp 'gpu_ntrs|gpu_trs_v|gpu_trs_n|gpu_boundary_v|gpu_interpolated_boundary_v|gpu_interpolate|gpu_boundary_edge_count|gpu_total_vertices|stl_bounding_boxes' src/common/ | grep -i allocate | head -20

echo -e "\n=== Where m_model is imported ==="
rg -n 'use.*m_model' src/ | head -20

Repository: MFlowCode/MFC

Length of output: 622


🏁 Script executed:

#!/bin/bash
# Check the finalization routine referenced in the public interface
echo "=== s_model_free subroutine definition ==="
rg -A 50 'subroutine\s+s_model_free' src/common/m_model.fpp | head -60

echo -e "\n=== s_model_write subroutine definition ==="
rg -A 30 'subroutine\s+s_model_write' src/common/m_model.fpp | head -40

Repository: MFlowCode/MFC

Length of output: 2632


🏁 Script executed:

#!/bin/bash
# Check if pre_process or post_process reference the new public variables or m_model
echo "=== pre_process usage of m_model ==="
rg -n 'gpu_ntrs|gpu_trs_v|gpu_trs_n|gpu_boundary_v|gpu_interpolated_boundary_v|stl_bounding_boxes|models' src/pre_process/ 2>/dev/null | head -10

echo -e "\n=== post_process usage of m_model ==="
rg -n 'gpu_ntrs|gpu_trs_v|gpu_trs_n|gpu_boundary_v|gpu_interpolated_boundary_v|stl_bounding_boxes|models' src/post_process/ 2>/dev/null | head -10

echo -e "\n=== Check if pre_process or post_process import m_model ==="
rg -n 'use.*m_model' src/pre_process/ src/post_process/ 2>/dev/null

Repository: MFlowCode/MFC

Length of output: 518


🏁 Script executed:

#!/bin/bash
# Get the full m_model.fpp file to understand context
wc -l src/common/m_model.fpp

Repository: MFlowCode/MFC

Length of output: 84


🏁 Script executed:

#!/bin/bash
# Read the module to find allocation patterns
# Focus on where the public arrays are allocated
sed -n '20,100p' src/common/m_model.fpp | cat -n

Repository: MFlowCode/MFC

Length of output: 3661


🏁 Script executed:

#!/bin/bash
# Search for allocation patterns - try broader search
rg -n 'allocate' src/common/m_model.fpp | head -30

Repository: MFlowCode/MFC

Length of output: 786


🏁 Script executed:

#!/bin/bash
# Look for initialization subroutines that might allocate these arrays
rg -n 'subroutine.*init\|subroutine.*allocate\|subroutine.*setup' src/common/m_model.fpp | head -20

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search for where models(:) is allocated
rg -B 3 -A 3 'models\s*\(' src/common/m_model.fpp | head -50

Repository: MFlowCode/MFC

Length of output: 430


🏁 Script executed:

#!/bin/bash
# Look for gpu_ntrs allocation pattern
rg -B 3 -A 3 'gpu_ntrs\s*\(' src/common/m_model.fpp | head -50

Repository: MFlowCode/MFC

Length of output: 490


🏁 Script executed:

#!/bin/bash
# Check how these are used in simulation to understand lifecycle
echo "=== Usage in m_ibm.fpp ==="
rg -B 2 -A 2 'gpu_ntrs\|gpu_trs_v\|gpu_boundary_v\|models\(' src/simulation/m_ibm.fpp | head -40

echo -e "\n=== Usage in pre_process/m_icpp_patches.fpp ==="
rg -B 2 -A 2 'models\(' src/pre_process/m_icpp_patches.fpp | head -40

Repository: MFlowCode/MFC

Length of output: 132


🏁 Script executed:

#!/bin/bash
# Search for allocation of the MODULE-LEVEL arrays in calling modules
echo "=== Search for allocate of gpu_ntrs in entire codebase ==="
rg 'allocate.*gpu_ntrs\|gpu_ntrs.*allocate' src/ | head -20

echo -e "\n=== Search for s_pack_model_for_gpu ==="
rg -n 's_pack_model_for_gpu' src/ | head -20

echo -e "\n=== Look for where gpu arrays are assigned/populated ==="
rg -n 'gpu_ntrs\s*=' src/ | head -20

Repository: MFlowCode/MFC

Length of output: 591


🏁 Script executed:

#!/bin/bash
# Search for deallocate of module-level arrays
echo "=== Deallocate of module-level gpu arrays ==="
rg 'deallocate.*gpu_ntrs\|deallocate.*gpu_trs_v\|deallocate.*models\(' src/ | head -20

echo -e "\n=== Look for finalization in m_ibm.fpp ==="
rg -n 'finalize\|cleanup\|deallocate' src/simulation/m_ibm.fpp | head -20

Repository: MFlowCode/MFC

Length of output: 146


🏁 Script executed:

#!/bin/bash
# Look at s_pack_model_for_gpu implementation
echo "=== s_pack_model_for_gpu subroutine ==="
rg -A 80 'subroutine\s+s_pack_model_for_gpu' src/common/m_model.fpp | head -100

Repository: MFlowCode/MFC

Length of output: 519


🏁 Script executed:

#!/bin/bash
# Final verification: confirm module-level arrays have no lifecycle
echo "=== Complete scan for any allocation pattern ==="
rg 'allocate\s*\(' src/common/m_model.fpp src/simulation/m_ib_patches.fpp src/simulation/m_ibm.fpp | grep -E 'gpu_ntrs|gpu_trs_v|gpu_boundary_v|models\s*\(' || echo "NO ALLOCATIONS FOUND for public module arrays"

echo -e "\n=== Confirm these are truly module-level (public scope) ==="
rg -B 1 'integer.*allocatable.*gpu_ntrs|real.*allocatable.*gpu_trs' src/common/m_model.fpp

Repository: MFlowCode/MFC

Length of output: 763


🏁 Script executed:

#!/bin/bash
# Search for deallocation of these specific arrays
echo "=== Deallocate patterns for public module arrays ==="
rg -n 'deallocate.*gpu_ntrs\|deallocate.*gpu_trs_v\|deallocate.*gpu_boundary_v\|deallocate.*models' src/ | head -20

echo -e "\n=== Broader search for deallocate in m_ib_patches.fpp ==="
rg -n 'deallocate' src/simulation/m_ib_patches.fpp | head -20

echo -e "\n=== Context around gpu array allocations in m_ib_patches.fpp ==="
rg -B 5 -A 10 'allocate.*gpu_ntrs' src/simulation/m_ib_patches.fpp | head -40

Repository: MFlowCode/MFC

Length of output: 890


🏁 Script executed:

#!/bin/bash
# Check what pre_process actually uses from m_model
echo "=== What does pre_process use from m_model? ==="
sed -n '1,50p' src/pre_process/m_icpp_patches.fpp | grep -A 15 'use m_model'

echo -e "\n=== Any direct usage of gpu_* or models arrays in pre_process? ==="
rg -n 'gpu_ntrs|gpu_trs_v|gpu_boundary_v|models\(' src/pre_process/m_icpp_patches.fpp | head -10

Repository: MFlowCode/MFC

Length of output: 553


Add deallocate() calls for all public module arrays in a finalization routine, and remove unnecessary m_model import from pre_process.

The public module arrays (gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices) are allocated in m_ib_patches.fpp but never deallocated, causing GPU memory leaks. Create a finalization subroutine in m_model.fpp to deallocate these arrays and call it during simulation cleanup. Additionally, src/pre_process/m_icpp_patches.fpp imports m_model but does not use any of the gpu or models arrays—remove this unnecessary coupling per src/common/ guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_model.fpp` around lines 20 - 42, Add a module finalizer in
m_model.fpp (e.g., subroutine s_finalize_model or a MODULE PROCEDURE bound to
final) that checks ALLOCATED(...) and calls DEALLOCATE(...) for all public
allocatable arrays gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_v,
gpu_interpolated_boundary_v, gpu_interpolate, gpu_boundary_edge_count,
gpu_total_vertices, stl_bounding_boxes and models to ensure GPU/host memory is
freed during simulation cleanup; export or document the finalizer so the cleanup
path in your driver or cleanup routine calls it. Also remove the unnecessary
"use m_model" import from src/pre_process/m_icpp_patches.fpp (or the specific
pre_process file mentioned) since it does not reference the gpu_* arrays or
models, eliminating the unwanted coupling.

Comment on lines +1209 to +1267
subroutine f_distance_normals_3D_flat(ntrs, trs_v, trs_n, pid, point, normals, distance)

$:GPU_ROUTINE(parallelism='[seq]')

integer, intent(in) :: ntrs
real(wp), dimension(:, :, :, :), intent(in) :: trs_v
real(wp), dimension(:, :, :), intent(in) :: trs_n
integer, intent(in) :: pid
real(wp), dimension(1:3), intent(in) :: point
real(wp), dimension(1:3), intent(out) :: normals
real(wp), intent(out) :: distance

real(wp), dimension(1:3, 1:3) :: tri
real(wp) :: dist_min, dist_t_min
real(wp) :: dist_min_normal, dist_buffer_normal
real(wp), dimension(1:3) :: midp
real(wp), dimension(1:3) :: dist_buffer
integer :: i, j, tri_idx

dist_min = 1.e12_wp
dist_min_normal = 1.e12_wp
distance = 0._wp

tri_idx = 0
do i = 1, ntrs
do j = 1, 3
tri(j, 1) = trs_v(j, 1, i, pid)
tri(j, 2) = trs_v(j, 2, i, pid)
tri(j, 3) = trs_v(j, 3, i, pid)
dist_buffer(j) = sqrt((point(1) - tri(j, 1))**2 + &
(point(2) - tri(j, 2))**2 + &
(point(3) - tri(j, 3))**2)
end do

do j = 1, 3
midp(j) = (tri(1, j) + tri(2, j) + tri(3, j))/3
end do

dist_t_min = minval(dist_buffer(1:3))
dist_buffer_normal = sqrt((point(1) - midp(1))**2 + &
(point(2) - midp(2))**2 + &
(point(3) - midp(3))**2)

if (dist_t_min < dist_min) then
dist_min = dist_t_min
end if

if (dist_buffer_normal < dist_min_normal) then
dist_min_normal = dist_buffer_normal
tri_idx = i
end if
end do

normals(1) = trs_n(1, tri_idx, pid)
normals(2) = trs_n(2, tri_idx, pid)
normals(3) = trs_n(3, tri_idx, pid)
distance = dist_min

end subroutine f_distance_normals_3D_flat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -nP --type-add 'fpp:*.fpp' --type 'fpp' '\bf_distance_normals_3D_flat\s*\(' src

Repository: MFlowCode/MFC

Length of output: 345


🏁 Script executed:

sed -n '680,700p' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 913


🏁 Script executed:

rg -nP 'gpu_ntrs' src/simulation/m_compute_levelset.fpp | head -20

Repository: MFlowCode/MFC

Length of output: 186


🏁 Script executed:

rg -nP '\bgpu_ntrs\b' src --type-add 'fpp:*.fpp' | grep -E '(allocate|dimension|intent|gpu_ntrs\s*=)'

Repository: MFlowCode/MFC

Length of output: 206


🏁 Script executed:

sed -n '1110,1160p' src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 2729


🏁 Script executed:

sed -n '650,710p' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 2630


Add guard in f_distance_normals_3D_flat for ntrs <= 0 to prevent out-of-bounds access.

When ntrs is 0 (which can occur if a patch has no model allocated), tri_idx remains uninitialized at 0 and is used to index trs_n(:, tri_idx, pid) at the end, causing an out-of-bounds read. The caller at src/simulation/m_compute_levelset.fpp:688 does not validate gpu_ntrs(patch_id) > 0 before invoking this function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_model.fpp` around lines 1209 - 1267, The subroutine
f_distance_normals_3D_flat can index trs_n(:, tri_idx, pid) with tri_idx==0 when
ntrs <= 0; add a guard at the top of f_distance_normals_3D_flat that checks if
ntrs <= 0 and in that case set distance to a safe value (e.g. a large positive
sentinel or 0._wp as appropriate) and set normals to a defined default (e.g.
zero vector) and return immediately to avoid using tri_idx/uninitialized
indices; ensure you reference and update tri_idx, distance and normals in this
early-return path so no out-of-bounds access to trs_n occurs.

Comment on lines +1402 to +1415
subroutine s_pack_model_for_gpu(ma)
type(t_model_array), intent(inout) :: ma
integer :: i

ma%ntrs = ma%model%ntrs
allocate (ma%trs_v(1:3, 1:3, 1:ma%ntrs))
allocate (ma%trs_n(1:3, 1:ma%ntrs))

do i = 1, ma%ntrs
ma%trs_v(:, :, i) = ma%model%trs(i)%v(:, :)
ma%trs_n(:, i) = ma%model%trs(i)%n(:)
end do
end subroutine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all call sites to s_pack_model_for_gpu
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' '\bs_pack_model_for_gpu\s*\(' src || true

Repository: MFlowCode/MFC

Length of output: 220


🏁 Script executed:

#!/bin/bash
# Also search for the subroutine definition and nearby code
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' -B 5 -A 20 'subroutine s_pack_model_for_gpu' src || true

Repository: MFlowCode/MFC

Length of output: 1111


🏁 Script executed:

#!/bin/bash
# Check if ma%trs_v or ma%trs_n are referenced elsewhere to understand lifecycle
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' 'ma%trs_v|ma%trs_n' src || true

Repository: MFlowCode/MFC

Length of output: 367


🏁 Script executed:

#!/bin/bash
# Get context around the call site in m_ib_patches.fpp
sed -n '1080,1120p' src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 1850


🏁 Script executed:

#!/bin/bash
# Search for surrounding loop context (broader range)
sed -n '1050,1130p' src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 4004


🏁 Script executed:

#!/bin/bash
# Check if models is deallocated/re-initialized somewhere
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' 'models\s*\(|deallocate.*models' src/simulation/ | head -30

Repository: MFlowCode/MFC

Length of output: 3130


🏁 Script executed:

#!/bin/bash
# Check if s_instantiate_STL_models or models are re-initialized in any time-stepping/simulation loop
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' 's_instantiate_STL_models|allocate.*models\(' src/simulation/ | grep -v '\.fpp-' | head -20

Repository: MFlowCode/MFC

Length of output: 1135


🏁 Script executed:

#!/bin/bash
# Search for any re-initialization or deallocate of models during simulation
rg -nP --type-add 'fpp:*.fpp' --type 'fpp' 'deallocate.*models|models.*=|trs_v.*deallocate|trs_n.*deallocate' src | head -20

Repository: MFlowCode/MFC

Length of output: 1318


Make s_pack_model_for_gpu re-entrant by checking allocations before allocating.

The function is called only once per object during initialization (in s_instantiate_STL_models), so there is currently no runtime risk. However, unconditional allocations of ma%trs_v and ma%trs_n violate defensive programming practices and would fail silently if the function were ever called again on the same object. Add pre-allocation checks:

Proposed fix
    ma%ntrs = ma%model%ntrs
+   if (allocated(ma%trs_v)) deallocate(ma%trs_v)
+   if (allocated(ma%trs_n)) deallocate(ma%trs_n)
    allocate (ma%trs_v(1:3, 1:3, 1:ma%ntrs))
    allocate (ma%trs_n(1:3, 1:ma%ntrs))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
subroutine s_pack_model_for_gpu(ma)
type(t_model_array), intent(inout) :: ma
integer :: i
ma%ntrs = ma%model%ntrs
allocate (ma%trs_v(1:3, 1:3, 1:ma%ntrs))
allocate (ma%trs_n(1:3, 1:ma%ntrs))
do i = 1, ma%ntrs
ma%trs_v(:, :, i) = ma%model%trs(i)%v(:, :)
ma%trs_n(:, i) = ma%model%trs(i)%n(:)
end do
end subroutine
subroutine s_pack_model_for_gpu(ma)
type(t_model_array), intent(inout) :: ma
integer :: i
ma%ntrs = ma%model%ntrs
if (allocated(ma%trs_v)) deallocate(ma%trs_v)
if (allocated(ma%trs_n)) deallocate(ma%trs_n)
allocate (ma%trs_v(1:3, 1:3, 1:ma%ntrs))
allocate (ma%trs_n(1:3, 1:ma%ntrs))
do i = 1, ma%ntrs
ma%trs_v(:, :, i) = ma%model%trs(i)%v(:, :)
ma%trs_n(:, i) = ma%model%trs(i)%n(:)
end do
end subroutine
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/common/m_model.fpp` around lines 1402 - 1415, s_pack_model_for_gpu
currently unconditionally allocates ma%trs_v and ma%trs_n; make it re-entrant by
checking existing allocations before allocating: set ma%ntrs = ma%model%ntrs,
then for ma%trs_v and ma%trs_n use allocated(...) to detect prior allocation and
only allocate if not allocated or if the current allocation dimensions don't
match ma%ntrs (in that case deallocate then allocate with the correct bounds);
ensure you copy into ma%trs_v(:, :, i) and ma%trs_n(:, i) after the allocation
checks complete.

Comment on lines +828 to +953
!> The STL patch is a 2D geometry that is imported from an STL file.
!! @param patch_id is the patch identifier
!! @param ib_markers_sf Array to track patch ids
!! @param STL_levelset STL levelset
!! @param STL_levelset_norm STL levelset normals
subroutine s_ib_model(patch_id, ib_markers_sf)
!! @param ib_markers Array to track patch ids
subroutine s_ib_model(patch_id, ib_markers)

integer, intent(in) :: patch_id
integer, dimension(0:m, 0:n, 0:p), intent(inout) :: ib_markers_sf

integer :: i, j, k !< Generic loop iterators
type(integer_field), intent(inout) :: ib_markers

type(t_model), pointer :: model
integer :: i, j, k !< Generic loop iterators
integer :: spc

real(wp) :: eta
real(wp) :: eta, threshold
real(wp), dimension(1:3) :: point, local_point, offset
real(wp), dimension(1:3) :: center, xyz_local
real(wp), dimension(1:3) :: center, xy_local
real(wp), dimension(1:3, 1:3) :: inverse_rotation

model => models(patch_id)%model
center = 0._wp
center(1) = patch_ib(patch_id)%x_centroid
center(2) = patch_ib(patch_id)%y_centroid
if (p > 0) center(3) = patch_ib(patch_id)%z_centroid
inverse_rotation(:, :) = patch_ib(patch_id)%rotation_matrix_inverse(:, :)
offset(:) = patch_ib(patch_id)%centroid_offset(:)
spc = patch_ib(patch_id)%model_spc
threshold = patch_ib(patch_id)%model_threshold

do i = 0, m
do j = 0, n
do k = 0, p
$:GPU_PARALLEL_LOOP(private='[i,j, xy_local, eta]',&
& copyin='[patch_id,center,inverse_rotation, offset, spc, threshold]', collapse=2)
do i = -gp_layers, m + gp_layers
do j = -gp_layers, n + gp_layers

xyz_local = [x_cc(i) - center(1), y_cc(j) - center(2), 0._wp]
if (p > 0) then
xyz_local(3) = z_cc(k) - center(3)
end if
xy_local = [x_cc(i) - center(1), y_cc(j) - center(2), 0._wp]
xy_local = matmul(inverse_rotation, xy_local)
xy_local = xy_local - offset

if (grid_geometry == 3) then
xy_local = f_convert_cyl_to_cart(xy_local)
end if

eta = f_model_is_inside_flat(gpu_ntrs(patch_id), &
gpu_trs_v, gpu_trs_n, &
patch_id, &
xy_local, (/dx(i), dy(j), 0._wp/), &
spc)

! Reading STL boundary vertices and compute the levelset and levelset_norm
if (eta > threshold) then
ib_markers%sf(i, j, 0) = patch_id
end if

end do
end do
$:END_GPU_PARALLEL_LOOP()

end subroutine s_ib_model

!> The STL patch is a 3D geometry that is imported from an STL file.
!! @param patch_id is the patch identifier
!! @param ib_markers Array to track patch ids
subroutine s_ib_3d_model(patch_id, ib_markers)

integer, intent(in) :: patch_id
type(integer_field), intent(inout) :: ib_markers

integer :: i, j, k, il, ir, jl, jr, kl, kr !< Generic loop iterators
integer :: spc

real(wp) :: eta, threshold, corner_distance
real(wp), dimension(1:3) :: point, local_point, offset
real(wp), dimension(1:3) :: center, xyz_local
real(wp), dimension(1:3, 1:3) :: inverse_rotation

center = 0._wp
center(1) = patch_ib(patch_id)%x_centroid
center(2) = patch_ib(patch_id)%y_centroid
center(3) = patch_ib(patch_id)%z_centroid
inverse_rotation(:, :) = patch_ib(patch_id)%rotation_matrix_inverse(:, :)
offset(:) = patch_ib(patch_id)%centroid_offset(:)
spc = patch_ib(patch_id)%model_spc
threshold = patch_ib(patch_id)%model_threshold

il = -gp_layers
jl = -gp_layers
kl = -gp_layers
ir = m + gp_layers
jr = n + gp_layers
kr = p + gp_layers
corner_distance = 0._wp
do i = 1, 3
corner_distance = corner_distance + maxval(abs(stl_bounding_boxes(patch_id, i, 1:3)))**2 ! distance to rim of cylinder
end do
corner_distance = sqrt(corner_distance)
call get_bounding_indices(center(1) - corner_distance, center(1) + corner_distance, x_cc, il, ir)
call get_bounding_indices(center(2) - corner_distance, center(2) + corner_distance, y_cc, jl, jr)
call get_bounding_indices(center(3) - corner_distance, center(3) + corner_distance, z_cc, kl, kr)

$:GPU_PARALLEL_LOOP(private='[i,j,k, xyz_local, eta]',&
& copyin='[patch_id,center,inverse_rotation, offset, spc, threshold]', collapse=3)
do i = il, ir
do j = jl, jr
do k = kl, kr

xyz_local = [x_cc(i) - center(1), y_cc(j) - center(2), z_cc(k) - center(3)]
xyz_local = matmul(inverse_rotation, xyz_local)
xyz_local = xyz_local - offset

if (grid_geometry == 3) then
xyz_local = f_convert_cyl_to_cart(xyz_local)
end if

if (p == 0) then
eta = f_model_is_inside(model, xyz_local, (/dx(i), dy(j), 0._wp/), patch_ib(patch_id)%model_spc)
else
eta = f_model_is_inside(model, xyz_local, (/dx(i), dy(j), dz(k)/), patch_ib(patch_id)%model_spc)
end if
eta = f_model_is_inside_flat(gpu_ntrs(patch_id), &
gpu_trs_v, gpu_trs_n, &
patch_id, &
xyz_local, (/dx(i), dy(j), dz(k)/), &
spc)

! Reading STL boundary vertices and compute the levelset and levelset_norm
if (eta > patch_ib(patch_id)%model_threshold) then
ib_markers_sf(i, j, k) = patch_id
ib_markers%sf(i, j, k) = patch_id
end if

end do
end do
end do
$:END_GPU_PARALLEL_LOOP()

end subroutine s_ib_model
end subroutine s_ib_3d_model

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# 1. Find s_instantiate_STL_models and s_apply_ib_patches calls to check ordering
echo "=== Looking for s_instantiate_STL_models and s_apply_ib_patches calls ==="
rg -n 's_instantiate_STL_models|s_apply_ib_patches' src/simulation/ --type fpp

echo -e "\n=== Checking if they're in correct order within m_ibm.fpp ==="
rg -n 's_instantiate_STL_models\|s_apply_ib_patches' src/simulation/m_ibm.fpp -B 2 -A 2

Repository: MFlowCode/MFC

Length of output: 238


🏁 Script executed:

#!/bin/bash
# 2. Find where gpu_ntrs/gpu_trs_v/gpu_trs_n are allocated and managed
echo "=== Finding gpu_ntrs/gpu_trs_v/gpu_trs_n allocations and GPU_ENTER_DATA ==="
rg -n 'allocate.*gpu_ntrs|allocate.*gpu_trs_v|allocate.*gpu_trs_n' src/simulation/ --type fpp

echo -e "\n=== Searching for GPU_ENTER_DATA calls with these variables ==="
rg -n 'GPU_ENTER_DATA' src/simulation/m_ib_patches.fpp -A 5 || rg -n 'GPU_ENTER_DATA' src/simulation/m_ibm.fpp -A 5

Repository: MFlowCode/MFC

Length of output: 821


🏁 Script executed:

#!/bin/bash
# 3. Verify bounding-box logic in both functions
echo "=== Checking s_ib_model for bounding-box logic ==="
sed -n '828,875p' src/simulation/m_ib_patches.fpp | grep -n 'il\|ir\|jl\|jr\|bounding' || echo "No bounding-box narrowing found in s_ib_model"

echo -e "\n=== Checking s_ib_3d_model for bounding-box logic ==="
sed -n '877,953p' src/simulation/m_ib_patches.fpp | grep -n 'il\|ir\|jl\|jr\|kl\|kr\|bounding'

Repository: MFlowCode/MFC

Length of output: 1111


🏁 Script executed:

#!/bin/bash
# 4. Check data movement for gpu_ntrs/gpu_trs_v/gpu_trs_n in the GPU loop context
echo "=== Checking GPU_PARALLEL_LOOP copyin clause for these variables ==="
sed -n '828,875p' src/simulation/m_ib_patches.fpp | grep -A 2 'GPU_PARALLEL_LOOP'

echo -e "\n=== In s_ib_3d_model ==="
sed -n '877,953p' src/simulation/m_ib_patches.fpp | grep -A 2 'GPU_PARALLEL_LOOP'

echo -e "\n=== Checking if gpu_trs_v/gpu_trs_n are explicitly referenced ==="
rg -n 'gpu_trs_v|gpu_trs_n' src/simulation/m_ib_patches.fpp | head -20

Repository: MFlowCode/MFC

Length of output: 1440


🏁 Script executed:

#!/bin/bash
# Check the broader context of s_apply_ib_patches and see if GPU_ENTER_DATA creates persistent region
rg -n 's_apply_ib_patches' src/simulation/m_ibm.fpp -B 10 -A 20 | head -80

Repository: MFlowCode/MFC

Length of output: 2981


🏁 Script executed:

#!/bin/bash
# Look for GPU_DATA wrapping context around the kernel calls
rg -n 'GPU_DATA|GPU_ENTER_DATA|GPU_EXIT_DATA' src/simulation/m_ibm.fpp -B 2 -A 2

Repository: MFlowCode/MFC

Length of output: 493


🏁 Script executed:

#!/bin/bash
# Check f_model_is_inside_flat signature to see if it expects device arrays
rg -n 'subroutine\s+f_model_is_inside_flat|function\s+f_model_is_inside_flat' src/simulation/ -A 15

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Verify where GPU_ENTER_DATA is called and what scope it governs
rg -n 'subroutine s_instantiate_STL_models' src/simulation/m_ib_patches.fpp -A 300 | tail -150

Repository: MFlowCode/MFC

Length of output: 6345


🏁 Script executed:

#!/bin/bash
# Check if there's a persistent GPU_ENTER_DATA region wrapping both calls
# Look for GPU_ENTER_DATA without matching GPU_EXIT_DATA
rg -n 'GPU_ENTER_DATA|GPU_EXIT_DATA' src/simulation/m_ibm.fpp | head -40

Repository: MFlowCode/MFC

Length of output: 185


🏁 Script executed:

#!/bin/bash
# Verify the exact scope of GPU_ENTER_DATA at line 1160 - is it persistent?
sed -n '1140,1170p' src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 1881


🏁 Script executed:

#!/bin/bash
# Check where gpu_ntrs/gpu_trs_v/gpu_trs_n might be declared with present() or accessed post-ENTER_DATA
rg -n 'present\|gpu_ntrs\|gpu_trs_v\|gpu_trs_n' src/simulation/m_ib_patches.fpp | grep -A 3 -B 3 'present'

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search for how GPU data persistence is managed - look for surrounding GPU_DATA or block context
sed -n '1100,1170p' src/simulation/m_ib_patches.fpp | head -80

Repository: MFlowCode/MFC

Length of output: 3699


🏁 Script executed:

#!/bin/bash
# Check if there's a GPU_EXIT_DATA anywhere in the file
rg -n 'GPU_EXIT_DATA' src/simulation/m_ib_patches.fpp || echo "No GPU_EXIT_DATA found"

# Check broader context around GPU_ENTER_DATA call at line 1160
sed -n '1090,1170p' src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 4049


🏁 Script executed:

#!/bin/bash
# Look for how f_model_is_inside_flat uses the arrays - check if it's marked GPU_ROUTINE
rg -n 'f_model_is_inside_flat' src/simulation/ -A 5 | head -40

Repository: MFlowCode/MFC

Length of output: 1164


🏁 Script executed:

#!/bin/bash
# Check the GPU macro definitions to understand data region semantics
fd 'gpuParallelization\|gpu.*\.fpp' src/simulation/ && cat src/simulation/gpuParallelization.fpp 2>/dev/null | head -200 || echo "File not found, searching for macro definitions"
rg -n '@:GPU_PARALLEL_LOOP|@:GPU_DATA|@:GPU_ENTER_DATA' src/ -B 2 | head -50

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check if there's a broad GPU_DATA or GPU context managing device persistence around the entire flow
rg -n 'GPU_DATA|GPU_ENTER_DATA|GPU_EXIT_DATA|GPU_ROUTINE' src/simulation/m_ibm.fpp | head -20

Repository: MFlowCode/MFC

Length of output: 280


🏁 Script executed:

#!/bin/bash
# Look at the main timestepping loop structure to understand device data persistence scope
rg -n 'subroutine s_time_step\|end subroutine s_time_step' src/simulation/m_ibm.fpp -A 300 | head -150

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check if the GPU arrays like gpu_trs_v are declared as module-level allocatables that persist
rg -n 'gpu_ntrs|gpu_trs_v|gpu_trs_n' src/simulation/m_ib_patches.fpp | grep -i 'allocatable\|dimension' | head -10

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search for module-level declarations of gpu arrays
rg -n 'allocatable.*gpu_ntrs|allocatable.*gpu_trs_v|allocatable.*gpu_trs_n' src/simulation/m_ib_patches.fpp -B 2 -A 2

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Find where these gpu arrays are declared at module scope
rg -n 'integer.*allocatable|real.*allocatable' src/simulation/m_ib_patches.fpp | grep -i gpu

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Get the entire module-level variable declarations section
sed -n '1,100p' src/simulation/m_ib_patches.fpp | grep -n 'allocatable\|dimension'

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Search for the module declaration section more broadly
head -200 src/simulation/m_ib_patches.fpp | tail -100

Repository: MFlowCode/MFC

Length of output: 3972


🏁 Script executed:

#!/bin/bash
# Check module-level variable declarations in m_ib_patches
grep -n 'module m_ib_patches' src/simulation/m_ib_patches.fpp -A 50 | head -60

Repository: MFlowCode/MFC

Length of output: 2313


🏁 Script executed:

#!/bin/bash
# Verify whether gpu_trs_v/gpu_trs_n require present() or explicit data movement in nested GPU regions
# Look for documentation or examples in gpuParallelization guidance
rg -n 'present\|GPU_PARALLEL_LOOP' src/simulation/m_ibm.fpp -B 2 -A 3 | head -80

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

#!/bin/bash
# Check if any other GPU_PARALLEL_LOOP kernels use variables allocated via GPU_ENTER_DATA elsewhere
rg -n 'GPU_PARALLEL_LOOP' src/simulation/ -A 5 | grep -A 5 'copyin\|present' | head -60

Repository: MFlowCode/MFC

Length of output: 4582


🏁 Script executed:

#!/bin/bash
# Critical check: verify gpu_ntrs/gpu_trs_v/gpu_trs_n are referenced but NOT in data clauses
echo "=== In s_ib_model GPU_PARALLEL_LOOP (line ~850) ==="
sed -n '850,870p' src/simulation/m_ib_patches.fpp

echo -e "\n=== In s_ib_3d_model GPU_PARALLEL_LOOP (line ~920) ==="
sed -n '920,940p' src/simulation/m_ib_patches.fpp

echo -e "\n=== Checking for 'present' in GPU_PARALLEL_LOOP patterns ==="
rg -n 'present=' src/simulation/ | head -10

Repository: MFlowCode/MFC

Length of output: 2233


🏁 Script executed:

#!/bin/bash
# Check if gpu_ntrs/gpu_trs_v/gpu_trs_n are module-level in m_model or elsewhere
fd -e fpp -e f90 | xargs grep -l 'gpu_ntrs\|gpu_trs_v' | head -5
rg -n 'integer.*allocatable.*gpu_ntrs|real.*allocatable.*gpu_trs' src/ -B 2 | head -30

Repository: MFlowCode/MFC

Length of output: 549


🏁 Script executed:

#!/bin/bash
# Check if s_instantiate_STL_models is called before s_apply_ib_patches in BOTH initial and restart paths
echo "=== Initial setup (lines ~100-120) ==="
sed -n '100,120p' src/simulation/m_ibm.fpp

echo -e "\n=== Restart/update path (lines ~935-950) ==="
sed -n '935,950p' src/simulation/m_ibm.fpp

Repository: MFlowCode/MFC

Length of output: 1505


🏁 Script executed:

#!/bin/bash
# Verify that GPU arrays are module-level and persisted
echo "=== Module scope declaration in m_model.fpp ==="
sed -n '30,40p' src/common/m_model.fpp

echo -e "\n=== Confirm they're never deallocated ==="
rg -n 'deallocate.*gpu_ntrs|deallocate.*gpu_trs_v|deallocate.*gpu_trs_n' src/

Repository: MFlowCode/MFC

Length of output: 810


Critical: GPU model data not refreshed in s_update_mib restart path; also declare persistent GPU arrays explicitly in kernel data clauses.

The gpu_ntrs/gpu_trs_v/gpu_trs_n arrays (module-level in m_model.fpp) are allocated and pushed to device via GPU_ENTER_DATA in s_instantiate_STL_models() with persistent lifetime. However, in the restart path (s_update_mib), s_apply_ib_patches() is called at line 945 without re-calling s_instantiate_STL_models() to refresh GPU data. If s_update_ib_rotation_matrix() modifies model state, the device arrays will be stale on the next marker generation.

Additionally, both s_ib_model and s_ib_3d_model reference gpu_ntrs/gpu_trs_v/gpu_trs_n inside GPU_PARALLEL_LOOP but do not explicitly declare them in the copyin or present clauses—they rely on implicit device persistence. For clarity and robustness, add present='[gpu_ntrs, gpu_trs_v, gpu_trs_n]' to the loop declarations.

Finally, s_ib_model (2D) scans the full grid; adding the same bounding-box narrowing logic from s_ib_3d_model would improve performance for sparse STL objects.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ib_patches.fpp` around lines 828 - 953, s_update_mib’s
restart path must refresh the device-resident model arrays before generating
markers: ensure s_instantiate_STL_models() (or an equivalent
GPU_ENTER_DATA/refresh routine for module arrays gpu_ntrs, gpu_trs_v, gpu_trs_n)
is called or re-entered before s_apply_ib_patches() in the restart branch so
device data reflects any changes from s_update_ib_rotation_matrix(); in
s_ib_model and s_ib_3d_model, add explicit device presence to the
GPU_PARALLEL_LOOP clauses by including present='[gpu_ntrs, gpu_trs_v,
gpu_trs_n]' (in addition to existing copyin) to avoid relying on implicit
persistence; and narrow the 2D scan in s_ib_model by reusing the bounding-index
logic from s_ib_3d_model (compute corner_distance from stl_bounding_boxes and
call get_bounding_indices to set i/j limits) to avoid scanning the full grid for
sparse STL patches.

Comment on lines +954 to +1169
subroutine s_instantiate_STL_models()

! Variables for IBM+STL
real(wp) :: normals(1:3) !< Boundary normal buffer
integer :: boundary_vertex_count, boundary_edge_count, total_vertices !< Boundary vertex
real(wp), allocatable, dimension(:, :, :) :: boundary_v !< Boundary vertex buffer
real(wp), allocatable, dimension(:, :) :: interpolated_boundary_v !< Interpolated vertex buffer
real(wp) :: dx_local, dy_local, dz_local !< Levelset distance buffer
logical :: interpolate !< Logical variable to determine whether or not the model should be interpolated

integer :: i, j, k !< Generic loop iterators
integer :: patch_id

type(t_bbox) :: bbox, bbox_old
type(t_model) :: model
type(ic_model_parameters) :: params

real(wp) :: eta
real(wp), dimension(1:3) :: point, model_center
real(wp) :: grid_mm(1:3, 1:2)

real(wp), dimension(1:4, 1:4) :: transform, transform_n

dx_local = minval(dx); dy_local = minval(dy)
if (p /= 0) dz_local = minval(dz)

do patch_id = 1, num_ibs
if (patch_ib(patch_id)%geometry == 5 .or. patch_ib(patch_id)%geometry == 12) then
allocate (models(patch_id)%model)
print *, " * Reading model: "//trim(patch_ib(patch_id)%model_filepath)

model = f_model_read(patch_ib(patch_id)%model_filepath)
params%scale(:) = patch_ib(patch_id)%model_scale(:)
params%translate(:) = patch_ib(patch_id)%model_translate(:)
params%rotate(:) = patch_ib(patch_id)%model_rotate(:)
params%spc = patch_ib(patch_id)%model_spc
params%threshold = patch_ib(patch_id)%model_threshold

if (f_approx_equal(dot_product(params%scale, params%scale), 0._wp)) then
params%scale(:) = 1._wp
end if

if (proc_rank == 0) then
print *, " * Transforming model."
end if

! Get the model center before transforming the model
bbox_old = f_create_bbox(model)
model_center(1:3) = (bbox_old%min(1:3) + bbox_old%max(1:3))/2._wp

! Compute the transform matrices for vertices and normals
transform = f_create_transform_matrix(params, model_center)
transform_n = f_create_transform_matrix(params)

call s_transform_model(model, transform, transform_n)

! Recreate the bounding box after transformation
bbox = f_create_bbox(model)

! Show the number of vertices in the original STL model
if (proc_rank == 0) then
print *, ' * Number of input model vertices:', 3*model%ntrs
end if

call f_check_boundary(model, boundary_v, boundary_vertex_count, boundary_edge_count)

! Check if the model needs interpolation
if (p > 0) then
call f_check_interpolation_3D(model, (/dx_local, dy_local, dz_local/), interpolate)
else
call f_check_interpolation_2D(boundary_v, boundary_edge_count, (/dx_local, dy_local, 0._wp/), interpolate)
end if
interpolate = .false.

! Show the number of edges and boundary edges in 2D STL models
if (proc_rank == 0 .and. p == 0) then
print *, ' * Number of 2D model boundary edges:', boundary_edge_count
end if

! Interpolate the STL model along the edges (2D) and on triangle facets (3D)
if (interpolate) then
if (proc_rank == 0) then
print *, ' * Interpolating STL vertices.'
end if

if (p > 0) then
call f_interpolate_3D(model, (/dx, dy, dz/), interpolated_boundary_v, total_vertices)
else
call f_interpolate_2D(boundary_v, boundary_edge_count, (/dx, dy, dz/), interpolated_boundary_v, total_vertices)
end if

if (proc_rank == 0) then
print *, ' * Total number of interpolated boundary vertices:', total_vertices
end if
end if

if (proc_rank == 0) then
write (*, "(A, 3(2X, F20.10))") " > Model: Min:", bbox%min(1:3)
write (*, "(A, 3(2X, F20.10))") " > Cen:", (bbox%min(1:3) + bbox%max(1:3))/2._wp
write (*, "(A, 3(2X, F20.10))") " > Max:", bbox%max(1:3)

grid_mm(1, :) = (/minval(x_cc(0:m)) - 0.5_wp*dx_local, maxval(x_cc(0:m)) + 0.5_wp*dx_local/)
grid_mm(2, :) = (/minval(y_cc(0:n)) - 0.5_wp*dy_local, maxval(y_cc(0:n)) + 0.5_wp*dy_local/)

if (p > 0) then
grid_mm(3, :) = (/minval(z_cc(0:p)) - 0.5_wp*dz_local, maxval(z_cc(0:p)) + 0.5_wp*dz_local/)
else
grid_mm(3, :) = (/0._wp, 0._wp/)
end if

write (*, "(A, 3(2X, F20.10))") " > Domain: Min:", grid_mm(:, 1)
write (*, "(A, 3(2X, F20.10))") " > Cen:", (grid_mm(:, 1) + grid_mm(:, 2))/2._wp
write (*, "(A, 3(2X, F20.10))") " > Max:", grid_mm(:, 2)
end if

allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]

models(patch_id)%model = model
models(patch_id)%boundary_v = boundary_v
models(patch_id)%boundary_edge_count = boundary_edge_count
if (interpolate) then
models(patch_id)%interpolate = 1
else
models(patch_id)%interpolate = 0
end if
if (interpolate) then
models(patch_id)%interpolated_boundary_v = interpolated_boundary_v
models(patch_id)%total_vertices = total_vertices
end if

end if
end do

! Pack and upload flat arrays for GPU (AFTER the loop)
block
integer :: pid, max_ntrs
integer :: max_bv1, max_bv2, max_bv3, max_iv1, max_iv2

max_ntrs = 0
max_bv1 = 0; max_bv2 = 0; max_bv3 = 0
max_iv1 = 0; max_iv2 = 0

do pid = 1, num_ibs
if (allocated(models(pid)%model)) then
call s_pack_model_for_gpu(models(pid))
max_ntrs = max(max_ntrs, models(pid)%ntrs)
end if
if (allocated(models(pid)%boundary_v)) then
max_bv1 = max(max_bv1, size(models(pid)%boundary_v, 1))
max_bv2 = max(max_bv2, size(models(pid)%boundary_v, 2))
max_bv3 = max(max_bv3, size(models(pid)%boundary_v, 3))
end if
if (allocated(models(pid)%interpolated_boundary_v)) then
max_iv1 = max(max_iv1, size(models(pid)%interpolated_boundary_v, 1))
max_iv2 = max(max_iv2, size(models(pid)%interpolated_boundary_v, 2))
end if
end do

if (max_ntrs > 0) then
allocate (gpu_ntrs(1:num_ibs))
allocate (gpu_trs_v(1:3, 1:3, 1:max_ntrs, 1:num_ibs))
allocate (gpu_trs_n(1:3, 1:max_ntrs, 1:num_ibs))
allocate (gpu_interpolate(1:num_ibs))
allocate (gpu_boundary_edge_count(1:num_ibs))
allocate (gpu_total_vertices(1:num_ibs))

gpu_ntrs = 0
gpu_trs_v = 0._wp
gpu_trs_n = 0._wp
gpu_interpolate = 0
gpu_boundary_edge_count = 0
gpu_total_vertices = 0

if (max_bv1 > 0) then
allocate (gpu_boundary_v(1:max_bv1, 1:max_bv2, 1:max_bv3, 1:num_ibs))
gpu_boundary_v = 0._wp
end if

if (max_iv1 > 0) then
allocate (gpu_interpolated_boundary_v(1:max_iv1, 1:max_iv2, 1:num_ibs))
gpu_interpolated_boundary_v = 0._wp
end if

do pid = 1, num_ibs
if (allocated(models(pid)%model)) then
gpu_ntrs(pid) = models(pid)%ntrs
gpu_trs_v(:, :, 1:models(pid)%ntrs, pid) = models(pid)%trs_v
gpu_trs_n(:, 1:models(pid)%ntrs, pid) = models(pid)%trs_n
gpu_interpolate(pid) = models(pid)%interpolate
gpu_boundary_edge_count(pid) = models(pid)%boundary_edge_count
gpu_total_vertices(pid) = models(pid)%total_vertices
end if
if (allocated(models(pid)%boundary_v)) then
gpu_boundary_v(1:size(models(pid)%boundary_v, 1), &
1:size(models(pid)%boundary_v, 2), &
1:size(models(pid)%boundary_v, 3), pid) = models(pid)%boundary_v
end if
if (allocated(models(pid)%interpolated_boundary_v)) then
gpu_interpolated_boundary_v(1:size(models(pid)%interpolated_boundary_v, 1), &
1:size(models(pid)%interpolated_boundary_v, 2), pid) = models(pid)%interpolated_boundary_v
end if
end do

$:GPU_ENTER_DATA(copyin='[gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices]')
if (allocated(gpu_boundary_v)) then
$:GPU_ENTER_DATA(copyin='[gpu_boundary_v]')
end if
if (allocated(gpu_interpolated_boundary_v)) then
$:GPU_ENTER_DATA(copyin='[gpu_interpolated_boundary_v]')
end if
end if
end block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for all occurrences of stl_bounding_boxes in the codebase
echo "=== stl_bounding_boxes allocations and usage ==="
rg -n "stl_bounding_boxes" src/simulation/m_ib_patches.fpp src/common/m_model.fpp 2>/dev/null || echo "Not found in specified files"

# Broader search
rg -n "stl_bounding_boxes" --type f90 --type fpp 2>/dev/null || true

Repository: MFlowCode/MFC

Length of output: 999


🏁 Script executed:

#!/bin/bash
# Check total_vertices initialization and usage pattern
echo "=== total_vertices usage pattern ==="
rg -n "total_vertices" src/simulation/m_ib_patches.fpp | head -50

# Check gpu_total_vertices
echo "=== gpu_total_vertices pattern ==="
rg -n "gpu_total_vertices" src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 1328


🏁 Script executed:

#!/bin/bash
# Check gpu_* allocations and deallocations
echo "=== gpu_* array allocations ==="
rg -n "allocate.*gpu_" src/simulation/m_ib_patches.fpp | head -30

echo "=== gpu_* deallocations ==="
rg -n "deallocate.*gpu_" src/simulation/m_ib_patches.fpp | head -30

# Check if there's a finalize/cleanup subroutine
echo "=== Potential cleanup routines ==="
rg -n "subroutine.*finalize|subroutine.*cleanup" src/simulation/m_ib_patches.fpp

Repository: MFlowCode/MFC

Length of output: 855


🏁 Script executed:

#!/bin/bash
# Check the module-level declarations to understand data structure
echo "=== Module-level declarations for gpu_* and stl_bounding_boxes ==="
rg -n "real.*gpu_|integer.*gpu_|allocatable.*gpu_|real.*stl_bounding|allocatable.*stl_bounding" src/simulation/m_ib_patches.fpp | head -40

# Look for module/global scope context
head -150 src/simulation/m_ib_patches.fpp | tail -100

Repository: MFlowCode/MFC

Length of output: 4044


s_instantiate_STL_models has critical allocation/initialization issues breaking multi-model scenarios.

All four issues are confirmed:

  1. stl_bounding_boxes allocation inside loop: Line 1069 allocates with allocate(stl_bounding_boxes(patch_id, 1:3, 1:3)) within the do patch_id = 1, num_ibs loop. This will fail on the second iteration—the first allocates shape (1,3,3), the second attempts (2,3,3), but the array is already allocated. Should allocate once before the loop with shape (num_ibs, 1:3, 1:3).

  2. total_vertices uninitialized when interpolate=false: The local variable total_vertices (line 958) is only assigned at lines 1040 or 1042 inside the if (interpolate) block. Line 1084 unconditionally copies it into models(patch_id)%total_vertices, propagating garbage values when interpolation is skipped. Initialize or conditionally assign.

  3. Incorrect spacing vector in interpolation: Lines 1040 and 1042 pass (/dx, dy, dz/) (array elements, not spacing magnitudes), though this is currently masked by line 1037 forcing interpolate = .false.. If interpolation is re-enabled, use (/dx_local, dy_local, dz_local/).

  4. Missing allocation guards on gpu_ arrays*: Lines 1116–1121 allocate module-level gpu_* arrays without checking if already allocated. Calling s_instantiate_STL_models twice will fail. Add guards: if (.not. allocated(gpu_ntrs)) allocate(...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/simulation/m_ib_patches.fpp` around lines 954 - 1169,
s_instantiate_STL_models has four fixes: allocate stl_bounding_boxes once before
the patch loop with shape (num_ibs,1:3,1:3) instead of calling
allocate(stl_bounding_boxes(patch_id,...)) inside the loop; ensure
total_vertices is initialized or only copied into
models(patch_id)%total_vertices when interpolate is true (do not propagate
uninitialized total_vertices when interpolation is skipped); pass spacing
magnitudes to the interpolation routines by using (/dx_local, dy_local,
dz_local/) when calling f_interpolate_3D/f_interpolate_2D (and remove or fix the
unconditional interpolate = .false. that currently masks interpolation logic);
and protect module-level gpu_* allocations (gpu_ntrs, gpu_trs_v, gpu_trs_n,
gpu_interpolate, gpu_boundary_edge_count, gpu_total_vertices, gpu_boundary_v,
gpu_interpolated_boundary_v) with if (.not. allocated(...)) checks before each
allocate to allow repeated calls to s_instantiate_STL_models.

Comment on lines +687 to 688
self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
"num_ibs must be between 1 and num_patches_max (10)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix outdated error message for num_ibs upper bound.

The validation allows up to 1000, but the message still says 10, which will mislead users. Consider deriving the bound from CONSTRAINTS to avoid future drift.

✅ Suggested fix
-        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
-                     "num_ibs must be between 1 and num_patches_max (10)")
+        max_ibs = CONSTRAINTS["num_ibs"]["max"]
+        self.prohibit(ib and (num_ibs <= 0 or num_ibs > max_ibs),
+                     f"num_ibs must be between 1 and num_patches_max ({max_ibs})")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@toolchain/mfc/case_validator.py` around lines 687 - 688, The error message
passed to self.prohibit for the num_ibs check is stale (mentions 10) while the
actual upper bound is 1000; update the message to reflect the real max and, to
prevent future drift, construct the message using the authoritative constraint
value (e.g. the CONSTRAINTS constant or the class attribute used for bounds)
rather than a hardcoded literal—locate the self.prohibit call that checks
num_ibs (the line with self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
...)) and interpolate/format the upper-bound from CONSTRAINTS (or
self.CONSTRAINTS / CONSTRAINTS['num_ibs']['max'] as appropriate) so the message
reads "num_ibs must be between 1 and <max>".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments