Skip to content

Conversation

@bcdandurand
Copy link
Collaborator

Summary
@akohlmey @stanmoore1
Please forward to anyone else as warranted.

Class VerletSplitRK : Verlet (run_style option verlet/split/rk)
Alternative to VerletSplit, where the K process responsibility is reduced to receiving charge densities,
computing grid potentials by solving the Poisson equations via FFTs, and communicating them to the R processes
The R processes thus do more of the computation (accumulating charge densities, interpolating long-range forces from the grid potentials)
that the K processes would have done with VerletSplit.

These are the motivations for this:

  1. Improved parallel scalability: The parts of the long-range computation that have good parallel scalability with increasing number of processes are moved to the R-processes. Thus, the parts that scale the least well are isolated to relatively few K-processes.
  2. Reduced need for communication: No need for inter-RK process communication of atom-specific information.
    Thus, there is no need for the inter-RK communication of data structure information during re-neighboring.
    This removes the need for calls to VerletSplit::rk_setup. Inter-RK communication buffers depend on the grid parameters (and not the number of atoms)
  3. Setup for future developments: The inter-RK communication of grid potentials instead of atom-wise long-range forces is more amenable to approximation in light of asynchronous communication. (This feature is described in our paper but is not part of the current pull request).

Class PPPM_RK : PPPM (kspace_style pppm/rk)
This class is needed to have a clean implementation of VerletSplitRK.
The computation of long-range forces is carried out in a heterogeneously parallel manner
with the same distinction between R processes and K processes as in VerletSplit.
Analogs for other kspace styles, e.g., PPPMDipoleRK : PPPMDipole, may possibly be contributed later.
Much of the partitioned inter-RK communication that was previously in VerletSplit is moved to PPPM_RK.
The following virtual functions, which were added to kspace.h, are implemented in pppm_rk.cpp:

        virtual void compute_grid_potentials(int, int) override;
        virtual void r2k_comm(int &eflag, int &vflag) override;
        virtual void k2r_comm(int eflag, int vflag) override;

Effectively, compute has been split into three parts, with inter-RK communication between each part.
R processes compute the first and third part, K processes the second (compute_grid_potentials) part.

Related Issue(s)

This pull request does not address a currently open issue. It is related to a closed issue: Issue #4149

Author(s)

The person responsible for the coding development is:
Brian Dandurand
MSCA Research Fellow
Electronics, Electrical Engineering, and Computer Science (EEECS)
Queen's University Belfast

Institutional Email: b.dandurand@qub.ac.uk
Long-term Email: bcdandurand@gmail.com

The developments behind this pull request were supported by the project
Project: Asynchronous Scientific Continuous Computations Exploiting Disaggregation (ASCCED)
Funding source: UKRI EPSRC EP/X01794X/1
Members of this project:
Research Fellow: Brian Dandurand
Supervisor 1: Prof. Hans Vandierendonck (Queen's University Belfast --- EEECS)
Supervisor 2: Prof. Bronis R. de Supinski (Queen's University Belfast --- EEECS)

The developments of this pull request may be described as a preliminary implementation of the enhanced baseline in

Dandurand, B., Vandierendonck, H., de Supinski, B. "Improving parallel scalability for molecular dynamics simulations in the exascale era". 39th IEEE International Parallel & Distributed Processing Symposium. June 3-7, 2025. Milan, Italy. To appear.

Licensing

By submitting this pull request, I agree, that my contribution will be included in LAMMPS and redistributed under either the GNU General Public License version 2 (GPL v2) or the GNU Lesser General Public License version 2.1 (LGPL v2.1).

Backward Compatibility

There should be no issues with backward compatibility. As for core LAMMPS source files, only de minimis additions were made to the KSpace class

$ diff kspace.h ~/lammps/src/kspace.h
84,85d83
<   int rk_flag;         /* 1 if a solver uses two distinct communicator worlds for
<                             r-space and k-space computations*/
178,182d175
<
<   // to be overridden in the *RK subclasses for which rk_flag == 1
<   virtual void r2k_comm(int &eflag, int &vflag) {};  //rk_flag == 1
<   virtual void k2r_comm(int eflag, int vflag) {};    //rk_flag == 1
<   virtual void compute_grid_potentials(int, int) {}; //rk_flag == 1
$ diff kspace.cpp ~/lammps/src/kspace.cpp
44c44
<     dipoleflag = spinflag = rk_flag = 0;
---
>     dipoleflag = spinflag = 0;

Implementation Notes

No other features should be affected. The contributed features are invoked by specifying the following pair of options in a LAMMPS script:

	kspace_style	pppm/rk		0.0001  #Or whatever precision
	run_style 	verlet/split/rk

These two options correspond to two new classes: PPPM_RK : PPPM and VerletSplitRK : Verlet

PPPM_RK: Due to the communication between R and K processes, buffers for PPPM data structures are allocated:

  FFT_SCALAR ***density_brick_buf;
  FFT_SCALAR ***vdx_brick_buf, ***vdy_brick_buf, ***vdz_brick_buf;
  FFT_SCALAR ***u_brick_buf;

  double energy_buf;
  double virial_mpi_buf[6];
  int flags_buf[2];
  double domain_boxbuf[6];

PPPM_RK: Parallel topology information similar to what is maintained in VerletSplit:

  int rproc;                            // 1 if an Rspace proc, 0 if Kspace; (corresponds to int master in VerletSplit)
  int me_block;                          // proc ID within Rspace/Kspace block
  int ratio;                             // ratio of Rspace procs to Kspace procs
  MPI_Comm block;                        // communicator within one block

PPPM_RK: Most of the collective and point-to-point communications are of the immediate form MPI_I* with corresponding MPI_Wait calls.
Thus PPPM_RK employs various MPI_Comm clones of MPI_Comm block to avoid collision of MPI_I* collective communications
Corresponding MPI_Request structures are also employed.

PPPM_RK: The main feature is that compute is split into three parts.
protected: virtual void compute_charge_densities(int,int) is called from the R-processes to gather charge densities into the grid it is called from public: virtual void r2k_comm(int &eflag, int &vflag), where the charge densities are also communicated to the K-processes
public: virtual void compute_grid_potentials(int, int); is called from the K-processes to compute the grid potentials from the charge densities by solving the Poisson equations via FFTs. These charge densities are sent to the appropriate R-processes by calling public: virtual void k2r_comm(int eflag, int vflag)
protected: virtual void compute_interpolate_forces(int, int); called from public: virtual void k2r_comm(int eflag, int vflag)

VerletSplitRK: With the use of KSpace of type having rk_flag == 1, the implementation of VerletSplitRK : Verlet is analogous to that VerletSplit : Verlet, but of a more streamlined form. Due to the different task assignments of R versus K processors, certain function calls are no longer necessary.

Correctness was verified on three applications from the examples directory.
For all three tests, kspace_style pppm/rk and run_style verlet/split/rk were used in place of the default settings in the in.* script.
HEAT: Changed kspace_style ewald --> kspace_style pppm/rk; Near-constant preservation of total energy was preserved as seen in baseline
VISCOSITY: Statistical outputs for the first 1000 iterations was very close to baseline.
(After 1000 iterations, deviation is observed with different parallel topologies even among baseline runs, due to the butterfly effect and
the fact that floating point addition is not perfectly associative.)
peptide: Near-constant preservation of total energy observed similar to baseline.
Post Submission Checklist

  • The feature or features in this pull request is complete
  • Licensing information is complete
  • Corresponding author information is complete
  • The source code follows the LAMMPS formatting guidelines
  • Suitable new documentation files and/or updates to the existing docs are included
  • The added/updated documentation is integrated and tested with the documentation build system
  • The feature has been verified to work with the conventional build system
  • The feature has been verified to work with the CMake based build system
  • Suitable tests have been added to the unittest tree.
  • A package specific README file has been included or updated
  • One or more example input decks are included

Further Information, Files, and Links

The developments of this pull request may be described as a preliminary implementation of the enhanced baseline in

Dandurand, B., Vandierendonck, H., de Supinski, B. "Improving parallel scalability for molecular dynamics simulations in the exascale era".
39th IEEE International Parallel & Distributed Processing Symposium. June 3-7, 2025. Milan, Italy. To appear.

With input from LAMMPS developers, I hope to add analogous /rk variants to other KSpace subclasses as warranted in future pulls so as to make the verlet/split/rk option more broadly applicable.

@bcdandurand bcdandurand requested a review from sjplimp as a code owner June 11, 2025 12:56
@ndtrung81
Copy link
Contributor

@bcdandurand this is a promising idea. I will look at the implementation and give it a try.

@ndtrung81
Copy link
Contributor

@bcdandurand Thanks for contributing the work, this is potentially very helpful for large scale simulations with long-range electrostatics on a large number of MPI processes. Have you observed any speedup with this implementation in your test runs?

I can build with CMake and test with a modified version of bench/in.rhodo and can run with the GPU package with pair/only yes. I have a couple of (early) comments:

  1. The implementation logic is clear and concise. Communicating only the grid-based charge density (R->K) and potential (K->R) is an improvement to the existing VerletSplit to siginficantly reduce the amount of data transfer between R and K processes.
  2. The change where the R-K communication is done requires changes to the logic in the run() function compared to VerletSplit. The addition of rk_flag and 3 new virtual functions to the core KSpace class makes the VerletSplitRK::run() looks cleaner than having to check if force->kspace style name matches pppm/rk in the init() function and to cast force->kspace to (PPPM_RK*).
  3. PPPM_RK currently comments out compute(), but there are quite a few classes that invoke force->kspace->compute(). Do you think this function should be enabled? Also, is it possible to support compute_group_group()?
  4. Please add sections for verlet/split/rk and pppm/rk in the corresponding doc pages (run_style.rst and kspace_style.rst).
  5. I think you could remove the commented out sections with #if 0 .. #endif in verlet_split_rk.cpp.

@bcdandurand
Copy link
Collaborator Author

@ndtrung81 Thank you very much for reviewing this contribution.
First, for the question about observed speedup, yes, improvement is observed where there would be communication bottleneck between R- and K-processes. At small node counts and for R to K ratio small, there would not be much bottleneck and thus little improvement. Otherwise, for larger node count or large R to K ratio, improvement is noticeable. For example, if I run VISCOSITY using non-default settings

replicate       4 4 4
kspace_style   pppm/rk 1.0e-4
run_style      verlet/split/rk

then the simulation takes about 47 seconds, versus about 77 seconds with the non-default (otherwise baseline) settings

replicate       4 4 4
kspace_style   pppm 1.0e-4
run_style      verlet/split

In both runs, I use 96 R processes and 4 K processes with a sbatch script similar to

#!/bin/bash

#SBATCH --ntasks=100                            # Number of tasks to run (for MPI tasks this is number of cores)
#SBATCH --nodes=1                               # Maximum number of nodes on which to launch tasks
#SBATCH --exclusive
#SBATCH --mem=0

mpirun -n 100 ~/lammpsfork/lammps/build/lmp -partition 96 4 -screen vsrk_r96k4 -log none -in in.cos.1000SPCE

@bcdandurand
Copy link
Collaborator Author

@ndtrung81
PPPM::compute: Currently, the PPPM::compute would be invoked from an instance of PPPM_RK. Whether this is ideal default behavior is up for debate. It probably is not sensible, since the same computation would be done redundantly by both the communicator of R processes and the communicator of K processes. I'll reply again after I've thought of what the use of pppm_rk requires outside of the use of verlet/split/rk.

Commented out code: Otherwise, the currently commented out implementation of PPPM_RK::compute might be refactored as a comment, to indicate the contents of a decomposed alternative to PPPM::compute.
Other commented code might likewise be refactored as comments if deemed useful, otherwise delete.

compute_group_group: I'm not as familiar with this (though I'm aware of its existence), I can look into this next.

Please advise as to the preferred way to provide updated changes in addition to pushing to the bcdandurand fork.

doc pages: I'll look into this and enquire as needed.


virtual void compute_grid_potentials(int, int) override;
//virtual void compute(int, int) override;
virtual void compute(int, int) override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcdandurand Please note that member functions should have either virtual or override. We prefer using override except for the base function where virtual is required. We're currently cleaning this up and making things consistent in PR #4634

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, will do.

@bcdandurand
Copy link
Collaborator Author

I'm looking at incorporating an analogous decomposition for PPPM::compute_group_group. For examples that would use this, I see
examples/PACKAGES/tally/in.pe
examples/PACKAGES/tally/in.force

@akohlmey akohlmey moved this from In Progress to After Stable Release in LAMMPS Pull Requests Jun 27, 2025
@bcdandurand
Copy link
Collaborator Author

I'm looking at incorporating an analogous decomposition for PPPM::compute_group_group. For examples that would use this, I see examples/PACKAGES/tally/in.pe examples/PACKAGES/tally/in.force

Testing in.pe under the script modifications:

kspace_style    pppm/rk 1.0e-4
replicate       4 4 4
run             500

with sbatch command, e.g.,
mpirun -n 60 ~/lammpsfork/lammps/build/lmp -partition 30 30 -screen verlet_r30k30_pppm_rk -log none -in in.pe

I'll soon have an update with the addition of an implementation of PPPM_RK::compute_group_group and ancillary structures. While it separates the execution between R- and K-processes, and produces the same output for tally/in.pe as with the baseline use of kspace_style pppm 1.0e-4, it does so in a manner that does not appear to be so useful in terms of speedup. If it is to be of any use, there needs to be a modified version of the ComputeGroupGroup class that is analogous to the VerletSplitRK variant of VerletSplit. Developing the variant of ComputeGroupGroup, if it's worth it, should probably be part of a later pull request. I need to do a bit more cleanup with the addition of PPPM_RK::compute_group_group before I push to the bcdandurand forked repo. (I won't be able to do so for a couple days.)

I can then turn to the doc pages after the above is finished.

@akohlmey
Copy link
Member

akohlmey commented Jul 2, 2025

@bcdandurand Before working on additional features, you should probably first pay attention to clean integration of your code into the LAMMPS code base. You pull request branch currently has a merge conflict with upstream and fails to compile on all our test platforms.

@bcdandurand
Copy link
Collaborator Author

@akohlmey @ndtrung81 et al.
I'm working on adding sections for verlet/split/rk and pppm/rk in the corresponding doc pages (run_style.rst and kspace_style.rst). This is in tandem with deciding what to do about PPPM_RK::compute_group_group. The issue I'm bringing up in this comment is also analogous with an issue with the implementation of PPPM_RK::compute, which is already integrated with this fork.

The issue is due to the assumption in PPPM_RK that the initial MPI communicator is partitioned (see PPPM_RK::setupRKBlock). While it's possible to implement PPPM_RK::compute and PPPM_RK::compute_group_group within the partitioned framework, the invocation of these two functions, within, e.g., Verlet::run and ComputeGroupGroup::kspace_contribution, respectively, which do not assume a partitioning of the MPI communicator, would be detrimental to runtime performance. This is due to the default behavior where identical execution occurs on R-processes and K-processes. If there are 100 processes, of which 96 are R-processes and 4 are K-processes, the speedup will be bottlenecked as if only 4 processes are being used.

Currently, there is only beneficial use of MPI parallelism if kspace_style pppm/rk is paired with run_style verlet/split/rk. We would need an analogous group/group/rk compute style implemented in, say, a ComputeGroupGroupRK class for any benefit from a call to PPPM_RK::compute_group_group to be realized.

My initial thought to prevent users from using an ineffective combination of styles (from a parallel computing perspective) is to have an rk_flag in the Verlet class (or higher up in Integrate) and other relevant classes (like Compute), which would be visible to instances of KSpace. An error can be thrown if there is not consistency between the rk_flag values of these classes when their respective styles are used.

The issue described in this comment is also relevant to updating the manual, because users would need to have consistency in the rk option for run_style, kspace_style, etc.

I hope this is clear, I'll clarify where helpful.

@akohlmey akohlmey moved this from After Stable Release to In Progress in LAMMPS Pull Requests Jul 22, 2025
@ndtrung81
Copy link
Contributor

@bcdandurand If you don't have time to complete the work on compute group/group/rk at the moment, we can leave it to a later PR. If possible, please wrap up the documentation for this PR and get it ready to the final round of review and to merge.

@bcdandurand
Copy link
Collaborator Author

@bcdandurand If you don't have time to complete the work on compute group/group/rk at the moment, we can leave it to a later PR. If possible, please wrap up the documentation for this PR and get it ready to the final round of review and to merge.

@ndtrung81 Understood. The cluster I normally work on is under maintenance, I expect to have access to it again on Friday. I'll then address the documentation as soon as I can.

Copy link
Contributor

@sjplimp sjplimp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to review this PR when it is closer to merging

@stanmoore1 stanmoore1 assigned sjplimp and unassigned ndtrung81 Sep 9, 2025
@sjplimp
Copy link
Contributor

sjplimp commented Dec 4, 2025

Thanks @bcdandurand for submitting these enhancements. And for the thorough explanation in the PR description of what you've done. Also thanks to @ndtrung81 for looking it over carefully and testing it out.

I have just a couple hi-level suggestions.

(1) In the doc pages for run_style and/or kspace_style for the new variant styles could you provide some numbers for sample speed-ups. This is to motivate users to try out the new options. Ideally, the numbers would compare vanilla verlet vs verlet/split (previous) vs verlet/split/rk. Could be explained in a few sentences, or add a simple table with 3 columns of numbers.

(2) If you like, you can add a "citation" in the source code for your pending paper which will trigger a prompt for the user to cite it, whenever the verlet/split/rk option is used.

Copy link
Contributor

@sjplimp sjplimp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my recent comment

@bcdandurand
Copy link
Collaborator Author

bcdandurand commented Dec 9, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants