Merzbild.jl benchmarks

Various benchmarks and comparisons to other open-source codes are provided here for reference test cases.

Couette flow, serial, small grid

Comparison with SPARTA are provided for a single-species (argon) Couette flow test case with 50000 particles and 50 cells (averaging over 36k timesteps after t>14000). The computation is serial. Timing in Merzbild.jl providedd by TimerOutputs.jl, timing in SPARTA provided by the inbuilt timers. No surface quantities are being computed. The input can be found in simulations/1D/couette_benchmarking.jl.

Merzbild.jl version 0.7.0, run with --check-bounds=no -O3.

SPARTA version 20Jan2025, compiled with -O3.

Intel Core i9-13900K, 128 GB RAM

Ubuntu 22.04.5, Julia version 1.11.2, SPARTA compiled with gcc version 11.4.0.

Merzbild.jl

──────────────────────────────────────────────────────────────────────────
                                 Time                    Allocations      
                        ───────────────────────   ────────────────────────
   Tot / % measured:         28.4s /  97.3%            126MiB /   9.8%    

Section         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────
sort             50.0k    10.5s   37.9%   209μs     0.00B    0.0%    0.00B
convect          50.0k    6.29s   22.8%   126μs     0.00B    0.0%    0.00B
collide          2.50M    5.90s   21.3%  2.36μs     0.00B    0.0%    0.00B
props compute    36.0k    4.88s   17.7%   136μs     0.00B    0.0%    0.00B
I/O                  1   85.4ms    0.3%  85.4ms   9.32MiB   75.3%  9.32MiB
avg physprops    36.0k   5.54ms    0.0%   154ns     0.00B    0.0%    0.00B
sampling             1   2.52ms    0.0%  2.52ms   3.05MiB   24.7%  3.05MiB
──────────────────────────────────────────────────────────────────────────

SPARTA

Loop time of 30.4417 on 1 procs for 50000 steps with 50000 particles

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Move    | 8.0618     | 8.0618     | 8.0618     |   0.0 | 26.48
Coll    | 10.862     | 10.862     | 10.862     |   0.0 | 35.68
Sort    | 2.793      | 2.793      | 2.793      |   0.0 |  9.18
Comm    | 0.0034328  | 0.0034328  | 0.0034328  |   0.0 |  0.01
Modify  | 8.719      | 8.719      | 8.719      |   0.0 | 28.64
Output  | 0.00057459 | 0.00057459 | 0.00057459 |   0.0 |  0.00
Other   |            | 0.002392   |            |       |  0.01

M1 Pro (Macbook Pro), 32 GB RAM

MacOS 15.4.1, Julia version 1.11.2, SPARTA compiled with Apple clang version 17.0.0.

Merzbild.jl

──────────────────────────────────────────────────────────────────────────
                                 Time                    Allocations      
                        ───────────────────────   ────────────────────────
   Tot / % measured:         33.5s /  97.7%            125MiB /   9.9%    

Section         ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────
sort             50.0k    11.6s   35.5%   232μs     0.00B    0.0%    0.00B
collide          2.50M    8.01s   24.4%  3.20μs     0.00B    0.0%    0.00B
convect          50.0k    7.44s   22.7%   149μs     0.00B    0.0%    0.00B
props compute    36.0k    5.60s   17.1%   156μs     0.00B    0.0%    0.00B
I/O                  1   91.8ms    0.3%  91.8ms   9.30MiB   75.3%  9.30MiB
avg physprops    36.0k   5.33ms    0.0%   148ns     0.00B    0.0%    0.00B
sampling             1   2.78ms    0.0%  2.78ms   3.05MiB   24.7%  3.05MiB
──────────────────────────────────────────────────────────────────────────

SPARTA

Loop time of 47.4825 on 1 procs for 50000 steps with 50000 particles

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Move    | 12.001     | 12.001     | 12.001     |   0.0 | 25.28
Coll    | 13.299     | 13.299     | 13.299     |   0.0 | 28.01
Sort    | 2.8246     | 2.8246     | 2.8246     |   0.0 |  5.95
Comm    | 0.0020463  | 0.0020463  | 0.0020463  |   0.0 |  0.00
Modify  | 19.352     | 19.352     | 19.352     |   0.0 | 40.76
Output  | 0.0019748  | 0.0019748  | 0.0019748  |   0.0 |  0.00
Other   |            | 0.0009062  |            |       |  0.00

Couette flow, serial, large grid

The physical parameters for this test case are the same as for the previous one, but a larger (2000 cells) grid is used, with 250 particles per cell at t=0. So the number of grid cells is 40x higher than for the small grid test case, and the number of particles is 10x higher.

In addition, surface properties are also computed and averaged. The numerical setup corresponds to the simulations/1D/couette_with_surface_quantities.jl file with the following parameters parameters for the run command:

run(1234, 300.0, 500.0, 5e-4, 5e22, 2000, 250, 2.59e-9, 1000, 50000, 14000; do_benchmark=true)

Setting do_benchmark to true turns off computation of the degree of particle index fragmentation.

Intel Core i9-13900K, 128 GB RAM

Ubuntu 22.04.5, Julia version 1.11.2, SPARTA compiled with gcc version 11.4.0.

Merzbild.jl

──────────────────────────────────────────────────────────────────────────────────────
                                             Time                    Allocations      
                                    ───────────────────────   ────────────────────────
         Tot / % measured:                795s /  99.5%            192MiB /  17.1%    

Section                     ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────
sort                         50.0k     269s   34.0%  5.37ms     0.00B    0.0%    0.00B
convect + surface compute    36.0k     209s   26.4%  5.80ms   2.20MiB    6.7%    64.0B
props compute                36.0k     131s   16.6%  3.63ms     0.00B    0.0%    0.00B
collide                       100M     114s   14.4%  1.14μs     0.00B    0.0%    0.00B
convect                      14.0k    68.1s    8.6%  4.86ms     0.00B    0.0%    0.00B
avg physprops                36.0k    196ms    0.0%  5.43μs     0.00B    0.0%    0.00B
sampling                         1   35.1ms    0.0%  35.1ms   30.5MiB   93.3%  30.5MiB
avg surfprops                36.0k   12.6ms    0.0%   351ns     0.00B    0.0%    0.00B
I/O                             15   2.26ms    0.0%   151μs   3.58KiB    0.0%     244B
──────────────────────────────────────────────────────────────────────────────────────

SPARTA

Loop time of 1155.34 on 1 procs for 50000 steps with 500000 particles

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Move    | 598.55     | 598.55     | 598.55     |   0.0 | 51.81
Coll    | 262.52     | 262.52     | 262.52     |   0.0 | 22.72
Sort    | 87.103     | 87.103     | 87.103     |   0.0 |  7.54
Comm    | 0.018935   | 0.018935   | 0.018935   |   0.0 |  0.00
Modify  | 207.14     | 207.14     | 207.14     |   0.0 | 17.93
Output  | 0.002845   | 0.002845   | 0.002845   |   0.0 |  0.00
Other   |            | 0.01218    |            |       |  0.00

M1 Pro (Macbook Pro), 32 GB RAM

MacOS 15.4.1, Julia version 1.11.2, SPARTA compiled with Apple clang version 17.0.0.

Merzbild.jl

──────────────────────────────────────────────────────────────────────────────────────
                                             Time                    Allocations      
                                    ───────────────────────   ────────────────────────
         Tot / % measured:               1164s /  99.5%            190MiB /  17.2%    

Section                     ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────
convect + surface compute    36.0k     359s   31.0%  10.0ms   2.20MiB    6.7%    64.0B
sort                         50.0k     337s   29.0%  6.73ms     0.00B    0.0%    0.00B
props compute                36.0k     181s   15.6%  5.02ms     0.00B    0.0%    0.00B
collide                       100M     171s   14.8%  1.71μs     0.00B    0.0%    0.00B
convect                      14.0k     110s    9.5%  7.87ms     0.00B    0.0%    0.00B
avg physprops                36.0k    198ms    0.0%  5.51μs     0.00B    0.0%    0.00B
sampling                         1   26.1ms    0.0%  26.1ms   30.5MiB   93.3%  30.5MiB
avg surfprops                36.0k   14.8ms    0.0%   410ns     0.00B    0.0%    0.00B
I/O                             15   1.77ms    0.0%   118μs   3.58KiB    0.0%     244B
──────────────────────────────────────────────────────────────────────────────────────

SPARTA

Loop time of 1418.29 on 1 procs for 50000 steps with 500000 particles

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Move    | 885.53     | 885.53     | 885.53     |   0.0 | 62.44
Coll    | 274.42     | 274.42     | 274.42     |   0.0 | 19.35
Sort    | 45.409     | 45.409     | 45.409     |   0.0 |  3.20
Comm    | 0.0055656  | 0.0055656  | 0.0055656  |   0.0 |  0.00
Modify  | 212.91     | 212.91     | 212.91     |   0.0 | 15.01
Output  | 0.0032787  | 0.0032787  | 0.0032787  |   0.0 |  0.00
Other   |            | 0.006827   |            |       |  0.00

AMD EPYC 9374F, 378 GB RAM

Ubuntu 24.04.3, Julia version 1.11.6.

Merzbild.jl

──────────────────────────────────────────────────────────────────────────────────────
                                             Time                    Allocations      
                                    ───────────────────────   ────────────────────────
         Tot / % measured:               1467s /  99.5%            192MiB /  17.0%    

Section                     ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────
convect + surface compute    36.0k     584s   40.0%  16.2ms   2.20MiB    6.7%    64.0B
sort                         50.0k     369s   25.3%  7.39ms     0.00B    0.0%    0.00B
collide                       100M     173s   11.8%  1.73μs     0.00B    0.0%    0.00B
convect                      14.0k     170s   11.6%  12.1ms     0.00B    0.0%    0.00B
props compute                36.0k     164s   11.2%  4.54ms     0.00B    0.0%    0.00B
avg physprops                36.0k    306ms    0.0%  8.51μs     0.00B    0.0%    0.00B
sampling                         1   31.4ms    0.0%  31.4ms   30.5MiB   93.3%  30.5MiB
avg surfprops                36.0k   21.8ms    0.0%   605ns     0.00B    0.0%    0.00B
I/O                             15   7.44ms    0.0%   496μs   3.58KiB    0.0%     244B
──────────────────────────────────────────────────────────────────────────────────────

Couette flow, multi-threaded, large grid

The numerical and physical parameters are the same as for the serial large grid case (2000 cells, 250 particles per cell at t=0). The simulation file is simulations/1D/couette_multithreaded.jl.

Intel Core i9-13900K, 128 GB RAM

Ubuntu 22.04.5, Julia version 1.11.2. Shown is the speed-up compared to a serial execution on the same computer (see above). DLB denotes dynamic load balancing (currently not used).

2 cores4 cores8 cores
n_chunks=n_threads, no DLB1.853.245.06

M1 Pro (Macbook Pro), 32 GB RAM

MacOS 15.4.1, Julia version 1.11.2. Shown is the speed-up compared to a serial execution on the same computer (see above). DLB denotes dynamic load balancing (currently not used).

2 cores4 cores8 cores
n_chunks=n_threads, no DLB2.273.755.76

AMD EPYC 9374F, 378 GB RAM

Ubuntu 24.04.3, Julia version 1.11.6. Shown is the speed-up compared to a serial execution on the same computer (see above). DLB denotes dynamic load balancing (currently not used).

2 cores4 cores8 cores16 cores**32 cores **
n_chunks=n_threads, no DLB1.883.685.245.706.57