😱 Status quo stories: Niklaus Builds a Hydrodynamics Simulator

🚧 Warning: Draft status 🚧

This is a draft "status quo" story submitted as part of the brainstorming period. It is derived from real-life experiences of actual Rust users and is meant to reflect some of the challenges that Async Rust programmers face today.

If you would like to expand on this story, or adjust the answers to the FAQ, feel free to open a PR making edits (but keep in mind that, as they reflect peoples' experiences, status quo stories cannot be wrong, only inaccurate). Alternatively, you may wish to add your own status quo story!

The story

Problem

Niklaus is a professor of physics at the University of Rustville. He needed to build a tool to solve hydrodynamics simulations; there is a common method for this that subdivides a region into a grid and computes the solution for each grid patch. All the patches in a grid for a point in time are independent and can be computed in parallel, but they are dependent on neighboring patches in the previously computed frame in time. This is a well known computational model and the patterns for basic parallelization are well established.

Niklaus wanted to write a performant tool to compute the solutions to the simulations of his research. He chose Rust because he needed high performance but he also wanted something that could be maintained by his students, who are not professional programmers. Rust's safety guarantees giver him confidence that his results are not going to be corrupted by data races or other programming errors. After implementing the core mathematical formulas, Niklaus began implementing the parallelization architecture.

His first attempt to was to emulate a common CFD design pattern: using message passing to communicate between processes that are each assigned a specific patch in the grid. So he assign one thread to each patch and used messages to communicate solution state to dependent patches. With one thread per patch this usually meant that there were 5-10x more threads than CPU cores.

This solution worked, but Niklaus had two problems with it. First, it gave him no control over CPU usage so the solution would greedily use all available CPU resources. Second, using messages to communicate solution values between patches did not scale when his team added a new feature (tracer particles) the additional messages caused by this change created so much overhead that parallel processing was no faster than serial. So, Niklaus decided to find a better solution.

Solution Path

To address the first problem: Niklaus' new design decoupled the work that needed to be done (solving physics equations for each patch in the grid) from the workers (threads), this would allow him to set the number of threads and not use all the CPU resources. So, he began looking for a tool in Rust that would meet this design pattern. When he read about async and how it allowed the user to define units of work and send those to an executor which would manage the execution of those tasks across a set of workers, he thought he'd found exactly what he needed. He also thought that the .await semantics would give a much better way of coordinating dependencies between patches. Further reading indicated that tokio was the runtime of choice for async in the community and, so, he began building a new CFD solver with async and tokio.

After making some progress, Niklaus ran into his first problem. Niklaus had been under a false impression about what async executors do. He had assumed that a multi-threaded executor could automatically move the execution of an async block to a worker thread. When this turned out to wrong, he went to Stackoverflow and learned that async tasks must be explicitly spawned into a thread pool if they are to be executed on a worker thread. This meant that the algorithm to be parallelized became strongly coupled to both the spawner and the executor. Code that used to cleanly express a physics algorithm now had interspersed references to the task spawner, not only making it harder to understand, but also making it impossible to try different execution strategies, since with Tokio the spawner and executor are the same object (the Tokio runtime). Niklaus felt that a better design for data parallelism would enable better separation of concerns: a group of interdependent compute tasks, and a strategy to execute them in parallel.

Niklaus second problem came as he tried to fully replace the message passing from the first design: sharing data between tasks. He used the async API to coordinate computation of patches so that a patch would only go to a worker when all its dependencies had completed. But he also needed to account for the solution data which was passed in the messages. He setup a shared data structure to track the solutions for each patch now that messages would not be passing that data. Learning how to properly use shared data with async was a new challenge. The initial design:

#![allow(unused)]
fn main() {
    let mut stage_primitive_and_scalar = |index: BlockIndex, state: BlockState<C>, hydro: H, geometry: GridGeometry| {
        let stage = async move {
            let p = state.try_to_primitive(&hydro, &geometry)?;
            let s = state.scalar_mass / &geometry.cell_volumes / p.map(P::lorentz_factor);
            Ok::<_, HydroError>( ( p.to_shared(), s.to_shared() ) )
        };
        stage_map.insert(index, runtime.spawn(stage).map(|f| f.unwrap()).shared());
    };
}

lacked performance because he needed to clone the value for every task. So, Niklaus switched over to using Arc to keep a thread safe RC to the shared data. But this change introduced a lot of .map and .unwrap function calls, making the code much harder to read. He realized that managing the dependency graph was not intuitive when using async for concurrency.

As the program matured, a new problem arose: a steep learning curve with error handling. The initial version of his design used panic!s to fail the program if an error was encountered, but the stack traces were almost unreadable. He asked his teammate Grace to migrate over to using the Result idiom for error handling and Grace found a major inconvenience. The Rust type inference inconsistently breaks when propagating Result in async blocks. Grace frequently found that she had to specify the type of the error when creating a result value:

#![allow(unused)]
fn main() {
Ok::<_, HydroError>( ( p.to_shared(), s.to_shared() ) )  
}

And she could not figure out why she had to add the ::<_, HydroError> to some of the Result values.

Finally, once Niklaus' team began using the new async design for their simulations, they noticed an important issue that impacted productivity: compilation time had now increased to between 30 and 60 seconds. The nature of their work requires frequent changes to code and recompilation and 30-60 seconds is long enough to have a noticeable impact on their quality of life. What he and his team want is for compilation to be 2 to 3 seconds. Niklaus believes that the use of async is a major contributor to the long compilation times.

This new solution works, but Niklaus is not satisfied with how complex his code became after the move to async and that compilation time is now 30-60 seconds. The state sharing adding a large amount of cruft with Arc and async is not well suited for using a dependency graph to schedule tasks so implementing this solution created a key component of his program that was difficult to understand and pervasive. Ultimately, his conclusion was that async is not appropriate for parallelizing computational tasks. He will be trying a new design based upon Rayon in the next version of her application.

🤔 Frequently Asked Questions

What are the morals of the story?

async looks to be the wrong choice for parallelizing compute bound/computational work
There is a lack of guidance to help people solving such problems get started on the right foot
Quality of Life issues (compilation time, type inference on Result) can create a drag on users ability to focus on their domain problem

What are the sources for this story?

This story is based on the experience of building the kilonova hydrodynamics simulation solver.

Why did you choose Niklaus and Grace to tell this story?

I chose Niklaus as the primary character in this story because this work was driven by someone who only uses programming for a small part of their work. Grace was chosen as a supporting character because of that persons experience with C/C++ programming and to avoid repeating characters.

How would this story have played out differently for the other characters?

Alan: there's a good chance he would have already had experience working with either async workflows in another language or doing parallelization of compute bound tasks; and so would already know from experience that async was not the right place to start.
Grace: likewise, might already have experience with problems like this and would know what to look for when searching for tools.
Barbara: the experience would likely be fairly similar, since the actual subject of this story is quite familiar with Rust by now

wg-async