Boost.MPI C++
1. Introduction
Boost.MPI is a library for message passing in high-performance parallel applications. A Boost.MPI program is one or more processes that can communicate either via sending and receiving individual messages (point-to-point communication) or by coordinating as a group (collective communication). Unlike communication in threaded environments or using a shared-memory library, Boost.MPI processes can be spread across many different machines, possibly with different operating systems and underlying architectures.
Boost.MPI is not a completely new parallel programming library. Rather, it is a C friendly interface to the standard Message Passing Interface (MPI), the most popular library interface for high-performance, distributed computing. MPI defines a library interface, available from C, Fortran, and C, for which there are many MPI implementations. Although there exist C bindings for MPI, they offer little functionality over the C bindings. The Boost.MPI library provides an alternative C interface to MPI that better supports modern C development styles, including complete support for user-defined data types and C Standard Library types, arbitrary function objects for collective algorithms, and the use of modern C++ library techniques to maintain maximal efficiency.
2. Getting started
2.1. MPI Implementation
To get started with Boost.MPI, you will first need a working MPI implementation. There are many conforming MPI implementations available. Boost.MPI should work with any of the implementations, although it has only been tested extensively with: You can test your implementation using the following simple program, which passes a message from one processor to another. Each processor prints a message to standard output.
#include <mpi.h> #include <iostream> int main(int argc, char argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { int value = 17; int result = MPI_Send(&value, 1, MPI_INT, 1, 0, MPI_COMM_WORLD); if (result == MPI_SUCCESS) std::cout << "Rank 0 OK!" << std::endl; } else if (rank == 1) { int value; int result = MPI_Recv(&value, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); if (result == MPI_SUCCESS && value == 17) std::cout << "Rank 1 OK!" << std::endl; } MPI_Finalize(); return 0; }
You should compile and run this program on two processors. To do this, consult the documentation for your MPI implementation. With OpenMPI, for instance, you compile with the mpiCC or mpic++ compiler, boot the LAM/MPI daemon, and run your program via mpirun. For instance, if your program is called mpi-test.cpp, use the following commands:
mpiCC -o mpi-test mpi-test.cpp lamboot mpirun -np 2 ./mpi-test lamhalt
When you run this program, you will see both Rank 0 OK! and Rank 1 OK! printed to the screen. However, they may be printed in any order and may even overlap each other. The following output is perfectly legitimate for this MPI program:
Rank Rank 1 OK! 0 OK!
If your output looks something like the above, your MPI implementation appears to be working with a C++ compiler and we’re ready to move on.
3. Configure and Build
As the rest of Boost, Boost.MPI uses version 2 of the Boost.Build system for configuring and building the library binary. Please refer to the general Boost installation instructions for Unix Variant (including Unix, Linux and MacOS) or Windows. The simplified build instructions should apply on most platforms with a few specific modifications described below.
4. Bootstrap
As explained in the boost installation instructions, running the bootstrap (./bootstrap.sh for unix variants or bootstrap.bat for Windows) from the boost root directory will produce a 'project-config.jam` file. You need to edit that file and add the following line:
using mpi ;
Alternatively, you can explicitly provide the list of Boost libraries you want to build. Please refer to the --help option of the bootstrap script.
4.1. Setting up your MPI Implementation
First, you need to scan the include/boost/mpi/config.hpp file and check if some settings need to be modified for your MPI implementation or preferences.
In particular, the BOOST_MPI_HOMOGENEOUS macro, that you will need to comment out if you plan to run on a heterogeneous set of machines. See the optimization notes below.
Most MPI implementations require specific compilation and link options. In order to mask theses details to the user, most MPI implementations provide wrappers which silently pass those options to the compiler.
Depending on your MPI implementation, some work might be needed to tell Boost which specific MPI option to use. This is done through the using mpi ; directive in the project-config.jam file those general form is (do not forget to leave spaces around : and before ;):
using mpi [<MPI compiler wrapper>] [<compilation and link options>] [<mpi runner>] ;
Depending on your installation and MPI distribution, the build system might be able to find all the required informations and you just need to specify:
using mpi ;
-
Trouble shooting Most of the time, specially with production HPC clusters, some work will need to be done.
Here is a list of the most common issues and suggestions on how to fix those.
-
Your wrapper is not in your path or does ot have a standard name You will need to tell the build system how to call it using the first parameter:
using mpi : /opt/mpi/bullxmpi/1.2.8.3/bin/mpicc ;
-
-
Warning Boost.MPI only uses the C interface, so specifying the C wrapper should be enough. But some implementations will insist on importing the C++ bindings.
-
Your wrapper is really eccentric or does not exist With some implementations, or with some specific integration[9] you will need to provide the compilation and link options through de second parameter using 'jam' directives. The following type configuration used to be required for some specific Intel MPI implementation (in such a case, the name of the wrapper can be left blank):
using mpi : mpiicc : <library-path>/softs/intel/impi/5.0.1.035/intel64/lib <library-path>/softs/intel/impi/5.0.1.035/intel64/lib/release_mt <include>/softs/intel/impi/5.0.1.035/intel64/include <find-shared-library>mpifort <find-shared-library>mpi_mt <find-shared-library>mpigi <find-shared-library>dl <find-shared-library>rt ;
As a convenience, MPI wrappers usually have an option that provides the required informations, which usually starts with --show. You can use those to find out the requested jam directive:
$ mpiicc -show icc -I/softs/.../include ... -L/softs/.../lib ... -Xlinker -rpath -Xlinker /softs/.../lib .. $
$ mpicc --showme icc -I/opt/.../include -pthread -L/opt/.../lib -lmpi -ldl -lm -lnuma -Wl,--export-dynamic -l $ mpicc --showme:compile -I/opt/mpi/bullxmpi/1.2.8.3/include -pthread $ mpicc --showme:link
-pthread -L/opt/.../lib -lmpi -ldl -lm -lnuma -Wl,--export-dynamic -lrt -lnsl -lutil -lm -ld $
To see the results of MPI auto-detection, pass --debug-configuration on the bjam command line.
-
The launch syntax cannot be detected
Note
This is only used when running the tests.
If you need to use a special command to launch an MPI program, you will need to specify it through the third parameter of the using mpi directive.
So, assuming you launch the all_gather_test program with:
$mpiexec.hydra -np 4 all_gather_test
The directive will look like:
using mpi : mpiicc : [<compilation and link options>] : mpiexec.hydra -n ;
-
Build
To build the whole Boost distribution:
$cd <boost distribution>
$./b2
To build the Boost.MPI library and dependancies:
$cd <boost distribution>/lib/mpi/build $../../../b2
-
Tests
You can run the regression tests with:
$cd <boost distribution>/lib/mpi/test $../../../b2
-
Installation
To install the whole Boost distribution: $cd <boost distribution> $./b2 install
5. Using Boost.MPI
To build applications based on Boost.MPI, compile and link them as you normally would for MPI programs, but remember to link against the boost_mpi and boost_serialization libraries, e.g.,
mpic++ -I/path/to/boost/mpi my_application.cpp -Llibdir \ -lboost_mpi -lboost_serialization
If you plan to use the Python bindings for Boost.MPI in conjunction with the C Boost.MPI, you will also need to link against the boost_mpi_python library, e.g., by adding -lboost_mpi_python-gcc to your link command. This step will only be necessary if you intend to register C types or use the skeleton/content mechanism from within Python.
6. Tutorial
A Boost.MPI program consists of many cooperating processes (possibly running on different computers) that communicate among themselves by passing messages. Boost.MPI is a library (as is the lower-level MPI), not a language, so the first step in a Boost.MPI is to create an mpi::environment object that initializes the MPI environment and enables communication among the processes. The mpi::environment object is initialized with the program arguments (which it may modify) in your main program. The creation of this object initializes MPI, and its destruction will finalize MPI. In the vast majority of Boost.MPI programs, an instance of mpi::environment will be declared in main at the very beginning of the program.
Warning
Declaring an mpi::environment at global scope is undefined behavior.
Communication with MPI always occurs over a communicator, which can be created by simply default-constructing an object of type mpi::communicator. This communicator can then be queried to determine how many processes are running (the "size" of the communicator) and to give a unique number to each process, from zero to the size of the communicator (i.e., the "rank" of the process):
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <iostream> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; std::cout << "I am process " << world.rank() << " of " << world.size()<< "." << std::endl; return 0; }
If you run this program with 7 processes, for instance, you will receive output such as:
I am process 5 of 7. I am process 0 of 7. I am process 1 of 7. I am process 6 of 7. I am process 2 of 7. I am process 4 of 7. I am process 3 of 7.
Of course, the processes can execute in a different order each time, so the ranks might not be strictly increasing. More interestingly, the text could come out completely garbled, because one process can start writing "I am a
process" before another process has finished writing "of 7.".
If you should still have an MPI library supporting only MPI 1.1 you will need to pass the command line arguments to the environment constructor as shown in this example:
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <iostream> namespace mpi = boost::mpi; int main(int argc, char* argv[]) { mpi::environment env(argc, argv); mpi::communicator world; std::cout << "I am process " << world.rank() << " of " << world.size()<< "." << std::endl; return 0; }
6.1. Point-to-Point communication
6.1.1. Blocking communication
As a message passing library, MPI’s primary purpose is to routine messages from one process to another, i.e., point-to-point. MPI contains routines that can send messages, receive messages, and query whether messages are available. Each message has a source process, a target process, a tag, and a payload containing arbitrary data. The source and target processes are the ranks of the sender and receiver of the message, respectively. Tags are integers that allow the receiver to distinguish between different messages coming from the same sender.
The following program uses two MPI processes to write "Hello, world!" to the screen (hello_world.cpp):
#include <boost/mpi.hpp> #include <iostream> #include <string> #include <boost/serialization/string.hpp> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; if (world.rank() == 0) { world.send(1, 0, std::string("Hello")); std::string msg; world.recv(1, 1, msg); std::cout << msg << "!" << std::endl; } else { std::string msg; world.recv(0, 0, msg); std::cout << msg << ", "; std::cout.flush(); world.send(0, 1, std::string("world")); } return 0; }
The first processor (rank 0) passes the message "Hello" to the second processor (rank 1) using tag 0. The
second processor prints the string it receives, along with a comma, then passes the message "world" back to processor 0 with a different tag. The first processor then writes this message with the "!" and exits. All sends are accomplished with the communicator::send method and all receives use a corresponding communicator::recv call.
6.1.2. Non-blocking communication
The default MPI communication operations—send and recv—may have to wait until the entire transmission is completed before they can return. Sometimes this blocking behavior has a negative impact on performance, because the sender could be performing useful computation while it is waiting for the transmission to occur. More important, however, are the cases where several communication operations must occur simultaneously, e.g., a process will both send and receive at the same time.
Let’s revisit our "Hello, world!" program from the previous section. The core of this program transmits two messages:
if (world.rank() == 0) { world.send(1, 0, std::string("Hello")); std::string msg; world.recv(1, 1, msg); std::cout << msg << "!" << std::endl; } else { std::string msg; world.recv(0, 0, msg); std::cout << msg << ", "; std::cout.flush(); world.send(0, 1, std::string("world")); }
The first process passes a message to the second process, then prepares to receive a message. The second process does the send and receive in the opposite order. However, this sequence of events is just that—a sequence—meaning that there is essentially no parallelism. We can use non-blocking communication to ensure that the two messages are transmitted simultaneously (hello_world_nonblocking.cpp):
#include <boost/mpi.hpp> #include <iostream> #include <string> #include <boost/serialization/string.hpp> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; if (world.rank() == 0) { mpi::request reqs[2]; std::string msg, out_msg = "Hello"; reqs[0] = world.isend(1, 0, out_msg); reqs[1] = world.irecv(1, 1, msg); mpi::wait_all(reqs, reqs + 2); std::cout << msg << "!" << std::endl; } else { mpi::request reqs[2]; std::string msg, out_msg = "world"; reqs[0] = world.isend(0, 1, out_msg); reqs[1] = world.irecv(0, 0, msg); mpi::wait_all(reqs, reqs + 2); std::cout << msg << ", "; } return 0; }
We have replaced calls to the communicator::send and communicator::recv members with similar calls to
their non-blocking counterparts, communicator::isend and communicator::irecv. The prefix i indicates that the operations return immediately with a mpi::request object, which allows one to query the status of a communication request (see the test method) or wait until it has completed (see the wait method). Multiple requests can be completed at the same time with the wait_all operation.
-
Important Regarding communication completion/progress: The MPI standard requires users to keep the request handle for a non-blocking communication, and to call the "wait" operation (or successfully test for completion) to complete the send or receive. Unlike most C MPI implementations, which allow the user to discard the request for a non-blocking send, Boost.MPI requires the user to call "wait" or "test", since the request object might contain temporary buffers that have to be kept until the send is completed. Moreover, the MPI standard does not guarantee that the receive makes any progress before a call to "wait" or "test", although most implementations of the C MPI do allow receives to progress before the call to "wait" or "test". Boost.MPI, on the other hand, generally requires "test" or "wait" calls to make progress. More specifically, Boost.MPI guarantee that calling "test" multiple time will eventually complete the communication (this is due to the fact that serialized communication are potentially a multi step operation.).
If you run this program multiple times, you may see some strange results: namely, some runs will produce:
Hello, world!
while others will produce:
world! Hello,
or even some garbled version of the letters in "Hello" and "world". This indicates that there is some parallelism in the program, because after both messages are (simultaneously) transmitted, both processes will concurrent execute their print statements. For both performance and correctness, non-blocking communication operations are critical to many parallel applications using MPI.
6.2. Collective operations
Point-to-point operations are the core message passing primitives in Boost.MPI. However, many message-passing applications also require higher-level communication algorithms that combine or summarize the data stored on many different processes. These algorithms support many common tasks such as "broadcast this value to all processes", "compute the sum of the values on all processors" or "find the global minimum."
6.2.1. Broadcast
The broadcast algorithm is by far the simplest collective operation. It broadcasts a value from a single process to all other processes within a communicator. For instance, the following program broadcasts "Hello, World!" from process 0 to every other process. (hello_world_broadcast.cpp)
#include <boost/mpi.hpp> #include <iostream> #include <string> #include <boost/serialization/string.hpp namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; std::string value; if (world.rank() == 0) { value = "Hello, World!"; } broadcast(world, value, 0); std::cout << "Process #" << world.rank() << " says " << value << std::endl; return 0; }
Running this program with seven processes will produce a result such as:
Process #0 says Hello, World! Process #2 says Hello, World! Process #1 says Hello, World! Process #4 says Hello, World! Process #3 says Hello, World! Process #5 says Hello, World! Process #6 says Hello, World!
6.2.2. Gather
The gather collective gathers the values produced by every process in a communicator into a vector of values on the "root" process (specified by an argument to gather). The /i/th element in the vector will correspond to the value gathered from the /i/th process. For instance, in the following program each process computes its own random number. All of these random numbers are gathered at process 0 (the "root" in this case), which prints out the values that correspond to each processor. (random_gather.cpp)
#include <boost/mpi.hpp> #include <iostream> #include <vector> #include <cstdlib> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; std::srand(time(0) + world.rank()); int my_number = std::rand(); if (world.rank() == 0) { std::vector<int> all_numbers; gather(world, my_number, all_numbers, 0); for (int proc = 0; proc < world.size(); ++proc) std::cout << "Process #" << proc << " thought of " << all_numbers[proc] << std::endl; } else { gather(world, my_number, 0); } return 0; }
Executing this program with seven processes will result in output such as the following. Although the random values will change from one run to the next, the order of the processes in the output will remain the same because only process 0 writes to std::cout.
Process #0 thought of 332199874 Process #1 thought of 20145617 Process #2 thought of 1862420122 Process #3 thought of 480422940 Process #4 thought of 1253380219 Process #5 thought of 949458815 Process #6 thought of 650073868
The gather operation collects values from every process into a vector at one process. If instead the values from every process need to be collected into identical vectors on every process, use the all_gather algorithm, which is semantically equivalent to calling gather followed by a broadcast of the resulting vector.
6.2.3. Scatter
The scatter collective scatters the values from a vector in the "root" process in a communicator into values in all the processes of the communicator. The /i/th element in the vector will correspond to the value received by the /i/th process. For instance, in the following program, the root process produces a vector of random nomber and send one value to each process that will print it. (random_scatter.cpp)
#include <boost/mpi.hpp> #include <boost/mpi/collectives.hpp> #include <iostream> #include <cstdlib> #include <vector> namespace mpi = boost::mpi; int main(int argc, char* argv[]) { mpi::environment env(argc, argv); mpi::communicator world; std::srand(time(0) + world.rank()); std::vector<int> all; int mine = -1; if (world.rank() == 0) { all.resize(world.size()); std::generate(all.begin(), all.end(), std::rand); } mpi::scatter(world, all, mine, 0); for (int r = 0; r < world.size(); ++r) { world.barrier(); if (r == world.rank()) { std::cout << "Rank " << r << " got " << mine << '\n'; } } return 0; }
Executing this program with seven processes will result in output such as the following. Although the random values will change from one run to the next, the order of the processes in the output will remain the same because of the barrier.
Rank 0 got 1409381269 Rank 1 got 17045268 Rank 2 got 440120016 Rank 3 got 936998224 Rank 4 got 1827129182 Rank 5 got 1951746047 Rank 6 got 2117359639
6.2.4. Reduce
The reduce collective summarizes the values from each process into a single value at the user-specified "root" process. The Boost.MPI reduce operation is similar in spirit to the STL accumulate operation, because it takes
a sequence of values (one per process) and combines them via a function object. For instance, we can randomly generate values in each process and the compute the minimum value over all processes via a call to reduce (random_min.cpp):
#include <boost/mpi.hpp> #include <iostream> #include <cstdlib> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; std::srand(time(0) + world.rank()); int my_number = std::rand(); if (world.rank() == 0) { int minimum; reduce(world, my_number, minimum, mpi::minimum<int>(), 0); std::cout << "The minimum value is " << minimum << std::endl; } else { reduce(world, my_number, mpi::minimum<int>(), 0); } return 0; }
The use of mpi::minimum<int> indicates that the minimum value should be computed. mpi::minimum<int> is a binary function object that compares its two parameters via < and returns the smaller value. Any associative binary function or function object will work provided it’s stateless. For instance, to concatenate strings with reduce one could use the function object std::plus<std::string> (string_cat.cpp):
#include <boost/mpi.hpp> #include <iostream> #include <string> #include <functional> #include <boost/serialization/string.hpp> namespace mpi = boost::mpi; int main() { mpi::environment env; mpi::communicator world; std::string names[10] = { "zero ", "one ", "two ", "three ", "four ", "five ", "six ", "seven ", "eight ", "nine " }; std::string result; reduce(world, world.rank() < 10? names[world.rank()] * std::string("many "), result, std::plus<std::string>(), 0); if (world.rank() == 0) std::cout << "The result is " << result << std::endl; return 0; }
In this example, we compute a string for each process and then perform a reduction that concatenates all of the strings together into one, long string. Executing this program with seven processors yields the following output:
The result is zero one two three four five six
-
Binary operations for reduce Any kind of binary function objects can be used with reduce. For instance, and there are many such function objects in the C++ standard <functional> header and the Boost.MPI header <boost/mpi/operations.hpp>. Or, you can create your own function object. Function objects used with reduce must be associative, i.e. f(x, f(y, z)) must be equivalent to f(f(x, y), z). If they are also commutative (i..e, f(x, y) == f(y, x)), Boost.MPI can use a more efficient implementation of reduce. To state that a function object is commutative, you will need to specialize the class is_commutative. For instance, we could modify the previous example by telling Boost.MPI that string concatenation is commutative:
namespace boost { namespace mpi { template<> struct is_commutative<std::plus<std::string>, std::string> * mpl::true_ { }; } } // end namespace boost::mpi
By adding this code prior to main(), Boost.MPI will assume that string concatenation is commutative and employ a different parallel algorithm for the reduce operation. Using this algorithm, the program outputs the following when run with seven processes:
The result is zero one four five six two three
Note how the numbers in the resulting string are in a different order: this is a direct result of Boost.MPI reordering operations. The result in this case differed from the non-commutative result because string concatenation is not commutative: f("x", "y") is not the same as f("y", "x"), because argument order matters. For truly commutative operations (e.g., integer addition), the more efficient commutative algorithm will produce the same result as the non-commutative algorithm. Boost.MPI also performs direct mappings from function objects in <functional> to MPI_Op values predefined by MPI (e.g., MPI_SUM, MPI_MAX); if you have your own function objects that can take advantage of this mapping, see the class template is_mpi_op.
Warning
Due to the underlying MPI limitations, it is important to note that the operation must be stateless.
All process variant
Like gather, reduce has an "all" variant called all_reduce that performs the reduction operation and broadcasts the result to all processes. This variant is useful, for instance, in establishing global minimum or maximum values.
The following code (global_min.cpp) shows a broadcasting version of the random_min.cpp example:
#include <boost/mpi.hpp> #include <iostream> #include <cstdlib> namespace mpi = boost::mpi; int main(int argc, char* argv[]) { mpi::environment env(argc, argv); mpi::communicator world; std::srand(world.rank()); int my_number = std::rand(); int minimum; mpi::all_reduce(world, my_number, minimum, mpi::minimum<int>()); if (world.rank() == 0) { std::cout << "The minimum value is " << minimum << std::endl; } return 0; }
In that example we provide both input and output values, requiring twice as much space, which can be a problem depending on the size of the transmitted data. If there is no need to preserve the input value, the output value can be omitted. In that case the input value will be overridden with the output value and Boost.MPI is able, in some situation, to implement the operation with a more space efficient solution (using the MPI_IN_PLACE flag of the MPI C mapping), as in the following example (in_place_global_min.cpp):
#include <boost/mpi.hpp> #include <iostream> #include <cstdlib> namespace mpi = boost::mpi; int main(int argc, char* argv[]) { mpi::environment env(argc, argv); mpi::communicator world; std::srand(world.rank()); int my_number = std::rand(); mpi::all_reduce(world, my_number, mpi::minimum<int>()); if (world.rank() == 0) { std::cout << "The minimum value is " << my_number << std::endl; } return 0; }
6.3. User-defined data types
The inclusion of boost/serialization/string.hpp in the previous examples is very important: it makes values of type std::string serializable, so that they can be be transmitted using Boost.MPI. In general, built-in C++ types (ints, floats, characters, etc.) can be transmitted over MPI directly, while user-defined and library-defined types will need to first be serialized (packed) into a format that is amenable to transmission. Boost.MPI relies on the Boost.Serialization library to serialize and deserialize data types.
For types defined by the standard library (such as std::string or std::vector) and some types in Boost (such as boost::variant), the Boost.Serialization library already contains all of the required serialization code. In these cases, you need only include the appropriate header from the boost/serialization directory.
For types that do not already have a serialization header, you will first need to implement serialization code before the types can be transmitted using Boost.MPI. Consider a simple class gps_position that contains members degrees, minutes, and seconds. This class is made serializable by making it a friend of boost::serialization::access and introducing the templated serialize() function, as follows:
class gps_position { private: friend class boost::serialization::access; template<class Archive> void serialize(Archive & ar, const unsigned int version) { ar & degrees; ar & minutes; ar & seconds; } int degrees; int minutes; float seconds; public: gps_position(){}; gps_position(int d, int m, float s) : degrees(d), minutes(m), seconds(s) {} };
Complete information about making types serializable is beyond the scope of this tutorial. For more information, please see the Boost.Serialization library tutorial from which the above example was extracted. One important side benefit of making types serializable for Boost.MPI is that they become serializable for any other usage, such as storing the objects to disk and manipulated them in XML.
Some serializable types, like gps_position above, have a fixed amount of data stored at fixed offsets and are fully defined by the values of their data member (most POD with no pointers are a good example). When this is the case, Boost.MPI can optimize their serialization and transmission by avoiding extraneous copy operations. To enable this optimization, users must specialize the type trait is_mpi_datatype, e.g.:
namespace boost { namespace mpi { template <> struct is_mpi_datatype<gps_position> : mpl::true_ { }; } }
For non-template types we have defined a macro to simplify declaring a type as an MPI datatype
BOOST_IS_MPI_DATATYPE(gps_position)
For composite traits, the specialization of is_mpi_datatype may depend on is_mpi_datatype itself. For instance, a boost::array object is fixed only when the type of the parameter it stores is fixed:
namespace boost { namespace mpi { template <typename T, std::size_t N> struct is_mpi_datatype<array<T, N> > : public is_mpi_datatype<T> { }; } }
The redundant copy elimination optimization can only be applied when the shape of the data type is completely fixed. Variable-length types (e.g., strings, linked lists) and types that store pointers cannot use the optimization, but Boost.MPI will be unable to detect this error at compile time. Attempting to perform this optimization when it is not correct will likely result in segmentation faults and other strange program behavior.
Boost.MPI can transmit any user-defined data type from one process to another. Built-in types can be transmitted without any extra effort; library-defined types require the inclusion of a serialization header; and user-defined types will require the addition of serialization code. Fixed data types can be optimized for transmission using the is_mpi_datatype type trait.
6.4. Communicators
6.4.1. Managing comminicators
Communication with Boost.MPI always occurs over a communicator. A communicator contains a set of processes that can send messages among themselves and perform collective operations. There can be many communicators within a single program, each of which contains its own isolated communication space that acts independently of the other communicators.
When the MPI environment is initialized, only the "world" communicator (called MPI_COMM_WORLD in the MPI C and Fortran bindings) is available. The "world" communicator, accessed by default-constructing a mpi::communicator object, contains all of the MPI processes present when the program begins execution.
Other communicators can then be constructed by duplicating or building subsets of the "world" communicator. For instance, in the following program we split the processes into two groups: one for processes generating data and the other for processes that will collect the data. (generate_collect.cpp)
#include <boost/mpi.hpp> #include <iostream> #include <cstdlib> #include <boost/serialization/vector.hpp> namespace mpi = boost::mpi; enum message_tags {msg_data_packet, msg_broadcast_data, msg_finished}; void generate_data(mpi::communicator local, mpi::communicator world); void collect_data(mpi::communicator local, mpi::communicator world); int main() { mpi::environment env; mpi::communicator world; bool is_generator = world.rank() < 2 * world.size() / 3; mpi::communicator local = world.split(is_generator? 0 : 1); if (is_generator) generate_data(local, world); else collect_data(local, world); return 0; }
When communicators are split in this way, their processes retain membership in both the original communicator (which is not altered by the split) and the new communicator. However, the ranks of the processes may be different from one communicator to the next, because the rank values within a communicator are always contiguous values starting at zero. In the example above, the first two thirds of the processes become "generators" and the remaining processes become "collectors". The ranks of the "collectors" in the world communicator will be 2/3 world.size() and greater, whereas the ranks of the same collector processes in the local communicator will start at zero. The following excerpt from collect_data() (in generate_collect.cpp) illustrates how to manage multiple communicators:
mpi::status msg = world.probe();
if (msg.tag() == msg_data_packet) {
-
Receive the packet of data std::vector<int> data; world.recv(msg.source(), msg.tag(), data);
-
Tell each of the collectors that we’ll be broadcasting some data for (int dest = 1; dest < local.size(); ++dest)
local.send(dest, msg_broadcast_data, msg.source());
-
Broadcast the actual data.
broadcast(local, data, 0);
The code in this except is executed by the "master" collector, e.g., the node with rank 2/3 world.size() in the world communicator and rank 0 in the local (collector) communicator. It receives a message from a generator via the world communicator, then broadcasts the message to each of the collectors via the local communicator.
For more control in the creation of communicators for subgroups of processes, the Boost.MPI group provides facilities to compute the union (|), intersection (&), and difference (-) of two groups, generate arbitrary subgroups, etc.
6.4.2. Cartesian communicator
A communicator can be organised as a cartesian grid, here a basic example:
#include <vector> #include <iostream> #include <boost/mpi/communicator.hpp> #include <boost/mpi/collectives.hpp> #include <boost/mpi/environment.hpp> #include <boost/mpi/cartesian_communicator.hpp> #include <boost/test/minimal.hpp> namespace mpi = boost::mpi; int test_main(int argc, char* argv[]) { mpi::environment env; mpi::communicator world; if (world.size() != 24) return -1; mpi::cartesian_dimension dims[] = {{2, true}, {3,true}, {4,true}}; mpi::cartesian_communicator cart(world, mpi::cartesian_topology(dims)); for (int r = 0; r < cart.size(); ++r) { cart.barrier(); if (r == cart.rank()) { std::vector<int> c = cart.coordinates(r); std::cout << "rk :" << r << " coords: " << c[0] << ' ' << c[1] << ' ' << c[2] << '\n'; } } return 0; }
6.5. Threads
There are an increasing number of hybrid parallel applications that mix distributed and shared memory parallelism. To know how to support that model, one need to know what level of threading support is guaranteed by the MPI implementation. There are 4 ordered level of possible threading support described by mpi::threading::level. At the lowest level, you should not use threads at all, at the highest level, any thread can perform MPI call.
If you want to use multi-threading in your MPI application, you should indicate in the environment constructor your preferred threading support. Then probe the one the library did provide, and decide what you can do with it (it could be nothing, then aborting is a valid option):
#include <boost/mpi/environment.hpp> #include <boost/mpi/communicator.hpp> #include <iostream> namespace mpi = boost::mpi; namespace mt = mpi::threading; int main() { mpi::environment env(mt::funneled); if (env.thread_level() < mt::funneled) { env.abort(-1); } mpi::communicator world; std::cout << "I am process " << world.rank() << " of " << world.size() << "." << std::endl; return 0; }
6.6. Separating structure from content
When communicating data types over MPI that are not fundamental to MPI (such as strings, lists, and user-defined data types), Boost.MPI must first serialize these data types into a buffer and then communicate them; the receiver then copies the results into a buffer before deserializing into an object on the other end. For some data types, this overhead can be eliminated by using is_mpi_datatype. However, variable-length data types such as
strings and lists cannot be MPI data types.
Boost.MPI supports a second technique for improving performance by separating the structure of these variable-length data structures from the content stored in the data structures. This feature is only beneficial when the shape of the data structure remains the same but the content of the data structure will need to be communicated several times. For instance, in a finite element analysis the structure of the mesh may be fixed at the beginning of computation but the various variables on the cells of the mesh (temperature, stress, etc.) will be communicated many times within the iterative analysis process. In this case, Boost.MPI allows one to first send the "skeleton" of the mesh once, then transmit the "content" multiple times. Since the content need not contain any information about the structure of the data type, it can be transmitted without creating separate communication buffers.
To illustrate the use of skeletons and content, we will take a somewhat more limited example wherein a master process generates random number sequences into a list and transmits them to several slave processes. The length of the list will be fixed at program startup, so the content of the list (i.e., the current sequence of numbers) can be transmitted efficiently. The complete example is available in example/random_content.cpp. We being with the master process (rank 0), which builds a list, communicates its structure via a skeleton, then repeatedly generates random number sequences to be broadcast to the slave processes via content:
-
Generate the list and broadcast its structure std::list<int> l(list_len); broadcast(world, mpi::skeleton(l), 0);
-
Generate content several times and broadcast out that content mpi::content c = mpi::get_content(l);
for (int i = 0; i < iterations; ++i) { std::generate(l.begin(), l.end(), &random); //Broadcast the new content of l broadcast(world, c, 0); } // Notify the slaves that we're done by sending all zeroes std::fill(l.begin(), l.end(), 0); broadcast(world, c, 0);
The slave processes have a very similar structure to the master. They receive (via the broadcast() call) the skeleton of the data structure, then use it to build their own lists of integers. In each iteration, they receive via another broadcast() the new content in the data structure and compute some property of the data:
-
Receive the content and build up our own
list std::list<int> l; broadcast(world, mpi::skeleton(l), 0); mpi::content c = mpi::get_content(l); int i = 0; do { broadcast(world, c, 0); if (std::find_if (l.begin(), l.end(), std::bind1st(std::not_equal_to<int>(), 0)) == l.end()) break; // Compute some property of the data. ++i; } while (true);
The skeletons and content of any Serializable data type can be transmitted either via the send and recv members of the communicator class (for point-to-point communicators) or broadcast via the broadcast() collective. When separating a data structure into a skeleton and content, be careful not to modify the data structure (either on the sender side or the receiver side) without transmitting the skeleton again. Boost.MPI can not detect these accidental modifications to the data structure, which will likely result in incorrect data being transmitted or unstable programs.
6.7. Performance optimizations
6.7.1. Serialization optimizations
To obtain optimal performance for small fixed-length data types not containing any pointers it is very important to mark them using the type traits of Boost.MPI and Boost.Serialization.
It was already discussed that fixed length types containing no pointers can be using as is_mpi_datatype, e.g.:
namespace boost { namespace mpi { template <> struct is_mpi_datatype<gps_position> : mpl::true_ { }; } }
or the equivalent macro
BOOST_IS_MPI_DATATYPE(gps_position)
In addition it can give a substantial performance gain to turn off tracking and versioning for these types, if no pointers to these types are used, by using the traits classes or helper macros of Boost.Serialization:
BOOST_CLASS_TRACKING(gps_position,track_never) BOOST_CLASS_IMPLEMENTATION(gps_position,object_serializable)
6.7.2. Homogeneous Machines
More optimizations are possible on homogeneous machines, by avoiding MPI_Pack/MPI_Unpack calls but using direct bitwise copy. This feature is enabled by default by defining the macro BOOST_MPI_HOMOGENEOUS in the include file boost/mpi/config.hpp. That definition must be consistent when building Boost.MPI and when building the application.
In addition all classes need to be marked both as is_mpi_datatype and as is_bitwise_serializable, by using the helper macro of Boost.Serialization:
BOOST_IS_BITWISE_SERIALIZABLE(gps_position)
Usually it is safe to serialize a class for which is_mpi_datatype is true by using binary copy of the bits. The exception are classes for which some members should be skipped for serialization.