2 years ago

#77426

test-img

mlinke-ai

Unknown fatal error while calling C++ programm with Spawn in mpi4py

I have a Python script called manager.py, looking like this:

#!/usr/bin/python3
from mpi4py import MPI
import os
import sys

print("Running C++ workers")
cpp_mpi_info = MPI.Info.Create()
cpp_mpi_info.Set("path", os.getcwd())
cpp_comm = MPI.COMM_WORLD.Spawn("./worker", maxprocs=4, info=cpp_mpi_info)
print("Done")

And I have a C++ programm worker.cpp, which looks like this:

#include <stdio.h>
#include <iostream>
#include <mpi.h>

int main(int argc, char **argv)
{
    MPI_Comm comm;
    int rank;
    int size;
    MPI_Init(&argc, &argv);
    MPI_Comm_get_parent(&comm);
    if (comm == MPI_COMM_NULL)
    {
        std::cout << "Running worker without manager" << std::endl;
    }
    else
    {
        MPI_Comm_rank(comm, &rank);
        MPI_Comm_size(comm, &size);
        std::cout << "C++ worker " << rank << " of " << size << std::endl;
    }

    MPI_Finalize();
    return 0;
}

But when I run manager.py (either with ./manager.py or with mpiexec -n 1 ./manager.py), I get the following output:

Running C++ workers
[melchior:228633] [[22763,2],1] selected pml ucx, but peer [[22763,1],0] on melchior selected pml ob1
[melchior:228633] [[22763,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[melchior:228635] [[22763,2],3] selected pml ucx, but peer [[22763,1],0] on melchior selected pml ob1
[melchior:228635] [[22763,2],3] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
[melchior:228634] [[22763,2],2] selected pml ucx, but peer [[22763,1],0] on melchior selected pml ob1
[melchior:228634] [[22763,2],2] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[melchior:228635] *** An error occurred in MPI_Init
[melchior:228635] *** reported by process [1491795970,3]
[melchior:228635] *** on a NULL communicator
[melchior:228635] *** Unknown error
[melchior:228635] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[melchior:228635] ***    and potentially your MPI job)
[melchior:228625] 2 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[melchior:228625] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[melchior:228625] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
--------------------------------------------------------------------------
A request has timed out and will therefore fail:

  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345

Your job may terminate as a result of this problem. You may want to
adjust the MCA parameter pmix_server_max_wait and try again. If this
occurred during a connect/accept operation, you can adjust that time
using the pmix_base_exchange_timeout parameter.
--------------------------------------------------------------------------

After that, execution hangs an undetermined amount of time and I have to kill it manually with a SIGKILL.

However, when I write both manager and worker in Python or C++ respectively, it works just fine.

So, what is the problem here and how do I fix it?

python

c++

mpi

mpi4py

0 Answers

Your Answer

Accepted video resources