Set CPU affinity in PMC

XMLWordPrintable

    • Type: New Feature
    • Resolution: Fixed
    • Priority: Major
    • master, 4.6.0
    • Affects Version/s: master
    • Component/s: PMC
    • None

      LIGO is running 8 core tasks using PMC on Stampede, which has 2 sockets with 8 cores each. They would like each task to be bound to a different socket so that the threads can share L3 cache. For PMC this requires us to call sched_setaffinity to bind each task to the right set of cores.

      In the long-term we need to have a generic and robust method for scheduling tasks on CPUs that allows for periodic rebalancing to remove fragmentation. That will require us to change the architecture of PMC so that each worker is responsible for an entire node and can schedule the tasks appropriately. In the near term we just need to handle the case where all the tasks require the same number of CPUs.

      The near-term solution is for the PMC master to keep track of the mapping of each task to a set of physical cores. The master needs to communicate this to the worker so that the worker can call sched_setaffinity(). In addition, the master needs to be able to identify sockets so that it can try to pack a task onto as few sockets as possible (e.g. we don't want half of the threads on one socket and the other half on another socket if they can all fit on one socket).

      This solution won't work in cases where there is a mixed workload of tasks of different core counts. For example, if we have a mixture of single core tasks and 8 core tasks on an dual socket machine with 8 cores per socket, then if we bind 8 tasks to CPU0, and 1 task to CPU1, then 1 of the tasks on CPU0 finishes, we have 8 cores free, but they are spread across the two CPUs. In that case, if an 8 core task comes along, we need to defragment the allocation by moving the task from CPU1 to the empty core on CPU0, and bind the 8 core task to CPU1. That involves resetting the affinity of running processes, which is not something that the architecture of PMC will allow.

      The long-term solution is to change PMC so that there is 1 MPI rank per host, and use threads to schedule multiple tasks. One thread will be responsible for MPI messaging and one thread will respond to requests and launch threads to execute tasks. This is necessary because MPI doesn't provide any descriptors for select() or poll()--basically, we have to dedicate one thread to listening for messages from the master. This is even more complicated for the rank0, which has to act as both a master and a worker. This design will enable us to have a single thread that controls the resources for the entire node, which will enable us to periodically rebalance the scheduling affinity (since that thread will know about all the running tasks, including their PIDs and required number of cores). It will also enable us to implement other things like work stealing.

            Assignee:
            Gideon Juve (Inactive)
            Reporter:
            Duncan Brown
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: