DeepSparse achieves better performance on multiple-socket systems as well as with hyperthreading disabled; models with larger batch sizes are likely to see an improvement. One standard way of controlling compute/memory resources when running processes is to use the numactl utility. numactl can be used when multiple processes need to run on the same hardware but require their own CPU/memory resources to run optimally.
To run DeepSparse on a single socket (N) of a multi-socket system, you would start the DeepSparse using numactl. For example:
numactl --cpunodebind N <deepsparseengine-process>
To run DeepSparse on multiple sockets (N,M), run:
numactl --cpunodebind N,M <deepsparseengine-process>
It is advised to also allocate memory from the same socket on which the engine is running. So,
--preferred should be used when using
--cpunodebind. For example:
1 numactl --cpunodebind N --preferred N <deepsparseengine-process>2 or3 numactl --cpunodebind N --membind N <deepsparseengine-process>
The difference between
--preferred is that
--preferred allows memory from other sockets to be allocated if the current socket is out of memory.
--membind does not allow memory to be allocated outside the specified socket.
For more fine-grained control, numactl can be used to bind the process running DeepSparse to a set of specific CPUs using
--physcpubind. CPUs are numbered from 0-N, where N is the maximum number of logical cores available on the system. On systems with hyper-threading (or SMT), there may be more than one logical thread per physical CPU. Usually, the logical CPUs/threads are numbered after all the physical CPUs/threads. For example, in a system with two threads per CPU and N physical CPUs, the threads for a particular CPU (K) will be K and K+N for all 0<=K<N. DeepSparse currently works best with hyper-threading/SMT disabled, so only one set of threads should be selected using numactl, i.e., 0 through (N-1) or N through (N-1).
Similarly, for a multi-socket system with N sockets and C physical CPUs per socket, the CPUs located on a single socket will range from KC to ((K+1)C)-1 where 0<=K<N. For multi-socket, multi-thread systems, the logical threads are separated by N*C. For example, for a two socket, two thread per CPU system with 8 cores per CPU, the logical threads for socket 0 would be numbered 0-7 and 16-23, and the threads for socket 1 would be numbered 8-15 and 24-31.
Given the architecture above, to run DeepSparse on the first four CPUs on the second socket, you would use:
numactl --physcpubind 8-11 --preferred 1 <deepsparseengine-process>
--preferred 1 is needed here since DeepSparse is being bound to CPUs on the second socket.
Note: When running on multiple sockets, using a batch size that is evenly divisible by the number of sockets will yield the best performance.
When using numactl to specify the CPUs/sockets on which the engine is allowed to run, there is no restriction as to the CPU on which a particular computation thread is executed. A single thread of computation may run on one or more CPUs during the course of execution. This is desirable if the system is being shared between multiple processes so that idle CPU threads are not prevented from doing other work.
However, the engine works best when threads are pinned (i.e., not allowed to migrate from one CPU to another). Thread pinning can be enabled using the
NM_BIND_THREADS_TO_CORES environment variable. For example:
1 NM_BIND_THREADS_TO_CORES=1 <deepsparseengine-process>2 or3 export NM_BIND_THREADS_TO_CORES=1 <deepsparseengine-process>
NM_BIND_THREADS_TO_CORES with care since it forces DeepSparse to run on only the threads it has been allocated at startup. If any other process ends up running on the same threads, it could result in a major degradation of performance.
Note: The threads-to-cores mappings described above are specific to Intel only. AMD has a different mapping. For AMD, all the threads for a single core are consecutive; that is, if each core has two threads and there are N cores, the threads for a particular core K are 2K and 2K+1. The mapping of cores to sockets is also straightforward. For an N socket system with C cores per socket, the cores for a particular socket S are numbered SC to ((S+1)C)-1.
This displays the inventory of available sockets/CPUs on a system:
This displays the resources available to the current process:
For further details about these and other parameters, see the man page on numactl: