Mapping Options

Tensor Comprehensions (TC) can be transformed, or mapped, into CUDA kernels almost automatically. Because there is more than one possible way to execute tensor operations in parallel on modern GPUs, for example, use different CUDA grids or different relative execution order, TC engine requires the user to make a set of choices regarding the mapping process and provide them through the mapping options. Given the specific options, the translation process becomes fully automatic.

The mapping options provide a relatively high-level declarative interface to the GPU mapping process. They are not expressed in terms of loops or other control flow constructs, or individual tensors. Instead, they enable or parameterize certain classes of transformations similarly to regular compiler options. In particular, they control the resources allocated to the GPU kernel, the number of threads, the amount of shared memory to use, the amount of computation per thread, etc. These resources affect occupancy and ultimately the performance of the generated kernel. Mapping Options are mostly intended for programmatic use, they can be configured through API calls, saved and loaded from a Protocol Buffer.

How to choose starting mapping options?

Don’t.

We recommend to not set up the mapping options manually unless you understand how TCs map to CUDA code and how the latter can be optimized. Use the Autotuner or the operation- and GPU-specific options provided with TC, see Defaults Provided.

Options API

Options can be set up programmatically using the C++ or Python API. Both implement a fluent interface through method chaining. Mapping options construction always starts from the naïve options that enable some kernel code to be generated but oftentimes provide poor performance. TC provide more efficient mapping options for some common deep learning operations, see Defaults Provided. Individual mapping parameters can be modified by calling option-specific functions, for example:

C++

#include <tc/core/cuda/cuda_mapping_options.h>

auto options = MappingOptions::makeNaiveMappingOptions()
    .mapToBlocks(100, 20)
    .mapToThreads(32, 4, 4);

Python

from tensor_comprehensions.tc import Options

options = Options("naive")
options.mapToBlocks([100, 20])
options.mapToThreads([32, 4, 4])

When an option allows for multiple arguments, Python API accepts a list while C++ API provides variadic-argument overloads along with vector- and initializer_list-based versions. See Available options for the full list.

Defaults provided

TC comes with a list of pre-tuned mapping options for some common classes of deep learning operations. Although these options were tested on recent production GPUs, performance remains sensitive both to the available GPU resources (number of SMs, shared memory size) and to the input sizes. We highly recommend using the autotuner for cases that require competitive performance.

The mapping options for the following classes of operations are provided as static methods of the MappingOptions class.

  • makePointwiseMappingOptions(): Mapping options for point-wise arithmetic operations (e.g. bias).
  • makeMlpMappingOptions(): Mapping options for multilayer perceptrons (sequences of fully connected layers followed by non-linearity).
  • makeConvolutionMappingOptions(): Mapping options for convolutional layers.

Available options

The following options are currently available:

  • .mapToBlocks(<list of 1..3 positive integers>): The configuration of CUDA grid, i.e. the number of CUDA blocks along three dimensions. Must be within the range allowed by CUDA (maximum 2^31-1 for the first value and 65535 for the second and third). Note that TC mapper eliminates empty blocks and the actual launch size may be smaller than requested.
  • .mapToThreads(<list of 1..3 positive integers>): The configuration of CUDA block, i.e. the number of CUDA threads in each block along three dimensions. Must be within the range allowed by CUDA (maximum 1024 for the first and second value, 32 for the third, product below 1024). Note that TC mapper eliminates empty threads and the actual launch size may be smaller than requested.
  • .tile(<list of positive integers>): Perform loop tiling on the generated code with the given sizes. Independent of mapping to a grid of thread blocks.
  • .useSharedMemory(<boolean>): Create block-local copies of data in shared memory when this can leverage data reuse or global memory access coalescing.
  • .maxSharedMemory(<positive integer>): The amount of shared memory to use, in bytes. If not provided, TC will query the active GPU and use all available shared memory.
  • .unroll(<positive integer>): Perform loop unrolling on the generated code and produce at most the given number of statements.
  • .unrollCopyShared(<boolean>): Also unroll the copies to and from shared memory introduced by the TC mapper. If unroll value is not provided, has no effect.
  • .useReaOnlyCache(<boolean>): Emit loads to the readonly cache when appropriate.
  • .matchLibraryCalls(<boolean>): Replace computation patterns with calls to highly optimized libraries (such as CUB, CUTLASS) when possible.
  • .fixParametersBeforeScheduling(<boolean>): Perform automatic loop scheduling taking into account specific tensor sizes. May produce faster kernels but significantly increases compilation time. Note that the mapping will be performed for specific tensor sizes anyway.
  • .outerScheduleFusionStrategy(<choice of Max, Preserve3Coincident, Min>): Require TC to try and execute different TC expressions interleaved (Max), separately (Min) or interleaved as long as sufficient parallelism is exploited (Preserve3Coincident) by performing loop fusion and fission. Applies before tiling.
  • .intraTileFusionStrategy(<choice of Max, Preserve3Coincident, Min>): Require TC to try and execute different TC expressions interleaved (Max), separately (Min) or interleaved as long as sufficient parallelism is exploited (Preserve3Coincident) by performing loop fusion and fission. Applies to inner loops created by tiling.
  • .scheduleFusionStrategy(<choice of Max, Preserve3Coincident, Min>): Set up outerScheduleFusionStrategy and intraTileFusionStrategy to the given value.

Note

Other, experimental options may be exposed in the API. Unless explained in the documentation, their behavior is undefined. They may or may not affect the kernel, and change the outputs. Use them at your own risk.

Impact on Performance

There is no general approach to choosing the best mapping options. We provide several recommendations that have proven successful several times in the past.

  • First and foremost, explore the mapping options together with a profiling tool that indicates what are the bottlenecks of your kernel. Since CUDA kernel performance is mostly affected by the GPU occupancy, identify the occupancy limiting factor and change the options that may affect it.
  • While dimensions of the LHS tensor are typically transformed into loops, some of which may be mapped to CUDA blocks and threads, you should not assume any correspondence between these dimensions, generated loops or positions of the mapping options arguments. To get more comfortable with mapping options, analyze how the generated CUDA code changes along with an option change.
  • The amount of parallelism and computation per thread is controlled by a combination of grid and block sizes. If the total number of threads (number of blocks times number of threads per block) equals the number of LHS tensor elements, then each thread computes a single element of that tensor. As different loops are generated for iterating over different tensor dimensions, and these loops end up mapped to GPU threads, consider grid/block size pairs that correspond to tensor sizes along different dimensions. Using a factor of the tensor size as the total number of threads will make each thread compute multiple elements of the tensor. Number of threads that do not evenly divide the tensor size will lead to thread divergence: some threads will do the computation while others will not. While divergence is generally detrimental for performance, you may want to consider multipliers of the warp size (32) as number of threads. Also keep in mind the limitation of the number of threads per block (typically 1024). Note that TC mapping engine will eliminate any blocks and threads that do not compute anything, e.g., if the total number of threads is greater than the number of LHS tensor elements that can be computed independently.
  • Different pairs of grid and block sizes result in the same total number of threads. If there is data reuse, i.e. the same elements of the RHS tensors are necessary to compute different elements of the LHS tensor, larger blocks allow the mapper to place more of the reused data into faster shared memory. However, the larger is the block, the more shared memory it requires, which may end up limiting the occupancy. You may want to set up the shared memory size to a value smaller than the physically available shared memory size in this case. Eventually, the data reused inside the block may stop fitting the shared memory.
  • Tiling may leverage the caches by making reuse more localized. Elements of the LHS tensor in TC can be computed independently yet, when not computed in parallel, they are computed in some order. While this order is optimized for maximal parallelism and reuse by an automatic procedure, it only changes the order in which tensor dimensions are processed. One can think of it as an extension to tensors of per-row or per-column matrix traversals. In any case, the entire slice (row, plane, hyper-place) of the LHS tensor is computed before the next slice starts. If some RHS tensor element is reused for computing LHS values in the same column, but the order was chosen to be per rows, this element is likely to be evicted from cache before it is needed again. Tiling changes the order in which LHS elements are computed by creating smaller blocks inside each slice. Tile sizes define the number of elements along each dimension in this block. This transformation reminds of how iterations are mapped to the CUDA grid of thread blocks. In fact, mapping to blocks implicitly performs tiling. Contrary to the thread block mapping, tiling does not require all elements to be computed independently from each other as long as other validity conditions hold. Note that TC engine performs tiling independently of mapping to the CUDA grid, i.e., the tiled dimensions may or may not be mapped to blocks or threads. Similarly to block and grid sizes, tile sizes that are divisors of the input tensor size are a reasonable choice. Keep them relatively small to benefit from caches.
  • Using shared memory is profitable in many cases. Even if when there is no reuse, data may be preloaded into a shared memory cache in a more efficient way than it is accessed during computation, in particular using memory coalescing. However, it may limit the amount of parallelism. Copying to shared memory also uses barrier synchronization inside blocks, which may be undesirable for short kernels. Promotion to shared memory may be disabled for cases where global memory access is not the principal bottleneck of the kernel.
  • Unrolling eliminates control flow by introducing copies of statements. This reduces the number of integer instructions but may significantly increase the compilation time.
  • Fusion strategy controls how different TC expressions will be interleaved with each other. Maximal fusion will attempt to “pipeline” the computation of tensor elements whenever it is possible while minimal fusion will try and ensure that all elements of one LHS tensor are computed before starting the next one. Fusion often makes reuse more local, but increases requirements to memory resources and, more importantly, may lead to a loss of parallelism. Maximal fusion is sometimes required at the outer level to produce kernels mappable to more than one block (or requiring a global synchronization), minimal fusion at the inner level can decrease the resources requirements at the const of additional synchronizations inside the loop.

Possible compiler issues

  • Mapping failures: Some combinations of mapping options are forbidden, for example using more than 1024 threads per block or more shared memory than physically available on the device. In these cases, TC mapper will throw an exception. In some extreme cases of catastrophic failure, TC may abort completely. Please report such cases to us.
  • Long compilation times: TC internally relies on a mathematical optimization problem that may be hard to solve. Mapping options related to scheduling, fusion and unrolling are known to affect compilation time significantly. Large unroll values and some cases of fixParametersBeforeScheduling may lead to minutes of compilation time for simple kernels. We recommend disabling these options if compilation takes too long or using the autotuner that prunes options resulting in long compilation times.