The acc
dialect is an MLIR dialect for representing the OpenACC programming model. OpenACC is a standardized directive-based model which is used with C, C++, and Fortran to enable programmers to expose parallelism in their code. The descriptive approach used by OpenACC allows targeting of parallel multicore and accelerator targets like GPUs by giving the compiler the freedom of how to parallelize for specific architectures. OpenACC also provides the ability to optimize the parallelism through increasingly more prescriptive clauses.
This dialect models the constructs from the OpenACC 3.3 specification
This document describes the design of the OpenACC dialect in MLIR. It lists and explains design goals and design choices along with their rationale. It also describes specifics with regards to acc dialect operations, types, and attributes.
acc
pragmas in MLIR. Additionally, this dialect is expected to be further lowered when materializing its semantics. Without a complete representation, a frontend might choose a lower abstraction (such as direct runtime call) - but this would impact the ability to do analysis and optimizations on the dialect.recipe
along with the private
operation which can be packaged neatly with the acc
dialect operations.hlfir
, fir
, llvm
, cir
.acc
dialect coexisting with other dialect(s) is necessary by construction. Through proper abstractions, neither the acc
dialect nor the source language dialect should have dependencies on each other; where needed, interfaces should be used to ensure acc
dialect can verify expected properties.acc copyin
clause. After the acc.copyin
operation, a pointer which lives on devices should be distinguishable from one that lives in host memory.MemoryEffects
, are the key way MLIR transformations and analyses are designed to interact with the IR. In order for the operations in the acc
dialect to be optimizable (either directly or even indirectly by not blocking optimizations of nested IR), implementing relevant common interfaces is needed.The design philosophy of the acc dialect is one where the design goals are adhered to. Current and planned operations, attributes, types must adhere to the design goals.
The OpenACC dialect includes both high-level operations (which retain the same semantic meaning as their OpenACC language equivalent), intermediate-level operations (which are used to decompose clauses from constructs), and low-level operations (to encode specifics associated with source language in a generic way).
The high-level operations list contains the following OpenACC language constructs and their corresponding operations:
acc parallel
→ acc.parallel
acc kernels
→ acc.kernels
acc serial
→ acc.serial
acc data
→ acc.data
acc loop
→ acc.loop
acc enter data
→ acc.enter_data
acc exit data
→ acc.exit_data
acc host_data
→ acc.host_data
acc init
→ acc.init
acc shutdown
→ acc.shutdown
acc update
→ acc.update
acc set
→ acc.set
acc wait
→ acc.wait
acc atomic read
→ acc.atomic.read
acc atomic write
→ acc.atomic.write
acc atomic update
→ acc.atomic.update
acc atomic capture
→ acc.atomic.capture
This second group contains operations which are used to represent either decomposed constructs or clauses for more accurate modeling:
acc routine
→ acc.routine
+ acc.routine_info
attributeacc declare
→ acc.declare_enter
+ acc.declare_exit
or acc.declare
acc {construct} copyin
→ acc.copyin
(before region) + acc.delete
(after region)acc {construct} copy
→ acc.copyin
(before region) + acc.copyout
(after region)acc {construct} copyout
→ acc.create
(before region) + acc.copyout
(after region)acc {construct} attach
→ acc.attach
(before region) + acc.detach
(after region)acc {construct} create
→ acc.create
(before region) + acc.delete
(after region)acc {construct} present
→ acc.present
(before region) + acc.delete
(after region)acc {construct} no_create
→ acc.nocreate
(before region) + acc.delete
(after region)acc {construct} deviceptr
→ acc.deviceptr
acc {construct} private
→ acc.private
acc {construct} firstprivate
→ acc.firstprivate
acc {construct} reduction
→ acc.reduction
acc cache
→ acc.cache
acc update device
→ acc.update_device
acc update host
→ acc.update_host
acc host_data use_device
→ acc.use_device
acc declare device_resident
→ acc.declare_device_resident
acc declare link
→ acc.declare_link
acc exit data delete
→ acc.delete
(with structured
flag as false)acc exit data detach
→ acc.detach
(with structured
flag as false)acc {construct} {data_clause}(var[lb:ub])
→ acc.bounds
The low-level operations are:
acc.private.recipe
acc.reduction.recipe
acc.firstprivate.recipe
acc.global_ctor
acc.global_dtor
acc.yield
acc.terminator
The low-level operations semantics and reasoning are further explained in sections below.The data clauses are decomposed from their constructs for better dataflow modeling in MLIR. There are multiple reasons for this which are consistent with the dialect goals:
MemoryEffects
to a single operation. This can better reflect semantics (like the fact that an acc.copyin
operation only reads host memory)CSE
).Each of the acc
dialect data operations represents either the entry or the exit portion of the data action specification. Thus, acc.copyin
represents the semantics defined in section 2.7.7 copyin clause
whose wording starts with At entry to a region
. The decomposed exit operation acc.delete
represents the second part of that section, whose wording starts with At exit from the region
. The delete
action may be performed after checking and updating of the relevant reference counters noted.
The acc
data operations, even when decomposed, retain their original data clause in an operation operand dataClause
for possibility to recover this information during debugging. For example, acc copy
, does not translate to acc.copy
operation, but instead to acc.copyin
for entry and acc.copyout
for exit. Both the decomposed operations hold a dataClause
field that specifies this was an acc copy
.
The link between the decomposed entry and exit operations is the ssa value produced by the entry operation. Namely, it is the accPtr
result which is used both in the dataOperands
of the operation used for the construct and in the accPtr
operand of the exit operation.
OpenACC data clauses allow the use of bounds specifiers as per 2.7.1 Data Specification in Data Clauses
. However, array dimensions for the data are not always required in the clause if the source language‘s type system captures this information - the user can just specify the variable name in the data clause. So the acc.bounds
operation is an important piece to ensure uniform representation of both explicit user set dimensions and implicit type-based dimensions. It contains several key features to allow properly encoding sizes in a manner flexible and agnostic to the source language’s dialect:
acc.bounds
operations.PointerLikeType
requirement in data clauses - since a lowerbound of 0 means looking at data at the zero offset from pointer. This requirement also works well in ensuring the acc
dialect is agnostic to source language dialect since it prevents ambiguity such as the case of Fortran arrays where the lower bound is not a fixed value.!fir.array<?x?xi32>
) but instead encodes it in some other way (such as through descriptors), then the frontend must fill in the acc.bounds
operands with appropriate information (such as loads from descriptor). The acc.bounds
operation also permits lossy source dialect, such as if the frontend uses aggressive pointer decay and cannot represent the dimensions in the type system (eg using !llvm.ptr
for arrays). Both of these aspects show acc.bounds
' operation's flexibility to allow the representation to be agnostic since the acc
dialect is not expected to be able to understand how to extract dimension information from the types of the source dialect.acc.bounds
operation is rich enough to accept either or both - for convenience in lowering to the dialect and for ability to precisely capture the meaning from the clause.acc.bounds
operation. This is also an important part to be able to accept a source language's arrays without forcing the frontend to normalize them in some way. For example, consider a case where in a parent function, a whole array is mapped to device. Then only a view of a non-1 stride is passed to child function (eg Fortran array slice with non-1 stride). A copy
operation of this data in child should be able to avoid remapping this array. If instead the operation required normalizing the array (such as making it contiguous), then unexpected disjoint mapping of the same host data would be error-prone since it would result in multiple mappings to device.The data operations also maintain semantics described in the OpenACC specification related to runtime counters. More specifically, consider the specification of the entry portion of acc copyin
in section 2.7.7:
At entry to a region, the structured reference counter is used. On an enter data directive, the dynamic reference counter is used. - If var is present and is not a null pointer, a present increment action with the appropriate reference counter is performed. - If var is not present, a copyin action with the appropriate reference counter is performed. - If var is a pointer reference, an attach action is performed.
The acc.copyin
operation includes these semantics, including those related to attach, which is specified through the varPtrPtr
operand. The structured
flag on the operation is important since the structured reference counter
should be used when the flag is true; and the dynamic reference counter
should be used when it is false.
At exit from structured regions (acc data
, acc kernels
), the acc copyin
operation is decomposed to acc.delete
(with the structured
flag as true). The semantics of the acc.delete
are also consistent with the OpenACC specification noted for the exit portion of the acc copyin
clause:
At exit from the region: - If the structured reference counter for var is zero, no action is taken. - Otherwise, a detach action is performed if var is a pointer reference, and a present decrement action with the structured reference counter is performed if var is not a null pointer. If both structured and dynamic reference counters are zero, a delete action is performed.
There are a few acc dialect type categories to describe:
varPtr
varPtr
must be pointer-like. This is done by attaching the PointerLikeType
interface to the appropriate MLIR type. Although memory/storage concept is a lower level abstraction, it is useful because the OpenACC model distinguishes between host and device memory explicitly - and the mapping between the two is done through pointers. Thus, by explicitly requiring it in the dialect, the appropriate language frontend must create storage or use type that satisfies the mapping constraint.varPtr
. This was done intentionally instead of introducing an acc.ref/ptr
type so that IR compatibility and the dialect's existing strong type checking can be maintained. This is needed since the acc
dialect must live within another dialect whose type system is unknown to it. The only constraint is that the appropriate dialect type must use the PointerLikeType
interface.acc.bounds
and acc.declare_enter
produce types to allow their results to be used only in specific operations.Recipes are a generic way to express source language specific semantics.
There are currently two categories of recipes, but the recipe concept can be extended for any additional low-level information that needs to be captured for successful lowering of OpenACC. The two categories are:
The intention of the recipes is to specify how materialization of action, such as privatization, should be done when the semantics of the action needs interpreted and lowered, such as before generating LLVM dialect.
The recipes used for privatization provide a source-language independent way of specifying the creation of a local variable of that type. This means using the appropriate alloca
instruction and being able to specify default initialization or default constructor.
The routine directive is used to note that a procedure should be made available for the accelerator in a way that is consistent with its modifiers, such as those that describe the parallelism. In the acc dialect, an acc routine is represented through two joint pieces - an attribute and an operation:
acc.routine
operation is simply a specifier which notes which symbol (or string) the acc routine is needed for, along with parallelism associated. This defines a symbol that can be referenced in attribute.acc.routine_info
attribute is an attribute used on the source dialect specific operation which specifies one or multiple acc.routine
symbols. Typically, this is attached to func.func
which either provides the declaration (in case of externals) or provides the actual body of the acc routine in the dialect that the source language was translated to.OpenACC declare
is a mechanism which declares a definition of a global or a local to be accessible to accelerator with an implicit lifetime as that of the scope where it was declared in. Thus, declare
semantics are represented through multiple operations and attributes:
acc.declare
- This is a structured operation which contains an MLIR region and can be used in similar manner as acc.data to specify an implicit data region with specific procedure lifetime. This is typically used inside func.func
after variable declarations.acc.declare_enter
- This is an unstructured operation which is used as a decomposed form of acc declare
. It effectively allows the entry operation to exist in a scope different than the exit operation. It can also be used along acc.declare_exit
which consumes its token to define a scoped region without using MLIR region. This operation is also used in acc.global_ctor
.acc.declare_exit
- The matching equivalent of acc.declare_enter
except that it specifies exit semantics. This operation is typically used inside a func.func
at the exit points or with acc.global_dtor
.acc.global_ctor
- Lives at the same level as source dialect globals and is used to specify data actions to be done at program entry. This is used in conjunction with source dialect globals whose lifetime is not just a single procedure.acc.global_dtor
- Defines the exit data actions that should be done at program exit. Typically used to revert the actions of acc.global_ctor
.The attributes:
acc.declare
- This is a facility for easier determination of variables which are acc declare
'd. This attribute is used on operations producing globals and on operations producing locals such as dialect specific alloca
's. Having this attribute is required in order to appear in a data mapping operation associated with any of the acc.declare*
operations.acc.declare_action
- Since the OpenACC specification allows declaration of variables that have yet to be allocated, this attribute is used at the allocation and deallocation points. More specifically, this attribute captures symbols of functions to be called to perform an action either pre-allocate, post-allocate, pre-deallocate, or post-deallocate. Calls to these functions should be materialized when lowering OpenACC semantics to ensure proper data actions are done after the allocation/deallocation.The design goal for the acc
dialect is to be friendly to MLIR optimization passes including CSE and LICM. Additionally, since it is designed to recover original clauses, it makes late verification and analysis possible in the MLIR framework outside of the frontend.
This section describes a few MLIR-level passes for which the acc
dialect design should be friendly for. This section is currently solely outlining the possibilities intended by the design and not necessarily existing passes.
Since the OpenACC dialect is not lossy with regards to its representation, it is possible to do OpenACC language semantic checking at the MLIR-level. What follows is a list of various semantic checks needed.
This first list is required to be done in the frontend because the acc
dialect operations must be valid when constructed:
However, the following are semantic checks that can be done at the MLIR-level (either in a separate pass or as part of the operation verifier):
Note that some of these checks can be even more precise when done at the MLIR level because optimizations like inlining and constant propagation expose detail that wouldn't have been visible in the frontend.
The OpenACC specification includes a section on 2.6.2 Variables with Implicitly Determined Data Attributes
. What this section describes are the data actions that should be applied to a variable for which user did not specify a data action for. The action depends on the construct being used and also on the default clause. However, the point to note here is that variables which are live-in into the acc region must employ some data mapping so the data can be passed to accelerator.
One possible optimizations that affects data attributes needed is Scalar Replacement of Aggregates (SROA)
. The acc
dialect should not prevent this from happening on the source dialect.
Because it is intended to be possible to apply optimizations across an acc
region, the analysis/transformation pass that applies the implicit data attributes should be run as late as possible - ideally right before any outlining process which uses the acc
region body to create an accelerator procedure. It is expected that existing MLIR facilities, such as mlir::Liveness
will work for the acc
region and thus can be used to perform this analysis.
The data operations are modeled in a way where data entry operations look like loads and data exit operations look like stores. Thus these operations are intended to be optimized in the following ways:
acc.copyin
dominates another.[include “Dialects/OpenACCDialectOps.md”]