flang/docs/DoConcurrentConversionToOpenMP.md - third_party/llvm-project - Git at Google

 <!--===- docs/DoConcurrentMappingToOpenMP.md

    Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
    See https://llvm.org/LICENSE.txt for license information.
    SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

 -->

 # `DO CONCURRENT` mapping to OpenMP

 ```{contents}
 ---
 local:
 ---
 ```

 This document seeks to describe the effort to parallelize `do concurrent` loops
 by mapping them to OpenMP worksharing constructs. The goals of this document
 are:
 * Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
   constructs.
 * Tracking the current status of such mapping.
 * Describing the limitations of the current implementation.
 * Describing next steps.
 * Tracking the current upstreaming status (from the AMD ROCm fork).

 ## Usage

 In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
 compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
 1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
    This maps such loops to the equivalent of `omp parallel do`.
 2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
    This maps such loops to the equivalent of
    `omp target teams distribute parallel do`.
 3. `none`: this disables `do concurrent` mapping altogether. In that case, such
    loops are emitted as sequential loops.

 The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
 OpenMP is also enabled. So you need to provide the following options to flang in
 order to enable it:
 ```
 flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
 ```
 For mapping to device, the target device architecture must be specified as well.
 See `-fopenmp-targets` and `--offload-arch` for more info.

 ## Current status

 Under the hood, `do concurrent` mapping is implemented in the
 `DoConcurrentConversionPass`. This is still an experimental pass which means
 that:
 * It has been tested in a very limited way so far.
 * It has been tested mostly on simple synthetic inputs.

 ### Loop nest detection

 On the `FIR` dialect level, the following loop:
 ```fortran
   do concurrent(i=1:n, j=1:m, k=1:o)
     a(i,j,k) = i + j + k
   end do
 ```
 is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
 contains **only** the following:
   1. The operations needed to assign/update the outer loop's induction variable.
   1. The inner loop itself.

 So the MLIR structure for the above example looks similar to the following:
 ```
   fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
     %i_idx_2 = fir.convert %i_idx : (index) -> i32
     fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>

     fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
       %j_idx_2 = fir.convert %j_idx : (index) -> i32
       fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>

       fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
         %k_idx_2 = fir.convert %k_idx : (index) -> i32
         fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>

         ... loop nest body goes here ...
       }
     }
   }
 ```
 This applies to multi-range loops in general; they are represented in the IR as
 a nest of `fir.do_loop` ops with the above nesting structure.

 Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
 loops and map them as "collapsed" loops in OpenMP.

 #### Further info regarding loop nest detection

 Loop nest detection is currently limited to the scenario described in the previous
 section. However, this is quite limited and can be extended in the future to cover
 more cases. At the moment, for the following loop nest, even though both loops are
 perfectly nested, only the outer loop is parallelized:
 ```fortran
 do concurrent(i=1:n)
   do concurrent(j=1:m)
     a(i,j) = i * j
   end do
 end do
 ```

 Similarly, for the following loop nest, even though the intervening statement `x = 41`
 does not have any memory effects that would affect parallelization, this nest is
 not parallelized either (only the outer loop is).

 ```fortran
 do concurrent(i=1:n)
   x = 41
   do concurrent(j=1:m)
     a(i,j) = i * j
   end do
 end do
 ```

 The above also has the consequence that the `j` variable will **not** be
 privatized in the OpenMP parallel/target region. In other words, it will be
 treated as if it was a `shared` variable. For more details about privatization,
 see the "Data environment" section below.

 See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
 of what is and is not detected as a perfect loop nest.

 ### Single-range loops

 Given the following loop:
 ```fortran
   do concurrent(i=1:n)
     a(i) = i * i
   end do
 ```

 #### Mapping to `host`

 Mapping this loop to the `host`, generates MLIR operations of the following
 structure:

 ```
 %4 = fir.address_of(@_QFEa) ...
 %6:2 = hlfir.declare %4 ...

 omp.parallel {
   // Allocate private copy for `i`.
   // TODO Use delayed privatization.
   %19 = fir.alloca i32 {bindc_name = "i"}
   %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...

   omp.wsloop {
     omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
       %23 = fir.convert %arg0 : (index) -> i32
       // Use the privatized version of `i`.
       fir.store %23 to %20#1 : !fir.ref<i32>
       ...

       // Use "shared" SSA value of `a`.
       %42 = hlfir.designate %6#0
       hlfir.assign %35 to %42
       ...
       omp.yield
     }
     omp.terminator
   }
   omp.terminator
 }
 ```

 #### Mapping to `device`

 <!-- TODO -->

 ### Multi-range loops

 The pass currently supports multi-range loops as well. Given the following
 example:

 ```fortran
    do concurrent(i=1:n, j=1:m)
        a(i,j) = i * j
    end do
 ```

 The generated `omp.loop_nest` operation look like:

 ```
 omp.loop_nest (%arg0, %arg1)
     : index = (%17, %19) to (%18, %20)
     inclusive step (%c1_2, %c1_4) {
   fir.store %arg0 to %private_i#1 : !fir.ref<i32>
   fir.store %arg1 to %private_j#1 : !fir.ref<i32>
   ...
   omp.yield
 }
 ```

 It is worth noting that we have privatized versions for both iteration
 variables: `i` and `j`. These are locally allocated inside the parallel/target
 OpenMP region similar to what the single-range example in previous section
 shows.

 ### Data environment

 By default, variables that are used inside a `do concurrent` loop nest are
 either treated as `shared` in case of mapping to `host`, or mapped into the
 `target` region using a `map` clause in case of mapping to `device`. The only
 exceptions to this are:
   1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
      case, for each IV, we allocate a local copy as shown by the mapping
      examples above.
   1. any values that are from allocations outside the loop nest and used
      exclusively inside of it. In such cases, a local privatized
      copy is created in the OpenMP region to prevent multiple teams of threads
      from accessing and destroying the same memory block, which causes runtime
      issues. For an example of such cases, see
      `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.

 Implicit mapping detection (for mapping to the target device) is still quite
 limited and work to make it smarter is underway for both OpenMP in general
 and `do concurrent` mapping.

 #### Non-perfectly-nested loops' IVs

 For non-perfectly-nested loops, the IVs are still treated as `shared` or
 `map` entries as pointed out above. This **might not** be consistent with what
 the Fortran specification tells us. In particular, taking the following
 snippets from the spec (version 2023) into account:

 > § 3.35
 > ------
 > construct entity
 > entity whose identifier has the scope of a construct

 > § 19.4
 > ------
 >  A variable that appears as an index-name in a FORALL or DO CONCURRENT
 >  construct [...] is a construct entity. A variable that has LOCAL or
 >  LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
 > [...]
 > The name of a variable that appears as an index-name in a DO CONCURRENT
 > construct, FORALL statement, or FORALL construct has a scope of the statement
 > or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
 > CONCURRENT construct has the scope of that construct.

 From the above quotes, it seems there is an equivalence between the IV of a `do
 concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
 to OpenMP's `private` clause). Which means that we should probably
 localize/privatize a `do concurrent` loop's IV even if it is not perfectly
 nested in the nest we are parallelizing. For now, however, we **do not** do
 that as pointed out previously. In the near future, we propose a middle-ground
 solution (see the Next steps section for more details).

 <!--
 More details about current status will be added along with relevant parts of the
 implementation in later upstreaming patches.
 -->

 ## Next steps

 This section describes some of the open questions/issues that are not tackled yet
 even in the downstream implementation.

 ### Separate MLIR op for `do concurrent`

 At the moment, both increment and concurrent loops are represented by one MLIR
 op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
 attribute. This is not ideal since the `fir.do_loop` op support only single
 iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
 emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
 pass to handle multi-range loops. Instead, it would better to model multi-range
 concurrent loops using a separate op which the IR more representative of the input
 Fortran code and also easier to detect and transform.

 ### Delayed privatization

 So far, we emit the privatization logic for IVs inline in the parallel/target
 region. This is enough for our purposes right now since we don't
 localize/privatize any sophisticated types of variables yet. Once we have need
 for more advanced localization through `do concurrent`'s locality specifiers
 (see below), delayed privatization will enable us to have a much cleaner IR.
 Once delayed privatization's implementation upstream is supported for the
 required constructs by the pass, we will move to it rather than inlined/early
 privatization.

 ### Locality specifiers for `do concurrent`

 Locality specifiers will enable the user to control the data environment of the
 loop nest in a more fine-grained way. Implementing these specifiers on the
 `FIR` dialect level is needed in order to support this in the
 `DoConcurrentConversionPass`.

 Such specifiers will also unlock a potential solution to the
 non-perfectly-nested loops' IVs issue described above. In particular, for a
 non-perfectly nested loop, one middle-ground proposal/solution would be to:
 * Emit the loop's IV as shared/mapped just like we do currently.
 * Emit a warning that the IV of the loop is emitted as shared/mapped.
 * Given support for `LOCAL`, we can recommend the user to explicitly
   localize/privatize the loop's IV if they choose to.

 #### Sharing TableGen clause records from the OpenMP dialect

 At the moment, the FIR dialect does not have a way to model locality specifiers
 on the IR level. Instead, something similar to early/eager privatization in OpenMP
 is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
 modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
 reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
 to OpenMP (and other parallel programming models) much easier.

 Therefore, one way to approach this problem is to extract the TableGen records
 for relevant OpenMP clauses in a shared dialect for "data environment management"
 and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
 as well.

 #### Supporting reductions

 Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
 is also still an open TODO. We can potentially extend the MLIR infrastructure
 proposed in the previous section to share reduction records among the different
 relevant dialects as well.

 ### More advanced detection of loop nests

 As pointed out earlier, any intervening code between the headers of 2 nested
 `do concurrent` loops prevents us from detecting this as a loop nest. In some
 cases this is overly conservative. Therefore, a more flexible detection logic
 of loop nests needs to be implemented.

 ### Data-dependence analysis

 Right now, we map loop nests without analysing whether such mapping is safe to
 do or not. We probably need to at least warn the user of unsafe loop nests due
 to loop-carried dependencies.

 ### Non-rectangular loop nests

 So far, we did not need to use the pass for non-rectangular loop nests. For
 example:
 ```fortran
 do concurrent(i=1:n)
   do concurrent(j=i:n)
     ...
   end do
 end do
 ```
 We defer this to the (hopefully) near future when we get the conversion in a
 good share for the samples/projects at hand.

 ### Generalizing the pass to other parallel programming models

 Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
 this in a more generalized direction and allow the pass to target other models;
 e.g. OpenACC. This goal should be kept in mind from the get-go even while only
 targeting OpenMP.


 ## Upstreaming status

 - [x] Command line options for `flang` and `bbc`.
 - [x] Conversion pass skeleton (no transormations happen yet).
 - [x] Status description and tracking document (this document).
 - [x] Loop nest detection to identify multi-range loops.
 - [ ] Basic host/CPU mapping support.
 - [ ] Basic device/GPU mapping support.
 - [ ] More advanced host and device support (expaned to multiple items as needed).
	<!--===- docs/DoConcurrentMappingToOpenMP.md

	Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	See https://llvm.org/LICENSE.txt for license information.
	SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

	-->

	# `DO CONCURRENT` mapping to OpenMP

	```{contents}
	---
	local:
	---
	```

	This document seeks to describe the effort to parallelize `do concurrent` loops
	by mapping them to OpenMP worksharing constructs. The goals of this document
	are:
	* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
	constructs.
	* Tracking the current status of such mapping.
	* Describing the limitations of the current implementation.
	* Describing next steps.
	* Tracking the current upstreaming status (from the AMD ROCm fork).

	## Usage

	In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
	compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
	1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
	This maps such loops to the equivalent of `omp parallel do`.
	2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
	This maps such loops to the equivalent of
	`omp target teams distribute parallel do`.
	3. `none`: this disables `do concurrent` mapping altogether. In that case, such
	loops are emitted as sequential loops.

	The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
	OpenMP is also enabled. So you need to provide the following options to flang in
	order to enable it:
	```
	flang ... -fopenmp -fdo-concurrent-to-openmp=[host\|device\|none] ...
	```
	For mapping to device, the target device architecture must be specified as well.
	See `-fopenmp-targets` and `--offload-arch` for more info.

	## Current status

	Under the hood, `do concurrent` mapping is implemented in the
	`DoConcurrentConversionPass`. This is still an experimental pass which means
	that:
	* It has been tested in a very limited way so far.
	* It has been tested mostly on simple synthetic inputs.

	### Loop nest detection

	On the `FIR` dialect level, the following loop:
	```fortran
	do concurrent(i=1:n, j=1:m, k=1:o)
	a(i,j,k) = i + j + k
	end do
	```
	is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
	contains only the following:
	1. The operations needed to assign/update the outer loop's induction variable.
	1. The inner loop itself.

	So the MLIR structure for the above example looks similar to the following:
	```
	fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
	%i_idx_2 = fir.convert %i_idx : (index) -> i32
	fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>

	fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
	%j_idx_2 = fir.convert %j_idx : (index) -> i32
	fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>

	fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
	%k_idx_2 = fir.convert %k_idx : (index) -> i32
	fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>

	... loop nest body goes here ...
	}
	}
	}
	```
	This applies to multi-range loops in general; they are represented in the IR as
	a nest of `fir.do_loop` ops with the above nesting structure.

	Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
	loops and map them as "collapsed" loops in OpenMP.

	#### Further info regarding loop nest detection

	Loop nest detection is currently limited to the scenario described in the previous
	section. However, this is quite limited and can be extended in the future to cover
	more cases. At the moment, for the following loop nest, even though both loops are
	perfectly nested, only the outer loop is parallelized:
	```fortran
	do concurrent(i=1:n)
	do concurrent(j=1:m)
	a(i,j) = i * j
	end do
	end do
	```

	Similarly, for the following loop nest, even though the intervening statement `x = 41`
	does not have any memory effects that would affect parallelization, this nest is
	not parallelized either (only the outer loop is).

	```fortran
	do concurrent(i=1:n)
	x = 41
	do concurrent(j=1:m)
	a(i,j) = i * j
	end do
	end do
	```

	The above also has the consequence that the `j` variable will not be
	privatized in the OpenMP parallel/target region. In other words, it will be
	treated as if it was a `shared` variable. For more details about privatization,
	see the "Data environment" section below.

	See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
	of what is and is not detected as a perfect loop nest.

	### Single-range loops

	Given the following loop:
	```fortran
	do concurrent(i=1:n)
	a(i) = i * i
	end do
	```

	#### Mapping to `host`

	Mapping this loop to the `host`, generates MLIR operations of the following
	structure:

	```
	%4 = fir.address_of(@_QFEa) ...
	%6:2 = hlfir.declare %4 ...

	omp.parallel {
	// Allocate private copy for `i`.
	// TODO Use delayed privatization.
	%19 = fir.alloca i32 {bindc_name = "i"}
	%20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...

	omp.wsloop {
	omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
	%23 = fir.convert %arg0 : (index) -> i32
	// Use the privatized version of `i`.
	fir.store %23 to %20#1 : !fir.ref<i32>
	...

	// Use "shared" SSA value of `a`.
	%42 = hlfir.designate %6#0
	hlfir.assign %35 to %42
	...
	omp.yield
	}
	omp.terminator
	}
	omp.terminator
	}
	```

	#### Mapping to `device`

	<!-- TODO -->

	### Multi-range loops

	The pass currently supports multi-range loops as well. Given the following
	example:

	```fortran
	do concurrent(i=1:n, j=1:m)
	a(i,j) = i * j
	end do
	```

	The generated `omp.loop_nest` operation look like:

	```
	omp.loop_nest (%arg0, %arg1)
	: index = (%17, %19) to (%18, %20)
	inclusive step (%c1_2, %c1_4) {
	fir.store %arg0 to %private_i#1 : !fir.ref<i32>
	fir.store %arg1 to %private_j#1 : !fir.ref<i32>
	...
	omp.yield
	}
	```

	It is worth noting that we have privatized versions for both iteration
	variables: `i` and `j`. These are locally allocated inside the parallel/target
	OpenMP region similar to what the single-range example in previous section
	shows.

	### Data environment

	By default, variables that are used inside a `do concurrent` loop nest are
	either treated as `shared` in case of mapping to `host`, or mapped into the
	`target` region using a `map` clause in case of mapping to `device`. The only
	exceptions to this are:
	1. the loop's iteration variable(s) (IV) of perfect loop nests. In that
	case, for each IV, we allocate a local copy as shown by the mapping
	examples above.
	1. any values that are from allocations outside the loop nest and used
	exclusively inside of it. In such cases, a local privatized
	copy is created in the OpenMP region to prevent multiple teams of threads
	from accessing and destroying the same memory block, which causes runtime
	issues. For an example of such cases, see
	`flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.

	Implicit mapping detection (for mapping to the target device) is still quite
	limited and work to make it smarter is underway for both OpenMP in general
	and `do concurrent` mapping.

	#### Non-perfectly-nested loops' IVs

	For non-perfectly-nested loops, the IVs are still treated as `shared` or
	`map` entries as pointed out above. This might not be consistent with what
	the Fortran specification tells us. In particular, taking the following
	snippets from the spec (version 2023) into account:

	> § 3.35
	> ------
	> construct entity
	> entity whose identifier has the scope of a construct

	> § 19.4
	> ------
	> A variable that appears as an index-name in a FORALL or DO CONCURRENT
	> construct [...] is a construct entity. A variable that has LOCAL or
	> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
	> [...]
	> The name of a variable that appears as an index-name in a DO CONCURRENT
	> construct, FORALL statement, or FORALL construct has a scope of the statement
	> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
	> CONCURRENT construct has the scope of that construct.

	From the above quotes, it seems there is an equivalence between the IV of a `do
	concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
	to OpenMP's `private` clause). Which means that we should probably
	localize/privatize a `do concurrent` loop's IV even if it is not perfectly
	nested in the nest we are parallelizing. For now, however, we do not do
	that as pointed out previously. In the near future, we propose a middle-ground
	solution (see the Next steps section for more details).

	<!--
	More details about current status will be added along with relevant parts of the
	implementation in later upstreaming patches.
	-->

	## Next steps

	This section describes some of the open questions/issues that are not tackled yet
	even in the downstream implementation.

	### Separate MLIR op for `do concurrent`

	At the moment, both increment and concurrent loops are represented by one MLIR
	op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
	attribute. This is not ideal since the `fir.do_loop` op support only single
	iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
	emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
	pass to handle multi-range loops. Instead, it would better to model multi-range
	concurrent loops using a separate op which the IR more representative of the input
	Fortran code and also easier to detect and transform.

	### Delayed privatization

	So far, we emit the privatization logic for IVs inline in the parallel/target
	region. This is enough for our purposes right now since we don't
	localize/privatize any sophisticated types of variables yet. Once we have need
	for more advanced localization through `do concurrent`'s locality specifiers
	(see below), delayed privatization will enable us to have a much cleaner IR.
	Once delayed privatization's implementation upstream is supported for the
	required constructs by the pass, we will move to it rather than inlined/early
	privatization.

	### Locality specifiers for `do concurrent`

	Locality specifiers will enable the user to control the data environment of the
	loop nest in a more fine-grained way. Implementing these specifiers on the
	`FIR` dialect level is needed in order to support this in the
	`DoConcurrentConversionPass`.

	Such specifiers will also unlock a potential solution to the
	non-perfectly-nested loops' IVs issue described above. In particular, for a
	non-perfectly nested loop, one middle-ground proposal/solution would be to:
	* Emit the loop's IV as shared/mapped just like we do currently.
	* Emit a warning that the IV of the loop is emitted as shared/mapped.
	* Given support for `LOCAL`, we can recommend the user to explicitly
	localize/privatize the loop's IV if they choose to.

	#### Sharing TableGen clause records from the OpenMP dialect

	At the moment, the FIR dialect does not have a way to model locality specifiers
	on the IR level. Instead, something similar to early/eager privatization in OpenMP
	is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
	modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
	reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
	to OpenMP (and other parallel programming models) much easier.

	Therefore, one way to approach this problem is to extract the TableGen records
	for relevant OpenMP clauses in a shared dialect for "data environment management"
	and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
	as well.

	#### Supporting reductions

	Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
	is also still an open TODO. We can potentially extend the MLIR infrastructure
	proposed in the previous section to share reduction records among the different
	relevant dialects as well.

	### More advanced detection of loop nests

	As pointed out earlier, any intervening code between the headers of 2 nested
	`do concurrent` loops prevents us from detecting this as a loop nest. In some
	cases this is overly conservative. Therefore, a more flexible detection logic
	of loop nests needs to be implemented.

	### Data-dependence analysis

	Right now, we map loop nests without analysing whether such mapping is safe to
	do or not. We probably need to at least warn the user of unsafe loop nests due
	to loop-carried dependencies.

	### Non-rectangular loop nests

	So far, we did not need to use the pass for non-rectangular loop nests. For
	example:
	```fortran
	do concurrent(i=1:n)
	do concurrent(j=i:n)
	...
	end do
	end do
	```
	We defer this to the (hopefully) near future when we get the conversion in a
	good share for the samples/projects at hand.

	### Generalizing the pass to other parallel programming models

	Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
	this in a more generalized direction and allow the pass to target other models;
	e.g. OpenACC. This goal should be kept in mind from the get-go even while only
	targeting OpenMP.


	## Upstreaming status

	- [x] Command line options for `flang` and `bbc`.
	- [x] Conversion pass skeleton (no transormations happen yet).
	- [x] Status description and tracking document (this document).
	- [x] Loop nest detection to identify multi-range loops.
	- [ ] Basic host/CPU mapping support.
	- [ ] Basic device/GPU mapping support.
	- [ ] More advanced host and device support (expaned to multiple items as needed).