Hi @gerhardu,

indeed, using such a non-primitive datatype is not something that is currently possible and I am not sure if that would be something we want to support as it would complicate things to a large extent for applying stencil-computation-specific optimisations.

Using

eg x = @zeros(nx,ny,nz,N), where x[:,:,:,1] would then refer to the first vector component of the vector field.

as you suggest, should work fine with `@parallel_indices`

kernel, but not with the available finite differences macros as the macros require just a symbol that represents an array as input (you would have to write your own set of macros, which is not difficult though). I don’t see any problems in terms of performance - at least as long as your arrays are multiples of 32.

A better option would probably be to create a vector of `N`

arrays where you can create each of these arrays with the macro `@zeros`

(or `@ones`

or `@rand`

). Then, you will likely want to **splat these arrays into the kernels** you are writing as I suggested in a similar question with the co-developped package ImplicitGlobalGrid.jl. That way you can get a symbol for each array corresponding to one of the `N`

components you can then work with in the kernel. With this approach, your code becomes then also compatible with ImplicitGlobalGrid.jl, in this respect, in order to run your simulations on multiple distributed GPUs (or CPUs).

Here is a short MWE demonstrating the above approach in 2-D:

```
using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@init_parallel_stencil(CUDA, Float64, 2);
@parallel function compute_something!(B, Ax)
@inn(B) = @d2_xi(Ax)
return
end
@parallel function compute_something!(B, Ax, Ay)
@inn(B) = @d2_xi(Ax) + @d2_yi(Ay)
return
end
nx = ny = 4
# Test kernel with N = 1
A = [@rand(nx,ny)]
B = @zeros(nx,ny)
@parallel compute_something!(B, A...)
# Compare with way as found in mini-apps
A_x = copy(A[1])
B_ref = @zeros(nx,ny)
@parallel compute_something!(B_ref, A_x)
B_ref ≈ B
# Test kernel with N = 2
A = [@rand(nx,ny), @rand(nx,ny)]
B = @zeros(nx,ny)
@parallel compute_something!(B, A...)
# Compare with way as found in mini-apps
A_x = copy(A[1])
A_y = copy(A[2])
B_ref = @zeros(nx,ny)
@parallel compute_something!(B_ref, A_x, A_y)
B_ref ≈ B
```

If you execute this in you REPL you see that this works:

```
julia> using ParallelStencil
julia> using ParallelStencil.FiniteDifferences2D
julia> @init_parallel_stencil(CUDA, Float64, 2);
julia> @parallel function compute_something!(B, Ax)
@inn(B) = @d2_xi(Ax)
return
end
compute_something! (generic function with 1 method)
julia> @parallel function compute_something!(B, Ax, Ay)
@inn(B) = @d2_xi(Ax) + @d2_yi(Ay)
return
end
compute_something! (generic function with 2 methods)
julia> nx = ny = 4
4
julia> # Test kernel with N = 1
A = [@rand(nx,ny)]
1-element Vector{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
[0.31322130418832184 0.9644593790849063 0.8421951080146028 0.8577772923657494; 0.5088175190632842 0.4177057769207575 0.4015573941471502 0.6532746859034653; 0.7671006039929464 0.8138802704821577 0.3195476513308886 0.21118995895483494; 0.9856881342175574 0.22716610365281564 0.03380085555816348 0.3556196653870136]
julia> B = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
julia> @parallel compute_something!(B, A...)
julia> # Compare with way as found in mini-apps
A_x = copy(A[1])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.313221 0.964459 0.842195 0.857777
0.508818 0.417706 0.401557 0.653275
0.767101 0.81388 0.319548 0.21119
0.985688 0.227166 0.0338009 0.35562
julia> B_ref = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
julia> @parallel compute_something!(B_ref, A_x)
julia> B_ref ≈ B
true
julia> # Test kernel with N = 2
A = [@rand(nx,ny), @rand(nx,ny)]
2-element Vector{CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}}:
[0.11648030379108043 0.7016918423717395 0.3813365394157098 0.43472254813509736; 0.6678345880903644 0.8885012054871835 0.6491284039029075 0.8524135900243921; 0.749178219318515 0.3771530364220903 0.1633290416089188 0.6699294296751979; 0.22483500649211696 0.8619593996199939 0.5208627399150867 0.632580958632444]
[0.1456817774423882 0.0599120126893542 0.6458892447881615 0.3584689704419577; 0.2664573828960062 0.5348710989611567 0.4653876225401121 0.931384898871823; 0.32879853389624825 0.010032647605189515 0.5631038579919612 0.10461129086618959; 0.9139882178251839 0.55668895067905 0.2144905738173768 0.03700448570805115]
julia> B = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
julia> @parallel compute_something!(B, A...)
julia> # Compare with way as found in mini-apps
A_x = copy(A[1])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.11648 0.701692 0.381337 0.434723
0.667835 0.888501 0.649128 0.852414
0.749178 0.377153 0.163329 0.669929
0.224835 0.861959 0.520863 0.632581
julia> A_y = copy(A[2])
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.145682 0.059912 0.645889 0.358469
0.266457 0.534871 0.465388 0.931385
0.328799 0.0100326 0.563104 0.104611
0.913988 0.556689 0.214491 0.0370045
julia> B_ref = @zeros(nx,ny)
4×4 CUDA.CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
julia> @parallel compute_something!(B_ref, A_x, A_y)
julia> B_ref ≈ B
true
```

Let us know if this approach suits you!