towards improvements to stencil

APL-related discussions - a stream of APL consciousness.
Not sure where to start a discussion ? Here's the place to be
Forum rules
This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !

towards improvements to stencil

Postby Roger|Dyalog on Thu Apr 23, 2020 6:46 pm

Background

The stencil operator ⌺ was introduced in Dyalog version 16.0 in 2017. Recently we received some feedback (OK, complaints) that (a) stencil does padding which is unwanted sometimes and needs to be removed from the result and (b) stencil is too slow when it is not supported by special code.

First, stencil in cases supported by special code is much faster than when it is not. The special cases are as follows, reproduced from the Dyalog '17 Workshop SA3.

      {⍵}       {⊢⍵}      {,⍵}      {⊂⍵}
{+/,⍵} {∧/,⍵} {∨/,⍵} {=/,⍵} {≠/,⍵}
{ +/,A×⍵} { +/⍪A×⍤2⊢⍵}
{C<+/,A×⍵} {C<+/⍪A×⍤2⊢⍵}

C: a single number or a variable whose value is a single number
A: a variable whose value is a rank-2 or -3 array

The comparison can be < ≤ ≥ > = ≠
odd window size; movement 1; matrix argument

You can test whether a particular case is supported by using a circumlocution to defeat the special case recognizer, and do some timings.

      )copy dfns cmpx

cmpx '{⍵}⌺3 5⊢y' '{⊢⊢⍵}⌺3 5⊢y' ⊣ y←?100 200⍴0
{⍵}⌺3 5⊢x → 4.22E¯4 | 0%
{⊢⊢⍵}⌺3 5⊢x → 5.31E¯2 | +12477% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

cmpx '{⌽⍵}⌺3 5⊢y' '{⊢⊢⌽⍵}⌺3 5⊢y'
{⌽⍵}⌺3 5⊢y → 2.17E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
{⊢⊢⌽⍵}⌺3 5⊢y → 2.21E¯1 | +1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

If the timings are the same then there is no special code.

Padding and performance improvements will take a lot of work. For example, for padding (i.e. the treatment of cells at the edges of the universe) multiple options are possible: no padding, padding, wrap from opposite edge, etc. While working on these improvements I hit upon the idea of writing a stencil function which produces the stencil cells. It only works with no padding and only for movements of 1 (which I understand are common cases), but turns out to be an interesting study.

A Stencil Function

⍺ stencell ⍵ produces the stencil cells of size ⍺ from ⍵, and is equivalent to {⍵}⌺⍺⊢⍵ after the padded cells are removed.

      stencell←{
⎕io←0 ⍝ ⎕io delenda est!
s←(≢⍺)↑⍴⍵
f←1+s-⍺ ⍝ frame AKA outer shape
m←⊖×⍀⊖1↓s,1 ⍝ multiplier for each axis
i←⊃∘.+⌿(m,m)×⍳¨f,⍺ ⍝ indices
(⊂i) ⌷ ⍵ ⍴⍨ (×⌿(≢⍺)↑⍴⍵),(≢⍺)↓⍴⍵
}

For example, stencell is applied to x with cell shape 3 5.

      ⊢ x←6 10⍴⍳60                    ⍝ (a)
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59

c←3 5 stencell x ⍝ (b)
⍴c
4 6 3 5

c ≡ 1 2 ↓ ¯1 ¯2 ↓ {⍵}⌺3 5 ⊢x ⍝ (c)
1

⊢ e←⊂⍤2 ⊢c ⍝ (d)
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ 0 1 2 3 4│ 1 2 3 4 5│ 2 3 4 5 6│ 3 4 5 6 7│ 4 5 6 7 8│ 5 6 7 8 9│
│10 11 12 13 14│11 12 13 14 15│12 13 14 15 16│13 14 15 16 17│14 15 16 17 18│15 16 17 18 19│
│20 21 22 23 24│21 22 23 24 25│22 23 24 25 26│23 24 25 26 27│24 25 26 27 28│25 26 27 28 29│
├──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┤
│10 11 12 13 14│11 12 13 14 15│12 13 14 15 16│13 14 15 16 17│14 15 16 17 18│15 16 17 18 19│
│20 21 22 23 24│21 22 23 24 25│22 23 24 25 26│23 24 25 26 27│24 25 26 27 28│25 26 27 28 29│
│30 31 32 33 34│31 32 33 34 35│32 33 34 35 36│33 34 35 36 37│34 35 36 37 38│35 36 37 38 39│
├──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┤
│20 21 22 23 24│21 22 23 24 25│22 23 24 25 26│23 24 25 26 27│24 25 26 27 28│25 26 27 28 29│
│30 31 32 33 34│31 32 33 34 35│32 33 34 35 36│33 34 35 36 37│34 35 36 37 38│35 36 37 38 39│
│40 41 42 43 44│41 42 43 44 45│42 43 44 45 46│43 44 45 46 47│44 45 46 47 48│45 46 47 48 49│
├──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┤
│30 31 32 33 34│31 32 33 34 35│32 33 34 35 36│33 34 35 36 37│34 35 36 37 38│35 36 37 38 39│
│40 41 42 43 44│41 42 43 44 45│42 43 44 45 46│43 44 45 46 47│44 45 46 47 48│45 46 47 48 49│
│50 51 52 53 54│51 52 53 54 55│52 53 54 55 56│53 54 55 56 57│54 55 56 57 58│55 56 57 58 59│
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘


∪¨ ,¨ e-⍬⍴e ⍝ (e)
┌──┬──┬──┬──┬──┬──┐
│0 │1 │2 │3 │4 │5 │
├──┼──┼──┼──┼──┼──┤
│10│11│12│13│14│15│
├──┼──┼──┼──┼──┼──┤
│20│21│22│23│24│25│
├──┼──┼──┼──┼──┼──┤
│30│31│32│33│34│35│
└──┴──┴──┴──┴──┴──┘

(a) The matrix x is chosen to make stencil results easier to understand.
(b) stencell is applied to x with cell shape 3 5.
(c) The result of stencell is the same as that for {⍵}⌺ after cells with padding are dropped.
(d) Enclose the matrices in c (the cells) to make the display more compact and easier to understand. (The misaligned displays are due to defects in the APL Chat Forum software.)
(e) Subsequent discussion is based on the observation that each cell is some scalar integer added to the first cell.

Indices

The key expression in the computation is:

      ⊃∘.+⌿(m,m)×⍳¨f,⍺

⍝ m: 10 1; multiplier for each axis
⍝ f: 4 6; frame; outer shape; lengths of the leading axes of the result
⍝ ⍺: 3 5; cell shape

We discuss a more verbose but equivalent version of this expression, (⊃∘.+⌿m×⍳¨f)∘.+(⊃∘.+⌿m×⍳¨⍺), starting with the right half, ⊃∘.+⌿m×⍳¨⍺, which produces the first cell.

      ⍳⍺               ⍝ ⍳3 5
┌───┬───┬───┬───┬───┐
│0 0│0 1│0 2│0 3│0 4│
├───┼───┼───┼───┼───┤
│1 0│1 1│1 2│1 3│1 4│
├───┼───┼───┼───┼───┤
│2 0│2 1│2 2│2 3│2 4│
└───┴───┴───┴───┴───┘
(⍴⍵)∘⊥¨⍳⍺ ⍝ 6 10∘⊥¨ ⍳3 5
0 1 2 3 4
10 11 12 13 14
20 21 22 23 24

Alternatively, this last result obtains by multiplying by m the corresponding indices for each axis, where an element of m is the increment for a unit in an axis. That is, m←⊖×⍀⊖1↓s,1 where s←(≢⍺)↑⍴⍵ is a prefix of the shape of ⍵. The multipliers are with respect to the argument ⍵ because the indices are required to be with respect to the argument ⍵.

      ⍳¨⍺              ⍝ ⍳¨3 5
┌─────┬─────────┐
│0 1 2│0 1 2 3 4│
└─────┴─────────┘
m×⍳¨⍺ ⍝ 10 1×⍳¨3 5
┌───────┬─────────┐
│0 10 20│0 1 2 3 4│
└───────┴─────────┘
∘.+⌿ m×⍳¨⍺ ⍝ ∘.+⌿ 10 1×⍳¨3 5
┌──────────────┐
│ 0 1 2 3 4│
│10 11 12 13 14│
│20 21 22 23 24│
└──────────────┘
((⍴⍵)∘⊥¨⍳⍺) ≡ ⊃∘.+⌿m×⍳¨⍺
1

This alternative computation is more efficient because it avoids creating and working on lots of small nested vectors and because the intermediate results for ∘.+⌿ grows fast from one to the next (i.e., O(⍟n) iterations in the main loop).

The left half, ⊃∘.+⌿m×⍳¨f, is similar, and computes the necessary scalar integers to be added to the result of the right half.

      ⊃ ∘.+⌿ m×⍳¨f     ⍝ ⊃ ∘.+⌿ 10 1×⍳¨4 6
0 1 2 3 4 5
10 11 12 13 14 15
20 21 22 23 24 25
30 31 32 33 34 35

The shorter expression derives from the more verbose one by some simple algebra.

      (⊃∘.+⌿m×⍳¨f)∘.+(⊃∘.+⌿m×⍳¨⍺)    ⍝ verbose version
⊃∘.+⌿(m×⍳¨f),m×⍳¨⍺ ⍝ ∘.+ is associative
⊃∘.+⌿(m,m)×(⍳¨f),⍳¨⍺ ⍝ m× distributes over ,
⊃∘.+⌿(m,m)×⍳¨f,⍺ ⍝ ⍳¨ distributes over ,

I am actually disappointed that the shorter expression was found :-); it would have been amusing to have a non-contrived and short expression with three uses of ∘.+ .

Cells

Having the indices i in hand, the stencil cells obtain by indexing into an appropriate reshape or ravel of the right argument ⍵. In general, the expression is

      (⊂i) ⌷ ⍵ ⍴⍨ (×⌿(≢⍺)↑⍴⍵),(≢⍺)↓⍴⍵

⍺ specifies the cell shape. If (≢⍺)=≢⍴⍵, that is, if a length is specified for each axis of ⍵, the general expression is equivalent to (⊂i)⌷,⍵ or (,⍵)[i]; if (≢⍺)<≢⍴⍵, that is, if there are some trailing unstencilled axes, the general expression is equivalent to (,[⍳≢⍺]⍵)[i;...;] (the leading ≢⍺ axes are ravelled) or ↑(,⊂⍤((≢⍴⍵)-≢⍺)⊢⍵)[i] (as if the trailing axes were shielded from indexing). The general expression covers both cases.

Application

stencell makes it possible to workaround current shortcomings in ⌺. The alternative approach is to use stencell to get all the stencil cells, all at once, and then work on the cells using ⍤, +.×, and other efficient primitives.

The following example is from Aaron Hsu. In the original problem the size of x is 512 512 64.

      K←?64 3 3 64⍴0
x←?256 256 64⍴0

t←1 1↓¯1 ¯1↓{+/⍪K×⍤3⊢⍵}⌺3 3⊢x
⍴t
256 256 64

cmpx '1 1↓¯1 ¯1↓{+/⍪K×⍤3⊢⍵}⌺3 3⊢x'
6.76E0

The computation is slow because the cells are rank-3, not supported by special code. Aaron then devised a significant speed-up using a simpler left operand to create the ravels of the cells (but still no special code):

      t ≡ (1 1↓¯1 ¯1↓{,⍵}⌺3 3⊢x)+.×⍉⍪K
1
cmpx '(1 1↓¯1 ¯1↓{,⍵}⌺3 3⊢x)+.×⍉⍪K'
1.67E0

Use of stencell would improve the performance a bit further:

      t ≡ (,⍤3 ⊢3 3 stencell x)+.×⍉⍪K
1
cmpx '(,⍤3 ⊢3 3 stencell x)+.×⍉⍪K'
1.09E0
cmpx '3 3 stencell x'
6.10E¯2

The last timing shows that the stencell computation is 6% (6.10e¯2÷1.09e0) of the total time.

Materializing all the cells does take more space than if the computation is incorporated in the left operand of ⌺, and is practicable only if the workspace sufficiently large.

      )copy dfns wsreq

wsreq '1 1↓¯1 ¯1↓{+/⍪K×⍤3⊢⍵}⌺3 3⊢x'
110649900
wsreq '(1 1↓¯1 ¯1↓{,⍵}⌺3 3⊢x)+.×⍉⍪K'
647815900
wsreq '(,⍤3 ⊢3 3 stencell x)+.×⍉⍪K'
333462260

Performance

stencell is competitive with {⍵}⌺ on matrices, where it is supported by special code written in C, and is faster when there is no special code. The benchmarks are done on a larger argument to reduce the effects of the padding/unpadding done in {⍵}⌺.

      y2←?200 300⍴0

cmpx '3 5 stencell y2' '1 2↓¯1 ¯2↓{⍵}⌺3 5⊢y2'
3 5 stencell y → 1.85E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
1 2↓¯1 ¯2↓{⍵}⌺3 5⊢y → 2.91E¯3 | +57% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

cmpx '3 5 stencell y' '{⍵}⌺3 5⊢y'
3 5 stencell y → 1.85E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* {⍵}⌺3 5⊢y → 1.04E¯3 | -45% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

y3←?200 300 64⍴0

cmpx '3 5 stencell y3' '1 2↓¯1 ¯2↓{⍵}⌺3 5⊢y3'
3 5 stencell y3 → 8.90E¯2 | 0% ⎕⎕⎕
1 2↓¯1 ¯2↓{⍵}⌺3 5⊢y3 → 7.78E¯1 | +773% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

cmpx '3 5 stencell y3' '{⍵}⌺3 5⊢y3'
3 5 stencell y3 → 9.38E¯2 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕
* {⍵}⌺3 5⊢y3 → 3.34E¯1 | +256% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

There is an interesting question of whether the shorter version of the key computation (in the Indices section above) is faster than the more verbose version.

      m←10 1 ⋄ f←4 6 ⋄ a←3 5

cmpx '⊃∘.+⌿(m,m)×⍳¨f,a' '(⊃∘.+⌿m×⍳¨f)∘.+(⊃∘.+⌿m×⍳¨a)'
⊃∘.+⌿(m,m)×⍳¨f,a → 3.75E¯6 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
(⊃∘.+⌿m×⍳¨f)∘.+(⊃∘.+⌿m×⍳¨a) → 5.20E¯6 | +38% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

In this case, it is faster, and I expect it will be faster for common cases which arise in stencil calculations, where the argument size is larger than the cell size. But it is easy to think of arguments where ∘.+⌿ is slower than ∘.+ done with a different grouping:

      cmpx '((⍳0)∘.+⍳100)∘.+⍳200' '(⍳0)∘.+((⍳100)∘.+⍳200)' '⊃∘.+/⍳¨0 100 200'
((⍳0)∘.+⍳100)∘.+⍳200 → 7.86E¯7 | 0% ⎕⎕
(⍳0)∘.+((⍳100)∘.+⍳200) → 1.05E¯5 | +1234% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
⊃∘.+/⍳¨0 100 200 → 1.11E¯5 | +1310% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

This question will be explored further in a later post.
Roger|Dyalog
 
Posts: 207
Joined: Thu Jul 28, 2011 10:53 am

Return to APL Chat

Who is online

Users browsing this forum: No registered users and 1 guest