* versus ⍟ performance

by **Roger|Dyalog** on Tue Mar 02, 2021 4:47 pm

0. A benchmark was posted in the APL Orchard on 2021-02-27:

      b←1+?⍨1e5  ⍝ with ⎕io←0; ⎕io delenda est!

      cmpx '0.1*b' '5⍟b' '10⍟b' '17⍟b'
  0.1*b → 9.0E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* 5⍟b   → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕
* 10⍟b  → 1.2E¯3 | -87% ⎕⎕⎕⎕⎕
* 17⍟b  → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕

The questions are, why is ⍺*⍵ so much slower than ⍺⍟⍵? And shouldn't power be faster than logarithm?

Explaining the relative speeds of * and ⍟ (both monadic and dyadic) is complicated.

1. *⍵ is computed by exp() and ⍟⍵ is computed by log() or log2() or log10(), C library functions all.

2. Mathematically and in the Dyalog implementation, ⍺⍟⍵ ←→ (⍟⍵)÷⍟⍺.

Mathematically, ⍺*⍵ ←→ *⍵×⍟⍺. In the Dyalog implementation, when ⍺≥0, ⍺*⍵ is computed by pow(), a C library function.

3. scalar⍟⍵ invokes different C library functions when scalar is 2 (log2), *1 (log), or 10 (log10). These do not make a big difference in speed.

      e←*1
      cmpx '2⍟b' 'e⍟b' '10⍟b' '5⍟b' '17⍟b'
  2⍟b  → 1.70E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* e⍟b  → 1.09E¯3 | -37% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕          
* 10⍟b → 1.18E¯3 | -31% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕        
* 5⍟b  → 1.39E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕      
* 17⍟b → 1.38E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

In fact:

      cmpx '2⍟b' '(⍟b)÷⍟2'
  2⍟b     → 1.70E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (⍟b)÷⍟2 → 8.94E¯4 | -48% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕              

      cmpx '10⍟b' '(⍟b)÷⍟10'
  10⍟b     → 1.18E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (⍟b)÷⍟10 → 8.96E¯4 | -25% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕      
      x←(⍴b)⍴10
      cmpx 'x⍟b' '(⍟b)÷⍟x'
  x⍟b     → 2.82E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (⍟b)÷⍟x → 1.56E¯3 | -45% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

I suspect the verboser APL expressions are faster because the divides are done using vector instructions, whereas the Dyalog implementation of ⍺⍟⍵ does scalar divides in a C loop, log(⍵[j])/log(⍺[j]) or log(⍵[j])/d, where d←log(scalar).

4. As stated above, ⍺*⍵ when ⍺≥0 is done using the C library function pow(). Again, using a mathematically equivalent expression, *⍵×⍟⍺, is faster, I suspect for the same reason why (⍟⍵)÷⍟⍺ is faster than ⍺⍟⍵--the divides are done using vector instructions.

      cmpx '0.1*b' '*b×⍟0.1'
  0.1*b   → 7.79E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *b×⍟0.1 → 4.19E¯3 | -47% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕   

      reldiff←{⌈/,(|⍺-⍵)÷(|⍺)⌈(|⍵)}   ⍝ maximum relative difference           
      (0.1*b) reldiff *b×⍟0.1
8.72981E¯14

(The last expression indicates that 0.1*b and *b×⍟0.1 are "equal", despite the asterisk in the first column of the cmpx result.)

In addition, I suspect 0.1*b and *b×⍟0.1 suffer in speed from hitting lots of underflows. (⍺*⍵ for finite arguments should not be 0, other than underflows. Likewise *⍵.)

      0.1*12345
0
      +/ 0 = 0.1*b
99677

What if there are very few or no underflows?

      b1←b÷1e3
      +/ 0 = 0.1*b1
0
      cmpx '0.1*b1' '*b1×⍟0.1'
  0.1*b1   → 5.91E¯3 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *b1×⍟0.1 → 1.14E¯3 | -81% ⎕⎕⎕⎕⎕⎕                        
      (0.1*b1) reldiff *b1×⍟0.1
3.12521E¯14

5. In a hypothetical situation where I get to write the C library functions, I believe log() can be faster than exp():

You can have a precomputed table so that ⍟⍵ only need to be computed in detail in the range ÷√2 and √2 (about 0.7 and 1.4). For ⍵ in that range, equation 4.1.27 in Abramowitz and Stegun

      ⍟⍵ ←→ 2 × +/ j ÷⍨ ((⍵-1)÷(⍵+1)) * j←1+2×⍳n

provides a fast converging series for t←(⍵-1)÷(⍵+1), or 0.171573 or less. The terms of the series are odd powers of t, which means each term is smaller than the previous by a factor of t*2, or 0.0294373, which declines to 0 rapidly. See the J Wiki essay, Extended Precision Functions.

On the other hand, *⍵ would be computed by +/(⍵*j)÷!j←⍳n. You can have a precomputed table here too so that *⍵ only need to use the series for small |⍵ (<1), but you are still operating on both even and odd powers.

6. Let us do one big happy benchmark:

      x←?1e5⍴0
      y←?1e5⍴0
      cmpx '⍟x' 'x⍟y' '(⍟y)÷⍟x' '*x' 'x*y' '*y×⍟x'
  ⍟x      → 6.43E¯4 |    0% ⎕⎕⎕                          
* x⍟y     → 2.81E¯3 | +337% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                
* (⍟y)÷⍟x → 1.59E¯3 | +148% ⎕⎕⎕⎕⎕⎕⎕⎕                      
* *x      → 1.11E¯3 |  +72% ⎕⎕⎕⎕⎕                        
* x*y     → 6.21E¯3 | +865% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
* *y×⍟x   → 1.82E¯3 | +182% ⎕⎕⎕⎕⎕⎕⎕⎕⎕

If you compare ⍟x to *x (6.43e¯4÷⍨1.11e¯3 or 1.73), and (⍟y)÷⍟x to *y×⍟x (1.59e¯3÷⍨1.82e¯3 or 1.14), the differences aren't so large, certainly not large enough (no greater than a factor of 2) to sweat over.

7. Some morals of this story:

There are lies, damned lies, and benchmarks.
Is it really worth it to have extra code for 2⍟⍵ and (*1)⍟⍵ and 10⍟⍵, when the benefits are not that significant?
Speed improvements made in one area (e.g. divide using vector instructions) create opportunities in other areas.

Finally, I don't recommend that y'all start coding (⍟⍵)÷(⍟⍺) instead of ⍺⍟⍵, or *⍵×⍟⍺ instead of ⍺*⍵. Personally, I expect Dyalog will improve * and ⍟ sometime, although not in the immediate future, and certainly not to the extent of writing replacements for C library functions.

The tool of thought for

software solutions

* versus ⍟ performance

* versus ⍟ performance

Who is online

QUICK LINKS