## * versus ⍟ performance

**Forum rules**

This forum is for discussing APL-related issues. If you think that the subject is off-topic, then the Chat forum is probably a better place for your thoughts !

1 post
• Page

**1**of**1**### * versus ⍟ performance

0. A benchmark was posted in the APL Orchard on 2021-02-27:

The questions are, why is ⍺*⍵ so much slower than ⍺⍟⍵? And shouldn't power be faster than logarithm?

Explaining the relative speeds of * and ⍟ (both monadic and dyadic) is complicated.

1. *⍵ is computed by exp() and ⍟⍵ is computed by log() or log2() or log10(), C library functions all.

2. Mathematically and in the Dyalog implementation, ⍺⍟⍵ ←→ (⍟⍵)÷⍟⍺.

Mathematically, ⍺*⍵ ←→ *⍵×⍟⍺. In the Dyalog implementation, when ⍺≥0, ⍺*⍵ is computed by pow(), a C library function.

3. scalar⍟⍵ invokes different C library functions when scalar is 2 (log2), *1 (log), or 10 (log10). These do not make a big difference in speed.

In fact:

I suspect the verboser APL expressions are faster because the divides are done using vector instructions, whereas the Dyalog implementation of ⍺⍟⍵ does scalar divides in a C loop, log(⍵[j])/log(⍺[j]) or log(⍵[j])/d, where d←log(scalar).

4. As stated above, ⍺*⍵ when ⍺≥0 is done using the C library function pow(). Again, using a mathematically equivalent expression, *⍵×⍟⍺, is faster, I suspect for the same reason why (⍟⍵)÷⍟⍺ is faster than ⍺⍟⍵--the divides are done using vector instructions.

(The last expression indicates that 0.1*b and *b×⍟0.1 are "equal", despite the asterisk in the first column of the cmpx result.)

In addition, I suspect 0.1*b and *b×⍟0.1 suffer in speed from hitting lots of underflows. (⍺*⍵ for finite arguments should not be 0, other than underflows. Likewise *⍵.)

What if there are very few or no underflows?

5. In a hypothetical situation where I get to write the C library functions, I believe log() can be faster than exp():

You can have a precomputed table so that ⍟⍵ only need to be computed in detail in the range ÷√2 and √2 (about 0.7 and 1.4). For ⍵ in that range, equation 4.1.27 in Abramowitz and Stegun

provides a fast converging series for t←(⍵-1)÷(⍵+1), or 0.171573 or less. The terms of the series are odd powers of t, which means each term is smaller than the previous by a factor of t*2, or 0.0294373, which declines to 0 rapidly. See the J Wiki essay, Extended Precision Functions.

On the other hand, *⍵ would be computed by +/(⍵*j)÷!j←⍳n. You can have a precomputed table here too so that *⍵ only need to use the series for small |⍵ (<1), but you are still operating on both even and odd powers.

6. Let us do one big happy benchmark:

If you compare ⍟x to *x (6.43e¯4÷⍨1.11e¯3 or 1.73), and (⍟y)÷⍟x to *y×⍟x (1.59e¯3÷⍨1.82e¯3 or 1.14), the differences aren't so large, certainly not large enough (no greater than a factor of 2) to sweat over.

7. Some morals of this story:

b←1+?⍨1e5 ⍝ with ⎕io←0; ⎕io delenda est!

cmpx '0.1*b' '5⍟b' '10⍟b' '17⍟b'

0.1*b → 9.0E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* 5⍟b → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕

* 10⍟b → 1.2E¯3 | -87% ⎕⎕⎕⎕⎕

* 17⍟b → 1.4E¯3 | -84% ⎕⎕⎕⎕⎕⎕

The questions are, why is ⍺*⍵ so much slower than ⍺⍟⍵? And shouldn't power be faster than logarithm?

Explaining the relative speeds of * and ⍟ (both monadic and dyadic) is complicated.

1. *⍵ is computed by exp() and ⍟⍵ is computed by log() or log2() or log10(), C library functions all.

2. Mathematically and in the Dyalog implementation, ⍺⍟⍵ ←→ (⍟⍵)÷⍟⍺.

Mathematically, ⍺*⍵ ←→ *⍵×⍟⍺. In the Dyalog implementation, when ⍺≥0, ⍺*⍵ is computed by pow(), a C library function.

3. scalar⍟⍵ invokes different C library functions when scalar is 2 (log2), *1 (log), or 10 (log10). These do not make a big difference in speed.

e←*1

cmpx '2⍟b' 'e⍟b' '10⍟b' '5⍟b' '17⍟b'

2⍟b → 1.70E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* e⍟b → 1.09E¯3 | -37% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* 10⍟b → 1.18E¯3 | -31% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* 5⍟b → 1.39E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* 17⍟b → 1.38E¯3 | -19% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

In fact:

cmpx '2⍟b' '(⍟b)÷⍟2'

2⍟b → 1.70E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

(⍟b)÷⍟2 → 8.94E¯4 | -48% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

cmpx '10⍟b' '(⍟b)÷⍟10'

10⍟b → 1.18E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

(⍟b)÷⍟10 → 8.96E¯4 | -25% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

x←(⍴b)⍴10

cmpx 'x⍟b' '(⍟b)÷⍟x'

x⍟b → 2.82E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

(⍟b)÷⍟x → 1.56E¯3 | -45% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

I suspect the verboser APL expressions are faster because the divides are done using vector instructions, whereas the Dyalog implementation of ⍺⍟⍵ does scalar divides in a C loop, log(⍵[j])/log(⍺[j]) or log(⍵[j])/d, where d←log(scalar).

4. As stated above, ⍺*⍵ when ⍺≥0 is done using the C library function pow(). Again, using a mathematically equivalent expression, *⍵×⍟⍺, is faster, I suspect for the same reason why (⍟⍵)÷⍟⍺ is faster than ⍺⍟⍵--the divides are done using vector instructions.

cmpx '0.1*b' '*b×⍟0.1'

0.1*b → 7.79E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* *b×⍟0.1 → 4.19E¯3 | -47% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

reldiff←{⌈/,(|⍺-⍵)÷(|⍺)⌈(|⍵)} ⍝ maximum relative difference

(0.1*b) reldiff *b×⍟0.1

8.72981E¯14

(The last expression indicates that 0.1*b and *b×⍟0.1 are "equal", despite the asterisk in the first column of the cmpx result.)

In addition, I suspect 0.1*b and *b×⍟0.1 suffer in speed from hitting lots of underflows. (⍺*⍵ for finite arguments should not be 0, other than underflows. Likewise *⍵.)

0.1*12345

0

+/ 0 = 0.1*b

99677

What if there are very few or no underflows?

b1←b÷1e3

+/ 0 = 0.1*b1

0

cmpx '0.1*b1' '*b1×⍟0.1'

0.1*b1 → 5.91E¯3 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* *b1×⍟0.1 → 1.14E¯3 | -81% ⎕⎕⎕⎕⎕⎕

(0.1*b1) reldiff *b1×⍟0.1

3.12521E¯14

5. In a hypothetical situation where I get to write the C library functions, I believe log() can be faster than exp():

You can have a precomputed table so that ⍟⍵ only need to be computed in detail in the range ÷√2 and √2 (about 0.7 and 1.4). For ⍵ in that range, equation 4.1.27 in Abramowitz and Stegun

⍟⍵ ←→ 2 × +/ j ÷⍨ ((⍵-1)÷(⍵+1)) * j←1+2×⍳n

provides a fast converging series for t←(⍵-1)÷(⍵+1), or 0.171573 or less. The terms of the series are odd powers of t, which means each term is smaller than the previous by a factor of t*2, or 0.0294373, which declines to 0 rapidly. See the J Wiki essay, Extended Precision Functions.

On the other hand, *⍵ would be computed by +/(⍵*j)÷!j←⍳n. You can have a precomputed table here too so that *⍵ only need to use the series for small |⍵ (<1), but you are still operating on both even and odd powers.

6. Let us do one big happy benchmark:

x←?1e5⍴0

y←?1e5⍴0

cmpx '⍟x' 'x⍟y' '(⍟y)÷⍟x' '*x' 'x*y' '*y×⍟x'

⍟x → 6.43E¯4 | 0% ⎕⎕⎕

* x⍟y → 2.81E¯3 | +337% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* (⍟y)÷⍟x → 1.59E¯3 | +148% ⎕⎕⎕⎕⎕⎕⎕⎕

* *x → 1.11E¯3 | +72% ⎕⎕⎕⎕⎕

* x*y → 6.21E¯3 | +865% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

* *y×⍟x → 1.82E¯3 | +182% ⎕⎕⎕⎕⎕⎕⎕⎕⎕

If you compare ⍟x to *x (6.43e¯4÷⍨1.11e¯3 or 1.73), and (⍟y)÷⍟x to *y×⍟x (1.59e¯3÷⍨1.82e¯3 or 1.14), the differences aren't so large, certainly not large enough (no greater than a factor of 2) to sweat over.

7. Some morals of this story:

- There are lies, damned lies, and benchmarks.
- Is it really worth it to have extra code for 2⍟⍵ and (*1)⍟⍵ and 10⍟⍵, when the benefits are not that significant?
- Speed improvements made in one area (e.g. divide using vector instructions) create opportunities in other areas.

- Roger|Dyalog
**Posts:**238**Joined:**Thu Jul 28, 2011 10:53 am

1 post
• Page

**1**of**1**### Who is online

Users browsing this forum: No registered users and 1 guest

Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group