Wohoo 14

by **Roger|Dyalog** on Wed Feb 04, 2015 7:25 pm

The index-of family of functions (⍳ ∊ ∪ ∩ ~) is a complex topic. For an in-depth discussion please see Index-Of, A 30-Year Quest (http://www.jsoftware.com/papers/indexof/indexofscript.htm), presented at the J Conference in Toronto, 2014-07-25. An alternative presentation was the workshop SP4: Exploring Index-Of (http://www.dyalog.com/uploads/conference/dyalog14/workshops/SP4_Exploring_IndexOf.zip) at the Dyalog Conference, 2014-09-21. (The link is from http://www.dyalog.com/user-meetings/dyalog14.htm.) The workshop consists of 21 questions, but they are "leading questions" which, in answering them, will (I claim) provide useful insights on index-of.

The ideas discussed in the paper and the workshop are not yet fully implemented in Dyalog APL, but we are working on them. So Paul is indeed correct that currently (v14.0) if x and y are char matrices then inverted table index-of (8⌶) is the fastest way to do index-of on them:

      a←(' ',⎕a,⎕d)[?1000 22⍴37]
      x←a[?1e6⍴≢a;]
      y←a[?1.1e6⍴≢a;]
      cmpx 'x⍳y' 'x{(↓⍺)⍳↓⍵}y' '(,⊂x)(8⌶)(,⊂y)'
  x⍳y            → 3.42E¯1 |   0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  x{(↓⍺)⍳↓⍵}y    → 3.41E¯1 |  -1% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
  (,⊂x)(8⌶)(,⊂y) → 8.70E¯2 | -75% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

x⍳y and x{(↓⍺)⍳↓⍵}y invoke the same code and so run at the same speed. (Timing differences of less than 5% are artifacts of the timing process and are not significant, or repeatable.) 8⌶ implements the latest thinking, including the use of the CRC instruction available in SSE4.1 (any Intel CPU after 2008). It is expected that in v14.1 the faster code would be used in ⍳, so employing the circumlocutory (,⊂x)(8⌶)(,⊂y) in place of x⍳y is not recommended.

As mentioned above, we are working on implementing the latest ideas. The following benchmarks show the speed-ups that can be expected once the implementation is complete:

      x←?1e6⍴2e9
      y←?1.1e6⍴2e9
      cmpx 'x f y' 'x⍳y'
  x f y → 5.78E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                           
  x⍳y   → 1.83E¯1 | +216% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
      cmpx 'y g x' 'y∊x'
  y g x → 4.61E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                             
  y∊x   → 1.74E¯1 | +276% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

      x←?1e6⍴2e6
      y←?1.1e6⍴2e6
      cmpx 'x f y' 'x⍳y'
  x f y → 1.64E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕                               
  x⍳y   → 7.62E¯2 | +365% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
      cmpx 'y g x' 'y∊x'
  y g x → 1.06E¯2 |    0% ⎕⎕⎕⎕⎕⎕                                  
  y∊x   → 7.50E¯2 | +608% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

      x←?1e6⍴4e6
      y←?1.1e6⍴4e6
      cmpx 'x f y' 'x⍳y'
  x f y → 4.92E¯2 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕                            
  x⍳y   → 1.58E¯1 | +221% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕
      cmpx 'y g x' 'y∊x'
  y g x → 9.81E¯3 |     0% ⎕⎕⎕                                     
  y∊x   → 1.50E¯1 | +1426% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The benchmarks also show that it is tricky to do benchmarks. Your faithful implementers are sneaky guys and will exploit properties of the data to effect faster computation. In this case, small-range is exploited with the range being computed efficiently using vector instructions available in any non-ancient CPU. To give you an idea of the details involved: ∊ on 2e9 uses hashing; ∊ on small-range uses a table, a faster computation. The factor for 2e6 is smaller than the factor for 4e6 because the v14.0 implementation also uses a table for that range. The factor for 4e6 is bigger because the v14.0 implementation uses hashing where v14.1 would use a table, but without using more temporary space. I conjecture that in practice all integers arguments to the ⍳ family are small range, so that small-rangeness is a special case but not an unusual case. And on and on.

Stig's example is interesting and I am glad he brought it up. I have a suggestion regarding the handling of floats: unless you have reason to believe that ⎕ct is significant, try setting a local ⎕ct←0. As well, there is a possibility that ⍳ could incorporate the same efficient code that is already available in 8⌶. That is, there is the possibility that x xFindn y can be done using plain old x⍳y, with x⍳y being the faster. The following benchmark shows the upper bound on the amount of speed-up:

      xData←{(?⍵⍴100),a[?⍵⍴≢a←'Stig' 'Paul' 'Morten' 'John' 'Jay' 'Nick' 'Fi' 'Roger'],⍪0.1×?⍵⍴200}
      xConvert←{t←↓[⎕IO]⍵ ⋄ t[1+⎕IO]←↑¨t[1+⎕IO] ⋄ t}

      x←xData 1e6
      y←xData 1.1e6
      x1←xConvert x
      y1←xConvert y

      cmpx 'x1(8⌶)y1' 'x xFindn y'
  x1(8⌶)y1   → 2.34E¯1 |    0% ⎕⎕⎕⎕⎕⎕⎕⎕                                
  x xFindn y → 1.14E0  | +388% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

Of course, until the time when ⍳ handles it internally, you have to account for the time needed for the conversion:

      cmpx 'x1←xConvert x⊣y1←xConvert y'
5.03E¯1

I should point out that compared to x1 (the "inverted table" format), x is extravagantly profligate in its use of space:

      ⎕size 'x' 'x1'
128000040 15000160

Inefficient use of space leads directly to unavoidable inefficiencies in time. I will be in touch with Stig regarding additional details on the matrices in his application (not to twist his arm to change his data, but to get details on how best to enhance ⍳ on his data :-).

Finally, the extension to ⍳ in v14.0 ("looking for rows") greatly improved it as a tool of thought, and the possibilities take time to sink in. Using ⍳ (and improving its implementation) for Stig's application is one example; the blog post A Speed-Up Story (http://www.dyalog.com/blog/2014/11/a-speed-up-story-2/) is another.

by **stignielsen** on Thu Feb 05, 2015 10:15 am

Thomas, sorry for the comment alignment ;-!

I forgot to mention that xFindn is a matrix lookup that is used to find occurences of rows from matrix into another, both of same shape and datatypes column wise.

It was introduced when we had an example where we had to find unique rows in a matrix (i.e. mat {(↓⍺)⍳↓⍵} mat). This operation took 6+ hours(!) with {(↓⍺)⍳↓⍵} and was reduced to 5 sec. with the xFindn implementation.

by **Roger|Dyalog** on Thu Feb 05, 2015 5:05 pm

I note that xFindn makes assumptions that the interpreter can not make either with ⍳ or with {(↓⍺)⍳↓⍵}. Specifically, the problem is that xFindn assumes that the result is the same if you replace a floating point column c by c⍳c. The following example demonstrates that the results are not necessarily the same.

      ⎕ct
1E¯14
      x←1+(0.5×⎕ct)×⍉4 4⊤⍳16
      y←⍉↑⍳⍨¨↓⍉x
      t←x ⋄ t[;0]←t[;0]⍳t[;0] ⋄ t[;1]←t[;1]⍳t[;1]
      y ≡ t
1
      x⍳x
0 0 0 1 0 0 0 1 0 0 0 1 4 4 4 5
      y⍳y
0 0 0 3 0 0 0 3 0 0 0 3 12 12 12 15

If ⎕ct≠0, ⍳ on multiple floating point columns is a difficult problem if it has to be fast. The difficulties are unavoidable from the way tolerance and ⍳ are defined -- basically each value is not a single point but an interval of equality. If you know (and the interpreter can not know) that ⎕ct is not a factor, you should set ⎕ct←0.

8⌶ ignores the current value of ⎕ct and proceeds as if ⎕ct=0.

by **Roger|Dyalog** on Tue Mar 31, 2015 4:25 am

In the 14.1 beta ⍳ has special code for relational tables, the arguments accepted by Stig's xFindn function. Relational tables are matrices in which items of a column have the same type.

First, some "random" data.

Code: Select all: u←(?m⍴5e5),(↓⎕A[?(m,10)⍴26]),(↓⎕AV[?(m,5)⍴256]),⍪0.01×?m⍴20000 x←u[?1e6⍴≢u;] y←u[?1e6⍴≢u;]

Since the data in this case contains floating point numbers, the special code requires non-tolerant comparisons. (xFindn also requires non-tolerant comparisons to work correctly. See the previous message in this thread.)

Code: Select all: ⎕ct←0

Now the benchmarks. The second benchmark shows that ⍳ exploits "selfies", the case where the left and right arguments are "the same". x is the same as x (of course); x is not "the same" as (⍴x)⍴x. Checking whether the arguments are "the same" is instantaneous for the interpreter.

Code: Select all: cmpx 'x⍳y' 'x xFindn y' x⍳y → 9.69E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ x xFindn y → 2.39E0 | +146% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx 'x⍳x' 'x xFindn x' x⍳x → 4.77E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕ x xFindn x → 2.32E0 | +387% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The Fine Print:

• As stated above, if the data contains floating point numbers, ⎕ct←0 is required for the special code to be invoked. (If ⎕ct≠0, x⍳y would still be correct, but slower.)

• The items of a column need not have the same length, but x⍳y would be a bit faster if they do have the same length (as above).

• The special code uses the CRC32 instruction (http://en.wikipedia.org/wiki/SSE4#SSE4.2) if available; it would be available on a CPU newer than 2008. x⍳y would be a bit slower otherwise.

Finally, if you have a choice, storing the data as inverted tables is more efficient both in space and in time. 8⌶ computes index-of on inverted tables and has been available since 14.0.

Code: Select all: xi←↑¨↓[1] x ⍝ the equivalent inverted tables yi←↑¨↓[1] y ⎕size 'x' 'xi' 'y' 'yi' 190378352 32000208 190378512 32000208 cmpx 'x⍳y' 'xi(8⌶)yi' x⍳y → 9.53E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ xi(8⌶)yi → 3.25E¯1 | -66% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ cmpx 'x⍳x' 'xi(8⌶)xi' x⍳x → 4.71E¯1 | 0% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕ xi(8⌶)xi → 1.59E¯1 | -67% ⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕⎕

The tool of thought for

software solutions

Wohoo 14

Re: Wohoo 14

Re: Wohoo 14

Re: Wohoo 14

Re: Wohoo 14

Who is online

QUICK LINKS