To Each or Not to Each

by **paulmansour** on Wed May 17, 2017 3:37 pm

I'm contemplating ¨ with ⎕S.

My main use-case for regex is a character matrix, where I am searching each row. I first convert this to a vector of vectors V.

My first instinct was to pass V to ⎕S in line mode. However, there is a fair amount work involved in lining up the results with the original input, which is critical for my use case. So I tried it with an ¨, which eliminates the lining up issue. I assumed this would be much slower, and sometimes it is (but really not that much slower), but oddly, on very simple search patterns it appears the ¨ is faster, even ignoring the extra work involved in lining up the results in the no-each case.

In addition to the advantage of lining up results with input, the ¨ allows the use of the Document mode variant, which means you can search for line ending chars.

Aside from speed (con most of the time), and searching for newlines (pro - I think) I'm wondering if I am missing anything with respect to the difference of ¨ versus no ¨.

by **Richard|Dyalog** on Thu May 18, 2017 10:19 am

The most likely explanation for the speed up you are seeing is that in simplifying the input to ⎕S to single character vectors, ⎕S is switching to an optimised mode whereby it uses ⍷ to find matches on the data rather than use the PCRE search engine. Using PCRE is slower for a number of reasons - principally that it itself is slower than ⍷ because it can do far more complex searches, and also because it requires that the data in the workspace be re-encoded and reconstructed into a format suitable for it to use.

⎕S is quite a complicated function and passing data to it in sections (using each) has some significant effects on what you can and cannot do with the data and may or may not be beneficial to you. ⎕S in line mode is not equivalent to ⎕S¨ with document mode set, because the former will split the data into logical lines wherever it finds a line ending character - implicitly between the vectors of character vectors and explicitly in the data itself, which is why you can search for line ending characters. However, ⎕S allows you to enquire which "block" (line) the match occurred on, which may be useful especially when you use a transformation function, and this will not be available if ⎕S is invoked multiple times. Which form suits you better will very much depend on what you are trying to do.

by **paulmansour** on Thu May 18, 2017 1:43 pm

Thanks Richard.

The tool of thought for

software solutions

To Each or Not to Each

To Each or Not to Each

Re: To Each or Not to Each

Re: To Each or Not to Each

Who is online

QUICK LINKS