RegEx: identiify fenced code blocks in Markdown

by **kai** on Tue May 31, 2016 9:00 am

I have this with ⎕IO←0 and ⎕ML←3:

Code: Select all: q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~' q Para 1 ~~~ { +/⍳⍵ } ~~~ para 2 ~~~

The rules for identifying a fenced code block are:

It starts with a line that has zero to three whitespace characters, followed by at least 3 "~" characters (= no upper limit) followed by a newline character. The same rule defines the end of a code block.

The number of whitespace characters as well as the number of "~" defining the fence can vary between start and end definition.

This regular expression seems to work fine:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 21 21↑8↓q 21↑8↓q ~~~ { +/⍳⍵ } ~~~

</Explanation for the interested reader>

That seams to work fine:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 21 21↑8↓q ~~~ { +/⍳⍵ } ~~~

However, when I convert q into a nested variable:

Code: Select all: q2←(⎕UCS 10){⎕ML←1 ⋄ 1↓¨⍺{⍵⊂⍨⍺=⍵}⍺,⍵}q

and then try again:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q2 9 26

The result has changed, and it is wrong! I have no idea why that is. Surely the nested version (q2) should be treated like the original version (q) !

May I ask the RegEx authorities for an explanation for this?

A second obstacle: if the APL code block contains a "~" character then I expected the regular expression to go wrong. For example:

Code: Select all: q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '~0 1⍷⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~'

It should go wrong because [^~] now fails when it reaches the ~ in ~0 1⍷⍵ but it keeps working:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 23 23↑8↓q ~~~ { ~0 1⍷⍵ } ~~~

It seems as if the [^~] is not needed and indeed it isn't:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 23 '^\s{0,3}~{3,}.*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 23

But shouldn't the .* then consume all characters until the end of the document because it is greedy?

by **MBaas** on Tue May 31, 2016 3:24 pm

Hi Kai,

there is a small problem with your q: the "~~~" below para2 are prefixed with 4 whitespaces, so the regex doesn't match that part...! If you remove that one whitespace and add a

Code: Select all: ('Greedy' 0)

or

Code: Select all: ('Greedy' 1)

, you'll see different results :-)

by **kai** on Tue May 31, 2016 3:32 pm

Yes, but that has a purpose: it confirms that it is NOT found because of the 4 white spaces.

by **kai** on Tue May 31, 2016 3:51 pm

Michael has a point regarding the greed. In case I make sure that even the third fenced block has just three leading whitespace characters then things get worse.

Code: Select all: q←1↓∊(⎕UCS 10),¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~' '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 0 37

That's obviously because it's greedy, so we must improve by making it non-greedy:

Code: Select all: '^\s{0,3}~{3,}[^~].*?^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)⊣q 8 21

Okay that fine but my problem when it is a nested array remains unsolved:

Code: Select all: q←,¨'Para 1 ' '' ' ~~~' '{' '+/⍳⍵' '}' ' ~~~' '' 'para 2' '' ' ~~~' '^\s{0,3}~{3,}[^~].*?^\s{0,3}~{3,}$'⎕S 0 1⍠('Mode' 'M')('DotAll' 1)⊣q 9 26

by **MBaas** on Tue May 31, 2016 5:28 pm

Ok, sorry - I misunderstood the part about ".*" consuming everything...
The nested case is interesting indeed and I look forward to a Guru-explanation for that one ;-)

by **DanB|Dyalog** on Tue May 31, 2016 10:57 pm

Kai,
searching through a string and a list of strings (VTV) if not entirely the same.
For VTVs the line separator (the EOL option) is by default CR,LF which means that you get 1 extra character per "between line" matches because you used a single LF (UCS 10) in your string.

You can see this with a simple example:

Code: Select all: ⎕ucs⊃'.*'⎕s'&'⎕OPT('Mode' 'M')('DotAll' 1) ,7⍴'aaa',⎕ucs 10 97 97 97 10 97 97 97 ⎕ucs⊃'.*'⎕s'&'⎕OPT('Mode' 'M')('DotAll' 1) ,'aaa' 'aaa' 97 97 97 13 10 97 97 97

You need to add the EOL option:

Code: Select all: ⎕UCS⊃'∧\s{0,3}~{3,}[∧~].*∧\s{0,3}~{3,}$'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q 10 32 126 126 126 10 123 10 43 47 9075 9077 10 125 10 32 32 32 126 126 126 ⎕UCS⊃'∧\s{0,3}~{3,}[∧~].*∧\s{0,3}~{3,}$'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q2 10 32 126 126 126 10 123 10 43 47 9075 9077 10 125 10 32 32 32 126 126 126

EOL is useless in the string case but it doesn't hurt to add it.

As for the [^~] "not working" it is actually working.
You specified "...~{3,}[∧~].*..." and the engine happily matched a minimum of 3 ~s. It stopped as soon as it found a character that was NOT ~. For sure the next character after that match was NOT a ~ and the [^~] matched naturally. It had to, unless we were at the end of the whole string. Here it matched the UCS 10 between the lines. It was superfluous as you found out.

It won't mind the ~ in "~0 1⍷⍵". The requirement is "∧\s{0,3}~{3,}", a minimum of 3 times, and that doesn't match so it matches a single "." (any char) instead.

You are right, the .* should consume all characters until the end of the document because it is greedy BUT, as Michael points out, your last line doesn't match because of the extra space at the beginning of the line.

Here is a simpler example:

Code: Select all: '∧D.*?∧D'⎕S'&'⎕OPT('Mode' 'M')('DotAll' 1),¨'D' 'l2' 'D' 'l4' 'D' 'l6' 'D' ┌──┬──┐ │D │D │ │ │ │ │l2│l6│ │ │ │ │D │D │ └──┴──┘

Here the text delimiter is 'D' which MUST start at the beginning of a line. Instead of using the 'Greedy' 0 option I used a 'local' lazy option (the ? after .* which only applies to it). If you remove it it will match the entire text.

So your delimiter, here, is "^\s{0,3}~{3,}" and your text is ".*?". Instead of repeating the delimiter (and risk typos) you can ask the engine to reuse it by grouping it in parentheses and referring to it a second time with (?1) like this:

Code: Select all: '(∧\s{0,3}~{3,}).*?(?1)' ⎕S '&' ⎕opt ('Mode' 'M')('DotAll' 1)⊣q

This reads:
(∧\s{0,3}~{3,}) define and use group 1
.*? match as little as possible
(?1) reuse group 1 to match

You don't really need the $

Hope this helps.

p.s. I used Classic to test my assertions and ^ doesn't work the same way there as in Unicode (the keyboard enters the APL ^ but ⎕S needs the ASCII ^) so careful when you cut and paste :(

by **kai** on Wed Jun 01, 2016 8:09 am

Thanks for the explanation. Very helpful.

That in case of the nested array a different result is returned is in my opinion a bug.

The `^∧` business is one more good reason to bury the classical version sooner rather than later.

by **Richard|Dyalog** on Wed Jun 01, 2016 11:04 am

> That in case of the nested array a different result is returned is in my opinion a bug.

Kai - let me expand on Dan's explanation to clarify why it is working correctly:

In specifying mixed mode (Mode M) you instructed the interpreter (using the PCRE search engine) to process the text in its entirety rather than line-by-line (line mode). Line mode reduces memory requirements and is set by default, but does not allow a search pattern to match across multiple lines because the search engine never sees beyond the end of any one line at a time - and you correctly chose a non-line mode because your search pattern needs to do exactly that.

However, in specifying that you wanted the text to be processed in its entirety but providing it as a vector of vectors (i.e. separate lines) it was necessary for the interpreter to construct the entire text and this required that the missing line ending characters be added. The default line ending CRLF (13 10) was assumed but you had previously used 10 (LF) - thus the text being processed was different in your two examples, and the results were different (and correct in each case).

When you removed the line endings from q to create q2, an essential piece of information was taken away; when q2 was presented to ⎕S it had no indication of what line ending you had in mind and assumed CRLF by default. However, ⎕S does allow you specify the line ending using the EOL variant option that Dan mentioned: you'll get the behaviour you expected if you specify that the missing line endings characters are LF:

Code: Select all: '^\s{0,3}~{3,}[^~].*^\s{0,3}~{3,}$'⎕S 0 1 ⍠('Mode' 'M')('DotAll' 1)('EOL' 'LF')⊣q2 8 21

by **DanB|Dyalog** on Wed Jun 01, 2016 11:05 am

You don't get a different result if you use the EOL option.
You will get a different result only if you don't. And that is because YOU chose LF as line delimiter. The program has no idea what you will choose.

You may not like the default but this is not a bug, it's a feature :)

by **kai** on Wed Jun 01, 2016 11:09 am

Dan and Richard: point taken.

I wonder whether that should be mentioned somewhere in the documentation.

The tool of thought for

software solutions

RegEx: identiify fenced code blocks in Markdown

RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Re: RegEx: identiify fenced code blocks in Markdown

Who is online

QUICK LINKS