RegExp: Matching Balanced Parenthesis and Quotes (greedy, non-recursive)

Solution

I need to match all the text within balanced parens, single quotes, or double quotes, but don’t snag on the content within, i.e. be greedy! “Balanced” means there should be a corresponding single or double quote, but not mix them, or there should be a closing paren for each opening paren.

PCRE:

The match will be in capturing group #3.

In short, this regular expression uses a positive lookahead (?=...)  within the lookaround conditional  (?...)  and the backreference \2  to match the initiating character. If it is a single quote or double quote, it was contained within \2 , but if it was an open paren ( , it is not contained in a capture group and therefore it should match a close paren. In retrospect, this could be enhanced, and also needs some tweaks for JavaScript compatibility since lookaround conditionals are not supported.

Do note that this matches the beginning and ending of the string ^$ , so may require tweaking for other applications.

Test Cases

regex101 link: https://regex101.com/r/fG4rZ8/3

Matches:

These will match hello , hello world , or in some cases 'hello world  for the last test case.

Doesn’t Match:

Node.js: Split/Count Number of Lines (Unicode Compatible)

The task was to read n number of a lines from a file in Node.js. The assumption is that the read buffer is UTF-8. The caveat is that there is more to just new lines in Unicode than “New Line (ASCII 10 or \n)” and “Carriage Return (ASCII 13 or \r)”.  This SO comment put me in the right direction.

Unicode Technical Standard #18 § 1.6 describes the following rule regarding line boundaries, (PS and LS are defined in Unicode 6.1 § 16.2 Layout Controls):

RL1.6 Line Boundaries
To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).

To put it into practice, the number of lines can be tallied while the ReadStream has incoming data:

The key is the regular expression, derived straight from RL1.6,  /\r\n|[\n\r\u0085\u2028\u2029]/ , which looks for any of the possible new line characters with precedence to CRLF (greedy). Don’t forget to listen for the “end” event and also remove the “data” listener on “end” or “error!”

In the context of splitting incoming data by line, simply calling  chunk.toString('utf8').split(/\r\n|[\n\r\u0085\u2028\u2029]/g) will do the trick. Note that if working with CSV files, parsing will require a bit more effort if escaped new-lines will need to be taken into consideration, and PS, LS and NEL are probably not characters that should act as line delimiters in that context anyway.

MySQL UDF: Perl Regular Expression Clauses

Currently working on migrating a database in MySQL. I needed to to some perl-like regex find and replaces on cells. MySQL does not support this natively. It does support REGEX/RLIKE, which is basically a LIKE clause with regular expression support – this is crap: is only useful for lookup queries and not data manipulation. One may argue that relational databases should only be used to load and serve static data and any manipulation of data should be done externally of the database. Well I say, “Bollocks!” in this case. When I’m on a utility server doing one-time, one-way updates to row data I don’t care if there’s a performance hit – of course I’m not stupid enough to implement and utilize these types of queries in a production environment (in this case the data coming in should be prepared and optimized before hand in order to maximize query times).

So after some really brief searching I found this little library called “lib_mysqludf_preg.” I’ll just document my installation procedure. Remember, I’m on CentOS 5.8 (recently upgraded from 5.6, went pretty smooth), oh and per usual I’m doing this as root since I haven’t broken that habit yet.

Download and Build the Module

First, a pcre module is required, so I went ahead and grabbed that:

While you’re at it, make sure you have things like make:

Create a compilation directory and grab the lib_mysqludf_preg source (double check the site for the latest stable build, at the time of this writing it is 1.0.1):

Now you should be in the directory full of source code. Go ahead and perform the preliminary configuration and checks for the upcoming build:

Everything went smoothly for me. It found mysqlbin but threw the notice “ERROR 1045 (28000): Access denied for user ‘root’@’localhost’ (using password: NO)” don’t be alarmed, it is just making sure mysqlbin is available. It also found mysql_config at /usr/bin/mysql_config and PCRE at /usr/bin/pcre-config (with pcre v6.6).

OK, so let’s “make” it:

OOPS! Looks like the initial make crapped out. I’m missing mysql dev files… soooo:

Now reconfigure and try again:

Finally! Looks like everything went smooth. OK let’s install it:

Success! Now to see if I can load the so module within MySQL:

Uh-Oh, can’t find the .so module…. read on.

Register and Install the Module in MySQL

We’re not done yet. So far all we’ve done is built the .so plugin, but we need MySQL to find it. By default, my distro put it in /usr/local/lib, but MySQL doesn’t know that exists. Why? Well my plugin_dir configuration for MySQL is blank, which means it falls back to the system’s dynamic link resolver. So I go look that one up:

…which gives me “/usr/lib64/mysql” This is where I need to copy those modules. I’ll be honest here, I don’t know if I need just the .so or all 3 files that the build created, so i’ll copy all 3 just to be safe and give them execute permissions:

Now let’s register them with ldconfig and restart mysql (again, tbh, not sure of the mysql restart is required, actually, I don’t think it is but better safe than sorry).

Now install the user-defined functions:

Finally, test it all to make sure the installed UDFs are working:

All test came back ok. Should be done now. That is all. kthxbye.