How to join tables on regex

16,411

As @Milen already mentioned regexp_matches() is probably the wrong function for your purpose. You want a simple regular expression match (~). Actually, the LIKE operator (~~) will be faster:

Presumably fastest with LIKE

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON msg.src_addr ~~ ('%38' || mnc.code || '%')
           OR msg.dst_addr ~~ ('%38' || mnc.code || '%')
WHERE  length(mnc.code) = 3

In addition, you only want mnc.code of exactly 3 characters.


With regexp

You could write the same with regular expressions but it will most definitely be slower. Here is a working example close to your original:

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON (msg.src_addr || '+' || msg.dst_addr) ~ (38 || mnc.code)
           AND length(mnc.code) = 3

This also requires msg.src_addr and msg.dst_addr to be NOT NULL.

The second query demonstrates how the additional check length(mnc.code) = 3 can go into the JOIN condition or a WHERE clause. Same effect here.


With regexp_matches()

You could make this work with regexp_matches():

SELECT msg.message
      ,msg.src_addr
      ,msg.dst_addr
      ,mnc.name
FROM   mnc
JOIN   msg ON EXISTS (
    SELECT * 
    FROM   regexp_matches(msg.src_addr ||'+'|| msg.dst_addr, '38(...)', 'g') x(y)
    WHERE  y[1] = mnc.code
    )

But it will be slow in comparison - or so I assume.

Explanation:
Your regexp_matches() expression just returns an array of all captured substrings of the first match. As you only capture one substring (one pair of brackets in your pattern), you will exclusively get arrays with one element.

You get all matches with the additional "globally" switch 'g' - but in multiple rows. So you need a sub-select to test them all (or aggregate). Put that in an EXISTS - semi-join and you arrive at what you wanted.

Maybe you can report back with a performance test of all three? Use EXPLAIN ANALYZE for that.

Share:
16,411
z4y4ts
Author by

z4y4ts

It's better to ask stupid question and get the answer rather than stay ignorant.

Updated on July 18, 2022

Comments

  • z4y4ts
    z4y4ts almost 2 years

    Say I have two tables msg for messages and mnc for mobile network codes. They share no relations. But I want to join them

    SELECT msg.message,
        msg.src_addr,
        msg.dst_addr,
        mnc.name,
    FROM "msg"
    JOIN "mnc"
    ON array_to_string(regexp_matches(msg.src_addr || '+' || msg.dst_addr, '38(...)'), '') = mnc.code
    

    But query fails with error:

    psql:marketing.sql:28: ERROR:  argument of JOIN/ON must not return a set
    LINE 12: ON array_to_string(regexp_matches(msg.src_addr || '+' || msg...
    

    Is there a way to do such join? Or am I moving wrong way?

  • Erwin Brandstetter
    Erwin Brandstetter over 12 years
    Actually, without the 'g' switch, regexp_matches() returns exactly 1 row with an array of all captured substrings of the first match. However, the OP would need the 'g' switch to get that result for all matches.
  • Milen A. Radev
    Milen A. Radev over 12 years
    It could return multiple rows and that's what's important to the parser hence the error message.
  • Erwin Brandstetter
    Erwin Brandstetter over 12 years
    This is going to fail because substring() only returns the first match, but one of the additional matches could be mnc.code. Consider: SELECT substring('38foo+38bar', '38(...)') = 'bar'. That's probably the reason why the OP tried regexp_matches().
  • z4y4ts
    z4y4ts over 12 years
    Hi @erwin, thank you for solid answer. Here are some performance numbers gist.github.com/1691021 Just as you said, query with LIKE is the fastest one, followed by regexp and regexp_matches(). No surprises though, but I think real numbers can be interesting.
  • Erwin Brandstetter
    Erwin Brandstetter over 12 years
    @z4y4ts: Thanks for the feedback. Exactly as expected, but it's always good to verify. :)
  • rup
    rup about 2 years
    Agreed it is a weird way to join tables however this is useful if you are ever running a one-off query and want to find identifiers for each table via a name or other vague connection.