Oracle - need to extract text between given strings

16,220

Solution 1

I'm not that familiar with the POSIX [[:print:]] character class but I got your query functioning using the wildcard .. You need to specify the n match parameter in REGEXP_REPLACE() so that . can match the newline character:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

I used the \1 backreference as I didn't see the need to capture the other groups from the regular expression. It might also be a good idea to use the * quantifier (instead of +) in case there is nothing preceding or following the delimiters. If you want to capture all of the groups then you can use the following:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(phrase, '(.+Begin begin)(.+)(End end.+)', '\2', 1, 1, 'n')
  FROM phrases;

UPDATE - FYI, I tested with [[:print:]] and it doesn't work. This is not surprising since [[:print:]] is supposed to match printable characters. It doesn't match anything with an ASCII value below 32 (a space). You need to use ..

UPDATE #2 - per update to question - I don't think a regex will work the way you want it to. Adding the lazy quantifier to (.+) has no effect and Oracle regular expressions don't have lookahead. There are a couple of things you might do, one is to use INSTR() and SUBSTR():

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT SUBSTR(phrase, str_start, str_end - str_start) FROM (
    SELECT INSTR(phrase, 'Begin begin') + LENGTH('Begin begin') AS str_start
         , INSTR(phrase, 'End end') AS str_end, phrase
      FROM phrases
);

Another is to combine INSTR() and SUBSTR() with a regular expression:

WITH phrases AS (
  SELECT 'stackoverflow is awesome. Begin beginHello,
 World!End end It has everything!End endTESTTESTTEST' AS phrase
    FROM dual
)
SELECT REGEXP_REPLACE(SUBSTR(phrase, 1, INSTR(phrase, 'End end') + LENGTH('End end')), '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
  FROM phrases;

Solution 2

Try this regex:

([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)

Sample usage:

SELECT regexp_replace(
         phrase ,
         '([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)',
         '\2',
         1,  -- Start at the beginning of the phrase
         0,  -- Replace ALL occurences
         'n' -- Let dot meta character matches new line character
)
FROM
  (SELECT 'stackoverflow is awesome. Begin beginHello, '
    || chr(10)
    || ' World!End end It has everything!' AS phrase
  FROM DUAL
  )

The dot meta character (.) matches any character in the database character set and the new line character. However, when regexp_replace is called, the match_parameter must contain n switch for dot matches new lines.

Share:
16,220
user1209216
Author by

user1209216

Updated on June 04, 2022

Comments

  • user1209216
    user1209216 almost 2 years

    Example - need to extract everything between "Begin begin" and "End end". I tried this way:

    with phrases as (
      select 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!' as phrase
        from dual
             )
    select regexp_replace(phrase
         , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
      from phrases
           ;
    

    Result: Hello, World!

    However it fails if my text contains new line characters. Any tip how to fix this to allow extracting text containing also new lines?

    [edit]How does it fail:

    with phrases as (
      select 'stackoverflow is awesome. Begin beginHello, 
      World!End end It has everything!' as phrase
        from dual
             )
    select regexp_replace(phrase
         , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2')
      from phrases
           ;
    

    Result:

    stackoverflow is awesome. Begin beginHello, World!End end It has everything!

    Should be:

    Hello,
    World!

    [edit]

    Another issue. Let's see to this sample:

    WITH phrases AS (
      SELECT 'stackoverflow is awesome. Begin beginHello,
     World!End end It has everything!End endTESTESTESTES' AS phrase
        FROM dual
    )
    SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
      FROM phrases;
    

    Result:

    Hello,
    World!End end It has everything!

    So it matches last occurence of end string and this is not what I want. Subsgtring should be extreacted to first occurence of my label, so result should be:

    Hello,
    World!

    Everything after first occurence of label string should be ignored. Any ideas?