Oracle - need to extract text between given strings
Solution 1
I'm not that familiar with the POSIX [[:print:]]
character class but I got your query functioning using the wildcard .
. You need to specify the n
match parameter in REGEXP_REPLACE()
so that .
can match the newline character:
WITH phrases AS (
SELECT 'stackoverflow is awesome. Begin beginHello,
World!End end It has everything!' AS phrase
FROM dual
)
SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
FROM phrases;
I used the \1
backreference as I didn't see the need to capture the other groups from the regular expression. It might also be a good idea to use the *
quantifier (instead of +
) in case there is nothing preceding or following the delimiters. If you want to capture all of the groups then you can use the following:
WITH phrases AS (
SELECT 'stackoverflow is awesome. Begin beginHello,
World!End end It has everything!' AS phrase
FROM dual
)
SELECT REGEXP_REPLACE(phrase, '(.+Begin begin)(.+)(End end.+)', '\2', 1, 1, 'n')
FROM phrases;
UPDATE - FYI, I tested with [[:print:]]
and it doesn't work. This is not surprising since [[:print:]]
is supposed to match printable characters. It doesn't match anything with an ASCII value below 32 (a space). You need to use .
.
UPDATE #2 - per update to question - I don't think a regex will work the way you want it to. Adding the lazy quantifier to (.+)
has no effect and Oracle regular expressions don't have lookahead. There are a couple of things you might do, one is to use INSTR()
and SUBSTR()
:
WITH phrases AS (
SELECT 'stackoverflow is awesome. Begin beginHello,
World!End end It has everything!End endTESTTESTTEST' AS phrase
FROM dual
)
SELECT SUBSTR(phrase, str_start, str_end - str_start) FROM (
SELECT INSTR(phrase, 'Begin begin') + LENGTH('Begin begin') AS str_start
, INSTR(phrase, 'End end') AS str_end, phrase
FROM phrases
);
Another is to combine INSTR()
and SUBSTR()
with a regular expression:
WITH phrases AS (
SELECT 'stackoverflow is awesome. Begin beginHello,
World!End end It has everything!End endTESTTESTTEST' AS phrase
FROM dual
)
SELECT REGEXP_REPLACE(SUBSTR(phrase, 1, INSTR(phrase, 'End end') + LENGTH('End end')), '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n')
FROM phrases;
Solution 2
Try this regex:
([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)
Sample usage:
SELECT regexp_replace(
phrase ,
'([[:print:]]+Begin begin)(.+?)(End end[[:print:]]+)',
'\2',
1, -- Start at the beginning of the phrase
0, -- Replace ALL occurences
'n' -- Let dot meta character matches new line character
)
FROM
(SELECT 'stackoverflow is awesome. Begin beginHello, '
|| chr(10)
|| ' World!End end It has everything!' AS phrase
FROM DUAL
)
The dot meta character (.
) matches any character in the database character set and the new line character. However, when regexp_replace is called, the match_parameter must contain n
switch for dot
matches new lines.
user1209216
Updated on June 04, 2022Comments
-
user1209216 almost 2 years
Example - need to extract everything between "Begin begin" and "End end". I tried this way:
with phrases as ( select 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!' as phrase from dual ) select regexp_replace(phrase , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2') from phrases ;
Result: Hello, World!
However it fails if my text contains new line characters. Any tip how to fix this to allow extracting text containing also new lines?
[edit]How does it fail:
with phrases as ( select 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!' as phrase from dual ) select regexp_replace(phrase , '([[:print:]]+Begin begin)([[:print:]]+)(End end[[:print:]]+)', '\2') from phrases ;
Result:
stackoverflow is awesome. Begin beginHello, World!End end It has everything!
Should be:
Hello,
World![edit]
Another issue. Let's see to this sample:
WITH phrases AS ( SELECT 'stackoverflow is awesome. Begin beginHello, World!End end It has everything!End endTESTESTESTES' AS phrase FROM dual ) SELECT REGEXP_REPLACE(phrase, '.+Begin begin(.+)End end.+', '\1', 1, 1, 'n') FROM phrases;
Result:
Hello,
World!End end It has everything!So it matches last occurence of end string and this is not what I want. Subsgtring should be extreacted to first occurence of my label, so result should be:
Hello,
World!Everything after first occurence of label string should be ignored. Any ideas?