Seeking regex in Notepad++ to search and replace CRLF between two quotation marks ["] only

11,551

Solution 1

This will work using one regex only (tested in Notepad++):

Enter this regex in the Find what field:

((?:^|\r\n)[^"]*+"[^\r\n"]*+)\r\n([^"]*+")

Enter this string in the Replace with field:

$1 $2

Make sure the Wrap around check box (and Regular expression radio button) are selected.

Do a Replace All as many times as required (until the "0 occurrences were replaced" dialog pops up).

Explanation:

(
  (?:^|\r\n)     Begin at start of file or before the CRLF before the start of a record
  [^"]*+         Consume all chars up to the opening "
  "              Consume the opening "
  [^\r\n"]*+     Consume all chars up to either the first CRLF or the closing "
)                Save as capturing group 1 (= everything in record before the target CRLF)
\r\n             Consume the target CRLF without capturing it
(
  [^"]*+         Consume all chars up to the closing "
  "              Consume the closing "
)                Save as capturing group 2 (= the rest of the string after the target CRLF)

Note: The *+ is a possessive quantifier. Use them appropriately to speed up execution.

Update:

This more general version of the regex will work with any line break sequence (\r\n, \r or \n):

((?:^|[\r\n]+)[^"]*+"[^\r\n"]*+)[\r\n]+([^"]*+")

Solution 2

In this case the source data is generated by the export function in GMail for your contacts. After the modification outlined below (without RegEx) the result can be used to tidy up your contacts database and re-import it to GMail or to MS Outlook. Yes, I am standing on the shoulders of @alan and @robinCTS. Thank you both.

Instructions in 5 steps:

use Notepad++ / find replace / extended search mode / wrap around = on

-1- replace all [CRLF] with a unique set characters or a string (I used [~~])

find: \r\n and replace with: ~~ The file contents are now on one line only.

-2- Now we need to separate the header line. For this move to where the first record starts exactly before the 88th. comma (including the word after the 87th. comma [,]) and enter the [CRLF] manually by hitting the return key. There are two lines now: header and records.

-3- now find all [,~~] and replace with [,\r\n] The result is one record per line.

-4- remove the remaining [~~] find: ~~ and replace with: [ ] a space. The file is now clean of unwanted [CRLF]s.

-5- Save the file and use it as intended.

Share:
11,551
snahl
Author by

snahl

Updated on June 14, 2022

Comments

  • snahl
    snahl about 2 years

    I've got a CSV file with some 600 records where I need to replace some [CRLF] with a [space] but only when the [CRLF] is positioned between two ["] (quotation marks). When the second ["] is encountered then it should skip the rest of the line and go to the next line in the text.

    I don't really have a starting point. Hope someone comes up with a suggestion.

    Example:

    John und Carol,,Smith,,,J.S.,,,,,,,,,,,,,+11 22 333 4444,,,,,"streetx 21[CRLF]
    New York City[CRLF]
    USA",streetx 21,,,,New York City,,,USA,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Normal,,My Contacts,[CRLF]
    

    In this case the two [CRLF] after the first ["] need to be replaced with a space [ ]. When the second ["] is encountered, skip the end of the line and go to next line.

    Then again, now on the next line, after the first ["] is encountered replace all [CRLF] until the second ["] is encountered. The [CRLF]s vary in numbers. In the CSV-file the amount of commas [,] before (23) and after (65) the 2 quotation marks ["] is constant.

    So maybe a comma counter could be used. I don't know.

    Thanks for feedback.

  • snahl
    snahl over 11 years
    Thanks. In step 2 NP++ responds saying: can't find the text: "((?:[^,]*?,){88})~" I assume there is a syntax problem. Regex is On, wrap around is Off.
  • snahl
    snahl over 11 years
    Thanks. Same here: NP++ responds saying: can't find the text: "((?:^|\r\n)[^"]*+"[^\r\n"]*+)\r\n([^"]*+")". Now I assume it isn't a syntax problem anymore. Rather a NP++ setup issue. I am lost; NP++ is updated.
  • robinCTS
    robinCTS over 11 years
    @snahl - You need Wrap around ON :-)
  • robinCTS
    robinCTS over 11 years
    @snahl - You need Wrap around ON :-)
  • snahl
    snahl over 11 years
    Wrap around is now turned ON (as per instructions). Result: "0 occurrences were found" (after first run). Nothing has changed in the CSV-file.
  • robinCTS
    robinCTS over 11 years
    @snahl - My bad I think! Try ((?:^|[\r\n]+)[^"]*+"[^\r\n"]*+)[\r\n]+([^"]*+"). I hardwired the CRLF for Windows rather than a generic version to cater for *nix/OS X style as well. Make sure Wrap around is ticked, Regular expression is selected and you press Replace All. If this still fails, try copying and pasting your posted example into a newly created file in NotePad++, and then running the replace on it. Let me know what happens. I'm running v6.2.3 What version do you have?
  • snahl
    snahl over 11 years
    sorry still NoGo, same result: Result: "0 occurrences were found" Meanwhile I got it done similar as @dda suggested. I will post the my solution below. Nevertheless I am still curious how this could be done in one step.
  • robinCTS
    robinCTS over 11 years
    Seems like this "header" might be the culprit. I forgot to mention that on the example you posted, I got both my and alan's version working in NP++. My regex also worked in the EditPad Pro program. If you could add the header and the first couple of records to your original question (with any sensitive material obsfucated, of course), I'm sure we could get a 1 step solution going.
  • robinCTS
    robinCTS over 11 years
    @snahl - Are you using NotePad++ v6.2.3? Because this version gives the result "0 occurrences were replaced." when a replace fails.
  • robinCTS
    robinCTS over 11 years
    Just tried your solution, skipping step 2, of course. Worked perfectly. No surprise there! Are you saying if you skip step 2 it doesn't work?
  • snahl
    snahl over 11 years
    @robinCTS No I don't say that. I just separated the header from the records because the header has 87 commas while the records have 88 commas each. But as you say this is not really necessary. Anyway, how is your regex doing? I would really like to use that one-liner. Are you still working on it? Do you still need sample data or can you try with your own from GMail contacts export?
  • robinCTS
    robinCTS over 11 years
    I'm still working on it. I've just upgrade my Google account to GMail. I'll have a go with that first.
  • robinCTS
    robinCTS over 11 years
    OK. Created 2 dummy GMail contacts; exported the contacts; did Replace All 3 times. Worked perfectly! I have a plan, though. [Step 1] Download the CSV file from here. Try Replace All with my Regex on that. Let me know if it works. Should we continue in chat?
  • alan
    alan over 11 years
    snahl. Glad you have a solution working. If @robinCTS answer helped you, you may want to upvote it. Cheers.
  • snahl
    snahl over 11 years
    @alan both of you helped me
  • snahl
    snahl over 11 years
    @robinCTS: I downloaded the file and applied you regex as per instructions. Result: "0 occurrences were found". I am puzzled(?). You say: 3 times, but there are only 2 records. Did you try with a file downloaded from within your GMail? Chat would be an option but I am probably in a different timezone (GMT+1). Suggest we go on as is.
  • snahl
    snahl over 11 years
    @robinCTS: Also with this regex: ((?:[^,]*?,){88})~ I get the result: "0 occurrences were found". Maybe the cause IS some issue with my notepad++ settings. Though other regex strings work fine. Any idea?
  • robinCTS
    robinCTS over 11 years
    GMT+8 here. Not planning on going to bed anytime soon :) Moving to this room.
  • alan
    alan over 11 years
    @snahl I don't have NP++ so can't test it there, but I did test it on your test data and it works in EditPad Pro. The only thing I can guess is make sure that you have run Step 1 first so that there are some ~ characters. If all the EOL are replace by ~ and 88 is the correct number, then I don't know why it wouldn't work in NP++
  • robinCTS
    robinCTS over 11 years
    I'm pretty sure it's a problem with snahl's NP++. I tested your version and mine and they both work in my NP++ and my EditPad Pro. I even sent @snahl a copy of a test file, and that didn't work. I've moved the discussion to chat to help him sort the problem out.
  • alan
    alan over 11 years
    @robinCTS ok. cool. glad it worked for someone besides me ;-) Cheers.
  • snahl
    snahl over 11 years
    I thought I was sure after updating NP++. I double checked it was v5.9.x. So I initiated another update, now on 6.2.2 et volia your regex works in 2 runs.
  • snahl
    snahl over 11 years
    Before I downloaded a trial of EditPad Pro and your regex worked flawlessly in 1 run. Thank you very much. With this we can easily edit the GMail contact export google.csv and contacts.csv files. very USEFUL