Java Regex - How to replace a pattern or how to

30,503

Solution 1

Try these:

PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"

Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./.

Notice how I changed your .* to [^>]*, too. If there happened to be two IMG tags on the same line, like this:

<img src="good" /><img src="./bad" />

...your regex would match this:

<img src="good" /><img src="./

It would do that even if you used a non-greedy .*?. [^>]* makes sure the match is always contained within the one tag.

Solution 2

Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.

Solution 3

Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group. You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.

// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
    // Found a match!
    // Append all chars before the match and then replaces the match by the 
    // replacement (the replacement refers to group 1 & 2 with $1 & $2
    // which match respectively everything between '<img' and 'src' and,
    // everything after the src value and the closing >
    m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input

Hope this helps you

Share:
30,503
mrd
Author by

mrd

Java and Android Development. PHP, Java, Android, iOS, Laravel, MySQL

Updated on August 02, 2022

Comments

  • mrd
    mrd almost 2 years

    I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags. The IMG tags look typically like this:

    <img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
    

    where the attributes are NOT in any specific order. I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:

    <img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
    

    I have the following class so far:

    import java.util.regex.*;
    
    
    public class Replacer {
    
        // this PATTERN should find all img tags with 0 or more attributes before the src-attribute
        private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
        private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
        private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN,  Pattern.CASE_INSENSITIVE);
    
    
        public static void findMatches(String html){
            Matcher matcher = COMPILED_PATTERN.matcher(html);
            // Check all occurance
            System.out.println("------------------------");
            System.out.println("Following Matches found:");
            while (matcher.find()) {
                System.out.print("Start index: " + matcher.start());
                System.out.print(" End index: " + matcher.end() + " ");
                System.out.println(matcher.group());
            }
            System.out.println("------------------------");
        }
    
        public static String replaceMatches(String html){
            //Pattern replace = Pattern.compile("\\s+");
            Matcher matcher = COMPILED_PATTERN.matcher(html);
            html = matcher.replaceAll(REPLACEMENT);
            return html;
        }
    }
    

    So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./.

    Now my method replaceMatches(String html) does not correctly replace the matches. I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both. A you can see, the replacement String contains 2 parts which are identical in all IMG tags: <img and src="./. In between these 2 parts, there should be the 0 or more HTML attributes from the original string. How do I formulate such a REPLACEMENT string? Can somebody please enlighten me?

  • mrd
    mrd over 12 years
    @ggreiner: yes I do, from a different class like Replacer.replacesMatches(html)
  • mrd
    mrd over 12 years
    I should add: when I examine the html output files, the replaced tags look like this: <img.*ssrc="suitbert &ndash;="" wikipedia_files="" 233px-suitbertus.jpg="" name="Grafik1" align="BOTTOM" width="236" height="246" border="0"></img.*ssrc="suitbert>
  • mrd
    mrd over 12 years
    As you can see completely messed up, so replacemt takes place but incorrectly
  • Dave Newton
    Dave Newton over 12 years
    IMO if you're simply searching for a pretty specific thing, and it's pretty controlled like this, a regex is fine. In this case it'd be the first thing I'd try. That said, I already have directory-based XML-like search/replace tools, so if it didn't succeed essentially immediately, I'd use those.
  • mrd
    mrd over 12 years
    I already had that idea, but there is no garantee that the src attribute only occurs in IMG tags. In particular, the src attribute is valid for quite a lot HTML tags, so that's a pretty unpredictable approach.
  • mrd
    mrd over 12 years
    I already suspected that and looked at appendReplacement. But I am confused about how to do that. Any link to an example or tutorial would be helpful
  • Alan Moore
    Alan Moore over 12 years
    There's no need to resort to appendReplacement() and appendTail() here (though it's certainly good to know about them). replaceAll() is perfectly capable of handling this job, as I demonstrated in my answer.
  • Guillaume Polet
    Guillaume Polet over 12 years
    Yes, I was only providing an example to the previous comment.
  • mrd
    mrd over 12 years
    Great, thx, this does the trick. And finally I understand how this thingie with $-sign and its use in the REPLACEMENT string works.
  • mrd
    mrd over 12 years
    @GuillaumePolet: Thx, yours and Alan's post above did enlight me and solved the problem. Very Interesting and exactly what I was looking for
  • mrd
    mrd over 12 years
    So this is the final solution, kudos to Alan More and Guillaume Polet:` private static final String PATTERN = "(<img[^>]*\\ssrc=\")\\./"; private static final String REPLACEMENT = "$1";`