Java Regex - How to replace a pattern or how to
Solution 1
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"
Basically, you capture everything except the ./
in group #1, then plug it back in using the $1
placeholder, effectively stripping off the ./
.
Notice how I changed your .*
to [^>]*
, too. If there happened to be two IMG tags on the same line, like this:
<img src="good" /><img src="./bad" />
...your regex would match this:
<img src="good" /><img src="./
It would do that even if you used a non-greedy .*?
. [^>]*
makes sure the match is always contained within the one tag.
Solution 2
Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.
Solution 3
Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group.
You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement
for the details.
// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
// Found a match!
// Append all chars before the match and then replaces the match by the
// replacement (the replacement refers to group 1 & 2 with $1 & $2
// which match respectively everything between '<img' and 'src' and,
// everything after the src value and the closing >
m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input
Hope this helps you
mrd
Java and Android Development. PHP, Java, Android, iOS, Laravel, MySQL
Updated on August 02, 2022Comments
-
mrd almost 2 years
I have a bunch of HTML files. In these files I need to correct the
src
attribute of the IMG tags. The IMG tags look typically like this:<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
where the attributes are NOT in any specific order. I need to remove the dot and the forward slash at the beginning of the
src
attribute of the IMG tags so they look like this:<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
I have the following class so far:
import java.util.regex.*; public class Replacer { // this PATTERN should find all img tags with 0 or more attributes before the src-attribute private static final String PATTERN = "<img\\.*\\ssrc=\"\\./"; private static final String REPLACEMENT = "<img\\.*\\ssrc=\""; private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN, Pattern.CASE_INSENSITIVE); public static void findMatches(String html){ Matcher matcher = COMPILED_PATTERN.matcher(html); // Check all occurance System.out.println("------------------------"); System.out.println("Following Matches found:"); while (matcher.find()) { System.out.print("Start index: " + matcher.start()); System.out.print(" End index: " + matcher.end() + " "); System.out.println(matcher.group()); } System.out.println("------------------------"); } public static String replaceMatches(String html){ //Pattern replace = Pattern.compile("\\s+"); Matcher matcher = COMPILED_PATTERN.matcher(html); html = matcher.replaceAll(REPLACEMENT); return html; } }
So, my method
findMatches(String html)
seems to find correctly all IMG tags where thesrc
attributes starts with./
.Now my method
replaceMatches(String html)
does not correctly replace the matches. I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both. A you can see, the replacement String contains 2 parts which are identical in all IMG tags:<img
andsrc="./
. In between these 2 parts, there should be the 0 or more HTML attributes from the original string. How do I formulate such a REPLACEMENT string? Can somebody please enlighten me? -
mrd over 12 years@ggreiner: yes I do, from a different class like Replacer.replacesMatches(html)
-
mrd over 12 yearsI should add: when I examine the html output files, the replaced tags look like this: <img.*ssrc="suitbert –="" wikipedia_files="" 233px-suitbertus.jpg="" name="Grafik1" align="BOTTOM" width="236" height="246" border="0"></img.*ssrc="suitbert>
-
mrd over 12 yearsAs you can see completely messed up, so replacemt takes place but incorrectly
-
Dave Newton over 12 yearsIMO if you're simply searching for a pretty specific thing, and it's pretty controlled like this, a regex is fine. In this case it'd be the first thing I'd try. That said, I already have directory-based XML-like search/replace tools, so if it didn't succeed essentially immediately, I'd use those.
-
mrd over 12 yearsI already had that idea, but there is no garantee that the src attribute only occurs in IMG tags. In particular, the src attribute is valid for quite a lot HTML tags, so that's a pretty unpredictable approach.
-
mrd over 12 yearsI already suspected that and looked at appendReplacement. But I am confused about how to do that. Any link to an example or tutorial would be helpful
-
Alan Moore over 12 yearsThere's no need to resort to
appendReplacement()
andappendTail()
here (though it's certainly good to know about them).replaceAll()
is perfectly capable of handling this job, as I demonstrated in my answer. -
Guillaume Polet over 12 yearsYes, I was only providing an example to the previous comment.
-
mrd over 12 yearsGreat, thx, this does the trick. And finally I understand how this thingie with $-sign and its use in the REPLACEMENT string works.
-
mrd over 12 years@GuillaumePolet: Thx, yours and Alan's post above did enlight me and solved the problem. Very Interesting and exactly what I was looking for
-
mrd over 12 yearsSo this is the final solution, kudos to Alan More and Guillaume Polet:` private static final String PATTERN = "(<img[^>]*\\ssrc=\")\\./"; private static final String REPLACEMENT = "$1";`