Java: I have a big string of html and need to extract the href="..." text

java html regex html-parsing

14,501

Solution 1

.*

This is an greedy operation that will take any character including the quotes.

Try something like:

"href=\"([^\"]*)\""

Solution 2

There are two problems with the code you've posted:

Firstly the .* in your regular expression is greedy. This will cause it to match all characters until the last " character that can be found. You can make this match be non-greedy by changing this to .*?.

Secondly, to pick up all the matches, you need to keep iterating with Matcher.find rather than looking for groups. Groups give you access to each parenthesized section of the regex. You however, are looking for each time the whole regular expression matches.

Putting these together gives you the following code which should do what you need:

Pattern p = Pattern.compile("href=\"(.*?)\"", Pattern.DOTALL);
Matcher m = p.matcher(innerHTML);

while (m.find()) 
{
    System.out.println(m.group(1));
}

Solution 3

Use a built in parser. Something like:

    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
    kit.read(reader, doc, 0);

    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);

    while (it.isValid())
    {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
        String href = (String)s.getAttribute(HTML.Attribute.HREF);
        System.out.println( href );
        it.next();
    }

Or use the ParserCallback:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        if (tag.equals(HTML.Tag.A))
        {
            String href = (String)a.getAttribute(HTML.Attribute.HREF);
            System.out.println(href);
        }
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

The Reader could be a StringReader.

Solution 4

Regex is great but not the right tool for this particular purpose. Normally you want to use a stackbased parser for this. Have a look at Java HTML parser API's like jTidy.

Solution 5

Another easy and reliable way to do it is by using Jsoup

Document doc = Jsoup.connect("http://example.com/").get();
Elements links = doc.select("a[href]");
for (Element link : links){
  System.out.println(link.attr("abs:href"));
}

View more solutions

14,501

Author by

Legend

Just a simple guy :)

Updated on June 05, 2022

Comments

Legend almost 2 years
I have this string containing a large chunk of html and am trying to extract the link from href="..." portion of the string. The href could be in one of the following forms:
```
<a href="..." />
<a class="..." href="..." />
```
I don't really have a problem with regex but for some reason when I use the following code:
```
        String innerHTML = getHTML(); 
  Pattern p = Pattern.compile("href=\"(.*)\"", Pattern.DOTALL);
  Matcher m = p.matcher(innerHTML);
  if (m.find()) {
   // Get all groups for this match
   for (int i=0; i<=m.groupCount(); i++) {
    String groupStr = m.group(i);
    System.out.println(groupStr);

   }
  }
```
Can someone tell me what is wrong with my code? I did this stuff in php but in Java I am somehow doing something wrong... What is happening is that it prints the whole html string whenever I try to print it...

EDIT: Just so that everyone knows what kind of a string I am dealing with:
```
<a class="Wrap" href="item.php?id=43241"><input type="button">
    <span class="chevron"></span>
  </a>
  <div class="menu"></div>
```
Everytime I run the code, it prints the whole string... That's the problem...

And about using jTidy... I'm on it but it would be interesting to know what went wrong in this case as well...
Legend over 14 years

It still prints the entire string and not the capture group :(
Peter Boughton over 14 years

Probably because he's missed the quantifier after the negated quote. But anyway, stop trying to use RegEx for this, it's the wrong tool for the job!
Kugel over 14 years

But it's the fastest tool for the job (development wise). Html parsers can be a pain.
Peter Boughton over 14 years

Regex cannot match HTML nodes correctly. Even with the non-Regular extensions of many modern regex extensions, HTML is too complex.
Legend over 14 years

Sorry! This works... There was something wrong with my string... Thanks a ton!
Denis Tulskiy over 14 years

It is, in fact, fastest for given task (performance wise). But XPath would be faster and more scalable development wise.
Kugel over 14 years

Xpath works on html too? @Peter I understand that, but the job here was not to match html nodes, but simply find the links.