Simplest way to correctly load html from web page into a string in Java

41,979

Solution 1

An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta tag is also an option.

So, it is surprisingly complicated to load a page into a String correctly, and even 3rd party libraries like HttpClient don't offer a general solution.

Here's a simple implementation that will handle the most common case:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
Reader r = new InputStreamReader(con.getInputStream(), charset);
StringBuilder buf = new StringBuilder();
while (true) {
  int ch = r.read();
  if (ch < 0)
    break;
  buf.append((char) ch);
}
String str = buf.toString();

Solution 2

You can still simplify it a bit using org.apache.commons.io.IOUtils:

URL url = new URL("http://stackoverflow.com/questions/1381617");
URLConnection con = url.openConnection();
Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*");
Matcher m = p.matcher(con.getContentType());
/* If Content-Type doesn't match this pre-conception, choose default and 
 * hope for the best. */
String charset = m.matches() ? m.group(1) : "ISO-8859-1";
String str = IOUtils.toString(con.getInputStream(), charset);

Solution 3

I use this:

        BufferedReader bufferedReader = new BufferedReader( 
                                     new InputStreamReader( 
                                          new URL(urlToSeach)
                                              .openConnection()
                                              .getInputStream() ));

        StringBuilder sb = new StringBuilder();
        String line = null;
        while( ( line = bufferedReader.readLine() ) != null ) {
             sb.append( line ) ;
             sb.append( "\n");
        }
        .... in finally.... 
        buffer.close();

It works most of the times.

Share:
41,979
Mark
Author by

Mark

CTO at CareSwitch

Updated on September 27, 2020

Comments

  • Mark
    Mark over 3 years

    Just what the title says.

    Help greatly appreciated!

  • Ar5hv1r
    Ar5hv1r over 12 years
    I know this is a really old question, but if you could check out stackoverflow.com/questions/7615014/… I'd really appreciate it.
  • Tal Weiss
    Tal Weiss almost 12 years
    Please change default encoding to "UTF-8" (trends.builtwith.com/encoding). People are learning from your (very good) answer!
  • Tal Weiss
    Tal Weiss almost 12 years
    There will be an extra "\n" at the end of the resulting string.
  • brady
    brady almost 12 years
    @TalWeiss Popularity doesn't matter; ISO-8859-1 is the specified default.. "When no explicit charset parameter is provided by the sender, media subtypes of the 'text' type are defined to have a default charset value of 'ISO-8859-1' when received via HTTP. Data in character sets other than 'ISO-8859-1' or its subsets MUST be labeled with an appropriate charset value."
  • Tal Weiss
    Tal Weiss almost 12 years
    @erickson I do see your point, but this is code for reading the web and people just want their code to work. As you remarked "hope for the best" - I'm just not sure what the best is, in terms of probability of your code actually working when the encoding is not specified. I'm GUESSING that globally you have better odds with UTF-8.