Jsoup set accept-header request doesn't work

11,629

It could be so easy if tempobet would just take a look in the Accept-Language Header...

They are serving tr (tempobet22.com) and en (tempobet.com) on different domains. First call to en-domain is redirected to tr-domain. If you choose another language they are doing two redirects and their magic session-sharing. For the first redirect you need a GAMBLINGSESS cookie from the first domain, for the second one for the second domain. Jsoup does not know this when it’s following a redirect...

String userAgent = "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36";
// get a session for tr and en domain
String tempobetSession = Jsoup.connect("https://www.tempobet.com/").userAgent(userAgent).execute().cookie("GAMBLINGSESS");
String tempobet22Session = Jsoup.connect("https://www.tempobet22.com/").userAgent(userAgent).execute().cookie("GAMBLINGSESS");
// tell tr domain that we wont to go to en without following the redirect
String redirect = Jsoup.connect("https://www.tempobet22.com/?change_lang=https://www.tempobet.com/")
    .userAgent(userAgent).cookie("GAMBLINGSESS", tempobet22Session).followRedirects(false).execute().header("Location");
// Redirect goes to en domain including our hashed tr-cookie as parameter - but this redirect needs a en-cookie
Response response = Jsoup.connect(redirect).userAgent(userAgent).cookie("GAMBLINGSESS", tempobetSession).execute();
// finally...
Document doc = Jsoup.connect("https://www.tempobet.com/league191_5_0.html").userAgent(userAgent).cookies(response.cookies()).get();
Share:
11,629
quartaela
Author by

quartaela

always curious...

Updated on June 14, 2022

Comments

  • quartaela
    quartaela almost 2 years

    I'm trying to parse data from tempobet.com in english format. The thing is when I use google rest client it returns the html as same as I want, however, when I try to parse it via Jsoup it returns the date format in my locale format. This is the test code

    import java.io.IOException;
    import java.util.Date;
    import java.util.ListIterator;
    import java.util.Locale;
    
    import org.apache.commons.lang3.time.DateUtils;
    import org.jsoup.Connection.Response;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import org.junit.Test;
    
    public class ParseHtmlTest {
    
        @Test
        public void testName() throws IOException {
    
            Response response = Jsoup.connect("https://www.tempobet.com/league191_5_0.html")
                                     .userAgent("Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36")
                                     .execute();
    
            Document doc = Jsoup.connect("https://www.tempobet.com/league191_5_0.html")
                                .userAgent("Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36")
                                .header("Accept-Language", "en-US")
                                .header("Accept-Encoding", "gzip,deflate,sdch")
                                .cookies(response.cookies())
                                .get();
    
            Elements tableElement = doc.select("table[class=table-a]");
            ListIterator<Element> trElementIterator = tableElement.select("tr:gt(2)").listIterator();
    
            while (trElementIterator.hasNext()) {
    
                ListIterator<Element> tdElementIterator = trElementIterator.next().select("td").listIterator();
    
                while (tdElementIterator.hasNext()) {
    
                    System.out.println(tdElementIterator.next());
                }
            }
        }
    }
    

    here is an example line of response

    <td width="40" class="grey">21 Nis 20:00</td>
    

    which the date should be "21 Apr 20:00". I will appreciate for any help. Thanks anyway

  • quartaela
    quartaela about 10 years
    Wow works like a charm! I have to ask a couple of questions,though I'm a newbie on http : ). There are a few more cookies when I check the cookie list, so how GAMBLINGSESS is enough ?. Secondly, when I make a GET request from chrome's advanced REST client it tells there are two redirections. First one is https://www.tempobet.com/ and second one is https://www.tempobet22.com/. So this is how I understand that there are two redirects right ?.
  • lefloh
    lefloh about 10 years
    GAMBLINGSESS is the session cookie where they seem to store the requested lang. I don’t know for what they use the last_domain_id cookie. Maybe tracking how often people switch from tr to en to de… The other Cookies are GoogleAnalytics Tracking Cookies. The first redirect ist the answer to ?change_lang... (4th line). If you follow this redirect they send a redirect to https://www.tempobet.com/ again. You can see this easily in the NetworkTab of the Chrome DevTools if you change the lang like a normal user would do on the website.
  • quartaela
    quartaela about 10 years
    yeah I got it. Thanks :)