How can I capture all network requests and full response data when loading a page in Chrome?

38,406

Solution 1

You can enable a request interception with page.setRequestInterception() for each request, and then, inside page.on('request'), you can use the request-promise-native module to act as a middle man to gather the response data before continuing the request with request.continue() in Puppeteer.

Here's a full working example:

'use strict';

const puppeteer = require('puppeteer');
const request_client = require('request-promise-native');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const result = [];

  await page.setRequestInterception(true);

  page.on('request', request => {
    request_client({
      uri: request.url(),
      resolveWithFullResponse: true,
    }).then(response => {
      const request_url = request.url();
      const request_headers = request.headers();
      const request_post_data = request.postData();
      const response_headers = response.headers;
      const response_size = response_headers['content-length'];
      const response_body = response.body;

      result.push({
        request_url,
        request_headers,
        request_post_data,
        response_headers,
        response_size,
        response_body,
      });

      console.log(result);
      request.continue();
    }).catch(error => {
      console.error(error);
      request.abort();
    });
  });

  await page.goto('https://example.com/', {
    waitUntil: 'networkidle0',
  });

  await browser.close();
})();

Solution 2

Puppeteer-only solution

This can be done with puppeteer alone. The problem you are describing that the response.buffer is cleared on navigation, can be circumvented by processing each request one after another.

How it works

The code below uses page.setRequestInterception to intercept all requests. If there is currently a request being processed/being waited for, new requests are put into a queue. Then, response.buffer() can be used without the problem that other requests might asynchronously wipe the buffer as there are no parallel requests. As soon as the currently processed request/response is handled, the next request will be processed.

Code

const puppeteer = require('puppeteer');

(async () => {
    const browser = await puppeteer.launch();
    const [page] = await browser.pages();

    const results = []; // collects all results

    let paused = false;
    let pausedRequests = [];

    const nextRequest = () => { // continue the next request or "unpause"
        if (pausedRequests.length === 0) {
            paused = false;
        } else {
            // continue first request in "queue"
            (pausedRequests.shift())(); // calls the request.continue function
        }
    };

    await page.setRequestInterception(true);
    page.on('request', request => {
        if (paused) {
            pausedRequests.push(() => request.continue());
        } else {
            paused = true; // pause, as we are processing a request now
            request.continue();
        }
    });

    page.on('requestfinished', async (request) => {
        const response = await request.response();

        const responseHeaders = response.headers();
        let responseBody;
        if (request.redirectChain().length === 0) {
            // body can only be access for non-redirect responses
            responseBody = await response.buffer();
        }

        const information = {
            url: request.url(),
            requestHeaders: request.headers(),
            requestPostData: request.postData(),
            responseHeaders: responseHeaders,
            responseSize: responseHeaders['content-length'],
            responseBody,
        };
        results.push(information);

        nextRequest(); // continue with next request
    });
    page.on('requestfailed', (request) => {
        // handle failed request
        nextRequest();
    });

    await page.goto('...', { waitUntil: 'networkidle0' });
    console.log(results);

    await browser.close();
})();

Solution 3

I would suggest you to search for a quick proxy server which allows to write requests logs together with actual content.

The target setup is to allow proxy server to just write a log file, and then analyze the log, searching for information you need.

Don't intercept requests while proxy is working (this will lead to slow down)

The performance issues(with proxy as logger setup) you may encounter are mostly related to TLS support, please pay attention to allow quick TLS handshake, HTTP2 protocol in the proxy setup

E.g. Squid benchmarks show that it is able to process hundreds RPS, which should be enough for testing purposes

Solution 4

I would suggest using a tool namely 'fiddler'. It will capture all the information that you mentioned when you load a URL url.

Share:
38,406
Matt Zeunert
Author by

Matt Zeunert

Web developer in London working on JavaScript apps and tools for developers. Currently working on a front-end monitoring tool: DebugBear

Updated on February 25, 2021

Comments

  • Matt Zeunert
    Matt Zeunert about 3 years

    Using Puppeteer, I'd like to load a URL in Chrome and capture the following information:

    • request URL
    • request headers
    • request post data
    • response headers text (including duplicate headers like set-cookie)
    • transferred response size (i.e. compressed size)
    • full response body

    Capturing the full response body is what causes the problems for me.

    Things I've tried:

    • Getting response content with response.buffer - this does not work if there are redirects at any point, since buffers are wiped on navigation
    • intercepting requests and using getResponseBodyForInterception - this means I can no longer access the encodedLength, and I also had problems getting the correct request and response headers in some cases
    • Using a local proxy works, but this slowed down page load times significantly (and also changed some behavior for e.g. certificate errors)

    Ideally the solution should only have a minor performance impact and have no functional differences from loading a page normally. I would also like to avoid forking Chrome.

  • Md. Abu Taher
    Md. Abu Taher over 5 years
    Was expecting you to write an answer, otherwise I would write the same answer. :D
  • Matt Zeunert
    Matt Zeunert over 5 years
    Thanks! This approach breaks some sites because at request interception some headers aren't included yet (e.g. Accept and Cookie). github.com/GoogleChrome/puppeteer/issues/3436 I want the outgoing request to have the same headers as without request interception.
  • Matt Zeunert
    Matt Zeunert over 5 years
    I think request.continue will make a new request rather than use the same data, but request.respond should work.
  • Matt Zeunert
    Matt Zeunert over 5 years
    That's using response.buffer which gets wiped on navigation.
  • Jose Rodriguez
    Jose Rodriguez over 5 years
    there's an checkbox to preserve log, so you can reload the page and you will not lose requests log
  • Matt Zeunert
    Matt Zeunert over 5 years
    It doesn't work, it only shows a "Failed to load response data" message after navigation.
  • Matt Zeunert
    Matt Zeunert over 5 years
    Thanks! I wasn't too keen on using a proxy because of the performance problems I was having, but I'll look into it again.
  • Andrii Muzalevskyi
    Andrii Muzalevskyi over 5 years
    @MattZeunert, thank you, please let me know if you need any help with it
  • Nisim Joseph
    Nisim Joseph over 4 years
    I tried to manipulate the request URL but it doesn't allow it and I couldn't see the different URL in the Tracing of chrome. any ideas on how to do it?
  • FelipeKunzler
    FelipeKunzler about 4 years
    request-promise-native seems to be deprecated as of now.
  • onassar
    onassar over 3 years
    Why do you need to pause requests? Why can't you simply let requests continue, and use the requestfinished event to check for the URL and response headers and store those? In my case, all I want are the headers associated with a particular request URL.
  • Thomas Dondorf
    Thomas Dondorf over 3 years
    @onassar Your use case is different to OPs. The question was how to capture "full response data" not just headers.
  • onassar
    onassar over 3 years
    Ahh okay. So if all I care about is the response headers, I could simplify the approach yah? In my case, I call setRequestInterception with true, and then call continue on request objects in the following events: request, requestfailed and requestfinished. The exception is I store the headers in requestfinished event calls. That make sense?
  • Thomas Dondorf
    Thomas Dondorf over 3 years
    @onassar Yes, if you don't need the buffer you can simplify it.
  • Willis
    Willis over 3 years
    what does [page] use for? I didn't see it use in your code.
  • Thomas Dondorf
    Thomas Dondorf over 3 years
    @Willis It's a destructuring assignment: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/…
  • Willis
    Willis over 3 years
    @ThomasDondorf got it, sorry I'm not very familiar with js, thank you!
  • NeuronButter
    NeuronButter almost 3 years
    I think you've confused Puppeteer with Pyppeteer. Puppeteer is for JavaScript, and the Pyppeteer library is just a port from the JavaScript one.
  • Gergely M
    Gergely M almost 3 years
    Hi @NeuronButter, I'm not confused just trying to help those who need help with Pyppeteer - which in fact not the same as Puppeteer. I did that because it's hard to find info for Pyppeteer. Search engines - like Google's - keep returning Puppeteer-related hits. That's how I ended up on this page. Regardless, I take your -1 gracefully since mine isn't a solution for the OP indeed.
  • NeuronButter
    NeuronButter almost 3 years
    That actually makes a lot of sense. I can't remove my -1 vote (sorry!), but in the future, try using "pyppeteer" (with the quotes) on Google, so you get an exact search match, and hopefully more relevant results :)