Puppeteer, save webpage and images

16,118

Let's go back to the first, you can use fullPage to take the screenshot.

await page.screenshot({path: 'example.png', fullPage: true});

If you really want to download all resources to offline, yes you can:

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Then, you can browser the website offline through puppeteer with everything all right.

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

Share:
16,118
Johan Hoeksma
Author by

Johan Hoeksma

Updated on June 11, 2022

Comments

  • Johan Hoeksma
    Johan Hoeksma almost 2 years

    I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:

    await page.screenshot({path: 'example.png'});
    

    But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:

    const html = await page.content();
    // ... write to file
    

    Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:

    page.on('request', request => {
        if (request.resourceType() === 'image') {
            const imgUrl = request.url()
            download(imgUrl, 'download').then((output) => {
                images.push({url: output.url, filename: output.filename})
            }).catch((err) => {
                console.log(err)
            })
            request.abort()
        } else {
            request.continue()
        }
    })
    

    Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.

    Now when I save the content, I want to point it to the offline images in the source.

    const html = await page.content();
    

    But now I like to replace all the

    <img src="/pic.png?id=123"> 
    <img src="https://twitter.com/pics/1.png">
    

    And also things like:

    <div style="background-image: url('this_also.gif')></div>
    

    So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?

    Javascript and CSS would also be nice

    Update

    For now I will open the big html file again with puppeteer.

    And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....

    request.respond({
        status: 200,
        contentType: 'image/jpeg',
        body: '..'
    });
    

    I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()

    • Cody G
      Cody G over 5 years
      I would think webpages are too dynamic to do something like this... (depending on how much of your life you want to spend on it) what is your end goal, just viewing it?
    • pguardiario
      pguardiario over 5 years
      Is the question how to manipulate the html? If so you would use cheerio from node or jQuery from page.evaluate.
    • Johan Hoeksma
      Johan Hoeksma over 5 years
      The question is how to point to the local download. When you have css, javascript images.
    • Johan Hoeksma
      Johan Hoeksma almost 5 years
      @Cody, the goal is to save big websites (like Twitter, Facebook ect). For offline usage
  • pouya
    pouya about 4 years
    Great but any MITM proxy can do the same.
  • Denis Glotov
    Denis Glotov over 3 years
    Better rely on requestfinished event.