Puppeteer, save webpage and images

javascript node.js html web-scraping puppeteer

16,118

Let's go back to the first, you can use fullPage to take the screenshot.

await page.screenshot({path: 'example.png', fullPage: true});

If you really want to download all resources to offline, yes you can:

const fse = require('fs-extra');

page.on('response', (res) => {
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});

Then, you can browser the website offline through puppeteer with everything all right.

await page.setRequestInterception(true);
page.on('request', (req) => {
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond({
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    });
});

16,118

Author by

Johan Hoeksma

Updated on June 11, 2022

Comments

Johan Hoeksma almost 2 years
I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:
```
await page.screenshot({path: 'example.png'});
```
But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:
```
const html = await page.content();
// ... write to file
```
Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:
```
page.on('request', request => {
    if (request.resourceType() === 'image') {
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => {
            images.push({url: output.url, filename: output.filename})
        }).catch((err) => {
            console.log(err)
        })
        request.abort()
    } else {
        request.continue()
    }
})
```
Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.

Now when I save the content, I want to point it to the offline images in the source.
```
const html = await page.content();
```
But now I like to replace all the
```
<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">
```
And also things like:
```
<div style="background-image: url('this_also.gif')></div>
```
So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?

Javascript and CSS would also be nice

Update

For now I will open the big html file again with puppeteer.

And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....
```
request.respond({
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
});
```
I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()
- Cody Ｇ over 5 years
  
  I would think webpages are too dynamic to do something like this... (depending on how much of your life you want to spend on it) what is your end goal, just viewing it?
- pguardiario over 5 years
  
  Is the question how to manipulate the html? If so you would use cheerio from node or jQuery from page.evaluate.
- Johan Hoeksma over 5 years
  
  The question is how to point to the local download. When you have css, javascript images.
- Johan Hoeksma almost 5 years
  
  @Cody, the goal is to save big websites (like Twitter, Facebook ect). For offline usage
pouya about 4 years

Great but any MITM proxy can do the same.
Denis Glotov over 3 years

Better rely on requestfinished event.