Puppeteer, save webpage and images
Let's go back to the first, you can use fullPage
to take the screenshot.
await page.screenshot({path: 'example.png', fullPage: true});
If you really want to download all resources to offline, yes you can:
const fse = require('fs-extra');
page.on('response', (res) => {
// save all the data to SOMEWHERE_TO_STORE
await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
});
Then, you can browser the website offline through puppeteer with everything all right.
await page.setRequestInterception(true);
page.on('request', (req) => {
// handle the request by responding data that you stored in SOMEWHERE_TO_STORE
// and of course, don't forget THE_FILE_TYPE
req.respond({
status: 200,
contentType: THE_FILE_TYPE,
body: await fse.readFile(SOMEWHERE_TO_STORE),
});
});
Johan Hoeksma
Updated on June 11, 2022Comments
-
Johan Hoeksma almost 2 years
I'm trying to save a webpage, for offline usage with Nodejs and puppeteer. I see a lot of examples with:
await page.screenshot({path: 'example.png'});
But with a bigger webpage it's not an option. So a better option in puppeteer is to load the page and then save like:
const html = await page.content(); // ... write to file
Ok, that works. Now I am going to scroll like pages as twitter. So I decided to block all images in puppeteer page:
page.on('request', request => { if (request.resourceType() === 'image') { const imgUrl = request.url() download(imgUrl, 'download').then((output) => { images.push({url: output.url, filename: output.filename}) }).catch((err) => { console.log(err) }) request.abort() } else { request.continue() } })
Ok, I now used the 'npm download' lib to download all the images. Yes the download images are ok :D.
Now when I save the content, I want to point it to the offline images in the source.
const html = await page.content();
But now I like to replace all the
<img src="/pic.png?id=123"> <img src="https://twitter.com/pics/1.png">
And also things like:
<div style="background-image: url('this_also.gif')></div>
So is there a way (in puppeteer) to scrape a big page and store the whole content offline ?
Javascript and CSS would also be nice
Update
For now I will open the big html file again with puppeteer.
And then intercept all files as: https://dom.com/img/img.jpg, /file.jpg, ....
request.respond({ status: 200, contentType: 'image/jpeg', body: '..' });
I can also do it with a chrome extention. But I like to have a function with some options page.html(), the same as page.pdf()
-
Cody G over 5 yearsI would think webpages are too dynamic to do something like this... (depending on how much of your life you want to spend on it) what is your end goal, just viewing it?
-
pguardiario over 5 yearsIs the question how to manipulate the html? If so you would use cheerio from node or jQuery from page.evaluate.
-
Johan Hoeksma over 5 yearsThe question is how to point to the local download. When you have css, javascript images.
-
Johan Hoeksma almost 5 years@Cody, the goal is to save big websites (like Twitter, Facebook ect). For offline usage
-
-
pouya about 4 yearsGreat but any MITM proxy can do the same.
-
Denis Glotov over 3 yearsBetter rely on
requestfinished
event.