How to parse data-uri in python?
Solution 1
Split the data URI on the comma to get the base64 encoded data without the header. Call base64.b64decode
to decode that to bytes. Last, write the bytes to a file.
from base64 import b64decode
data_uri = "data:image/png;base64,iVBORw0KGg..."
# Python 2 and <Python 3.4
header, encoded = data_uri.split(",", 1)
data = b64decode(encoded)
# Python 3.4+
# from urllib import request
# with request.urlopen(data_uri) as response:
# data = response.read()
with open("image.png", "wb") as f:
f.write(data)
Solution 2
Python since 3.4 has support for data-uri, under the hood using urllib.request.DataHandler
.
from urllib.request import urlopen
with urlopen(data_uri) as response:
data = response.read()
Solution 3
w3lib (a library used by Scrapy) has a function to parse data uris:
>>> from w3lib.url import parse_data_uri
>>> parse_data_uri('data:image/png;base64,iVBORw0KGg==')
ParseDataURIResult(media_type='image/png', media_type_parameters={}, data=b'\x89PNG\r\n\x1a')
Solution 4
This may help:
import re
from lxml import html
BASE_NAME = "image_"
source_code = """<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==" alt="Red dot" />
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=" alt="Black dot" />"""
tree = html.fromstring(source_code)
for i,image in enumerate(tree.xpath('//img[contains(@src, "data:image")]/@src')):
image_type, image_content = image.split(',', 1)
image_type = re.findall('data:image\/(\w+);base64', image_type)[0]
with open("{}{}.{}".format(BASE_NAME, i, image_type), "wb") as f:
f.write(image_content.decode('base64'))
print "[*] '{}' image found with content: {}\n".format(image_type, image_content)
Output:
[*] 'png' image found with content: iVBORw0KGgoAAAANSUhEUgAAAAUA
AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
9TXL0Y4OHwAAAABJRU5ErkJggg==
[*] 'gif' image found with content: R0lGODlhAQABAIAAAAUEBAAAACwAAAAAAQABAAACAkQBADs=
It will save every base64
image within <img>
tags, with their respective file extension:
Prefixed by BASE_NAME + auto-increment digit(s) provided by enumerate
+ image_extension
Solution 5
Correcting JRodDynamite's post:
from base64 import decodestring
png_arr= "data:image/png;base64,iVBORw0KGg..."
png_arr = png_arr.split(",")
png_arr = png_arr[1]
fh = open("imageToSave.png", "wb")
fh.write(decodestring(png_arr))
fh.close()
blueFast
Updated on June 08, 2022Comments
-
blueFast almost 2 years
HTML image elements have this simplified format:
<img src='something'>
That something can be
data-uri
, for example:data:image/png;base64,iVBORw0KGg...
Is there a standard way of parsing this with python, so that I get
content_type
and base64 data separated, or should I create my own parser for this? -
Andrey Belyak over 5 yearsthe prettiest solution imho: short and produces well-structured result
-
Darkyen over 2 yearsJust splitting on first comma is not necessarily correct, the MIME may contain comma as well, for example:
data:video/webm; codecs=\"vp8, opus\";base64,GkXfowEAAAAAAAAfQoaBAUL3g...
-
Darkyen over 2 yearsAnd quotes won't help, because this is also possible:
data:video/webm;codecs=vp8,opus;base64,GkXfo59...