Regex to parse image data URI
Solution 1
EDIT: expanded to show usage
var regex = new Regex(@"data:(?<mime>[\w/\-\.]+);(?<encoding>\w+),(?<data>.*)", RegexOptions.Compiled);
var match = regex.Match(input);
var mime = match.Groups["mime"].Value;
var encoding = match.Groups["encoding"].Value;
var data = match.Groups["data"].Value;
NOTE: The regex applies to the input shown in question. If there was a charset
specified too, it would not work and would have to be rewritten.
Solution 2
Actually, you don't need a regex for that. According to Wikipedia, the data URI format is
data:[<MIME-type>][;charset=<encoding>][;base64],<data>
so just do the following:
byte[] imagedata = Convert.FromBase64String(imageSrc.Substring(imageSrc.IndexOf(",") + 1));
Solution 3
Data URI's have a bit of complexity to them, they can contain params, media type, etc... and sometimes you need to know this info, not just the data.
To parse a data URI and extract all of the relevant parts, try this:
/**
* Parse a data uri and return an object with information about the different parts
* @param {*} data_uri
*/
function parseDataURI(data_uri) {
let regex = /^\s*data:(?<media_type>(?<mime_type>[a-z\-]+\/[a-z\-\+]+)(?<params>(;[a-z\-]+\=[a-z\-]+)*))?(?<encoding>;base64)?,(?<data>[a-z0-9\!\$\&\'\,\(\)\*\+\,\;\=\-\.\_\~\:\@\/\?\%\s]*\s*)$/i;
let result = regex.exec(data_uri);
let info = {
media_type: result.groups.media_type,
mime_type: result.groups.mime_type,
params: result.groups.params,
encoding: result.groups.encoding,
data: result.groups.data
}
if(info.params)
info.params = Object.fromEntries(info.params.split(';').slice(1).map(param => param.split('=')));
if(info.encoding)
info.encoding = info.encoding.replace(';','');
return info;
}
This will give you an object that has all the relevant bits parsed out, and the params as a dictionary {foo: baz}.
Example (mocha test with assert):
describe("Parse data URI", () => {
it("Should extract data URI parts correctly",
async ()=> {
let uri = 'data:text/vnd-example+xyz;foo=bar;bar=baz;base64,R0lGODdh';
let info = parseDataURI(uri);
assert.equal(info.media_type,'text/vnd-example+xyz;foo=bar;bar=baz');
assert.equal(info.mime_type,'text/vnd-example+xyz');
assert.equal(info.encoding, 'base64');
assert.equal(info.data, 'R0lGODdh');
assert.equal(info.params.foo, 'bar');
assert.equal(info.params.bar, 'baz');
}
);
});
Solution 4
I faced also with the need to parse the data URI scheme. As a result, I improved the regular expression given on this page specifically for C# and which fits any data URI scheme (to check the scheme, you can take it from here or here.
Here is my solution for C#:
private class DataUriModel {
public string MediaType { get; set; }
public string Type { get; set; }
public string[] Tree { get; set; }
public string Subtype { get; set; }
public string Suffix { get; set; }
public string[] Params { get; set; }
public string Encoding { get; set; }
public string Data { get; set; }
}
static void Main(string[] args) {
string s = "data:image/prs.jpeg+gzip;charset=UTF-8;page=21;page=22;base64,/9j/4AAQSkZJRgABAQAAAQABAAD";
var parsedUri = GetDataURI(s);
Console.WriteLine(decodedUri.Type);
Console.WriteLine(decodedUri.Subtype);
Console.WriteLine(decodedUri.Encoding);
}
private static DataUriModel GetDataURI(string data) {
var result = new DataUriModel();
Regex regex = new Regex(@"^\s*data:(?<media_type>(?<type>[a-z\-]+){1}\/(?<tree>([a-z\-]+\.)+)?(?<subtype>[a-z\-]+){1}(?<suffix>\+[a-z]+)?(?<params>(;[a-z\-]+\=[a-z0-9\-\+]+)*)?)?(?<encoding>;base64)?(?<data>,+[a-z0-9\\\!\$\&\'\,\(\)\*\+\,\;\=\-\.\~\:\@\/\?\%\s]*\s*)?$", RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Multiline);
var match = regex.Match(data);
if (!match.Success)
return result;
var names = regex.GetGroupNames();
foreach (var name in names) {
var group = match.Groups[name];
switch (name) {
case "media_type": result.MediaType = group.Value; break;
case "type": result.Type = group.Value; break;
case "tree": result.Tree = !string.IsNullOrWhiteSpace(group.Value) && group.Value.Length > 1 ? group.Value[0..^1].Split(".") : null; break;
case "subtype": result.Subtype = group.Value; break;
case "suffix": result.Suffix = !string.IsNullOrWhiteSpace(group.Value) && group.Value.Length > 1 ? group.Value[1..] : null; break;
case "params": result.Params = !string.IsNullOrWhiteSpace(group.Value) && group.Value.Length > 1 ? group.Value[1..].Split(";") : null; break;
case "encoding": result.Encoding = !string.IsNullOrWhiteSpace(group.Value) && group.Value.Length > 1 ? group.Value[1..] : null; break;
case "data": result.Data = !string.IsNullOrWhiteSpace(group.Value) && group.Value.Length > 1 ? group.Value[1..] : null; break;
}
}
return result;
}
Steven
Updated on June 24, 2022Comments
-
Steven almost 2 years
If I have :
<img src="data:image/gif;base64,R0lGODlhtwBEANUAAMbIypOVmO7v76yusOHi49AsSDY1N2NkZvvs6VVWWPDAutZOWJ+hpPPPyeqmoNlcYXBxdNTV1nx+gN51c4iJjEdHSfbc19M+UOeZk7m7veSMiNtpauGBfu2zrc4RQSMfIP///wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACH5BAAAAAAALAAAAAC3AEQAAAb/QJBwSCwaj8ikcslsOp/QqHRKrVqv2Kx2y+16v+CweEwum8/otHrNbrvf8Lh8Tq/b7/i8fs" />
How can I parse the data part into:
- Mime type (image/gif)
- Encoding (base64)
- Image data (the binary data)
-
Steven about 13 yearsThanks for the quick response, but I would like to know the mime-type as well so that I can write the data into a file with the right extension, .png if the user submits image/png, .gif if user submits image/gif etc.
-
Steven about 13 yearsI tried this but using matches.Groups[0].ToString() in c# it returned everything instead of the mime part. Can you expand the code.
-
Steven about 13 yearsHere's my code: string pattern= @"data:(?<mime>[\w/]+);(?<encoding>\w+),(?<data>.*)"; Match matches = Regex.Match(imgsrc, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline); HttpContext.Current.Response.Write("<br />match: " + matches.Groups[0].ToString());
-
Steven about 13 yearsI think this part: imageSrc.IndexOf(",") should be imageSrc.IndexOf(",")-1 to prevent the "," from being included in the data.
-
eselk almost 11 yearsI'd probably use this method to remove the data part, then use Split(';') to get the other parts. Also, think it should be Indexof(",")+1, not -1.. of course, in real code you would want to check for -1 (not found) result also.
-
Per Lundberg over 10 yearsI updated the MIME part of the regexp slightly to be able to use with MIME types like application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, where the \w and / is not enough to match (since . and - are not "word" characters).
-
Roger Far over 7 yearsDoesn't work for data:text/plain;charset=utf-8;base64,IVRSTlMJVFJ
-
Rhult about 7 yearsI know it's explicit in the question, but I found it was easier to parse without a regular expression by getting the position of the comma and then dealing with the media type/encoding part. The logic was clearer than a complicated regular expression, I could handle cases like @YesMan85 above, and I was getting a bit worried about regex performance for a long data string.
-
jazzcat about 7 years@Steven you're right, but it's actually
+1
not-1
.. Edited my answer -
Randy Burden about 3 yearsThanks, this worked for me. This was the first time I ran into the C# range operator and since I'm using an older version of C# I had to substitute the uses of the Range Operator with calls to
Substring()
. e.g. instead ofgroup.Value[1..]
, I usedgroup.Value.Substring(1)
-
werehamster over 2 yearsfinal part of the regex captures trailing whitespace, so your info.data can end with spaces and tabs and such like. Move the final bracket before the \s eg: regex = /^\sdata:(?<media_type>(?<mime_type>[a-z\-]+\/[a-z\-\+]+)(?<params>(;[a-z\-]+\=[a-z\-]+)))?(?<encoding>;base64)?,(?<data>[a-z0-9\!\$\&\'\,40;)*\+\,\;\=\-\._\~\:\@\/\?\%\s])*\s*$/i;