How to decode base64 text in xml file in Linux?

6,162

Solution 1

I'll say what I always do. Please NEVER use regular expressions to parse XML. It's bad news. XML has some various formatting which means semantically identical XML will match or not match certain regular expressions. Simple things like line wrapping, unary tags, etc.

This means you create brittle code, which one day might mysteriously break because of an upstream and perfectly valid change to your data flow.

For parsing your XML I would suggest using perl and the quite excellent XML::Twig module.

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;
use MIME::Base64;

#we take a "value" element, check it for an "encoding=base64" and if it is
#we rewrite the content and delete that attribute in the XML. 
sub decode_value {
    my ( $twig, $value ) = @_;
    if (    $value->att('encoding')
        and $value->att('encoding') eq "base64" )
    {
        my $decoded_text = decode_base64( $value->text );
        if ( $decoded_text =~ m/[^\s\d\w\=\-\,\.]/ ) {
            $decoded_text = "decoded";
        }
        $value->set_text($decoded_text);
        $value -> del_att('encoding');

    }
}


#twig handlers 'fires' a piece of code each time you hit a 'value' element. 
#it passes this piece of code that chunk of XML to handle, which means
#you can do things like dynamic XML rewrites 
#pretty print controls output XML rendering - there's a variety of options
#check the manpage. 
my $twig = XML::Twig->new(
    pretty_print  => "indented",
    twig_handlers => { 'value' => \&decode_value, }
);
$twig->parsefile('your_xml_file');
$twig->print;

This will give:

<directory-entries>
  <entry dn="ads">
    <attr name="memberof">
      <value>CN=VPN-employee</value>
      <value encoding="base64">hello world</value>
      <value encoding="base64">decoded</value>
      <value encoding="base64">decoded</value>
    </attr>
  </entry>
</directory-entries>

You could alternatively transform $decoded_text like this:

$decoded_text =~ s/[^\s\d\w=,-. ]+/_/g;

(URI::Escape module is worth a look here too, as it 'percent encodes' text URL style. )

Which would give instead:

  <value encoding="base64">CN=Floppy - _ _,OU=Device Control,OU=Groups,OU=_,DC=hq,DC=bc</value>
  <value encoding="base64">CN=USB-_ - _ _,OU=Device Control,OU=Groups,OU=_,DC=hq,DC=bc</value>

But you might also find using Net::LDAP does what you need.

#!/usr/bin/perl
use strict;
use warnings;

use Net::LDAP;

my $ldap   = Net::LDAP->new('host');
my $result = $ldap->bind(
    'CN=informatica,OU=Accounts for System Purposes,OU=System Accounts,DC=hq,DC=bc',
    'password'
);
if ( $result->code ) { die "Error connecting to LDAP server"; }

my $ldap_search = $ldap->search(
    base   => 'DC=hq,DC=bc',
    scope  => 'subtree',
    filter => '(&(objectClass=organizationalPerson)(CN=*))',
    attrs  => [ 'employeeID', 'memberOf' ],
);

foreach my $entry ( $ldap_search->entries ) {
    print "dn:\t", $entry->dn(), "\n";
    foreach my $attr ( $entry->attributes ) {
        print "$attr:";
        foreach my $value ( $entry->get_value($attr) ) {
            next unless defined $value;
            if ( $value =~ m/[^\s\d\w,-=+@\'.()]/ ) { $value = "binary_data" }
            chomp($value);
            print "\t$value\n";
        }
    }
}

Solution 2

Compact Script

Assuming the xml is in file.xml, just do:

sed -r 's/("base64">)([[:graph:]]+)/\1'"`grep -oP '"base64">\K[[:graph:]]+' file.xml | base64 -d`"'/g' file.xml 

This is a compact regex, which will do the task. Let me break it down and explain.

Break Down

First I select the base64 string using grep and decode it:

grep -oP '"base64">\K[[:graph:]]+' file.xml | base64 -d

I could save this in a variable:

baseString=`grep -oP '"base64">\K[[:graph:]]+' file.xml | base64 -d`

Then use sed to replace the base64 with the decoded string saved in the variable:

sed -r 's/("base64">)([[:graph:]]+)/\1'"$baseString"'/g' file.xml
Share:
6,162

Related videos on Youtube

Meruyert
Author by

Meruyert

Updated on September 18, 2022

Comments

  • Meruyert
    Meruyert over 1 year

    I'm new in linux (shell). I need to decode base64 text in xml file using linux shell script. Could you please help me to write linux shell script for decoding the values of those tags where attribute is encoding="base64" the structure of my file is

        <directory-entries>
            <entry dn="ads">
            <attr name="memberof">
            <value>CN=VPN-employee</value>
            <value encoding="base64">aGVsbG8gd29ybGQ=   </value>
    <value encoding="base64">
    Q049RmxvcHB5IC0g0LTQvtGB0YLRg9C/INC30LDQutGA0YvRgixPVT1EZXZpY2UgQ29udHJv
    bCxPVT1Hcm91cHMsT1U90JHQkNCd0JosREM9aHEsREM9YmM=
        </value>
        <value encoding="base64">
    Q049VVNCLdC00LjRgdC60LggLSDRgtC+0LvRjNC60L4g0YfRgtC10L3QuNC1LE9VPURldmlj
    ZSBDb250cm9sLE9VPUdyb3VwcyxPVT3QkdCQ0J3QmixEQz1ocSxEQz1iYw==
        </value>
        </attr>
        </entry>
        </directory-entries>
    

    The wanted output is

        <directory-entries>
            <entry dn="ads">
            <attr name="memberof">
            <value>CN=VPN-employee</value>
            <value encoding="base64">Hello world  </value>
           <value encoding="base64"> decoded         </value>
           <value encoding="base64">    decoded         </value>
        </attr>
        </entry>
        </directory-entries>
    

    I'm generating XML from Active Directory using ldapsearch. The script that I used to obtain this file is:

    ldapsearch -h host -p 389 -D "CN=informatica,OU=Accounts for System Purposes,OU=System Accounts,DC=hq,DC=bc" -w password -s sub -B -E UTF-8 -X "(&(objectClass=organizationalPerson)(CN=*))" employeeID memberof > ldap_logins.xml
    

    I don't know if it is possible to decode the text while generating the xml file. Thank you in advance!

    • Stephen Kitt
      Stephen Kitt about 9 years
      I don't have a complete answer, but a couple of hints. On the ldapsearch side, you can use the -t option to output "non-printable" text to temporary files rather than Base64-encoded values. If you want to parse XML, check out XMLStarlet. Also, does the output need to be valid XML? Shouldn't the "encoded" attribute be dropped from the output?
    • Meruyert
      Meruyert about 9 years
      Thank you for feedback. Yes, the output should be valid XML. I need decoded value, the attribute itself can be dropped from the output
    • shivams
      shivams about 9 years
      @Meruyert I've provided a proper answer using an xml parser called xmlstarlet. Just check it, if it helps.
  • Meruyert
    Meruyert about 9 years
    Thank you for your answer! The script works for cases where values do not have line breaks. I have line breaks in values. I've updated the structure of the file in the question, added more examples. Do you have any ideas how to deal with those line breaks?
  • shivams
    shivams about 9 years
    Oh! Multi-line regex is very tricky using bash. For such cases, it is better advised to go for some proper xml parser. However, I will provide some solution using regex. Wait.
  • shivams
    shivams about 9 years
    Yes. Using an xml parser is always the only sane option. @Meruyert please use this solution (if it works fine) , rather than going for my regex based solution.
  • shivams
    shivams about 9 years
    It is unclear which language you are using. @Sobrique.
  • Sobrique
    Sobrique about 9 years
    Wow, that's impressive on my part. Amended answer to indicate that I do mean perl here ;)
  • shivams
    shivams about 9 years
    Sorry for my ignorance. But I am really a new kid. Born in the era of Python, rather than Perl. Done a lot of bash but never touched Perl :/ Perhaps, I should be ashamed :|
  • Sobrique
    Sobrique about 9 years
    Hardly. Perl and Python have very similar use cases. I'm crusty enough to pre-date python, and learned perl back when it was really the only option for extending shell scripting. Still like it though, not least because it remains pretty similar to shell, and very widely supported.
  • elysch
    elysch almost 6 years
    I know this is old. I want to use the sed command, but it says "test" is not defined. Do you remember how it was defined?
  • shivams
    shivams almost 6 years
    @elysch: test is not a command here. I used it to denote the file-name. I should have used file.xml instead. I am correcting it.
  • elysch
    elysch almost 6 years
    I tried that but I get an error sed: -e expression #1, char 297: unknown option to s'`. Don't know how to find which value is causing problems
  • elysch
    elysch almost 6 years
    Annother question: How would it know how to "select" each base64 string in the right place? Testing the grep command on its own, it shows all the base64 strings, not just one