Working with unicode encoded Strings from Active Directory via python-ldap

12,380

First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.

You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.

UPDATED:

OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:

u.decode('unicode_escape').encode('iso8859-1').decode('utf8')

You might want to look into whether you can get the data in a more natural format.

Share:
12,380
Raptor
Author by

Raptor

I am enthusiastic about autonomous robotics and any optimization or structures of urban infrastructure. Highly interested in advancing my C++ skills, I use C++17 daily. I am currently working with autonomous street sweepers, mostly dealing with holding all components together and ensuring useful architecture. Almost exclusively on ROS right now. Right now i am a team lead responsible for making our software scale for a larger number of deployments. I also dealt a lot with all things devops and networks: CI, containerizing, package building, network configuration and benchmarking.

Updated on June 04, 2022

Comments

  • Raptor
    Raptor almost 2 years

    I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:

    I am reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.

    Example (fullentries is my array with all the AD entries):

    fullentries[23][1].decode('utf-8', 'ignore')    
    print fullentries[23][1].encode('utf-8', 'ignore')
    print fullentries[23][1].encode('latin1', 'ignore')
    print repr(fullentries[23][1])
    

    A second test with a string inserted by hand as follows:

    testentry = "M\xc3\xbcller"
    testentry.decode('utf-8', 'ignore')
    print testentry.encode('utf-8', 'ignore')
    print testentry.encode('latin1', 'ignore')
    print repr(testentry)
    

    The output of the first example ist:

    M\xc3\xbcller
    M\xc3\xbcller
    u'M\\xc3\\xbcller'
    

    Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.

    The output of the second example:

    Müller
    M�ller
    'M\xc3\xbcller'
    

    Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.

    My older question this topic is here: Writing UTF-8 String to MySQL with Python

    Edit: Added repr(s) infos