How can I convert Unicode to uppercase to print it?

40

Solution 1

I think it's as simple as not converting to ASCII first.

 >>> print u'exámple'.upper()
 EXÁMPLE

Solution 2

In python 2.x, just convert the string to unicode before calling upper(). Using your code, which is in utf-8 format on this webpage:

>>> s = 'exámple'
>>> s
'ex\xc3\xa1mple'  # my terminal is not utf8. c3a1 is the UTF-8 hex for á
>>> s.decode('utf-8').upper()
u'EX\xc1MPLE'  # c1 is the utf-16 aka unicode for á

The call to decode takes it from its current format to unicode. You can then convert it to some other format, like utf-8, by using encode. If the character was in, say, iso-8859-2 (Czech, etc, in this case), you would instead use s.decode('iso-8859-2').upper().

As in my case, if your terminal is not unicode/utf-8 compliant, the best you can hope for is either a hex representation of the characters (like mine) or to convert it lossily using s.decode('utf-8').upper().encode('ascii', 'replace'), which results in 'EX?MPLE'. If you can't make your terminal show unicode, write the output to a file in utf-8 format and open that in your favourite editor.

Solution 3

first off, i only use python 3.1 these days; its central merit is to have disambiguated byte strings from unicode objects. this makes the vast majority of text manipulations much safer than used to be the case. weighing in the trillions of user questions regarding python 2.x encoding problems, the u'äbc convention of python 2.1 was just a mistake; with explicit bytes and bytearray, life becomes so much easier.

secondly, if py3k is not your flavor, then try to go with from __future__ import unicode_literals, as this will mimic py3k's behavior on python 2.6 and 2.7. this thing would have avoided the (easily committed) blunder you did when saying print 'exámple'.upper() . essentially, this is the same as in py3k: print( 'exámple'.encode( 'utf-8' ).upper() ). compare these versions (for py3k):

print( 'exámple'.encode( 'utf-8' ).upper() )
print( 'exámple'.encode( 'utf-8' ).upper().decode( 'utf-8' ) )
print( 'exámple'.upper() )

The first one is, basically, what you did when used a bare string 'exámple', provided you set your default encoding to utf-8 (according to a BDFL pronouncement, setting the default encoding at run time is a bad idea, so in py2 you'll have to trick it by saying import sys; reload( sys ); sys.setdefaultencoding( 'utf-8' ); i present a better solution for py3k below). when you look at the output of these three lines:

b'EX\xc3\xa1MPLE'
EXáMPLE
EXÁMPLE

you can see that when upper() got applied to the first text, it acted on bytes, not on characters. python allows the upper() method on bytes, but it is only defined on the US-ASCII interpretation of bytes. since utf-8 uses values within 8 bits but outside of US-ASCII (128 up to 255, which are not used by US-ASCII), those won't be affected by upper(), so when we decode back in the second line, we get that lower-case á. finally, the third line does it right, and yes, surprise, python seems to be aware that Á is the upper case letter corresponding to á. i ran a quick test to see what characters python 3 does not convert between upper and lower case:

for cid in range( 3000 ):
  my_chr = chr( cid )
  if my_chr == my_chr.upper() and my_chr == my_chr.lower():
    say( my_chr )

perusing the list reveals very few incidences of latin, cyrillic, or greek letters; most of the output is non-european characters and punctuation. the only characters i could find that python got wrong are Ԥ/ԥ (\u0524, \u0525, 'cyrillic {capital|small} letter pe with descender'), so as long as you stay outside of the Latin Extended-X blocks (check out those, they might yield surprises), you might actually use that method. of course, i did not check the correctness of the mappings.

lastly, here is what i put into my py3k application boot section: a method that redefines the encoding sys.stdout sees, with numerical character references (NCRs) as fallback; this has the effect that printing to standard output will never raise a unicode encoding error. when i work on ubuntu, _sys.stdout.encoding is utf-8; when the same program runs on windows, it might be something quaint like cp850. the output might looks starnge, but the application runs without raising an exception on those dim-witted terminals.

#===========================================================================================================
# MAKE STDOUT BEHAVE IN A FAILSAFE MANNER
#-----------------------------------------------------------------------------------------------------------
def _harden_stdout():
  """Ensure that unprintable output to STDOUT does not cause encoding errors; use XML character references
  so any kind of output gets a chance to render in a decipherable way."""
  global _sys_TRM
  _sys.stdout       = _sys_TRM = _sys_io.TextIOWrapper(
    _sys.stdout.buffer,
    encoding        = _sys.stdout.encoding,
    errors          = 'xmlcharrefreplace',
    line_buffering  = true )
#...........................................................................................................
_harden_stdout()

one more piece of advice: when testing, always try to print repr( x ) or a similar thing that reveals the identity of x. all kinds of misunderstandings can crop up if you just print x in py2 and x is either an octet string or a unicode object. it is very puzzling and prone to cause a lot of head-scratching. as i said, try to move at least to py26 with that from future import unicode literals incantation.

and to close, quoting a quote: " Glyph Lefkowitz says it best in his article Encoding:

I believe that in the context of this discussion, the term "string" is meaningless. There is text, and there is byte-oriented data (which may very well represent text, but is not yet converted to it). In Python types, Text is unicode. Data is str. The idea of "non-Unicode text" is just a programming error waiting to happen."

update: just found python 3 correctly converts ſ LATIN SMALL LETTER LONG S to S when uppercasing. neat!

Solution 4

I think there's a bit of background we're missing here:

>>> type('hello')
<type 'str'>

>>> type(u'hello')
<type 'unicode'>

As long as you're using "unicode" strings instead of "native" strings, the operators like upper() will operate with unicode in mind. FWIW, Python 3 uses unicode by default, making the distinction largely irrelevant.

Taking a string from unicode to str and then back to unicode is suboptimal in many ways, and many libraries will produce unicode output if you want it; so try to use only unicode objects for strings internally whenever you can.

Share:
40
NextNightFlyer
Author by

NextNightFlyer

Updated on March 31, 2020

Comments

  • NextNightFlyer
    NextNightFlyer about 4 years

    I am making a React/Next.js app that has a page which requires a login to see. Every React/Next.js app I've seen has a folder pages which contains all of the pages. I usually protect the views that require a login with a middleware or an express backend running Passport.js and a local authentication strategy. But how would I make it so that the page that's protected is actually hosted on a separate url?

    I know you might be thinking that this is pointless because if the main server controls authentication, couldn't whoever controls that server set any user to authenticated and access the protected data anyway?

    In this case, that attack vector is not relevant because I'm using something called "Lit Protocol" which creates the login JWT on a decentralized network of dedicated servers so spoofing authentication isn't really a problem.

    So just to reiterate the original question, how would I make it so that the page that's protected is actually hosted on a separate url? And I can take it from there.