How to use unicode characters in Windows command line?

500,446

Solution 1

My background: I use Unicode input/output in a console for years (and do it a lot daily. Moreover, I develop support tools for exactly this task). There are very few problems, as far as you understand the following facts/limitations:

  • CMD and “console” are unrelated factors. CMD.exe is a just one of programs which are ready to “work inside” a console (“console applications”).
  • AFAIK, CMD has perfect support for Unicode; you can enter/output all Unicode chars when any codepage is active.
  • Windows’ console has A LOT of support for Unicode — but it is not perfect (just “good enough”; see below).
  • chcp 65001 is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems with cp65001, but the rest is still applicable to Win10.
  • I work in cp1252. As I already said: To input/output Unicode in a console, one does not need to set the codepage.

The details

  • To read/write Unicode to a console, an application (or its C runtime library) should be smart enough to use not File-I/O API, but Console-I/O API. (For an example, see how Python does it.)
  • Likewise, to read Unicode command-line arguments, an application (or its C runtime library) should be smart enough to use the corresponding API.
  • Console font rendering supports only Unicode characters in BMP (in other words: below U+10000). Only simple text rendering is supported (so European — and some East Asian — languages should work fine — as far as one uses precomposed forms). [There is a minor fine print here for East Asian and for characters U+0000, U+0001, U+30FB.]

Practical considerations

  • The defaults on Window are not very helpful. For best experience, one should tune up 3 pieces of configuration:

    • For output: a comprehensive console font. For best results, I recommend my builds. (The installation instructions are present there — and also listed in other answers on this page.)
    • For input: a capable keyboard layout. For best results, I recommend my layouts.
    • For input: allow HEX input of Unicode.
  • One more gotcha with “Pasting” into a console application (very technical):

    • HEX input delivers a character on KeyUp of Alt; all the other ways to deliver a character happen on KeyDown; so many applications are not ready to see a character on KeyUp. (Only applicable to applications using Console-I/O API.)
    • Conclusion: many application would not react on HEX input events.
    • Moreover, what happens with a “Pasted” character depends on the current keyboard layout: if the character can be typed without using prefix keys (but with arbitrary complicated combination of modifiers, as in Ctrl-Alt-AltGr-Kana-Shift-Gray*) then it is delivered on an emulated keypress. This is what any application expects — so pasting anything which contains only such characters is fine.
    • However, the “other” characters are delivered by emulating HEX input.

    Conclusion: unless your keyboard layout supports input of A LOT of characters without prefix keys, some buggy applications may skip characters when you Paste via Console’s UI: Alt-Space E P. (This is why I recommend using my keyboard layouts!)

One should also keep in mind that the “alternative, ‘more capable’ consoles” for Windows are not consoles at all. They do not support Console-I/O APIs, so the programs which rely on these APIs to work would not function. (The programs which use only “File-I/O APIs to the console filehandles” would work fine, though.)

One example of such non-console is a part of MicroSoft’s Powershell. I do not use it; to experiment, press and release WinKey, then type powershell.


(On the other hand, there are programs such as ConEmu or ANSICON which try to do more: they “attempt” to intercept Console-I/O APIs to make “true console applications” work too. This definitely works for toy example programs; in real life, this may or may not solve your particular problems. Experiment.)

Summary

  • set font, keyboard layout (and optionally, allow HEX input).

  • use only programs which go through Console-I/O APIs, and accept Unicode command-line arguments. For example, any cygwin-compiled program should be fine. As I already said, CMD is fine too.

UPD: Initially, for a bug in cp65001, I was mixing up Kernel and CRTL layers (UPD²: and Windows user-mode API!). Also: Win8 fixes one half of this bug; I clarified the section about “better console” application, and added a reference to how Python does it.

Solution 2

Try:

chcp 65001

which will change the code page to UTF-8. Also, you need to use Lucida console fonts.

Solution 3

I had same problem (I'm from the Czech Republic). I have an English installation of Windows, and I have to work with files on a shared drive. Paths to the files include Czech-specific characters.

The solution that works for me is:

In the batch file, change the charset page

My batch file:

chcp 1250
copy "O:\VEŘEJNÉ\ŽŽŽŽŽŽ\Ž.xls" c:\temp

The batch file has to be saved in CP 1250.

Note that the console will not show characters correctly, but it will understand them...

Solution 4

Check the language for non-Unicode programs. If you have problems with Russian in the Windows console, then you should set Russian here:

Changing language for non-Unicode programs

Solution 5

It's is quite difficult to change the default Codepage of Windows console. When you search the web you find different proposals, however some of them may break your Windows entirely, i.e. your PC does not boot anymore.

The most secure solution is this one: Go to your Registry key HKEY_CURRENT_USER\Software\Microsoft\Command Processor and add String value Autorun = chcp 65001.

Or you can use this small Batch-Script for the most common code pages.

@ECHO off
SET ROOT_KEY="HKEY_CURRENT_USER"
FOR /f "skip=2 tokens=3" %%i in ('reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /v OEMCP') do set OEMCP=%%i
ECHO System default values:
ECHO.
ECHO ...............................................
ECHO Select Codepage 
ECHO ...............................................
ECHO.
ECHO 1 - CP1252
ECHO 2 - UTF-8
ECHO 3 - CP850
ECHO 4 - ISO-8859-1
ECHO 5 - ISO-8859-15
ECHO 6 - US-ASCII
ECHO.
ECHO 9 - Reset to System Default (CP%OEMCP%)
ECHO 0 - EXIT
ECHO.
SET /P  CP="Select a Codepage: "
if %CP%==1 (
    echo Set default Codepage to CP1252
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 1252>nul" /f
) else if %CP%==2 (
    echo Set default Codepage to UTF-8
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 65001>nul" /f
) else if %CP%==3 (
    echo Set default Codepage to CP850
    reg add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 850>nul" /f
) else if %CP%==4 (
    echo Set default Codepage to ISO-8859-1
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 28591>nul" /f
) else if %CP%==5 (
    echo Set default Codepage to ISO-8859-15
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 28605>nul" /f
) else if %CP%==6 (
    echo Set default Codepage to ASCII
    add "%ROOT_KEY%\Software\Microsoft\Command Processor" /v Autorun /t REG_SZ /d "@chcp 20127>nul" /f
) else if %CP%==9 (
    echo Reset Codepage to System Default
    reg delete "%ROOT_KEY%\Software\Microsoft\Command Processor" /v AutoRun /f
) else if %CP%==0 (
    echo Bye
) else (
    echo Invalid choice
    pause
)

Using @chcp 65001>nul instead of chcp 65001 suppresses the output "Active code page: 65001" you would get every time you start a new command line windows.

A full list of all available number you can get from Code Page Identifiers

Note, the settings will apply only for the current user. If you like to set it for all users, replace line SET ROOT_KEY="HKEY_CURRENT_USER" by SET ROOT_KEY="HKEY_LOCAL_MACHINE"

Share:
500,446
Vilx-
Author by

Vilx-

Just your average everyday programmer. #SOreadytohelp

Updated on July 16, 2022

Comments

  • Vilx-
    Vilx- 5 months

    We have a project in Team Foundation Server (TFS) that has a non-English character (š) in it. When trying to script a few build-related things we've stumbled upon a problem - we can't pass the š letter to the command-line tools. The command prompt or what not else messes it up, and the tf.exe utility can't find the specified project.

    I've tried different formats for the .bat file (ANSI, UTF-8 with and without BOM) as well as scripting it in JavaScript (which is Unicode inherently) - but no luck. How do I execute a program and pass it a Unicode command line?

  • user3176001
    user3176001 over 13 years
    This is probably a bit dangerous as you could get naming conflict. e.g., if you have two files both which render as "???", and you enter "cd ???" it wouldn't know which to use (or worse would choose an arbitrary one).
  • User
    User over 13 years
    You don't enter ???, you enter the real name it's just being displayed as ???. Think of it as of a password input box. Whatever you enter is displayed as ***, but submitted is the original text.
  • AnnanFay
    AnnanFay about 11 years
    Do you know if there's a way to make this the default?
  • Danubian Sailor
    Danubian Sailor about 11 years
    By me Lucida font stays chosen, but chcp must be typed each time... anyway great thanx for this tip, I didn't even thought it is possible :)
  • Amit Patil
    Amit Patil almost 11 years
    Note there are serious implementation bugs in Windows's code page 65001 support which will break many applications that rely on the C standard library IO methods, so this is very fragile. (Batch files also just stop working in 65001.) Unfortunately UTF-8 is a second-class citizen in Windows.
  • Vilx-
    Vilx- almost 11 years
    Upvotes for everyone and accepted this answer because it's the most upvoted one. We moved away from TFS not long after this question was posted, so it's not relevant anymore. I also can't say if it works or not because we don't have a TFS server anymore to test on.
  • Lea Hayes
    Lea Hayes over 10 years
    Cheers! I needed this so that I could input the copyright character within my batch file.
  • Vilx-
    Vilx- about 10 years
    Windws-1253 isn't an Unicode codepage. It's a standard 256-character codepage. Apparently you only used characters that can be displayed in that codepage, but it won't be universal.
  • Roman Starkov
    Roman Starkov about 10 years
    @bobince Do you have an example of a bug in the Windows code page 65001 support? I'm curious because I've never run into one, and googling didn't turn anything up either. (Batch files do stop working, of course, but UTF-8 is hardly a second-class citizen...)
  • Amit Patil
    Amit Patil about 10 years
    @romkyns: My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on. Some background. The default code pages used for CJK "multibyte" locales have special handling built in to fix this, but 65001 doesn't - it is not supported.
  • Roman Starkov
    Roman Starkov about 10 years
    @bobince ah, thank you, that was interesting. Also found this, which has more info about the status of the bug...
  • Amit Patil
    Amit Patil about 10 years
    @romkyns: aha! Thanks, I knew I had read more about it on Kaplan's blog but couldn't dig the post out. Depressing how long this has gone without fix (or even adequate doc).
  • Admin
    Admin over 9 years
    @romkyns, and though I'm late, here is a bug, with Python 3.3.2 on Windows XP and console with chcp 65001 and Lucida Console: just build a string "s" with characters 945 to 969 (it's the greek alphabet). Then just try to show "s" (not even calling "print"). It's printed on three lines, with "s" on the first and garbage and the two others.
  • Basic
    Basic about 9 years
    Interesting question here though - is the bug because it should report bytes and instead reports characters - or because the applications using it have assumed bytes=characters incorrectly? In other words, is it an API fail or an API usage fail?
  • Seany84
    Seany84 almost 9 years
    This worked perfectly for me too in an almost identical situation to yours. Instead my path contained Irish Gaelic characters i.e. á, é, í, ó, and ú.
  • caglaror about 8 years
    @vanna that solves my "Turkish characters and spaces in path on network problem". you are great.
  • caglaror about 8 years
  • alexchandel
    alexchandel over 7 years
    Updated Kaplan blog on broken UTF-8 in windows available here, since Microsoft deleted all his blog posts after he rubbed a higher-up the wrong way.
  • Admin
    Admin about 7 years
    nice idea and usable example too!
  • Vilx-
    Vilx- almost 7 years
    In this (old) case, the issue was with a script rather than a console. Would using bash scripts solve this?
  • Steve Barnes
    Steve Barnes almost 7 years
    Yes indeed they wood bash scripts can be flagged as UTF-8 and just work with a lot more power than windows batch files - I know that it was an old case but thought the option was worth flagging for future reference as MS don't seem to be getting much better at Unicode.
  • Vlastimil Ovčáčík
    Vlastimil Ovčáčík almost 7 years
    You probably just needed to use different font to also display the characters correctly, Lucida Console worked for me.
  • endolith
    endolith about 6 years
    "Windows-1250 is a code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin script, such as Polish, Czech, Slovak, Hungarian, Slovene, Bosnian, Croatian, Serbian (Latin script), Romanian (before 1993 spelling reform) and Albanian."
  • Ohad Schneider
    Ohad Schneider almost 6 years
    Doesn't work for me with Hebrew characters in Windows 10 (Lucida console + chcp 65001),
  • Peter Mortensen
    Peter Mortensen almost 6 years
    grep, find, and less.
  • Tony Wall almost 6 years
    Thanks it works! I don't know why people voted this down, it is a valid alternative for some people.. This codepage 1252 does fix the problem also on Windows Server 2012, where the same code with CP 65001 did not work for me. I suppose it depends in what codepage the batch script was edited with, or the OS defaults. In this case it was created with Notepad on a German MUI machine with en-US base OS..
  • asmaier
    asmaier almost 6 years
    Better use the font "Consolas". Lucida Console is missing unicode characters like 02B9 .
  • maviz
    maviz almost 6 years
    To make utf-8 the default encoding: go to [HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun] and set it to chcp 65001
  • Eryk Sun
    Eryk Sun almost 6 years
    The console's (conhost.exe) support for codepage 65001 is fundamentally broken (for both input and output in Windows 7, but still broken for input in Windows 10). Please remove this suggestion to avoid repeating this bad advice in an endless loop of naive 'help'. The cmd shell is a Unicode application that uses the console's UTF-16 API and base APIs CreateProcessW and ShellExecuteExW. If there's a problem with handling the command-line, it's because the application is using the ANSI encoded char * version from a standard C main instead of the wchar_t * from a wmain entry point.
  • Eryk Sun
    Eryk Sun over 5 years
    In general using codepage 65001 will only work without bugs in Windows 10 with the Creators update. In Windows 7 it will have both output and input bugs. In Windows 8 and older versions of Windows 10 it only has the input bug, which limits input to 7-bit ASCII.
  • wisbucky
    wisbucky about 5 years
    This did indeed works for commands run directly in the command prompt. However, with running a .cmd batch file, I still need to put chcp 65001 at the top of of the batch file.
  • ivan_pozdeev
    ivan_pozdeev about 5 years
    Due to the poor support, you're better off using alternative consoles if you need reliable Unicode. Like Console2 for Windows programs and mintty for Cygwin ones (that's the reason why they rolled out mintty in the first place).
  • ivan_pozdeev
    ivan_pozdeev about 5 years
    That doesn't enable support for Unicode in cmd, it only switches the default codepage to cp866 which is still an 8-bit character set. It even uses cp866 instead of cp1251 which adds its own shitload of trouble.
  • ivan_pozdeev
    ivan_pozdeev about 5 years
    @eryksun what about the font? I get an impression that cmd fundamentally uses 8-bit character points for display, so it cannot possibly support more than 256 at a time.
  • ivan_pozdeev
    ivan_pozdeev about 5 years
    cp1250 is still an 8-bit character set, it still only supports 256 characters, just changes what those characters are.
  • Eryk Sun
    Eryk Sun about 5 years
    @ivan_pozdeev, CMD is a standard I/O shell, not a console or terminal. For console handles, it uses the Unicode console functions ReadConsoleW and WriteConsoleW, which read and write UTF-16 text from and to its attached console host process, conhost.exe. If a file handle is not a console (e.g. reading a batch file or reading piped input from a for /f loop, or redirecting dir to a pipe), CMD's built-in commands use the console's input or output codepage as the encoding. For output, you can override this to UTF-16 via CMD's /u option.
  • Eryk Sun
    Eryk Sun about 5 years
    @ivan_pozdeev, the console uses 16-bit character cells. In principle it can display any character in the BMP. However, it doesn't use Uniscribe/DirectWrite, so it doesn't support complex scripts (e.g. right-to-left text) or automatic fallback fonts. Manual font linking in the registry is possible, but the results aren't very good, so in practice it's limited to what the current font supports. A character beyond the BMP is written as a UTF-16 surrogate pair in two logically separate cells, so it renders as two default glyphs (e.g. empty boxes), but it can be copied to the clipboard fine.
  • WesternGun
    WesternGun about 5 years
    In your case, it is a font problem... the content is there, just no proper font to display it. But OP is different.
  • Cheers and hth. - Alf
    Cheers and hth. - Alf about 5 years
    –1 UTF-8 in consoles works only partially and only for output. Additionally the question isn't about i/o but about command line arguments. Over 300 incompetents so far have upvoted this advice. That's impressive.
  • Vilx-
    Vilx- about 5 years
    OK, for something this thorough, you deserve to be the accepted answer! Awesome!
  • Ssuching Yu
    Ssuching Yu almost 5 years
    This didn't for me. The Chinese characters in the output of point command are still garbled.
  • code4j
    code4j almost 5 years
    @SiqingYu I give up the crazy setting. Just use blog.miniasp.com/post/2015/09/27/Useful-tool-Cmder.aspx
  • Ssuching Yu
    Ssuching Yu almost 5 years
    I used Cmder before, but it cannot replace the developer console used by Visual Studio.
  • code4j
    code4j almost 5 years
    @SiqingYu Do you mean the c# interactive powershell?
  • Ssuching Yu
    Ssuching Yu almost 5 years
    Not the interactive power shell, but the developer console, used by Visual C++ too. It is the default debug console in Win32 Console Application projects.
  • Wernfried Domscheit
    Wernfried Domscheit almost 5 years
    @Cheersandhth.-Alf, the header is quite generic, I assume that's the reason why many search-engines will hit this page first. However, apart from undoubted limitation/bug I think chcp 65001 is sufficient for 99% of people having problem with "Unicode in command line"
  • Cheers and hth. - Alf
    Cheers and hth. - Alf almost 5 years
    @WernfriedDomscheit: What was the first part of “UTF-8 in consoles works only partially and only for output” that you failed to understand?
  • Wernfried Domscheit
    Wernfried Domscheit almost 5 years
    @Cheersandhth.-Alf, I understand the issue. However for a typical usecase for example echo € > euro.txt and type euro.txt the solution is sufficient for most of the people. Such commands do not work with codepage 850 (the default for western europe)
  • Cheers and hth. - Alf
    Cheers and hth. - Alf almost 5 years
    "the solution is sufficient for most of the people" It's not a solution. It's advice akin to pouring sugar in car's gas tank, plain sabotage. And regarding "I understand the issue", no you do not. Given that claim I advice to read up on the the Dunning-Kruger effect.
  • NikolaDjokic almost 5 years
    @Cheers and hth. - Alf: Almost 300K people came to this question, because of the title. The vast majority didn't read the body of the question. They immediately copied and pasted the code from the first answer, it worked for them, up voted and continued with their lives. They most probably won't have to deal again with Windows Command Prompt intricacies. They just wanted to run a simple program and get on with their work. They don't need the deep expertise, you obviously possess and they aren't incompetents. You don't have to be rude.
  • Rick
    Rick over 4 years
    Outputting UTF-8 encoded characters are fine. But input is still encoded by system codepage.
  • Rick
    Rick over 4 years
    I am a newbie to C++ and can't understand this answer after reading carefully. Can somebody help me about this or make a easier explanation?
  • Rick
    Rick over 4 years
    @OhadSchneider windows version <=1709 can't use chcp and I failed too.
  • Vilx-
    Vilx- over 4 years
    Ahh, no, I'm sorry, but you missed the question. This is for when I'm writing a program that will receive the unicode characters. My question was about sending the unicode characters to another program (which hopefully supports receiving them, but I really have no way to know except disassembly).
  • Ilya Zakharevich
    Ilya Zakharevich over 4 years
    @Bachi Thanks to Bachi, I found out that v73 of my keyboard layout (mentioned above) was missing some support files. Now fixed! (Judging by my .log files, it is an intermittent bug in zip -ru [?!]. Have no clue how to debug it — or avoid in the future…)
  • Ilya Zakharevich
    Ilya Zakharevich over 4 years
    @Rick: Right! I added a link to a workaround in Python (but I cannot find a direct link to the patch right now…).
  • Rick
    Rick over 4 years
    @IlyaZakharevich :D Thank you. But I somehow give up using unicode on Windows. I am going to use Linux laterly.
  • Eryk Sun
    Eryk Sun over 4 years
    Bugs in the console are not in the kernel. The APIs in kernel32.dll and kernelbase.dll typically interface to system calls exported by ntdll.dll. The console API ultimately makes either I/O calls (e.g. NtReadFile, NtDeviceIoControlFile) in Windows 8+ or LPC calls in older versions. These system calls go through the kernel (e.g. via the ConDrv device in Win 8+), but ultimately they're implemented in the user-mode console host process. This is either an instance of conhost.exe in Windows 7+ or, in older versions, the session subsystem process, csrss.exe. Console bugs are usually here.
  • skomisa
    skomisa about 4 years
    Just to add that Windows users may already have a bash shell if you use Git: just open a Git > Git Bash window.
  • Pontiac_CZ
    Pontiac_CZ about 4 years
    Finally useful answer! Displayed chars still garbled but the arguments (filenames with accents) are now passed to called programs correctly. Thank you! (I'm from CZ as well)
  • vulcan raven
    vulcan raven about 4 years
    It seems like the real (Unix-grade) UTF-8 support in Windows consoles is under way: github.com/Microsoft/console/issues/190 and github.com/Microsoft/WSL/issues/75.
  • phuclv
    phuclv almost 4 years
    the Windows 10 cmd supports UTF-8 much better than previous versions Windows Command-Line: Unicode and UTF-8 Output Text Buffer
  • Ilya Zakharevich
    Ilya Zakharevich almost 4 years
    @phuclv: they claim that they do — but I did not see any example of what would work better than what is on Win7. Moreover, IIUC, this is going to appear at some moment — last time I checked, it looked like their changes were not accessible from outside of the kernel. (So: IIUC, one would need to open a handle to a certain driver — it is not “just writing to STDOUT”. I may be wrong — but it is hard to extract technical details from all the flack they create).
  • zvi
    zvi over 3 years
    See also me answer below for new option in newer Windows 10 versions
  • Corey
    Corey about 3 years
    How to achieve this by using powershell or cmd?
  • akinuri
    akinuri over 2 years
    I'm trying to display Chinese characters in the console and doing this didn't work on Windows 10 64-bit (Installed in Turkish and later changed to English). Next, I'll try to install Chinese language and see if it works.
  • user31708
    user31708 over 2 years
    "more capable consoles" can now be real consoles by using the Pseudo Console API. Microsoft now makes an official "more capable console", Windows Terminal.
  • PHD over 2 years
    Changing the font to DejaVu Sans Mono Unifont displays broken characters for korean and chinese on CMD and Unifont is not avaiable for CMD when it works on Microsoft Word.
  • Ilya Zakharevich
    Ilya Zakharevich about 2 years
    You need to be more specific in your complaints. (Especially in your “Unifont is not avaiable for CMD”.) I may only guess that you mean that the “Mono” variant includes only those characters which make sense in 3:2 aspect ratio. (I have plans to make another flavor “including the remaining characters anyway”, but could not find time to work on this during last couple of years.)
  • Green
    Green about 2 years
    I tried using this method, and now the font is super small and it seem it is permanent.
  • Alon Or
    Alon Or almost 2 years
    Just be careful with this, it broke the functionality of some old and crappy programs that were working fine in server 2019.
  • phuclv
    phuclv over 1 year
    new disks have 8.3 name generation disabled by default and this won't work