Determine and change filename encoding on Windows
Solution 1
Based on JosefZ's script, here is a modified version that works recursively:
Get-ChildItem "X:\" -Recurse | ForEach-Object {
$y = $_.Name.Normalize("FormC")
$file = $_.Fullname
if ( $y.Length -ne $_.Name.Length ) {
Rename-Item -LiteralPath "$file" -NewName "$y" -WhatIf
Write-Host "renamed file $file"
}
}
Remove -WhatIf
after testing. I had problems with paths that were too long, but that's a topic for another post.
Solution 2
I can reproduce your problem using next simple Powershell script
$RatedName = "šöü" # set sample string
$FormDName = $RatedName.Normalize("FormD") # its Canonical Decomposition
$FormCName = $FormDName.Normalize("FormC") # followed by Canonical Composition
# list each string character by character
($RatedName,$FormDName,$FormCName) | ForEach-Object {
$charArr = [char[]]$_
"$_" # display string in new line for better readability
# display each character together with its Unicode codepoint
For( $i=0; $i -lt $charArr.Count; $i++ ) {
$charInt = [int]$charArr[$i]
# next "Try-Catch-Finally" code snippet adopted from my "Alt KeyCode Finder"
# http://superuser.com/a/1047961/376602
Try {
# Get-CharInfo module downloadable from http://poshcode.org/5234
# to add it into the current session: use Import-Module cmdlet
$charInt | Get-CharInfo |% {
$ChUCode = $_.CodePoint
$ChCtgry = $_.Category
$ChDescr = $_.Description
}
}
Catch {
$ChUCode = "U+{0:x4}" -f $charInt
if ( $charInt -le 0x1F -or ($charInt -ge 0x7F -and $charInt -le 0x9F))
{ $ChCtgry = "Control" } else { $ChCtgry = "" }
$ChDescr = ""
}
Finally { $ChOut = $charArr[$i] }
"{0} {1,-2} {2} {3,5} {4}" -f $i, $charArr[$i], $ChUCode, $charInt, $ChDescr
}
}
# create sample files
$RatedName | Out-File "D:\test\1097217Rated$RatedName.txt" -Encoding utf8
$FormDName | Out-File "D:\test\1097217FormD$FormDName.txt" -Encoding utf8
$FormCName | Out-File "D:\test\1097217FormC$FormCName.txt" -Encoding utf8
"" # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
$y = $_.Name.Normalize("FormC")
if ( $y.Length -ne $_.Name.Length ) {
Rename-Item -NewName $y -LiteralPath $_ -WhatIf
} else {
" : file name is already normalized $_"
}
}
Above script is updated as follows: 1st shows more info on composed/decomposed Unicode characters i.e their Unicode names (see Get-CharInfo module); 2nd embedded very artless draft of possible solution.
Output from cmd
prompt:
==> powershell -c D:\PShell\SU\1097217.ps1
šöü
0 š U+0161 353 Latin Small Letter S With Caron
1 ö U+00F6 246 Latin Small Letter O With Diaeresis
2 ü U+00FC 252 Latin Small Letter U With Diaeresis
šöü
0 s U+0073 115 Latin Small Letter S
1 ̌ U+030C 780 Combining Caron
2 o U+006F 111 Latin Small Letter O
3 ̈ U+0308 776 Combining Diaeresis
4 u U+0075 117 Latin Small Letter U
5 ̈ U+0308 776 Combining Diaeresis
šöü
0 š U+0161 353 Latin Small Letter S With Caron
1 ö U+00F6 246 Latin Small Letter O With Diaeresis
2 ü U+00FC 252 Latin Small Letter U With Diaeresis
: file name is already normalized D:\test\1097217FormCšöü.txt
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt".
: file name is already normalized D:\test\1097217Ratedšöü.txt
==> dir /b D:\test\1097217*
1097217FormCšöü.txt
1097217FormDšöü.txt
1097217Ratedšöü.txt
In fact, above dir
output looks like 1097217FormDsˇo¨u¨.txt
in cmd
window and my unicode-aware browser composes strings as listed above but unicode analyzer shows the characters truly as well as the latest image:
However, next example shows the problem in its full width: a for
loop changes combining accents to normal ones:
==> for /F "delims=" %G in ('dir /b /S D:\test\1097217*') do @echo %~nxG & dir /B %~fG
1097217FormCšöü.txt
1097217FormCšöü.txt
1097217FormDsˇo¨u¨.txt
File Not Found
1097217Ratedšöü.txt
1097217Ratedšöü.txt
==>
Here's very artless draft of possible solution (see output above):
"" # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
$y = $_.Name.Normalize("FormC")
if ( $y.Length -ne $_.Name.Length ) {
Rename-Item -NewName $y -LiteralPath $_ -WhatIf
} else {
" : file name is already normalized $_"
}
}
(ToDo: invoke Rename-Item
merely if necessary):
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
$y = $_.Name.Normalize("FormC")
if ($true) { ### ToDo
Rename-Item -NewName $y -LiteralPath $_ -WhatIf
}
}
and its output (again, here are rendered composed strings and image below shows cmd
window look unbiased):
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormCšöü.txt Destination: D:\test\1097217FormCšöü.txt".
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt".
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
Ratedšöü.txt Destination: D:\test\1097217Ratedšöü.txt".
Updated cmd
output
Related videos on Youtube

nixer
Updated on September 18, 2022Comments
-
nixer 3 months
I have files on a Windows server that have certain accented characters in the name. On Windows Explorer files are displayed normally but running 'dir' at the command prompt with default settings displays substituted characters.
For example, the character
ö
is displayed aso"
in the listing. This causes problems when accessing these files from other platforms over SMB, presumably because of conflicting encoding/code pages. The problem is not present with all files and I don't know where the problem files came from.Example:
E:\folder\files>dir Volume in drive E is data Volume Serial Number is 5841-C30E Directory of E:\folder\files 07/05/2016 07:46 PM <DIR> . 07/05/2016 07:46 PM <DIR> .. 12/01/2015 11:12 AM 14,105 file with o" character.xlsx 01/22/2015 05:30 PM 11,598 file with correct ö character.xlsx 2 File(s) 25,703 bytes 2 Dir(s) 2,727,491,600,384 bytes free
I've changed file and directory names, but you'll get the idea.
Any ideas how the names could have gotten this way? Perhaps they were copied or created using another platform or tool?
How could I batch find and rename all the problem files? I looked at a couple of GUI renaming utilities but they don't see the problem and only work with the name shown in Windows Explorer.
Filesystem on the drive is ReFS, could that have something to do with it?
Edit: ran PowerShell command
Y:\test>powershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i -lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}} file with o¨ character.xlsx o 111 file with o¨ character.xlsx ¨ 776
Cleaned up to show only relevant part.
So looks like it's really a
combining diaeresis
and not a vertical quotation mark. Like it should be, as I understand, when talking about unicode normalization.-
DavidPostill over 6 yearsUse
chcp
in thecmd
shell to set an appropriate code page. See chcp - Change the active console Code Page. The default code page is determined by the Windows Locale. -
JosefZ over 6 yearsnixer please edit your question and add a real example of such
dir
(Copy & Paste
fromcmd
window). @DavidPostillchcp
would not suffice; looks like there is displayed a Canonical or Compatibility Decompositiono
̈
(U+006F
Latin Small Letter O followed byU+0308
Combining Diaeresis) instead of theö
character (U+00F6
Latin Small Letter O With Diaeresis). -
nixer over 6 years@DavidPostill @JosefZ I played around with
chcp
but couldn't get the name to show up correctly. It just changes the"
to some other character like?
. So it seems to have been originally saved with decomposition and command prompt shows the actual name, Windows Explorer combines it back on the fly. -
JosefZ over 6 yearsI can't believe that there is
"
(Quotation Mark) listed in a file name as this character is reserved (disallowed in a filename) by Naming Files, Paths, and Namespaces article. Should apply to bothNTFS
andReFS
file systems. Please run onelinerpowershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i -lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}}
instead ofdir
and edit again andCopy&Paste
only relevant output lines (numbers should suffice). FYI"
code is 34.
-
-
nixer over 6 yearsThanks for the tip. I tried several locales but none of them affected the decomposition and I wasn't able to replicate the issue. I also tried several file renaming utilities, but none of them knew how to operate with decomposition. This leads me to believe that the files were transferred from another machine or platform using some tool that mangled the names. I'm still searching for a bulk renaming that could find and fix all the files having this issues.
-
miroxlav over 6 years@nixer – regarding bulk renaming, I already wrote how it can be done. More details: Inside TCMD, use Search&Replace in Multi-rename tool (accessible from main menu). Although be careful and create a backup before, you can get yourself into logical catch by using incorrect renaming order. I think the best option (if viable) would be to use the files to determine who uploaded them and focus on machine of that user.
-
nixer over 6 yearsVery nice detective work! At the moment a PowerShell script seems like the best option for correcting the issue. I haven't found a file renaming utility that understands decomposed unicode.
-
JosefZ over 6 years@nixer please note updated answer: renaming part could help!
-
nixer over 6 yearsThe draft script works wonderfully in the current directory. I tried to modify it to do renaming recursively but due to my poor PowerShell skills, I haven't been able to yet.
-
JosefZ over 6 years@nixer please search stackoverflow for your additional request.