Determine and change filename encoding on Windows

9,207

Solution 1

Based on JosefZ's script, here is a modified version that works recursively:

Get-ChildItem "X:\" -Recurse | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    $file = $_.Fullname
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -LiteralPath "$file" -NewName "$y" -WhatIf
        Write-Host "renamed file $file"
    }
}

Remove -WhatIf after testing. I had problems with paths that were too long, but that's a topic for another post.

Solution 2

I can reproduce your problem using next simple Powershell script

$RatedName = "šöü"                            # set sample string
$FormDName = $RatedName.Normalize("FormD")    # its Canonical Decomposition
$FormCName = $FormDName.Normalize("FormC")    #     followed by Canonical Composition
                                              # list each string character by character
($RatedName,$FormDName,$FormCName) | ForEach-Object {
    $charArr = [char[]]$_ 
    "$_"      # display string in new line for better readability
              # display each character together with its Unicode codepoint
    For( $i=0; $i -lt $charArr.Count; $i++ ) { 
        $charInt = [int]$charArr[$i]
        # next "Try-Catch-Finally" code snippet adopted from my "Alt KeyCode Finder"
        #                                       http://superuser.com/a/1047961/376602
        Try {    
            # Get-CharInfo module downloadable from http://poshcode.org/5234
            #        to add it into the current session: use Import-Module cmdlet
            $charInt | Get-CharInfo |% {
                $ChUCode = $_.CodePoint
                $ChCtgry = $_.Category
                $ChDescr = $_.Description
            }
        }
        Catch {
            $ChUCode = "U+{0:x4}" -f $charInt
            if ( $charInt -le 0x1F -or ($charInt -ge 0x7F -and $charInt -le 0x9F)) 
                 { $ChCtgry = "Control" } else { $ChCtgry = "" }
            $ChDescr = ""
        }
        Finally { $ChOut = $charArr[$i] }
        "{0} {1,-2} {2} {3,5} {4}" -f $i, $charArr[$i], $ChUCode, $charInt, $ChDescr
    }
}
# create sample files
$RatedName | Out-File "D:\test\1097217Rated$RatedName.txt" -Encoding utf8
$FormDName | Out-File "D:\test\1097217FormD$FormDName.txt" -Encoding utf8
$FormCName | Out-File "D:\test\1097217FormC$FormCName.txt" -Encoding utf8
""                                 # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -NewName $y -LiteralPath $_ -WhatIf
    } else {
        "       : file name is already normalized $_"
    }
}

Above script is updated as follows: 1st shows more info on composed/decomposed Unicode characters i.e their Unicode names (see Get-CharInfo module); 2nd embedded very artless draft of possible solution.
Output from cmd prompt:

==> powershell -c D:\PShell\SU\1097217.ps1
šöü
0 š  U+0161   353 Latin Small Letter S With Caron
1 ö  U+00F6   246 Latin Small Letter O With Diaeresis
2 ü  U+00FC   252 Latin Small Letter U With Diaeresis
šöü
0 s  U+0073   115 Latin Small Letter S
1 ̌  U+030C   780 Combining Caron
2 o  U+006F   111 Latin Small Letter O
3 ̈  U+0308   776 Combining Diaeresis
4 u  U+0075   117 Latin Small Letter U
5 ̈  U+0308   776 Combining Diaeresis
šöü
0 š  U+0161   353 Latin Small Letter S With Caron
1 ö  U+00F6   246 Latin Small Letter O With Diaeresis
2 ü  U+00FC   252 Latin Small Letter U With Diaeresis
       : file name is already normalized D:\test\1097217FormCšöü.txt
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt".
       : file name is already normalized D:\test\1097217Ratedšöü.txt
==> dir /b D:\test\1097217*
1097217FormCšöü.txt
1097217FormDšöü.txt
1097217Ratedšöü.txt

In fact, above dir output looks like 1097217FormDsˇo¨u¨.txt in cmd window and my unicode-aware browser composes strings as listed above but unicode analyzer shows the characters truly as well as the latest image:

combining accents

However, next example shows the problem in its full width: a for loop changes combining accents to normal ones:

==> for /F "delims=" %G in ('dir /b /S D:\test\1097217*') do @echo %~nxG & dir /B %~fG
1097217FormCšöü.txt
1097217FormCšöü.txt
1097217FormDsˇo¨u¨.txt
File Not Found
1097217Ratedšöü.txt
1097217Ratedšöü.txt

==>

Here's very artless draft of possible solution (see output above):

""                                 # very artless draft of possible solution
Get-ChildItem "D:\test\1097217*" | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    if ( $y.Length -ne $_.Name.Length ) {
        Rename-Item -NewName $y -LiteralPath $_ -WhatIf
    } else {
        "       : file name is already normalized $_"
    }
}

(ToDo: invoke Rename-Item merely if necessary):

Get-ChildItem "D:\test\1097217*" | ForEach-Object {
    $y = $_.Name.Normalize("FormC")
    if ($true) {                                         ### ToDo
        Rename-Item -NewName $y -LiteralPath $_ -WhatIf
    }
}

and its output (again, here are rendered composed strings and image below shows cmd window look unbiased):

What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormCšöü.txt Destination: D:\test\1097217FormCšöü.txt".
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
FormDšöü.txt Destination: D:\test\1097217FormDšöü.txt".
What if: Performing the operation "Rename File" on target "Item: D:\test\1097217
Ratedšöü.txt Destination: D:\test\1097217Ratedšöü.txt".

combining accents

Updated cmd output

updated cmd output

Share:
9,207

Related videos on Youtube

nixer
Author by

nixer

Updated on September 18, 2022

Comments

  • nixer
    nixer 3 months

    I have files on a Windows server that have certain accented characters in the name. On Windows Explorer files are displayed normally but running 'dir' at the command prompt with default settings displays substituted characters.

    For example, the character ö is displayed as o" in the listing. This causes problems when accessing these files from other platforms over SMB, presumably because of conflicting encoding/code pages. The problem is not present with all files and I don't know where the problem files came from.

    Example:

    E:\folder\files>dir
     Volume in drive E is data
     Volume Serial Number is 5841-C30E
     Directory of E:\folder\files  
    07/05/2016  07:46 PM    <DIR>          .
    07/05/2016  07:46 PM    <DIR>          ..
    12/01/2015  11:12 AM            14,105 file with o" character.xlsx
    01/22/2015  05:30 PM            11,598 file with correct ö character.xlsx
                   2 File(s)         25,703 bytes
                   2 Dir(s)  2,727,491,600,384 bytes free
    

    I've changed file and directory names, but you'll get the idea.

    Any ideas how the names could have gotten this way? Perhaps they were copied or created using another platform or tool?

    How could I batch find and rename all the problem files? I looked at a couple of GUI renaming utilities but they don't see the problem and only work with the name shown in Windows Explorer.

    Filesystem on the drive is ReFS, could that have something to do with it?

    Edit: ran PowerShell command

    Y:\test>powershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i
    -lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}}
    file with o¨ character.xlsx o 111
    file with o¨ character.xlsx ¨ 776
    

    Cleaned up to show only relevant part.

    So looks like it's really a combining diaeresis and not a vertical quotation mark. Like it should be, as I understand, when talking about unicode normalization.

    • DavidPostill
      DavidPostill over 6 years
      Use chcp in the cmd shell to set an appropriate code page. See chcp - Change the active console Code Page. The default code page is determined by the Windows Locale.
    • JosefZ
      JosefZ over 6 years
      nixer please edit your question and add a real example of such dir (Copy & Paste from cmd window). @DavidPostill chcp would not suffice; looks like there is displayed a Canonical or Compatibility Decomposition o ̈ (U+006F Latin Small Letter O followed by U+0308 Combining Diaeresis) instead of the ö character (U+00F6 Latin Small Letter O With Diaeresis).
    • nixer
      nixer over 6 years
      @DavidPostill @JosefZ I played around with chcp but couldn't get the name to show up correctly. It just changes the " to some other character like ?. So it seems to have been originally saved with decomposition and command prompt shows the actual name, Windows Explorer combines it back on the fly.
    • JosefZ
      JosefZ over 6 years
      I can't believe that there is " (Quotation Mark) listed in a file name as this character is reserved (disallowed in a filename) by Naming Files, Paths, and Namespaces article. Should apply to both NTFS and ReFS file systems. Please run oneliner powershell -c Get-ChildItem ^|ForEach-Object {$x=$_.Name; For ($i=0;$i -lt $x.Length; $i++) {\"{0} {1} {2}\" -f $x,$x[$i],[int]$x[$i]}} instead of dir and edit again and Copy&Paste only relevant output lines (numbers should suffice). FYI " code is 34.
  • nixer
    nixer over 6 years
    Thanks for the tip. I tried several locales but none of them affected the decomposition and I wasn't able to replicate the issue. I also tried several file renaming utilities, but none of them knew how to operate with decomposition. This leads me to believe that the files were transferred from another machine or platform using some tool that mangled the names. I'm still searching for a bulk renaming that could find and fix all the files having this issues.
  • miroxlav
    miroxlav over 6 years
    @nixer – regarding bulk renaming, I already wrote how it can be done. More details: Inside TCMD, use Search&Replace in Multi-rename tool (accessible from main menu). Although be careful and create a backup before, you can get yourself into logical catch by using incorrect renaming order. I think the best option (if viable) would be to use the files to determine who uploaded them and focus on machine of that user.
  • nixer
    nixer over 6 years
    Very nice detective work! At the moment a PowerShell script seems like the best option for correcting the issue. I haven't found a file renaming utility that understands decomposed unicode.
  • JosefZ
    JosefZ over 6 years
    @nixer please note updated answer: renaming part could help!
  • nixer
    nixer over 6 years
    The draft script works wonderfully in the current directory. I tried to modify it to do renaming recursively but due to my poor PowerShell skills, I haven't been able to yet.
  • JosefZ
    JosefZ over 6 years
    @nixer please search stackoverflow for your additional request.