Read UTF-8 files correctly with PowerShell

powershell encoding utf-8

31,780

Solution 1

If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :

Get-Content -Path test.txt -Encoding UTF8

Solution 2

Really JPBlanc is right. If you want it read as UTF8 then specify that when the file is read.

On a side note, you're losing formatting in here with the [String]+[String] stuff. Not to mention your regex match doesn't work. Check out the regex search changes, and the changes made to the $newMsgs, and the way I'm outputting your data to the file.

# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt #-Encoding UTF8
    if($data -match "\br([0-9]+)\b"){
        $startRev = [int]([regex]::Match($data,"\br([0-9]+)\b")).groups[1].value + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = @"
2014-04-01 - r$startRev`r`n`r`n
    Line 1`r`n
    Line 2`r`n`r`n
"@

# Write new data back
$newmsgs,$data | Out-File test.txt -Encoding UTF8

Solution 3

Get-Content doesn't seem to handle UTF-files without BOM at all (if you omit the Encoding-flag). System.IO.File.ReadLines seems to be an alternative, examples:

PS C:\temp\powershellutf8> $a = Get-Content .\utf8wobom.txt
PS C:\temp\powershellutf8> $b = Get-Content .\utf8wbom.txt
PS C:\temp\powershellutf8> $a2 = Get-Content .\utf8wbom.txt -Encoding UTF8
PS C:\temp\powershellutf8> $a
ABCDEFGHIJKLMNOPQRSTUVWXYZÃ…Ã„Ã–  <== This doesnt seem to be right at all
PS C:\temp\powershellutf8> $b
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $a2
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8>
PS C:\temp\powershellutf8> $c = [IO.File]::ReadLines('.\utf8wbom.txt');
PS C:\temp\powershellutf8> $c
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $d = [IO.File]::ReadLines('.\utf8wobom.txt');
PS C:\temp\powershellutf8> $d
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ <== Works!

31,780

Author by

ygoe

Software developer and consultant, specialising in .NET data, web and cross-platform solutions. Into clean code, usability, design and typography. Curious. Not a coffee drinker. The passion for creating useful applications for people and continuous improvement in coding has kept me in the software development business for over 20 years and counting. I started with Basic, Pascal and C++, however for the past 15 years my focus has been on C# and .NET technologies on several platforms, together with databases, JavaScript and PHP. This includes customer projects for data processing and analysis in sectors like power supply, finance, medical and manufacturing, as well as my own product ideas that I draw from everyday work. I also enjoy contributing to open source projects. In my leisure time I also like travelling and photographing the beauty of our world. Read more about me on my website.

Updated on February 24, 2020

Comments

ygoe about 4 years
Following situation:
- A PowerShell script creates a file with UTF-8 encoding
- The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
- The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
- This can be iterated many times
With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.

I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?

Update

I have added a little test script that shows what I'm trying to do and what happens instead.
```
# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
    $data = Get-Content -Path test.txt
    if ($data -match "^[0-9-]{10} - r([0-9]+)")
    {
        $startRev = [int]$matches[1] + 1
    }
}
Write-Host Next revision is $startRev

# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
    "Line 1`r`n" + `
    "Line 2`r`n`r`n"

# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8
```
After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).

Instead, the second run gives me an error.
ygoe about 10 years

Because, according to the official documentation, this parameter doesn't even exist? How could I know about it? I'll give it a try.
ygoe about 10 years

That improved it. The regex itself was good, just not how I used it. I found that somewhere else... Isn't there a way without duplicating the regex string? Also, what does the comma in the last command do? I see lots of additional new lines added at the end initially.
ygoe about 10 years

Found it, must be an array. Unfortunately the empty $data for the first run causes extra lines. – And why does the + operator of two strings change their actual content? That's new to me in any programming language.
ygoe about 10 years

Okay, it's Get-Content's fault. It gives me an array of lines, not a single multiline string. That causes all sorts of chaos. I've switched to [System.IO.File]::ReadAllText() and [System.IO.File]::WriteAllText() and now I get much more predictable results.
Polymorphix almost 7 years

Get-Content -raw gives you the single multiline string you're looking for.
ygoe over 5 years

Sorry, 5 years later I don't know that anymore. I haven't used PS much in a while.
phuclv about 5 years

the parameter has existed since at least PowerShell 3.0