How to convert a file to UTF-8 in php?

14,439

Solution 1

before you can convert it to utf-8, you need to know what characterset it is. if you can't figure that out, you can't in any sane way convert it to utf8.. however, an insane way to convert it to utf-8, if the encoding cannot be determined, is to simply strip any bytes that doesn't happen to be valid in utf-8, you might be able to use that as a fallback...

warning, untested code (im suddenly in a hurry), but may look something like this:

foreach ( $datas as $data ) {
    $encoding = guess_encoding ( $data );
    if (empty ( $encoding )) {
        // encoding cannot be determined...
        // as a fallback, we simply strip any bytes that isnt valid utf-8...
        // obviously this isn't a reliable conversion scheme.
        // also this could probably be improved
        $data = iconv ( "ASCII", "UTF-8//TRANSLIT//IGNORE", $text );
    } else {
        $data = mb_convert_encoding ( $data, 'UTF-8', $encoding );
    }
    $row [] = explode ( ',', $data );
}
function guess_encoding(string $str): string {
    $blacklist = array (
            'pass',
            'auto',
            'wchar',
            'byte2be',
            'byte2le',
            'byte4be',
            'byte4le',
            'BASE64',
            'UUENCODE',
            'HTML-ENTITIES',
            '7bit',
            '8bit' 
    );
    $encodings = array_flip ( mb_list_encodings () );
    foreach ( $blacklist as $tmp ) {
        unset ( $encodings [$tmp] );
    }
    $encodings = array_keys ( $encodings );
    $detected = mb_detect_encoding ( $str, $encodings, true );
    return ( string ) $detected;
}

Solution 2

Try this out.
The example I have used was something I was doing in a test environment, you might need to change the code slightly.

I had a text file with the following data in:

test
café
áÁÁÁááá
žžœš¥±
ÆÆÖÖÖasØØ
ß

Then I had a form which took a file input in and performed the following code:

function neatify_files(&$files) {
    $tmp = array();
    for ($i = 0; $i < count($_FILES); $i++) {
        for ($j = 0; $j < count($_FILES[array_keys($_FILES)[$i]]["name"]); $j++) {
            $tmp[array_keys($_FILES)[$i]][$j]["name"] = $_FILES[array_keys($_FILES)[$i]]["name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["type"] = $_FILES[array_keys($_FILES)[$i]]["type"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["tmp_name"] = $_FILES[array_keys($_FILES)[$i]]["tmp_name"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["error"] = $_FILES[array_keys($_FILES)[$i]]["error"][$j];
            $tmp[array_keys($_FILES)[$i]][$j]["size"] = $_FILES[array_keys($_FILES)[$i]]["size"][$j];
        }
    }
    return $files = $tmp;
}

if (isset($_POST["submit"])) {
    neatify_files($_FILES);
    $file = $_FILES["file"][0];

    $handle = fopen($file["tmp_name"], "r");
    while ($line = fgets($handle)) {
        $enc = mb_detect_encoding($line, "UTF-8", true);
        if (strtolower($enc) != "utf-8") {
            echo "<p>" . (iconv($enc, "UTF-8", $line)) . "</p>";
        } else {
            echo "<p>$line</p>";
        }
    }
}
?>
<form action="<?= $_SERVER["PHP_SELF"]; ?>" method="POST" enctype="multipart/form-data">
    <input type="file" name="file[]" />
    <input type="submit" name="submit" value="Submit" />
</form>

The function neatify_files is something I wrote to make the $_FILES array more logical in its layout.

The form is a standard form that simply POSTs the data to the server.
Note: Using $_SERVER["PHP_SELF"] is a security risk, see here for more.

When the data is posted I store the file in a variable. Obviously, if you are using the multiple attribute your code won't look quite like this.

$handle stores the entire contents of the text file, in a read-only format; hence the "r" argument.

$enc uses the mb_detect_encoding function to detect the encoding (duh).
At first I was having trouble with obtaining the correct encoding. Setting the encoding_list to use only UTF-8, and setting strict to be true.

If the encoding is UTF-8 then I simply print the line, if it didn't I converted it to UTF-8 using the iconv function.

Solution 3

you can convert the file text into binary data by using the following

FUNCTION bin2text($bin_str) 
{ 
    $text_str = ''; 
    $chars = EXPLODE("\n", CHUNK_SPLIT(STR_REPLACE("\n", '', $bin_str), 8)); 
    $_I = COUNT($chars); 
    FOR($i = 0; $i < $_I; $text_str .= CHR(BINDEC($chars[$i])), $i  ); 
    RETURN $text_str; 
} 

FUNCTION text2bin($txt_str) 
{ 
    $len = STRLEN($txt_str); 
    $bin = ''; 
    FOR($i = 0; $i < $len; $i  ) 
    { 
        $bin .= STRLEN(DECBIN(ORD($txt_str[$i]))) < 8 ? STR_PAD(DECBIN(ORD($txt_str[$i])), 8, 0, STR_PAD_LEFT) : DECBIN(ORD($txt_str[$i])); 
    } 
    RETURN $bin; 
}

after converting the data into binary you simply change the text to php method mb_convert_encoding($fileText, "UTF-8");

Solution 4

Let's try this:

function encode_utf8($data)
{
    if ($data === null || $data === '') {
        return $data;
    }
    if (!mb_check_encoding($data, 'UTF-8')) {
        return mb_convert_encoding($data, 'UTF-8');
    } else {
        return $data;
    }
}

Usage:

$content = file_get_contents($_FILES['file']['tmp_name']);
$content = encode_utf8($content);

$rows = explode("\n", $content);
foreach ($rows as $row) {
    print_r($row);
}

Solution 5

function convert_file_to_utf8($source, $target) {
    $content=file_get_contents($source);
    # detect original encoding
    $original_encoding=mb_detect_encoding($content, "UTF-8, ISO-8859-1, ISO-8859-15", true);
    # now convert
    if ($original_encoding!='UTF-8') {
        $content=mb_convert_encoding($content, 'UTF-8', $original_encoding);

    }
    $bom=chr(239) . chr(187) . chr(191); # use BOM to be on safe side
    file_put_contents($target, $bom.$content);
}
Share:
14,439
Julio de Leon
Author by

Julio de Leon

Hello world!

Updated on June 22, 2022

Comments

  • Julio de Leon
    Julio de Leon 3 months

    Is it possible to convert a file into UTF-8 on my end?

    If I have an access on the file after the submission with

    $_FILES['file']['tmp_name']
    

    Note: The user can upload a CSV file with any kind of charset, I usually encounter an unknown 8-bit charset.

    I try

    $row = array();
    $datas = file($_FILES['file']['tmp_name']);
    foreach($datas as $data) {
        $data = mb_convert_encoding($data, 'UTF-8');
        $row[] = explode(',', $data);
    }
    

    But the problem is, this code remove special characters like single quote.

    My first question is htmlspecialchars remove the value inside the array?

    I put it for additional information. Thanks for those who can help!

  • zessx
    zessx almost 5 years
    Why are you uppercasing PHP keywords?
  • zessx
    zessx almost 5 years
    Not an issue, but this is kinda weird. Does this mean you never use editors' autocompletion and snippets?
  • JustCarty
    JustCarty over 4 years
    @OblivionCoder Any chance you could confirm if this answer solved your problem?
  • Eric P
    Eric P 12 months
    Worked brilliantly for my issue of processing unpredictably formatted CSV's via LOAD DATA LOCAL INFILE.