PHP UTF-8 questions - If I create a string in PHP... is it in UTF-8?

10,507

Solution 1

First question: it depends on what exactly goes in the string.

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of. So, if you only put valid UTF-8 bytes between the quotes (fairly easy if the file itself is encoded as UTF-8), then the string will be UTF-8, and you can safely use mb_strlen() on it.

Also, if you're using mbstring functions, you need to explicitly tell it what character set your string is, either with mbstring.internal_encoding or as the last argument to any mbstring function.

Second question: yes, with caveats.

Two strings that are both independently valid UTF-8 can be safely byte-wise concatenated (like with PHP's . operator) and still be valid UTF-8. However, you can never be sure, without doing some work yourself, that a POSTed string is valid UTF-8. Database strings are a little easier, if you carefully set the connection character set, because most DBMSs will do any conversion for you.

Solution 2

If your source code is in UTF-8, then the string is in UTF-8, if not — it’s not. Since your example string is english-only, it is valid UTF-8.

PHP doesn’t itself know about charsets. If you pass stuff to mb* function, it treats it as an UTF-8 string.

Concatenation must work fine no matter what, if I understand UTF-8 right :-) Just make sure both strings are UTF-8, otherwise you will get strange string as a result.

Solution 3

Make sure your default_charset directive is set to UTF-8 before any of this execution occurs.

Either modify the php.ini directly or do it at runtime with

<?php

ini_set( 'default_charset', 'UTF-8' );
Share:
10,507
Keith Palmer Jr.
Author by

Keith Palmer Jr.

ChargeOver - Subscription Payments &amp; Recurring Invoicing ConsoliBYTE - QuickBooks integration solutions QuickBooks PHP DevKit FOLLOW ME: https://twitter.com/keith_palmer_jr Keith Palmer [email protected] ChargeOver.com: Recurring billing w/ QuickBooks ConsoliByte.com: QuickBooks consulting/integration w/ shopping carts

Updated on June 06, 2022

Comments

  • Keith Palmer Jr.
    Keith Palmer Jr. almost 2 years

    In PHP, if I create a string like this:

    $str = "bla bla here is my string";
    

    Will I then be able to use the mbstring functions to operate on that string as UTF8?

    // Will this work?
    $str = mb_strlen($str); 
    

    Further, if I then have another string that I know is UTF-8 (say it was a POSTed form value, or a UTF-8 string from a database), can I then concatenate these two and not have any problems?

    // What about this, will this work? 
    $str = $str . $utf8_string_from_database;
    
  • chazomaticus
    chazomaticus about 15 years
    All this does is control the headers sent to the client. It doesn't actually affect anything about how PHP handles strings.
  • Peter Bailey
    Peter Bailey about 15 years
    It does more than that. Try executing urldecode('%C3%A9') with a default_charset of ISO-8859-1 and then again with a default_charset of UTF-8. But you are correct, it has no bearing on how PHP treats strings at the bit level.
  • chazomaticus
    chazomaticus about 15 years
    The ONLY reason you would see different results from that is because your browser is interpreting those bytes differently. Like I said, it affects NOTHING about how PHP actually handles strings, WHATSOEVER.
  • Peter Bailey
    Peter Bailey about 15 years
    I don't mean to start an argument here, but I think you're missing my point. I'm talking about how the string "%C3%A9" may be interpreted as a single 2-byte sequence, or two 1-byte sequences. This issue exists with or without a browser, although that's certainly where it occurs the most.
  • chazomaticus
    chazomaticus about 15 years
    Also, like I said, setting default_charset only affects the HTTP headers sent to the client. "Browsers" (or anything that does HTTP) then are the only agents that setting default_charset will affect. Execute your example on the command line and you'll see no difference.
  • Anthony Rutledge
    Anthony Rutledge over 4 years
    About the second answer, this is why encoding validation/conversion must be part of any input validation routine. To that end, using a PHP stream input filter (iconv) will enable you to get UTF-8 strings into your application without having to use manual loops and MB/Iconv functions. Learning to get input from the php://input stream wrapper is the key.