NSString initWithData returns null

23,761

Solution 1

You say that it “is definitely UTF-8”, but without a Content-Type header, you don't really know that. (And even if you did have a header saying that, it could still be wrong.)

My guess is that your data is usually ASCII, which always parses correctly as UTF-8, but you sometimes are trying to parse data that's actually encoded in ISO 8859-1 or Windows codepage 1252. Such data will generally be mostly ASCII, but with some bytes outside the 0–127 range ASCII defines. UTF-8 would expect such bytes to form a sequence of code units within a specified sequence of ranges, but in other encodings, any byte, regardless of value, is a complete character on its own. Trying to interpret non-ASCII non-UTF-8 data as UTF-8 will almost always get you either wrong results (wrong characters) or no results at all (cannot decode; decoder returns nil), because the data was never encoded in UTF-8 in the first place.

You should try UTF-8 first, and if it fails, use ISO 8859-1. If you're letting the user retrieve any web page, you should let them change the encoding you use to decode the data, in case they discover that it was actually 8859-9 or codepage-1252 or some other 8-bit encoding.

If you're downloading the data from a specific server, and especially if you have influence on what runs on that server, you should make it serve up an accurate Content-Type header and/or fix whatever bug is causing it to serve up text that isn't in UTF-8.

Solution 2

As Peter said, the content-type Header is just an "hint" of what the content sent is expected to be. On server side you can set any content-type and send any bytes sequences, which can be invalid.

I had exactly the same issue dealing with incorrect UTF-8 data, which included ISO-8859-1 (Latin-1) characters (french accents).

Wikipedia about UTF-8 is worth reading to understand this issue and how to handle encoding errors.

The fact is that NSString initWithData:encoding: strict implementation just return nil when a decoding error occurs. (unlike java for instance which use a replacement character)

The peter solution of converting a mostly UTF-8 data into Latin-1 was not satisfying me. (All UTF-8 characters becomes incorrect, for just one Latin 1 erratic character)

Best option would be a fix on server side, sure, but I'm not responsible on this side...

So I looked deeper, and found a solution using GNU libiconv C library (available on OSX and iOS) The principle is using iconv to remove non UTF-8 invalid characters (i.e. "prété" will become "prt")

Here is a sample code, equivalent of the command line iconv -c -f UTF-8 -t UTF-8 invalid.txt > cleaned.txt

#include "iconv.h"

- (NSData *)cleanUTF8:(NSData *)data {
  iconv_t cd = iconv_open("UTF-8", "UTF-8"); // convert to UTF-8 from UTF-8
  int one = 1;
  iconvctl(cd, ICONV_SET_DISCARD_ILSEQ, &one); // discard invalid characters

  size_t inbytesleft, outbytesleft;
  inbytesleft = outbytesleft = data.length;
  char *inbuf  = (char *)data.bytes;
  char *outbuf = malloc(sizeof(char) * data.length);
  char *outptr = outbuf;
  if (iconv(cd, &inbuf, &inbytesleft, &outptr, &outbytesleft)
      == (size_t)-1) {
    NSLog(@"this should not happen, seriously");
    return nil;
  }
  NSData *result = [NSData dataWithBytes:outbuf length:data.length - outbytesleft];
  iconv_close(cd);
  free(outbuf);
  return result;
}

Then the resulting NSData can be safely decoded using NSUTF8StringEncoding

Note that latest iconv also allow fallback methods by using :

iconvctl(cd, ICONV_SET_FALLBACKS, &fallbacks);

By using a fallback on unicode errors, you can use a replacement character, or better, to try another encoding. In my case I managed to fallback to LATIN-1 where UTF-8 failed, which resulted in 99% positive conversions. Look at iconv source code for understanding it.

Solution 3

The default encoding for HTTP if none is specified is ISO-8859-1. If the HTTP response is compliant to HTTP/1.1 and it's not specifying a character set encoding, that is the encoding it is using.

Try decoding the string with that NSISOLatin1StringEncoding.

Solution 4

The data might have been in another encoding of unicode, such as UTF16, or in some totally different encodings.

There're libraries which can guess the encoding used in a data, but that should be a last resort. If you're using a web service, that web service should have a documentation which says which encoding it uses. Look for it, or ask the provider of the web service which encoding it uses. If neither is available, you should try to get a sample data and determine the encoding for that, and use that in the program.

On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?

That depends on the size of the data. If it's small, that would be perfectly fine. If it's big, it would be better to deal with the data piecemeal.

Share:
23,761

Related videos on Youtube

dmkc
Author by

dmkc

Updated on December 08, 2020

Comments

  • dmkc
    dmkc over 3 years

    I am pulling data from a website via NSURLConnection and stashing the received data away in an instance of NSMutableData. In the connectionDidFinishLoading delegate method the data is convert into a string with a call to NSString's appropriate method:

    NSString *result = [[NSString alloc] initWithData:data 
                                         encoding:NSUTF8StringEncoding]
    

    The resulting string turns out to be a null. If I use the NSASCIIStringEncoding, however, I do obtain the appropriate string, albeit with unicode characters garbled up as expected. The server's Content-Type header does not specify the UTF-8 encoding, but I have attempted a number of different websites with a similar scenario, and there string conversion happens just fine. It seems like the problem only pertains to the given web service but I have no clue why.

    On a side note, is pulling web pages and data from an API good practice, i.e. buffering the data, converting into a string, and manipulating the string afterwards?

    Much appreciated!

    • Nico
      Nico almost 14 years
      For debugging, you should save the data to a file in the temporary directory if the method fails, so that you can open the file in TextWrangler or something to see what encoding it's actually in.
  • dmkc
    dmkc almost 14 years
    It is definitely UTF-8. It's almost like a certain character is causing it to freak out.
  • Yuji
    Yuji almost 14 years
    Could you post the exact string which causes the problem? Maybe it's malformed, etc.
  • dmkc
    dmkc almost 14 years
    This is so strange. It started working fine now.. I found another site it failed at, hypem.com. But that now also works fine.. I want to blame the simulator or my network somehow, but I honestly don't know.. In general though, what could possibly cause such an error given it's not my device? Could a network failure possibly produce that, or would one of the proper delegate methods get called in case of an error? Thank you for sticking around to answer!
  • Yuji
    Yuji almost 14 years
    I guess the data itself from the website is sometimes corrupt, due to a failure to convert to UTF8 to start with, etc. Encoding problems are very dear to me, coming from Japan where three encodings were competing each other. Gradual adoption of UTF8, although not perfect, is a real blessing to me.
  • dmkc
    dmkc almost 14 years
    This is probably the most full and complete answer. In the interests of those following in my steps googling for this question I shall make the answer available as the answer :). To sum up, it seems like decoding as UTF, and falling back to other encodings might be the best bet in case something happens.
  • dmkc
    dmkc about 13 years
    I don't understand how you can just discard characters? What if you're dealing with cyrilics, for instance? You'd be discarding every single character in your input.
  • Vincent Guerci
    Vincent Guerci about 13 years
    My answer to the question is a way to ensure that text is valid UTF-8. That's why the iconv code I posted just remove invalid UTF-8 characters. Cyrillic can be encoded in UTF-8.. and other encodings too, which is off topic.
  • htafoya
    htafoya over 11 years
    Nice, actually NSASCIIStringEncoding worked fine as mitjak says, but I think it's a good practice to test agains several encodings in case one fails. I'll save that for my IO utility classes.
  • Nico
    Nico over 11 years
    @htafoya: NSASCIIStringEncoding should not work on any string containing any character values above 127, since that is not valid ASCII. You should just get nil. In practice, last I checked, Cocoa treats that constant as synonymous with ISO 8859-1. I can only assume that the reason why Apple has not fixed this is because applications exist that say “ASCII” when they mean “ISO 8859-1” that would break under the correct behavior.
  • fatfreddyscat
    fatfreddyscat over 9 years
    WOW!!! THANK YOU!!! I've been banging my head against the wall dealing with invalid UTF-8 that comes from a server which I have no control over - my NSString would always by (null) :( I wish ios just put square blocks or question marks or something like Android's String class. Thank you again for this!!
  • fatfreddyscat
    fatfreddyscat over 9 years
    btw, you forgot to free() your outbuf before returning nil in the error case; you might want to fix it up so that someone who copies + pastes is doesn't have a memory leak (yeah yeah, should seriously never happen, I know but it's still good practice). (oh, and also close your iconv handle)
  • InsaneRabbit
    InsaneRabbit almost 7 years
    Thank you! I just have an email converted by ICU to UTF8 in C++, but cannot convert to NSString. I tried your method it works!