When did C++ compilers start considering more than two hex digits in string literal character escapes?

14,602

Solution 1

GCC is only following the standard. #877: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."

Solution 2

I have found answers to my questions:

  • C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention \x at all (the only character escapes available at that time were octal). K&R 2nd edition states:

    '\xhh'
    

    where hh is one or more hexadecimal digits (0...9, a...f, A...F).

    so it appears this behaviour has been around since ANSI C.

  • While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.

  • There is indeed a less awkward workaround:

    char foo[] = "\u00ABEcho";
    

    The \u escape accepts four hex digits always.

Update: The use of \u isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u. Here's a snippet from GCC:

/* The standard permits $, @ and ` to be specified as UCNs.  We use
     hex escapes so that this also works with EBCDIC hosts.  */
  else if ((result < 0xa0
            && (result != 0x24 && result != 0x40 && result != 0x60))
           || (result & 0x80000000)
           || (result >= 0xD800 && result <= 0xDFFF))
    {
      cpp_error (pfile, CPP_DL_ERROR,
                 "%.*s is not a valid universal character",
                 (int) (str - base), base);
      result = 1;
    }

Solution 3

I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT may be greater than 8, in which case '\xABE' or '\xABEc' could be valid.

Solution 4

I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"

Share:
14,602
Greg Hewgill
Author by

Greg Hewgill

Software geek. Twitter: @ghewgill

Updated on June 08, 2022

Comments

  • Greg Hewgill
    Greg Hewgill almost 2 years

    I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x notation. For example:

    char foo[] = "\xABEcho";
    

    However, g++ (version 4.1.2 if it matters) throws an error:

    test.cpp:1: error: hex escape sequence out of range
    

    The compiler appears to be considering the Ec characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char, an error is raised. Obviously for a wide string literal L"\xABEcho" the first character would be U+ABEC, followed by L"ho".

    It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x, and not look any further.

    I can think of one workaround for this:

    char foo[] = "\xAB""Echo";
    

    but that's a bit ugly. So I have three questions:

    • When did this change?

    • Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?

    • Is there a workaround that's less awkward than the above?

  • Ben Voigt
    Ben Voigt about 13 years
    Doesn't change the behavior at all, the standard says "There is no limit to the number of digits in a hexadecimal sequence." So now "\x00ABEc" is treated as a single hexadecimal character.
  • user1066101
    user1066101 about 13 years
    @Ben Voigt: "Specifying \xnn in a wchar_t string literal is equivalent to specifying \x00nn". It seems that some compilers are at odds with your interpretation.
  • Ignacio Vazquez-Abrams
    Ignacio Vazquez-Abrams about 13 years
    But what does it say about \xnnn? Is that considered equivalent to \x00nnn?
  • user1066101
    user1066101 about 13 years
    @Ignacio Vazquez-Abrams: Nothing.
  • Wiz
    Wiz almost 11 years
    -1, Note that this answer is not correct. Only hexadecimal escape sequence is the longest sequence of hexadecimal digits. On the other hand, octal escape sequences are limited to up to three octal digits. That's how the standard dictates it. (C++11, $2.14.3 Character literals).
  • Ignacio Vazquez-Abrams
    Ignacio Vazquez-Abrams almost 11 years
    @Wiz: You do know that 4.1.2 had no support for C++11, right?
  • Wiz
    Wiz almost 11 years
    @IgnacioVazquez-Abrams I am not sure what you mean by that. This rule has been in existence since the original C standard back in C89. It's the same in C89/C99/C11/C++98/C++11. I only happen to quote the latest standard, that's all.
  • Ignacio Vazquez-Abrams
    Ignacio Vazquez-Abrams almost 11 years
    @Wiz: Does that then mean that the standard is contradicting itself?
  • Wiz
    Wiz almost 11 years
    I am not sure what you mean by that. The standard just says that octal escape sequence can be at most 3 octal digits while hex escape sequences have no upper limit as to their length.
  • Ignacio Vazquez-Abrams
    Ignacio Vazquez-Abrams almost 11 years
    @Wiz: Oh, I see. You're fixated on the "octal" bit when the question doesn't even bring it up. I get it now.
  • Wiz
    Wiz almost 11 years
    The quoted text in your answer is simply wrong with respect to octal escape sequences. That's all I have an issue with.
  • Brian Bi
    Brian Bi almost 10 years
    Also \u is not really equivalent to \x in the sense that \x produces a particular integer value, whereas \u produces a certain ISO 10646 code point, so the numerical value depends on encoding.
  • supercat
    supercat almost 9 years
    On some systems, a char may require three or four hex digits (or even more). While CHAR_BIT is usually eight, there are some systems still in production (such as digital signal processors) where char is some other size (16 probably being the most common size other than eight).
  • Adrian McCarthy
    Adrian McCarthy about 6 years
    It's interesting that the number of hexadecimal digits in an escape is unbounded but the number of octal digits must be one, two, or three. And why the heck do the longer universal character names require eight digits when the first two must necessarily be 0?
  • phuclv
    phuclv almost 6 years
    there are better ways like \u00nnEcho or "\xnn" "Echo"