Pasting binary data into a Unicode terminal

6,756

Solution 1

ef bf bd is the UTF-8 encoding of REPLACEMENT CHARACTER (�), which is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".

What you describe is not "extended ASCII", but rather binary data. Some bytes in the range 0x80-0xff are not valid ISO 8859 anything, so it's understandable that some programs treat that as an unknown character.

You could try using an 8-bit character encoding that uses all 255 positions, such as IBM code page 850.

But then the program you're copying from might be interpreting the data too. And what happens when you paste a null byte or a terminal escape sequence? The whole approach seems destined to fail.

Solution 2

There are several comments which did not get a suitable answer. Here are some points:

  • xterm doesn't accept "arbitrary binary data". It accepts (depending on locale) UTF-8 or ISO-8859-1. The latter follows the ICCM, the former is an extension from XFree86. In either encoding, xterm may interpret these characters to (attempt to) provide the data from the selection. If pasting UTF-8 text from a selection into ISO-8859-1 encoding, it will approximate the most commonly-used characters (including line-drawing).

  • selection (and pasting) depend upon both the source (where the selection is made) and the target (where the text is pasted). Both have to agree upon the format of the data to select/paste. xterm provides and accepts several formats (see button.c in sources). Konsole and gnome-terminal use fewer formats.

  • Konsole, for instance, does X11 selection as an afterthought. It uses the QClipboard::Selection method. Qt's page comments in the section Notes for X11 Users is interesting reading in that regard. But read the code and see that it only supports COMPOUND_TEXT:

    if (*format == 8 && *type == ATOM(COMPOUND_TEXT)) {
        // convert COMPOUND_TEXT to a multibyte string
        XTextProperty textprop;
        textprop.encoding = *type;
        textprop.format = *format;
        textprop.nitems = buffer_offset;
        textprop.value = (unsigned char *) buffer->data();
    
        char **list_ret = 0;
        int count;
        if (XmbTextPropertyToTextList(display, &textprop, &list_ret,
                     &count) == Success && count && list_ret) {
            offset = buffer_offset = strlen(list_ret[0]);
            buffer->resize(offset);
            memcpy(buffer->data(), list_ret[0], offset);
        }
        if (list_ret) XFreeStringList(list_ret);
    }
    
  • Likewise, GNOME's VTE uses gtk_clipboard_get_for_display, generally following Qt's lead.

  • IBM 850 is an 8-bit encoding (like ISO-8859-1), and cannot represent the UTF-8 replacement character. So your terminal uses ? (the default character).

Further reading:

Solution 3

Terminals are generally not designed to accept binary input: they expect control characters to have a special meaning in applications, and do some processing of control characters themselves (mostly into a few signals).

An exception is Emacs's term mode (or one of its variants), which treats pasted data as raw text that's passed on to the application.

The normal method of providing binary input to an application would be to redirect its input from a file or pipe. If the data is in the X clipboard, you can use xclip or xsel:

xclip -o | myapp
xsel -o | myapp

Solution 4

The expected behaviour worked here using yakuake terminal. I've done echo -en "\x5" | xclip and then middle button clicked on a screen session with a serial port opened on it. The device echoed just as expected.

Share:
6,756

Related videos on Youtube

mtvec
Author by

mtvec

Updated on September 18, 2022

Comments

  • mtvec
    mtvec over 1 year

    I need to be able to paste binary data into a terminal. For some reason, every byte outside the ASCII range (0x80-0xff) is pasted as the same three byte sequence 0xef 0xbf 0xbd.

    For example:

    $ echo -en "\x80" | xclip
    $ hd
    <paste><EOF>
    00000000  ef bf bd                                       |...|
    00000004
    

    It has something to do with the character encoding used by the terminal since if I change it from UTF-8 to ISO 8859 or similar every character in the extended range is translated to 0x3f.

    Does anybody have an idea on how to paste arbitrary binary data into the terminal?

    Edit: This seems to be very terminal dependent. The example above is in Konsole. I get the desired behavior in xterm and Gnome Terminal doesn't allow to paste characters in the extended range at all. Any Konsole specific solution would still be appreciated.

    • Admin
      Admin about 8 years
      xterm doesn't accept "arbitrary binary data". It accepts (depending on locale) UTF-8 or ISO-8859-1. It sounds as if OP is selecting text which is not UTF-8 (which would explain the problems with konsole and gnome-terminal). None of the suggested answers seem to have focused on the root cause.
  • mtvec
    mtvec almost 12 years
    That sounds reasonable but why then does it work in xterm?
  • Mikel
    Mikel almost 12 years
    I guess it's passing thru unrecognized data rather than trying to interpret it.
  • mtvec
    mtvec almost 12 years
    The problem is that after having provided the binary data, the program accepts normal user input again. This can be tackled by using named pipes, something like cat pipe - | myapp. But since this is meant to be a demonstration for people who do not necessarily have a programming background, I wanted to keep it as simple as possible and just find a way to paste binary data in the terminal.
  • mtvec
    mtvec almost 12 years
    IBM 850 also seems to translate those "high" bytes to 0x3f.
  • Marius
    Marius almost 8 years
    The example used in this answer does not address the question; it uses a code which is normal ASCII, does not involve UTF-8 encoding.