how to force shell script characters encoding from within the script

33,154

Bash stores strings as byte strings, and performs operations according to the current LC_CTYPE setting. So there is no need to restart bash: just set the LC_CTYPE or LC_ALL variable to your desired locale. Note that if you store a string in a variable or function, what matters is the encoding at the time the variable is expanded or the relevant command in the function is executed. Here's a script that demonstrates that:

#!/bin/bash
LC_CTYPE=en_US.utf8
v_utf8='é'
n_utf8=${#v_utf8}
f_utf8 () { tmp='é'; echo ${#tmp}; }
echo "UTF-8 in UTF-8: $n_utf8 $(f_utf8)"
LC_CTYPE=en_US
v_latin1='é'
n_latin1=${#v_latin1}
f_latin1 () { tmp='é'; echo ${#tmp}; }
echo "Latin 1 in Latin 1: $n_latin1 $(f_latin1)"
echo "UTF-8 in Latin 1: ${#v_utf8} $(f_utf8)"
LC_CTYPE=en_US.utf8
echo "Latin 1 in UTF-8: ${#v_latin1} $(f_latin1)"

Output:

UTF-8 in UTF-8: 1 1
Latin 1 in Latin 1: 2 2
UTF-8 in Latin 1: 2 2
Latin 1 in UTF-8: 1 1

As you can see, the length of the string is calculated according to the current value of LC_CTYPE, regardless of the value at the time of definition.

Share:
33,154

Related videos on Youtube

serhatg
Author by

serhatg

Updated on September 18, 2022

Comments

  • serhatg
    serhatg over 1 year

    I have a few shell scripts with UTF8-encoded characters inside and i want to be sure that they are decoded correctly regardless of the machine locale settings.

    Is it possible to force the shell (bash or sh) to detect the correct script encoding? (something similar to the python or ruby encoding cookie)

    The solution could be a custom shebang like:

    #!/bin/bash --utf8
    

    The solution should aim to better portability, so it is not necessary to stick with bash.

    EDIT: maybe i've found a possible solution using a recursive script call:

    # check if current locale is UTF8-based (otherwise this script may not work correctly)
    locale | grep -q 'UTF-8'
    if [ $? -ne 0 ]; then
        export LC_ALL=en_GB.UTF-8
        # recursive call this script with the modified environment
        $0 "$@"
        exit $?
    fi
    
    • mikeserv
      mikeserv over 8 years
      use the locale command to get a list of encodings available to you and put the one most suitable to you in LC_ALL for the duration of your special chars evaluation.
    • serhatg
      serhatg over 8 years
      this can only be done BEFORE the script is launched, i want to force the encoding from within the script, to make it easier for the user.
    • serhatg
      serhatg over 8 years
      if i set LC_ALL inside the script, the shell won't re-decode the strings (i guess string decodings happens when the script is loaded). So, it can only be done before the shell istance is created.
    • mikeserv
      mikeserv over 8 years
      that's not true.
    • Admin
      Admin over 8 years
      Just add a line LC_ALL=en_GB.UTF-8 in your script, all text read from then on will be re-coded if needed. That will not change the encoding of the parent shell, but that is not what you seek, correct?
    • serhatg
      serhatg over 8 years
      can i assume most shells have the same behavior? I think restarting the shell is safer/more portable.
    • mikeserv
      mikeserv over 8 years
      restarting the shell is not necessary for one that handles encoding according to standard, dash is not among those - but you can restart it as many times as you like and it still won't handle anything but a C locale.
  • serhatg
    serhatg over 8 years
    ok, but can i assume most shells have the same behavior? (as i've said i am looking for the most portable/reliable solution)
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 8 years
    @eadmaster I see the same behavior with bash (at least as far back to 3.1.17), ATT ksh (at least as far back to 93r) and zsh (at least as far back to 4.3.6). Dash (as of 0.5.7), posh (as of 0.12.3), mksh (as of 50d) and BusyBox ash (as of 1.22.0) don't support multibyte locales anyway.