how to force shell script characters encoding from within the script
Bash stores strings as byte strings, and performs operations according to the current LC_CTYPE
setting. So there is no need to restart bash: just set the LC_CTYPE
or LC_ALL
variable to your desired locale. Note that if you store a string in a variable or function, what matters is the encoding at the time the variable is expanded or the relevant command in the function is executed. Here's a script that demonstrates that:
#!/bin/bash
LC_CTYPE=en_US.utf8
v_utf8='é'
n_utf8=${#v_utf8}
f_utf8 () { tmp='é'; echo ${#tmp}; }
echo "UTF-8 in UTF-8: $n_utf8 $(f_utf8)"
LC_CTYPE=en_US
v_latin1='é'
n_latin1=${#v_latin1}
f_latin1 () { tmp='é'; echo ${#tmp}; }
echo "Latin 1 in Latin 1: $n_latin1 $(f_latin1)"
echo "UTF-8 in Latin 1: ${#v_utf8} $(f_utf8)"
LC_CTYPE=en_US.utf8
echo "Latin 1 in UTF-8: ${#v_latin1} $(f_latin1)"
Output:
UTF-8 in UTF-8: 1 1
Latin 1 in Latin 1: 2 2
UTF-8 in Latin 1: 2 2
Latin 1 in UTF-8: 1 1
As you can see, the length of the string is calculated according to the current value of LC_CTYPE
, regardless of the value at the time of definition.
Related videos on Youtube
serhatg
Updated on September 18, 2022Comments
-
serhatg over 1 year
I have a few shell scripts with UTF8-encoded characters inside and i want to be sure that they are decoded correctly regardless of the machine locale settings.
Is it possible to force the shell (bash or sh) to detect the correct script encoding? (something similar to the python or ruby encoding cookie)
The solution could be a custom shebang like:
#!/bin/bash --utf8
The solution should aim to better portability, so it is not necessary to stick with bash.
EDIT: maybe i've found a possible solution using a recursive script call:
# check if current locale is UTF8-based (otherwise this script may not work correctly) locale | grep -q 'UTF-8' if [ $? -ne 0 ]; then export LC_ALL=en_GB.UTF-8 # recursive call this script with the modified environment $0 "$@" exit $? fi
-
mikeserv over 8 yearsuse the
locale
command to get a list of encodings available to you and put the one most suitable to you inLC_ALL
for the duration of your special chars evaluation. -
serhatg over 8 yearsthis can only be done BEFORE the script is launched, i want to force the encoding from within the script, to make it easier for the user.
-
serhatg over 8 yearsif i set LC_ALL inside the script, the shell won't re-decode the strings (i guess string decodings happens when the script is loaded). So, it can only be done before the shell istance is created.
-
mikeserv over 8 yearsthat's not true.
-
Admin over 8 yearsJust add a line
LC_ALL=en_GB.UTF-8
in your script, all text read from then on will be re-coded if needed. That will not change the encoding of the parent shell, but that is not what you seek, correct? -
serhatg over 8 yearscan i assume most shells have the same behavior? I think restarting the shell is safer/more portable.
-
mikeserv over 8 yearsrestarting the shell is not necessary for one that handles encoding according to standard,
dash
is not among those - but you can restart it as many times as you like and it still won't handle anything but a C locale.
-
-
serhatg over 8 yearsok, but can i assume most shells have the same behavior? (as i've said i am looking for the most portable/reliable solution)
-
Gilles 'SO- stop being evil' over 8 years@eadmaster I see the same behavior with bash (at least as far back to 3.1.17), ATT ksh (at least as far back to 93r) and zsh (at least as far back to 4.3.6). Dash (as of 0.5.7), posh (as of 0.12.3), mksh (as of 50d) and BusyBox ash (as of 1.22.0) don't support multibyte locales anyway.