Short way to escape HTML in Bash?

22,644

Solution 1

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'

Solution 2

You can use recode utility:

    echo 'He said: "Not sure that - 2<1"' | recode ascii..html

Output:

    He said: "Not sure that - 2<1"

Solution 3

Pure bash, no external programs:

function htmlEscape () {
    local s
    s=${1//&/&}
    s=${s//</<}
    s=${s//>/>}
    s=${s//'"'/"}
    printf -- %s "$s"
}

Just simple string substitution.

Solution 4

or use xmlstar Escape/Unescape special XML characters:

$ echo '<abc&def>'| xml esc
<abc&def>
Share:
22,644

Related videos on Youtube

James Evans
Author by

James Evans

Updated on July 09, 2022

Comments

  • James Evans
    James Evans 6 months

    The box has no Ruby/Python/Perl etc.

    Only bash, sed, and awk.

    A way is to replace chars by map, but it becomes tedious.

    Perhaps some built-in functionality i'm not aware of?

  • Ruud Helderman
    Ruud Helderman about 7 years
    Big mistake. When I HTML-encode a string &, it is because I want it to be rendered by some web browser as &. That is why it must be turned into &amp;. That way, HTML-encoding and HTML-decoding are in balance. You don't suppress HTML-encoding just because the input looks like it has already been HTML-encoded. HTML-encoding is not idempotent. Failure to grasp that, eventually leads to XSS vulnerabilities.
  • Brian McCutchon
    Brian McCutchon almost 7 years
    @Ruud is right; the right way to accomplish this is to escape ampersands first, like in ruakh's answer.
  • tbodt
    tbodt about 6 years
    Probably not available if there's no Python/Ruby/Perl.
  • kmkaplan
    kmkaplan almost 6 years
    I totally agree with what @Ruud said except that he should have emphasized failure to grasp that leads to XSS vulnerabilities
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 5 years
    +1 for elegance and efficiency. You should post your answer here: stackoverflow.com/questions/5929492/… where they recommend installing recode, perl, php, xmlsarlet and w3m (a web browser for crying out loud). The last answer recommends using Python3 which although installed by default (in Ubuntu at least) is overkill too.
  • ruakh
    ruakh over 5 years
    @WinEunuuchs2Unix: Thanks for your kind words! That question is asking about the opposite direction (< to <), and the answers there are trying to cover the possibility of random other entity references like é and numeric character references like É, rather than minimally-escaped HTML. For many purposes that might be overengineering, but on Stack Overflow it can be hard to tell exactly what someone's purpose is, so I don't blame the answerers there for wanting to provide something universal.
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 5 years
    @ruakh You're welcome :) Can't your sed search and replace simply be reversed to accomplish the same result as those answers?
  • ruakh
    ruakh over 5 years
    @WinEunuuchs2Unix: There are many ways to HTML-escape a given piece of text; for example, <, <, and < are all valid ways to escape <. My sed script only does one kind of HTML-escaping, since you only need one; but if you want to do HTML-unescaping, then either you need to handle all valid ways of escaping, or you need to know beforehand exactly what way of escaping was used. Do you see what I mean?
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 5 years
    Yes. My HTML-unescaping is limited to stack exchange site Ask Ubuntu and so far I've only noticed &Amp;, $lt; and ". The goal is to compare all the scripts on my drive I've published in Ask Ubuntu to see if they have been changed locally or revised by someone else in Ask Ubuntu. For fun I'm also extracting upvotes from the HTML file and putting it in the local file. This is the work in progress from a few nights ago: askubuntu.com/questions/894888/…
  • geotheory
    geotheory about 4 years
    Really useful. Same as a function: escape_html() { sed $1 's/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'; }
  • Ohiovr
    Ohiovr almost 3 years
    I want to try this but I don't know how to install xml esc. I don't even know what it is. Could you elaborate?
  • schemacs
    schemacs almost 3 years
    Just brew install xmlstarlet if you are using MacOS.
  • vhs
    vhs over 2 years
    Tested on 30 or so textfiles containing ASCII and it even handles the null character \0. Use to sandbox textfile contents for srcdoc attribute of a sandboxed iframe in HTML and allow background styling via parent frame to cascade.