How to reliably strip invisible characters that break code?

17,400

Solution 1

Well, the easiest way I can think of is to use sed

sed -i 's/[^[:print:]]//g' your_script.js
//            ^^^^^ this can also be 'ascii'

or using tr

tr -cd '\11\12\15\40-\176' < old_script.js > new_script.js

Solution 2

Is there some sort of website somewhere that will strip all characters other than ASCII?

You could use this website

You can recreate the website using this code:

<!DOCTYPE html>
<html>
    <head>
        <meta http-equiv="content-type" content="text/html; charset=UTF-8">
        <title>- jsFiddle demo</title>
        <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js"></script>
        <link rel="stylesheet" type="text/css" href="/css/normalize.css">
        <link rel="stylesheet" type="text/css" href="/css/result-light.css">
        <style type="text/css">
            textarea {
                width: 800px;
                height: 480px;
                outline: none;
                font-family: Monaco, Consolas, monospace;
                border: 0;
                padding: 15px;
                color: hsl(0, 0%, 27%);
                background-color: #F6F6F6;
            }
        </style>
        <script type="text/javascript">
            //<![CDATA[ 
            $(function () {
                $("button").click(function () {
                    $("textarea").val(
                             $("textarea").val().replace(/[^\u0000-\u007E]/g, "")
                    );
                    $("textarea").focus()[0].select();
                });
            }); //]]>
        </script>
    </head>
    <body>
        <textarea></textarea>
        <button>Remove</button>
    </body>
</html>

Solution 3

you can use regex to filter everything out of 0-127. For example in javascript:

text.replace(/[^\x00-\x7F]/g, "")

x00 = 0, x7f = 127

Share:
17,400
Steven Lu
Author by

Steven Lu

Play a multitouch HTML5 Tetris clone -- http://htmltetris.com (Interesting note about this site. It used to be my site, then Tetris Co. sent me a cease and desist, then I forgot about it, and now it’s back: someone brought it back and put MY code back on the site.) A huge fan of tmux and vim.

Updated on July 10, 2022

Comments

  • Steven Lu
    Steven Lu 2 months

    I am trying to build a bookmarklet and got slammed with this issue which I was just able to figure out: a \u8203 character, which Chrome unhelpfully tells me in my block of code (upon pasting into the JS console) is an `"Invalid character ILLEGAL".

    Luckily Safari was the one that told me it was a \u8203.

    I am editing the code in the Sublime Text 2 editor and somehow copying in and out of it (I also tried TextEdit) fails to remove it.

    Is there some sort of website somewhere that will strip all characters other than ASCII?

    When I try to save as ISO 8859 but it will save it back as UTF-8 "because of unsupported characters".

    ... Yeah. that's the point. Get rid of my unsupported evil characters.

    What am I supposed to do? Edit my file in a hex editor?

    FYI I actually solved it by re-typing the code (which originated from this site by the way).