Fastest way to strip all non-printable characters from a Java String

24,246

Solution 1

If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:

char [] oldChars = new char[5];

String stripControlChars(String s)
{
    final int inputLen = s.length();
    if ( oldChars.length < inputLen )
    {
        oldChars = new char[inputLen];
    }
    s.getChars(0, inputLen, oldChars, 0);

etc...

This is a big win - 20% or so, as I understand the current best case.

If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.

Solution 2

using 1 char array could work a bit better

int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);

and I avoided repeated calls to s.length();

another micro-optimization that might work is

int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
                       // if there are none it ends on the null char I appended
for (int  j = newLen; j < length; j++) {
    char ch = oldChars[j];
    if (ch >= ' ') {
        oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
        newLen++;
    }
}
s = new String(oldChars, 0, newLen);

Solution 3

Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.

As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.

Anyways here we go:

    // Has to be done only once - so cache those! Prohibitively expensive otherwise
    private Field value;
    private Field offset;
    private Field count;
    private Field hash;
    {
        try {
            value = String.class.getDeclaredField("value");
            value.setAccessible(true);
            offset = String.class.getDeclaredField("offset");
            offset.setAccessible(true);
            count = String.class.getDeclaredField("count");
            count.setAccessible(true);
            hash = String.class.getDeclaredField("hash");
            hash.setAccessible(true);               
        }
        catch (NoSuchFieldException e) {
            throw new RuntimeException();
        }

    }

    @Override
    public String strip(final String old) {
        final int length = old.length();
        char[] chars = null;
        int off = 0;
        try {
            chars = (char[]) value.get(old);
            off = offset.getInt(old);
        }
        catch(IllegalArgumentException e) {
            throw new RuntimeException(e);
        }
        catch(IllegalAccessException e) {
            throw new RuntimeException(e);
        }
        int newLen = off;
        for(int j = off; j < off + length; j++) {
            final char ch = chars[j];
            if (ch >= ' ') {
                chars[newLen] = ch;
                newLen++;
            }
        }
        if (newLen - off != length) {
            // We changed the internal state of the string, so at least
            // be friendly enough to correct it.
            try {
                count.setInt(old, newLen - off);
                // Have to recompute hash later on
                hash.setInt(old, 0);
            }
            catch(IllegalArgumentException e) {
                e.printStackTrace();
            }
            catch(IllegalAccessException e) {
                e.printStackTrace();
            }
        }
        // Well we have to return something
        return old;
    }

For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.

Solution 4

You could split the task into a several parallel subtasks, depending of processor's quantity.

Solution 5

I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.

Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).

Share:
24,246

Related videos on Youtube

GreyCat
Author by

GreyCat

Updated on July 08, 2022

Comments

  • GreyCat
    GreyCat almost 2 years

    What is the fastest way to strip all non-printable characters from a String in Java?

    So far I've tried and measured on 138-byte, 131-character String:

    • String's replaceAll() - slowest method
      • 517009 results / sec
    • Precompile a Pattern, then use Matcher's replaceAll()
      • 637836 results / sec
    • Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
      • 711946 results / sec
    • Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
      • 1052964 results / sec
    • Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
      • 2022653 results / sec
    • Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
      • 2502502 results / sec
    • Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
      • 857485 results / sec
    • Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
      • 791076 results / sec
    • Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
      • 370164 results / sec

    My best try was the following:

        char[] oldChars = new char[s.length()];
        s.getChars(0, s.length(), oldChars, 0);
        char[] newChars = new char[s.length()];
        int newLen = 0;
        for (int j = 0; j < s.length(); j++) {
            char ch = oldChars[j];
            if (ch >= ' ') {
                newChars[newLen] = ch;
                newLen++;
            }
        }
        s = new String(newChars, 0, newLen);
    

    Any thoughts on how to make it even faster?

    Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?

    Update

    • Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
    • Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.

    Update 2

    I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.

    You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:

    • Debian sid
    • Linux 2.6.39-2-amd64 (x86_64)
    • Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
      • Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
      • Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

    Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:

    Same single string

    This mode works on a same single string provided by StringSource class as a constant. The showdown is:

     Ops / s  │ Algorithm
    ──────────┼──────────────────────────────
    6 535 947 │ Voo1
    ──────────┼──────────────────────────────
    5 350 454 │ RatchetFreak2EdStaub1GreyCat1
    5 249 343 │ EdStaub1
    5 002 501 │ EdStaub1GreyCat1
    4 859 086 │ ArrayOfCharFromStringCharAt
    4 295 532 │ RatchetFreak1
    4 045 307 │ ArrayOfCharFromArrayOfChar
    2 790 178 │ RatchetFreak2EdStaub1GreyCat2
    2 583 311 │ RatchetFreak2
    1 274 859 │ StringBuilderChar
    1 138 174 │ StringBuilderCodePoint
      994 727 │ ArrayOfByteUTF8String
      918 611 │ ArrayOfByteUTF8Const
      756 086 │ MatcherReplace
      598 945 │ StringReplaceAll
      460 045 │ ArrayOfByteWindows1251
    

    In charted form: Same single string chart
    (source: greycat.ru)

    Multiple strings, 100% of strings contain control characters

    Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.

     Ops / s  │ Algorithm
    ──────────┼──────────────────────────────
    2 123 142 │ Voo1
    ──────────┼──────────────────────────────
    1 782 214 │ EdStaub1
    1 776 199 │ EdStaub1GreyCat1
    1 694 628 │ ArrayOfCharFromStringCharAt
    1 481 481 │ ArrayOfCharFromArrayOfChar
    1 460 067 │ RatchetFreak2EdStaub1GreyCat1
    1 438 435 │ RatchetFreak2EdStaub1GreyCat2
    1 366 494 │ RatchetFreak2
    1 349 710 │ RatchetFreak1
      893 176 │ ArrayOfByteUTF8String
      817 127 │ ArrayOfByteUTF8Const
      778 089 │ StringBuilderChar
      734 754 │ StringBuilderCodePoint
      377 829 │ ArrayOfByteWindows1251
      224 140 │ MatcherReplace
      211 104 │ StringReplaceAll
    

    In charted form: Multiple strings, 100% concentration
    (source: greycat.ru)

    Multiple strings, 1% of strings contain control characters

    Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.

     Ops / s  │ Algorithm
    ──────────┼──────────────────────────────
    3 711 952 │ Voo1
    ──────────┼──────────────────────────────
    2 851 440 │ EdStaub1GreyCat1
    2 455 796 │ EdStaub1
    2 426 007 │ ArrayOfCharFromStringCharAt
    2 347 969 │ RatchetFreak2EdStaub1GreyCat2
    2 242 152 │ RatchetFreak1
    2 171 553 │ ArrayOfCharFromArrayOfChar
    1 922 707 │ RatchetFreak2EdStaub1GreyCat1
    1 857 010 │ RatchetFreak2
    1 023 751 │ ArrayOfByteUTF8String
      939 055 │ StringBuilderChar
      907 194 │ ArrayOfByteUTF8Const
      841 963 │ StringBuilderCodePoint
      606 465 │ MatcherReplace
      501 555 │ StringReplaceAll
      381 185 │ ArrayOfByteWindows1251
    

    In charted form: Multiple strings, 1% concentration
    (source: greycat.ru)

    It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).

    References

    • Gustav Barkefors
      Gustav Barkefors over 12 years
      "This question shows research effort" - hmm... yeah, pass. +1
    • Admin
      Admin over 12 years
      StringBuilder will be marginally faster than StringBuffer as it is un-synchronized, I just mention this because you tagged this micro-optimization
    • home
      home over 12 years
      @Jarrod Roberson: ok, so let's make all read-only fields final and extract s.length() out of the for loop as well :-)
    • Vishy
      Vishy over 12 years
      Some characters below space are printable e.g. \t and \n. Many characters above 127 are non-printable in your character set.
    • ratchet freak
      ratchet freak over 12 years
      did you init the string buffer with a capacity of s.length()?
    • Admin
      Admin over 12 years
      @home I agree I make everything I can final, but for other reasons as well, and if if you are looking at micro-optimization making the .length() read once is another micro-optimization. I mean this is tagged micro-optimization isn't it!
    • home
      home over 12 years
      @Jarrod Roberson: full ack, that's why I mentioned it - btw I make all fields final for exactly the same reason :-)
    • GreyCat
      GreyCat over 12 years
      @ratchet freak: Yes, I did new StringBuffer(s.length())
    • Ed Staub
      Ed Staub over 12 years
      @GreyCat, do you want to post the code of the current best case? It's a bit hard to tease together from the answers and your comments.
    • GreyCat
      GreyCat over 12 years
      Please check out question update - I've posted the testing framework and massive showdown for all 17 proposed algorithms.
    • ratchet freak
      ratchet freak over 12 years
      looking back all we did was avoid the superfluous allocations of the char arrays (and the new String object when there was no change)
    • GreyCat
      GreyCat over 12 years
      Yeah, but these are really microoptimizations - so they depend on input data a lot - a tiny shift in a few coefficient makes algorithms with or without extra checks switch places.
    • dube
      dube over 2 years
      It bugs me more than it should that in ten years nobody mentioned that the regex for replaceAll is not precompiled. Then it would at least beat the "StringBuilderChar"
  • GreyCat
    GreyCat over 12 years
    Thanks! Your version yields 3105590 strings / sec - a massive improvement!
  • Thomas
    Thomas over 12 years
    newLen++;: what about using preincrement ++newLen;? - (++j in the loop as well). Have a look here: stackoverflow.com/questions/1546981/…
  • GreyCat
    GreyCat over 12 years
    Yeah, I thought of it too, but it won't yield any performance gains in my situation - this stripping algorithm would be called in already massively parallel system.
  • GreyCat
    GreyCat over 12 years
    And, besides, I might guess that forking off a few threads for processing per every 50-100 byte string would be a huge overkill.
  • GreyCat
    GreyCat over 12 years
    Adding final to this algorithm and using oldChars[newLen++] (++newLen is an error - the whole string would be off by 1!) yields no measureable performance gains (i.e. I get ±2..3% differences, which are comparable to differences of different runs)
  • ratchet freak
    ratchet freak over 12 years
    @grey I made another version with some other optimizations
  • Thomas
    Thomas over 12 years
    @GreyCat well you could initialize newLen with -1 to account for that.
  • Thomas
    Thomas over 12 years
    Shouldn't that be oldChars[length]='\0?
  • GreyCat
    GreyCat over 12 years
    Probably you've meant oldChars[length] = '\0' instead oldChars='\0'? But even then if fails me with erroneous result... Let me check...
  • ratchet freak
    ratchet freak over 12 years
    @gray I made the char array 1 bigger to account for that
  • GreyCat
    GreyCat over 12 years
    @ratchet freak Your new version doesn't strip first found non-printable character - it could be quickly patched by adding newLen-- after a while loop...
  • umbr
    umbr over 12 years
    Yes, forking threads for every small string is not good idea. But load balancer could improve performance. BTW, did you test performance with StringBuilder instead of StringBuffer which has performance lacks because it synchronized.
  • ratchet freak
    ratchet freak over 12 years
    @grey well it would only improve in my second version when the start of the strings are all for a large part printable (i.e. the non printable chars are rare and mostly near the end)
  • GreyCat
    GreyCat over 12 years
    Hmm! That's a brilliant idea! 99.9% of strings in my production environment won't really require stripping - I can improve it further to eliminate even first char[] allocation and return the String as is, if no stripping happened.
  • GreyCat
    GreyCat over 12 years
    My production setup runs spawns several separate processes and utilizes as much parallel CPUs and cores as possible, so I can freely use StringBuilder everywhere without any issues at all.
  • GreyCat
    GreyCat over 12 years
    Great idea! So far it brought the counts up to 3471017 strings per second - i.e. a +12% improvement over previous best version.
  • GreyCat
    GreyCat over 12 years
    +1 for the effort - but you should've asked me - I already have a similar benchmarking suite - that's where I was testing my algorithms ;)
  • GreyCat
    GreyCat over 12 years
    I doubt that any unrolling could be done here, as there are (a) dependencies on following steps of algorithm on previous steps, (b) I haven't even heard of anybody doing manual loop unrolling in Java producing any stellar results; JIT usually does a good job at unrolling whatever it sees fitting the task. Thanks for the suggestion and a link, though :)
  • Voo
    Voo over 12 years
    @GreyCat Well I could've, but just throwing that together (out of existing code anyways) was probably faster ;)
  • GreyCat
    GreyCat over 12 years
    Thanks for the solution, please check out question update - I've published my testing framework and added 3 test run results for 17 algorithms. Your algorithm is always on the top, but it changes internal state of Java String, thus breaking the "immutable String" contract => it would pretty hard to use it in real world application. Test-wise, yeah, it's the best result, but I guess I'll announce it as a separate nomination :)
  • Voo
    Voo over 12 years
    @GreyCat Yeah it has certainly some big strings attached and honestly I pretty much only wrote it up because I'm pretty sure there's no noticeable way to improve your current best solution further. There are situations where I'm certain that it'll work fine (no substring or intern calls before stripping it ), but that's because of knowledge about one current Hotspot version (ie afaik it won't intern strings read from IO - wouldn't be especially useful). It may be useful if one really needs those extra x%, but otherwise more a baseline to see how much you can still improve ;)
  • Voo
    Voo over 12 years
    Though I'm attempted to try a JNI version if I find time - never used it so far so that'd be interesting. But I'm pretty sure it'll be slower because of the additional calling overhead (strings are way too small) and the fact that the JIT shouldn't have such a hard time optimizing the functions. Just don't use new String() in case your string wasn't changed, but I think you got that already.
  • GreyCat
    GreyCat over 12 years
    I've already tried to do exactly the same thing in pure C - and, well, it doesn't really show much improvement over your reflection-based version. C version runs something like +5..10% faster, not really that great - I thought it would be at least like 1.5x-1.7x...
  • Peter Cordes
    Peter Cordes almost 3 years
    Another option to avoid repeated allocation of the tmp buf would be to have the caller pass in (by reference) a scratch buffer of sufficient size. You'd have to check if that ended up defeating some optimizations, but hopefully the JVM can be sure the String and the tmp buf aren't the same object just because of their types. It's not like a C function taking two char* args. (If the JVM even does any optimizations that would matter; presumably it's not going to use SIMD to scan for characters that aren't >= ' ', unfortunately.)
  • Peter Cordes
    Peter Cordes almost 3 years
    It's surprising there's anything to gain from using System.arraycopy - if you have to compare characters one at a time anyway, it only costs a couple extra asm instructions to store and increment a pointer. Except this is Java so it probably involves a bounds-check, and the JIT isn't wonderful. So that's probably why it ends up being fast to scan read-only to find the length of a block of printable characters. Especially if they're rare so you have decent sized chunks for memcpy.
  • Peter Cordes
    Peter Cordes almost 3 years
    while (length > newLength) newLength *= 2; seems a bit silly; why not max(length + 16, copy.length * 2). Or max( length * 3 / 2, copy.length*2) or something. Probably it's not important to worry about 32-bit overflow in the *3 before the /2; maybe use >>>1 to make sure we do unsigned division to allow the *3 temporary to have twice the range of signed int.
  • Peter Cordes
    Peter Cordes almost 3 years
    You talk about "late bail" - Do you mean the if (newLen == length) return s; check in github.com/GreyCat/java-string-benchmark/blob/master/src/ru/‌​…? Are you talking about putting it right after the for (j loop? Perhaps that helps the JIT see that if that loop condition is false the first time, it should jump to code that does return s;, otherwise fall into the loop. Surprised it made that much difference to just not check the same condition twice, but maybe it made the JIT do a worse job in some way.
  • Stelios Adamantidis
    Stelios Adamantidis almost 3 years
    @PeterCordes Let's discuss in the chat :)
  • Stelios Adamantidis
    Stelios Adamantidis almost 3 years
    @PeterCordes meh, I got overwhelmed lately and didn't respond in time so now the chat is deleted :( My bad. You mentioned something about arraycopy and memcpy but Java (openJDK ) apparently loops over it! Anyway, too late now. But thanks for the useful input regardless!
  • Peter Cordes
    Peter Cordes almost 3 years
    That was back in OpenJDK7, and had basically a "todo" comment. Even with that C++ source, it's possible that a C++ compiler could compile that loop into a memcpy or memmove call, if it can inline and optimize away the oop class's copy-constructor. Yes, CPUs can copy much faster than 1 byte at a time when they don't need to look at each byte separately. (My earlier point was that your algorithm already does need to look at each byte separately, so it might as well get the copying done while looking. Unless that leads to much worse JIT asm, perhaps because of bounds-check or aliasing)
  • Stelios Adamantidis
    Stelios Adamantidis almost 3 years
    @PeterCordes I tried several iterations when attempting to optimize that part. I even tried arraycopy to another array. What boosted performance was copying to the same array. Also note that I don't copy each byte, I copy blocks, from the first printable to the last ;) If every second or third character was non printable I'm not sure if result would have been the same.
  • Peter Cordes
    Peter Cordes almost 3 years
    Yeah, I know how you're using ArrayCopy. My point was that doing copy[idx] = c; or whatever on the fly while searching must be making surrounding asm worse for some reason, for it to be worth reloading the same source data with wide loads, instead of storing each correct char while you're already looking at it one at a time. It's the same reason libc strcpy isn't implemented as strlen + memcpy - instead it combines both operations. (Of course it's normally hand-written in asm, unlike with portable C or Java where you may have fast primitive operations like memcpy you can't replicate)
  • Stelios Adamantidis
    Stelios Adamantidis almost 3 years
    @PeterCordes I get your point, perhaps copy[idx] = c; yields more checks that the arraycopy doesn't? I have to be honest I don't have answer for everything :) (although I'd love to sometimes). But it might be a good opportunity for further investigation (if time permits :( )
  • Peter Cordes
    Peter Cordes almost 3 years
    Yes, like I already said, my two main theories are that bounds-checking on every byte (instead of maybe just once for each ArrayCopy) are costing a lot in the asm, or that doing a store at all defeats some other optimizations, like perhaps not keeping all the loop vars in registers. e.g. the compiler can no longer prove as many loop invariants, because it thinks the store might change the value of some part of some variable. (Figuring out when stores can/can't affect other things is called "alias analysis"). A read-only loop makes alias analysis trivial / unnecessary.