Fastest way to strip all non-printable characters from a Java String
Solution 1
If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
char [] oldChars = new char[5];
String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);
etc...
This is a big win - 20% or so, as I understand the current best case.
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.
Solution 2
using 1 char array could work a bit better
int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);
and I avoided repeated calls to s.length();
another micro-optimization that might work is
int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);
Solution 3
Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.
As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.
Anyways here we go:
// Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}
}
@Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}
For my teststring that gets 3477148.18ops/s
vs. 2616120.89ops/s
for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.
Solution 4
You could split the task into a several parallel subtasks, depending of processor's quantity.
Solution 5
I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.
Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).
Related videos on Youtube
GreyCat
Updated on July 08, 2022Comments
-
GreyCat almost 2 years
What is the fastest way to strip all non-printable characters from a
String
in Java?So far I've tried and measured on 138-byte, 131-character String:
- String's
replaceAll()
- slowest method- 517009 results / sec
- Precompile a Pattern, then use Matcher's
replaceAll()
- 637836 results / sec
- Use StringBuffer, get codepoints using
codepointAt()
one-by-one and append to StringBuffer- 711946 results / sec
- Use StringBuffer, get chars using
charAt()
one-by-one and append to StringBuffer- 1052964 results / sec
- Preallocate a
char[]
buffer, get chars usingcharAt()
one-by-one and fill this buffer, then convert back to String- 2022653 results / sec
- Preallocate 2
char[]
buffers - old and new, get all chars for existing String at once usinggetChars()
, iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version- 2502502 results / sec
- Same stuff with 2 buffers - only using
byte[]
,getBytes()
and specifying encoding as "utf-8"- 857485 results / sec
- Same stuff with 2
byte[]
buffers, but specifying encoding as a constantCharset.forName("utf-8")
- 791076 results / sec
- Same stuff with 2
byte[]
buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)- 370164 results / sec
My best try was the following:
char[] oldChars = new char[s.length()]; s.getChars(0, s.length(), oldChars, 0); char[] newChars = new char[s.length()]; int newLen = 0; for (int j = 0; j < s.length(); j++) { char ch = oldChars[j]; if (ch >= ' ') { newChars[newLen] = ch; newLen++; } } s = new String(newChars, 0, newLen);
Any thoughts on how to make it even faster?
Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const
Charset.forName("utf-8")
?Update
- Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
- Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.
Update 2
I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.
You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:
- Debian sid
- Linux 2.6.39-2-amd64 (x86_64)
- Java installed from a package
sun-java6-jdk-6.24-1
, JVM identifies itself as- Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
- Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:
Same single string
This mode works on a same single string provided by
StringSource
class as a constant. The showdown is:Ops / s │ Algorithm ──────────┼────────────────────────────── 6 535 947 │ Voo1 ──────────┼────────────────────────────── 5 350 454 │ RatchetFreak2EdStaub1GreyCat1 5 249 343 │ EdStaub1 5 002 501 │ EdStaub1GreyCat1 4 859 086 │ ArrayOfCharFromStringCharAt 4 295 532 │ RatchetFreak1 4 045 307 │ ArrayOfCharFromArrayOfChar 2 790 178 │ RatchetFreak2EdStaub1GreyCat2 2 583 311 │ RatchetFreak2 1 274 859 │ StringBuilderChar 1 138 174 │ StringBuilderCodePoint 994 727 │ ArrayOfByteUTF8String 918 611 │ ArrayOfByteUTF8Const 756 086 │ MatcherReplace 598 945 │ StringReplaceAll 460 045 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)Multiple strings, 100% of strings contain control characters
Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.
Ops / s │ Algorithm ──────────┼────────────────────────────── 2 123 142 │ Voo1 ──────────┼────────────────────────────── 1 782 214 │ EdStaub1 1 776 199 │ EdStaub1GreyCat1 1 694 628 │ ArrayOfCharFromStringCharAt 1 481 481 │ ArrayOfCharFromArrayOfChar 1 460 067 │ RatchetFreak2EdStaub1GreyCat1 1 438 435 │ RatchetFreak2EdStaub1GreyCat2 1 366 494 │ RatchetFreak2 1 349 710 │ RatchetFreak1 893 176 │ ArrayOfByteUTF8String 817 127 │ ArrayOfByteUTF8Const 778 089 │ StringBuilderChar 734 754 │ StringBuilderCodePoint 377 829 │ ArrayOfByteWindows1251 224 140 │ MatcherReplace 211 104 │ StringReplaceAll
In charted form:
(source: greycat.ru)Multiple strings, 1% of strings contain control characters
Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.
Ops / s │ Algorithm ──────────┼────────────────────────────── 3 711 952 │ Voo1 ──────────┼────────────────────────────── 2 851 440 │ EdStaub1GreyCat1 2 455 796 │ EdStaub1 2 426 007 │ ArrayOfCharFromStringCharAt 2 347 969 │ RatchetFreak2EdStaub1GreyCat2 2 242 152 │ RatchetFreak1 2 171 553 │ ArrayOfCharFromArrayOfChar 1 922 707 │ RatchetFreak2EdStaub1GreyCat1 1 857 010 │ RatchetFreak2 1 023 751 │ ArrayOfByteUTF8String 939 055 │ StringBuilderChar 907 194 │ ArrayOfByteUTF8Const 841 963 │ StringBuilderCodePoint 606 465 │ MatcherReplace 501 555 │ StringReplaceAll 381 185 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).
References
- GitHub repository with a benchmarking suite
-
Gustav Barkefors over 12 years"This question shows research effort" - hmm... yeah, pass. +1
-
Admin over 12 years
StringBuilder
will be marginally faster thanStringBuffer
as it is un-synchronized, I just mention this because you tagged thismicro-optimization
-
home over 12 years@Jarrod Roberson: ok, so let's make all read-only fields final and extract
s.length()
out of thefor
loop as well :-) -
Vishy over 12 yearsSome characters below space are printable e.g.
\t
and\n
. Many characters above 127 are non-printable in your character set. -
ratchet freak over 12 yearsdid you init the string buffer with a capacity of
s.length()
? -
Admin over 12 years@home I agree I make everything I can
final
, but for other reasons as well, and if if you are looking at micro-optimization making the.length()
read once is another micro-optimization. I mean this is taggedmicro-optimization
isn't it! -
home over 12 years@Jarrod Roberson: full ack, that's why I mentioned it - btw I make all fields
final
for exactly the same reason :-) -
GreyCat over 12 years@ratchet freak: Yes, I did
new StringBuffer(s.length())
-
Ed Staub over 12 years@GreyCat, do you want to post the code of the current best case? It's a bit hard to tease together from the answers and your comments.
-
GreyCat over 12 yearsPlease check out question update - I've posted the testing framework and massive showdown for all 17 proposed algorithms.
-
ratchet freak over 12 yearslooking back all we did was avoid the superfluous allocations of the char arrays (and the new String object when there was no change)
-
GreyCat over 12 yearsYeah, but these are really microoptimizations - so they depend on input data a lot - a tiny shift in a few coefficient makes algorithms with or without extra checks switch places.
-
dube over 2 yearsIt bugs me more than it should that in ten years nobody mentioned that the regex for
replaceAll
is not precompiled. Then it would at least beat the "StringBuilderChar"
- String's
-
GreyCat over 12 yearsThanks! Your version yields 3105590 strings / sec - a massive improvement!
-
Thomas over 12 years
newLen++;
: what about using preincrement++newLen;
? - (++j
in the loop as well). Have a look here: stackoverflow.com/questions/1546981/… -
GreyCat over 12 yearsYeah, I thought of it too, but it won't yield any performance gains in my situation - this stripping algorithm would be called in already massively parallel system.
-
GreyCat over 12 yearsAnd, besides, I might guess that forking off a few threads for processing per every 50-100 byte string would be a huge overkill.
-
GreyCat over 12 yearsAdding
final
to this algorithm and usingoldChars[newLen++]
(++newLen
is an error - the whole string would be off by 1!) yields no measureable performance gains (i.e. I get ±2..3% differences, which are comparable to differences of different runs) -
ratchet freak over 12 years@grey I made another version with some other optimizations
-
Thomas over 12 years@GreyCat well you could initialize
newLen
with -1 to account for that. -
Thomas over 12 yearsShouldn't that be
oldChars[length]='\0
? -
GreyCat over 12 yearsProbably you've meant
oldChars[length] = '\0'
insteadoldChars='\0'
? But even then if fails me with erroneous result... Let me check... -
ratchet freak over 12 years@gray I made the char array 1 bigger to account for that
-
GreyCat over 12 years@ratchet freak Your new version doesn't strip first found non-printable character - it could be quickly patched by adding
newLen--
after a while loop... -
umbr over 12 yearsYes, forking threads for every small string is not good idea. But load balancer could improve performance. BTW, did you test performance with StringBuilder instead of StringBuffer which has performance lacks because it synchronized.
-
ratchet freak over 12 years@grey well it would only improve in my second version when the start of the strings are all for a large part printable (i.e. the non printable chars are rare and mostly near the end)
-
GreyCat over 12 yearsHmm! That's a brilliant idea! 99.9% of strings in my production environment won't really require stripping - I can improve it further to eliminate even first
char[]
allocation and return the String as is, if no stripping happened. -
GreyCat over 12 yearsMy production setup runs spawns several separate processes and utilizes as much parallel CPUs and cores as possible, so I can freely use
StringBuilder
everywhere without any issues at all. -
GreyCat over 12 yearsGreat idea! So far it brought the counts up to 3471017 strings per second - i.e. a +12% improvement over previous best version.
-
GreyCat over 12 years+1 for the effort - but you should've asked me - I already have a similar benchmarking suite - that's where I was testing my algorithms ;)
-
GreyCat over 12 yearsI doubt that any unrolling could be done here, as there are (a) dependencies on following steps of algorithm on previous steps, (b) I haven't even heard of anybody doing manual loop unrolling in Java producing any stellar results; JIT usually does a good job at unrolling whatever it sees fitting the task. Thanks for the suggestion and a link, though :)
-
Voo over 12 years@GreyCat Well I could've, but just throwing that together (out of existing code anyways) was probably faster ;)
-
GreyCat over 12 yearsThanks for the solution, please check out question update - I've published my testing framework and added 3 test run results for 17 algorithms. Your algorithm is always on the top, but it changes internal state of Java String, thus breaking the "immutable String" contract => it would pretty hard to use it in real world application. Test-wise, yeah, it's the best result, but I guess I'll announce it as a separate nomination :)
-
Voo over 12 years@GreyCat Yeah it has certainly some big strings attached and honestly I pretty much only wrote it up because I'm pretty sure there's no noticeable way to improve your current best solution further. There are situations where I'm certain that it'll work fine (no substring or intern calls before stripping it ), but that's because of knowledge about one current Hotspot version (ie afaik it won't intern strings read from IO - wouldn't be especially useful). It may be useful if one really needs those extra x%, but otherwise more a baseline to see how much you can still improve ;)
-
Voo over 12 yearsThough I'm attempted to try a JNI version if I find time - never used it so far so that'd be interesting. But I'm pretty sure it'll be slower because of the additional calling overhead (strings are way too small) and the fact that the JIT shouldn't have such a hard time optimizing the functions. Just don't use
new String()
in case your string wasn't changed, but I think you got that already. -
GreyCat over 12 yearsI've already tried to do exactly the same thing in pure C - and, well, it doesn't really show much improvement over your reflection-based version. C version runs something like +5..10% faster, not really that great - I thought it would be at least like 1.5x-1.7x...
-
Peter Cordes almost 3 yearsAnother option to avoid repeated allocation of the tmp buf would be to have the caller pass in (by reference) a scratch buffer of sufficient size. You'd have to check if that ended up defeating some optimizations, but hopefully the JVM can be sure the String and the tmp buf aren't the same object just because of their types. It's not like a C function taking two
char*
args. (If the JVM even does any optimizations that would matter; presumably it's not going to use SIMD to scan for characters that aren't>= ' '
, unfortunately.) -
Peter Cordes almost 3 yearsIt's surprising there's anything to gain from using
System.arraycopy
- if you have to compare characters one at a time anyway, it only costs a couple extra asm instructions to store and increment a pointer. Except this is Java so it probably involves a bounds-check, and the JIT isn't wonderful. So that's probably why it ends up being fast to scan read-only to find the length of a block of printable characters. Especially if they're rare so you have decent sized chunks for memcpy. -
Peter Cordes almost 3 years
while (length > newLength) newLength *= 2;
seems a bit silly; why notmax(length + 16, copy.length * 2)
. Ormax( length * 3 / 2, copy.length*2)
or something. Probably it's not important to worry about 32-bit overflow in the*3
before the/2
; maybe use>>>1
to make sure we do unsigned division to allow the*3
temporary to have twice the range of signed int. -
Peter Cordes almost 3 yearsYou talk about "late bail" - Do you mean the
if (newLen == length) return s;
check in github.com/GreyCat/java-string-benchmark/blob/master/src/ru/…? Are you talking about putting it right after thefor (j
loop? Perhaps that helps the JIT see that if that loop condition is false the first time, it should jump to code that doesreturn s;
, otherwise fall into the loop. Surprised it made that much difference to just not check the same condition twice, but maybe it made the JIT do a worse job in some way. -
Stelios Adamantidis almost 3 years@PeterCordes Let's discuss in the chat :)
-
Stelios Adamantidis almost 3 years@PeterCordes meh, I got overwhelmed lately and didn't respond in time so now the chat is deleted :( My bad. You mentioned something about
arraycopy
andmemcpy
but Java (openJDK ) apparently loops over it! Anyway, too late now. But thanks for the useful input regardless! -
Peter Cordes almost 3 yearsThat was back in OpenJDK7, and had basically a "todo" comment. Even with that C++ source, it's possible that a C++ compiler could compile that loop into a
memcpy
ormemmove
call, if it can inline and optimize away theoop
class's copy-constructor. Yes, CPUs can copy much faster than 1 byte at a time when they don't need to look at each byte separately. (My earlier point was that your algorithm already does need to look at each byte separately, so it might as well get the copying done while looking. Unless that leads to much worse JIT asm, perhaps because of bounds-check or aliasing) -
Stelios Adamantidis almost 3 years@PeterCordes I tried several iterations when attempting to optimize that part. I even tried arraycopy to another array. What boosted performance was copying to the same array. Also note that I don't copy each byte, I copy blocks, from the first printable to the last ;) If every second or third character was non printable I'm not sure if result would have been the same.
-
Peter Cordes almost 3 yearsYeah, I know how you're using ArrayCopy. My point was that doing
copy[idx] = c;
or whatever on the fly while searching must be making surrounding asm worse for some reason, for it to be worth reloading the same source data with wide loads, instead of storing each correct char while you're already looking at it one at a time. It's the same reason libcstrcpy
isn't implemented asstrlen
+memcpy
- instead it combines both operations. (Of course it's normally hand-written in asm, unlike with portable C or Java where you may have fast primitive operations like memcpy you can't replicate) -
Stelios Adamantidis almost 3 years@PeterCordes I get your point, perhaps
copy[idx] = c;
yields more checks that the arraycopy doesn't? I have to be honest I don't have answer for everything :) (although I'd love to sometimes). But it might be a good opportunity for further investigation (if time permits :( ) -
Peter Cordes almost 3 yearsYes, like I already said, my two main theories are that bounds-checking on every byte (instead of maybe just once for each ArrayCopy) are costing a lot in the asm, or that doing a store at all defeats some other optimizations, like perhaps not keeping all the loop vars in registers. e.g. the compiler can no longer prove as many loop invariants, because it thinks the store might change the value of some part of some variable. (Figuring out when stores can/can't affect other things is called "alias analysis"). A read-only loop makes alias analysis trivial / unnecessary.