Is there a memory-efficient replacement of java.lang.String?

java string optimization memory performance

21,825

Solution 1

With a Little Bit of Help From the JVM...

WARNING: This solution is now obsolete in newer Java SE versions. See other ad-hoc solutions further below.

If you use an HotSpot JVM, since Java 6 update 21, you can use this command-line option:

-XX:+UseCompressedStrings

The JVM Options page reads:

Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

UPDATE: This feature was broken in a later version and was supposed to be fixed again in Java SE 6u25 as mentioned by the 6u25 b03 release notes (however we don't see it in the 6u25 final release notes). The bug report 7016213 is not visible for security reasons. So, use with care and check first. Like any -XX option, it is deemed experimental and subject to change without much notice, so it's probably not always best to not use that in the startup scrip of a production server.

UPDATE 2013-03 (thanks to a comment by Aleksey Maximus): See this related question and its accepted answer. The option now seems to be deceased. This is further confirmed in the bug 7129417 report.

The End Justifies the Means

Warning: (Ugly) Solutions for Specific Needs

This is a bit out of the box and lower-level, but since you asked... don't hit the messenger!

Your Own Lighter String Representation

If ASCII is fine for you needs, then why don't you just roll out your own implementation?

As you mentioned, you could byte[] instead of char[] internally. But that's not all.

To do it even more lightweight, instead of wrapping your byte arrays in a class, why not simply use an helper class containing mostly static methods operating on these byte arrays that you pass around? Sure, it's going to feel pretty C-ish, but it would work, and would save you the huge overhead that goes with String objects.

And sure, it would miss some nice functionalities... unless your re-implement them. If you really need them, then there's not much choice. Thanks to OpenJDK and a lot of other good projects, you could very well roll out your own fugly LiteStrings class that just operate on byte[] parameters. You'll feel like taking a shower every time you need to call a function, but you'll have saved heaps of memory.

I'd recommend to make it resemble closely the String class's contract and to provide meaningful adapters and builders to convert from and to String, and you might want to also have adapters to and from StringBuffer and StringBuilder, as well as some mirror implementations of other things you might need. Definitely some piece of work, but might be worth it (see a bit below the "Make it Count!" section).

On-the-Fly Compression/Decompression

You could very well compress your strings in memory and decompress them on the fly when you need them. After all, you only need to be able to read them when you access them, right?

Of course, being that violent will mean:

more complex (thus less maintainable) code,
more processing power,
relatively long strings are needed for the compression to be relevant (or to compact multiple strings into one by implementing your own store system, to make the compression more effective).

Do Both

For a full-headache, of course you can do all of that:

C-ish helper class,
byte arrays,
on-the-fly compressed store.

Be sure to make that open-source. :)

Make it Count!

By the way, see this great presentation on Building Memory-Efficient Java Applications by N. Mitchell and G. Sevitsky: [2008 version], [2009 version].

From this presentation, we see that an 8-char string eats 64 bytes on a 32-bit system (96 for a 64-bit system!!), and most of it is due to JVM overhead. And from this article we see that an 8-byte array would eat "only" 24 bytes: 12 bytes of header, 8 x 1 byte + 4 bytes of alignment).

Sounds like this could be worth it if you really manipulate a lot of that stuff (and possibly speed up things a bit, as you'd spend less time allocating memory, but don't quote me on that and benchmark it; plus it would depend greatly on your implementation).

Solution 2

At Terracotta, we have some cases where we compress big Strings as they are sent around the network and actually leave them compressed until decompression is necessary. We do this by converting the char[] to byte[], compressing the byte[], then encoding that byte[] back into the original char[]. For certain operations like hash and length, we can answer those questions without decoding the compressed string. For data like big XML strings, you can get substantial compression this way.

Moving the compressed data around the network is a definite win. Keeping it compressed is dependent on the use case. Of course, we have some knobs to turn this off and change the length at which compression turns on, etc.

This is all done with byte code instrumentation on java.lang.String which we've found is very delicate due to how early String is used in startup but is stable if you follow some guidelines.

Solution 3

The article points out two things:

Character arrays increase in chunks of 8 bytes.
There is a large difference in size between char[] and String objects.

The overhead is due to including a char[] object reference, and three ints: an offset, a length, and space for storing the String's hashcode, plus the standard overhead of simply being an object.

Slightly different from String.intern(), or a character array used by String.substring() is using a single char[] for all Strings, this means you do not need to store the object reference in your wrapper String-like object. You would still need the offset, and you introduce a (large) limit on how many characters you can have in total.

You would no longer need the length if you use a special end of string marker. That saves four bytes for the length, but costs you two bytes for the marker, plus the additional time, complexity, and buffer overrun risks.

The space-time trade-off of not storing the hash may help you if you do not need it often.

For an application that I've worked with, where I needed super fast and memory efficient treatment of a large number of strings, I was able to leave the data in its encoded form, and work with byte arrays. My output encoding was the same as my input encoding, and I didn't need to decode bytes to characters nor encode back to bytes again for output.

In addition, I could leave the input data in the byte array it was originally read into - a memory mapped file.

My objects consisted of an int offset (the limit suited my situation), an int length, and an int hashcode.

java.lang.String was the familiar hammer for what I wanted to do, but not the best tool for the job.

Solution 4

An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.

For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).

It's all about priorities.

Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:

The String class must support all possible unicode characters.
Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.
When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.
Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.

Solution 5

I think you should be very cautious about basing any ideas and/or assumptions off of a javaworld.com article from 2002. There have been many, many changes to the compiler and JVM in the six years since then. At the very least, test your hypothesis and solution against a modern JVM first to make sure that the solution is even worth the effort.

View more solutions

21,825

Author by

the.duckman

Updated on April 17, 2020

Comments

the.duckman about 4 years
After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:
```
length: 0, {class java.lang.String} size = 40 bytes
length: 7, {class java.lang.String} size = 56 bytes
```
While the article has some tips to minimize this, I did not find them entirely satisfying. It seems to be wasteful to use char[] for storing the data. The obvious improvement for most western languages would be to use byte[] and an encoding like UTF-8 instead, as you only need a single byte to store the most frequent characters then instead of two bytes.

Of course one could use String.getBytes("UTF-8") and new String(bytes, "UTF-8"). Even the overhead of the String instance itself would be gone. But then there you lose very handy methods like equals(), hashCode(), length(), ...

Sun has a patent on byte[] representation of Strings, as far as I can tell.

Frameworks for efficient representation of string objects in Java programming environments
... The techniques can be implemented to create Java string objects as arrays of one-byte characters when it is appropriate ...

But I failed to find an API for that patent.

Why do I care?
In most cases I don't. But I worked on applications with huge caches, containing lots of Strings, which would have benefitted from using the memory more efficiently.

Does anybody know of such an API? Or is there another way to keep your memory footprint for Strings small, even at the cost of CPU performance or uglier API?

Please don't repeat the suggestions from the above article:
- own variant of String.intern() (possibly with SoftReferences)
- storing a single char[] and exploiting the current String.subString(.) implementation to avoid data copying (nasty)
Update

I ran the code from the article on Sun's current JVM (1.6.0_10). It yielded the same results as in 2002.
caseyamcl over 15 years

Few bytes? For many environments (ASCII only data), Java's storage requirements are slightly more than double the required amount. For large volumes of data, this is indeed a large block of wasted memory.
the.duckman over 15 years

As I wrote, in most cases no. But yes, I have written more than one app, where the biggest part of the heap were String instances and the corresponding char[]. The few bytes are several hundreds of MB.
the.duckman over 15 years

Zip only works on Strings bigger than some hundreds chars. I did Huffman coding with static lookups once - that worked. But this means, we store the data in byte[] again. Unfortunately the javolution classes are not memory efficient, as a Google code search showed - you were right.
the.duckman over 15 years

True. I just ran the code from the article on Sun's newest 1.6.0_10 JVM. Same results as 2002.
Bill K over 15 years

Yes, nkr1pt, you are right. They often point to the same object in memory, and "abc" and "abcdef" can even point to the same exact array since "length" is stored independently.
the.duckman over 15 years

Some time ago I read that interned Strings are stored in the PermGen and are never freed again. Don't know how this is today. This page wiki.eclipse.org/index.php/Performance_Bloopers lists using String.intern() as a blooper in the implementation of Eclipse 3.0.
caseyamcl over 15 years

They can be interned so that all equal strings are shared, but my assumption is that he wasn't wanting to do that (possibly long strings with not much duplication?). Large strings are not automatically shared.
the.duckman over 15 years

Sorry, my answer was not precise enough. I meant: No they are not "less memory intensive for some time now". And yes, you are right in a special case: The compilers are clever enough nowadays to merge equal String instances in a single Class to the same instance. That's why "a"=="a" yields true.
caseyamcl over 15 years

Good ? regarding permgen... I don't know if the VMs do that or not. I think most of the time the problem with inter is just that the strings you are interning end up not being duplicated as much as you think. The intern() calls can end up destroying your perf gains. Or perhaps depending on use.
caseyamcl over 15 years

Yes, zip won't work for that reason (headers too big)... but I think gzip crosses over at smaller values, albeit probably still in the 100+ char range. It is kind of surprising that noone has developed one with memory efficiency as the primary goal.
Alex Miller over 15 years

I wouldn't suggest using StringBuffer but if you were going to go that route, you should use StringBuilder as it is unsynchronized vs StringBuffer which is synchronized and is thus much faster in the vast majority of use cases.
Daniel Liu over 15 years

Well, I suppose you could just pass your strings around as byte arrays and then use a CharsetDecoder to convert them to strings on demand. I agree that it'd be nice if the String class provided a constructor that would do it for you, but I don't think it'd be worth having a whole different class.
Kevin Day over 15 years

the problem with indiscriminate use of intern() is that interned strings can not be garbage collected (i.e. permgen). In other words, a memory leak.
haylem over 13 years

@Stephen: Really? Never paid attention to that but it might be. Thanks for the heads-up.
BeeOnRope about 13 years

The UTF-16 encoding has all the same disadvantages you mention about the UTF-8 one: it isn't one code unit per code point either (only UTF-32 has that), so length in characters (unicode characters, not Java 16-bit code point characters) is still O(N). Sure, these characters are rare, but you are either correct or not. When the original design decision occurred, surrogates were non-existent so it may have made sense then. All of the existing methods on String could be made to operate in a similar way to the existing ones, with string efficiency with UTF-8. Show me a counter example!
oligofren over 12 years

@Alex: the performance difference between stringbuffer and stringbuilder is negligable.
Alex Miller over 12 years

@oligofren - in most cases, yes. However, StringBuilder is never slower (as it is identical but eliminates synchronization), thus it should be your default choice. In a few cases (where you are doing tight loops of string concatenation), the difference will be significant. Why would choose to do something that can only be slower?
the.duckman over 12 years

UseCompressedStrings is not a compiler option, but a JVM option, and a rather recent one, compared to my question: thevirtualmachinist.blogspot.com/2010/12/… But it sound very promising, thanks!
Oleksii Kolotylenko about 11 years

Some update for this information stackoverflow.com/questions/8833385/…
haylem about 11 years

@AlekseyMaximus: thanks. I've integrated your answer and the link to the related question and its answer in mine, plus the link to the bug report explaining why the Java technical documentation still mentions this option for Java SE 7 post Update 2.
haylem about 11 years

Interesting, I hadn't even noticed your answer when I wrote mine mentioning a possible approach like this. Nice insight on Terracotta :).
supercat about 10 years

When people use methods like indexOf, what they generally want is some means of identifying a string position. Code which wants the first eight characters following the string "NAME=" often won't care whether the value returned by IndexOf represents the displacement in bytes, char-sized units, or code points, provided it's consistent with other string operations. Given a method to return the index of the code point some number of code points forward or backward from a given index, there shouldn't be much need for code-point-based indices.
Navin about 10 years

This is misinformation. Number of characters is still O(n) in UTF-16.
Navin about 10 years

Old answer, but UTF-16 is not more efficient than UTF-8. Some characters are 32bits long so it takes O(n) time to find the length of a string.
supercat almost 10 years

One thing I'd like to see that could really help the performance of strings and any other immutable type that too advantage of it, would be a TelescopingReference [TR]with the following special treadment from the GC: the first field of a TelescopingReference` would also be a TelescopingReference called link, and whenever a TR Foo was scanned by the GC, Foo.link was non-null, and Foo.link.link was non-null, it would change Foo.link to identify the last non-null item in the linked list. Such an approach would mean that if two strings were compared and found to be equal, ...
supercat almost 10 years

...one could be made to hold a link to the other so they could be recgonized as equivalent without having to examine data. Discovery that any member of one equivalence set matched any member of another would allow instant recognition that all members of both sets matched, and a hash code computed for one member of a set would be cached for all. Such a thing could almost be implemented now, at reasonable cost, but for the fast that the right sequence of comparing objects and abandoning them could cause memory usage to grow without bound until the proper object was abandoned.
Lii almost 6 years

Since Java 9 this Compact Strings feature in incorporated into the standard library and used by default.