Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

c++ arrays performance optimization memory-bandwidth

15,353

Solution 1

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.

When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).

If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.

Quick benchmark

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
    /// This stuff allows to use this class wherever a library function
    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
    typedef uint32_t result_type;
    static uint32_t min() { return 1; }
    static uint32_t max() { return uint32_t(-1); }

    /// PRNG state
    uint32_t y;

    /// Initializes with seed
    XorShift32(uint32_t seed = 0) : y(seed) {
        if(y == 0) y = 2463534242UL;
    }

    /// Returns a value in the range [1, 1<<32)
    uint32_t operator()() {
        y ^= (y<<13);
        y ^= (y>>17);
        y ^= (y<<15);
        return y;
    }

    /// Returns a value in the range [0, limit); this conforms to the RandomFunc
    /// requirements for std::random_shuffle
    uint32_t operator()(uint32_t limit) {
        return (*this)()%limit;
    }
};

struct mean_variance {
    double rmean = 0.;
    double rvariance = 0.;
    int count = 0;

    void operator()(double x) {
        ++count;
        double ormean = rmean;
        rmean     += (x-rmean)/count;
        rvariance += (x-ormean)*(x-rmean);
    }

    double mean()     const { return rmean; }
    double variance() const { return rvariance/(count-1); }
    double stddev()   const { return std::sqrt(variance()); }
};

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

volatile unsigned out;

int main() {
    XorShift32 xs;
    std::vector<uint8_t> vec;
    int size = 10000000;
    for(int i = 0; i<size; ++i) {
        uint32_t v = xs();
        if(v < 1825361101)      v = 0; // 42.5%
        else if(v < 4080218931) v = 1; // 95.0%
        else if(v < 4252017623) v = 2; // 99.0%
        else {
            while((v & 0xff) < 3) v = xs();
        }
        vec.push_back(v);
    }
    populate(vec.data(), vec.size());
    mean_variance lk_t, arr_t;
    for(int i = 0; i<50; ++i) {
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += lookup(xs() % size);
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "lookup: %10d µs\n", dur);
            lk_t(dur);
        }
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += vec[xs() % size];
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "array:  %10d µs\n", dur);
            arr_t(dur);
        }
    }

    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
            lk_t.mean(), lk_t.stddev(),
            arr_t.mean(), arr_t.stddev(),
            arr_t.mean()/lk_t.mean());
    return 0;
}

(code and data always updated in my Bitbucket)

The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:

performs a random lookup of 10M elements with my data structure
does the same through the original array.

(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)

These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).

I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.

            CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

The results are... mixed!

In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.

Solution 2

Another option could be

check if the result is 0, 1 or 2
if not do a regular lookup

In other words something like:

unsigned char lookup(int index) {
    int code = (bmap[index>>2]>>(2*(index&3)))&3;
    if (code != 3) return code;
    return full_array[index];
}

where bmap uses 2 bits per element with the value 3 meaning "other".

This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.

Solution 3

This is more of a "long comment" than a concrete answer

Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".

For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.

I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.

In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?

But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".

The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.

In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.

Solution 4

What I've done in the past is use a hashmap in front of a bitset.

This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).

Often, however, "cache is king".

Solution 5

Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.

There are two assumptions in your questions:

The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.

I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.

For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.

View more solutions

15,353

Author by

JohnAl

Updated on June 17, 2022

Comments

JohnAl almost 2 years
Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)
```
uint8_t MyArray[10000000];
```
when the value at any position in the array is
- 0 or 1 for 95% of all cases,
- 2 in 4% of cases,
- between 3 and 255 in the other 1% of cases?
So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.

I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude. It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.
cmaster - reinstate monica almost 6 years

If the larger values (>2) are nearly equally distributed across the array, you can significantly speed up the search by looking up i*small_array_size/large_array_size first.
JohnAl almost 6 years

Thanks! In the end, I'm just interested in whats quicker when 100% of the CPU is busy with looping over such arrays (different threads over different arrays). Currently, with a uint8_t array, the RAM bandwidth is saturated after ~5 threads are working on that at the same time (on a quad channel system), so using more than 5 threads no longer gives any benefit. I would like this to use >10 threads without running into RAM bandwidth issues, but if the CPU side of the access becomes so slow that 10 threads get less done than 5 threads before, that would obviously not be progress.
JohnAl almost 6 years

How exactly would a hashmap halve the space compared to Matteo's answer? What should be in that hashmap?
meneldal almost 6 years

I'd say that's a good compromise to get as many cache hits as possible (since the reduced structure can fit in the cache more easily), without losing much on random access time.
o11c almost 6 years

@JohnAl Using a 1-bit bitset=bitvec instead of a 2-bit bitvec.
JohnAl almost 6 years

@o11c I'm not sure if I understand it correctly. You mean to have an array of 1 bit values where 0 means look at main_arr and 1 means look at the sec_arr (in the case of Matteos code)? That would need overall more space than Matteos answer though, since its one additional array. I don't quite understand how you would do it only using half the space compared to Matteos answer.
JohnAl almost 6 years

I do sometimes have to change individual values in the array, but only rarely. O(n) for that is quite bad, so I'll have to see if theres enough of a performance benefit to be worth it. I will profile all mentioned solutions of course (and my original one) and see which one is the fastest. @MartinBonner yes, of course std::sort , but I'm wondering if I should create a struct to use as the type for sec_arr and override the < operator for that, or define the sorting differently? I never really had to use std::sort much yet.
Martin Bonner supports Monica almost 6 years

@JohnAl You don't need a struct. A uint32_t will be fine. Erasing an element from the secondary buffer will obviously leave it sorted. Inserting an element can be done with std::lower_bound and then insert (rather than appending and re-sorting the whole thing). Updates make the full-size secondary array much more attractive - I'd certainly start with that.
Martin Bonner supports Monica almost 6 years

@JohnAl How many cores do you have? If you are CPU bound, there's no point having more threads than cores. Also, maybe time to look at GPU programming?
JohnAl almost 6 years

@MartinBonner only the first 24 bit of the uint32_t are allowed to be used for the sorting though, so I can't just sort the uint32_t array as if the values would be "real" uint32_t, so thats why I asked. I would use a struct with a bitfield and compare the first 24 bit of the bitfield then in the < overriden operator, and if theres a "cleaner" way I would like to hear about that.
Martin Bonner supports Monica almost 6 years

@JohnAl Because the value is (idx << 8) + val you don't have to worry about the value portion - just use a straight compare. It will always compare less than ((idx+1) << 8) + val and less than ((idx-1) << 8) + val
JohnAl almost 6 years

@MartinBonner I do currently have 12 threads. And I agree, this would probably run very nicely on a GPU.
JohnAl almost 6 years

@MartinBonner oh, I didn't think about that! that makes it easy of course. Thanks! :)
Martin Bonner supports Monica almost 6 years

ahem! My second "less than" should be "greater than" of course.
Martin Bonner supports Monica almost 6 years

Could you clarify this? You look up the expectional cases first, and then look in the bitmap? If so, I suspect the slow lookup in the hash will overwhelm the savings in reducing the size of the bitmap.
Matteo Italia almost 6 years

@JohnAl: if that may be useful, I added a populate function that should populate main_arr and sec_arr according to the format that lookup expects. I didn't actually try it, so don't expect it to really work correctly :-) ; anyhow, it should give you the general idea.
Jack Aidley almost 6 years

@JohnAI: If you are simply running multiple versions of the same inefficient process on multiple threads, you will always see limited progress. There will be bigger wins in designing your algorithm for parallel processing than in tweaking a storage structure.
JohnAl almost 6 years

@MatteoItalia Very useful, thanks! I haven't tested the populate function yet, but I have tested lookup and used it to calculate the sum of the array. I compared it with a "regular" array, and the sum is same, so seems to work fine. Also did some profiling (random access, both same): Array Size: 1M -- Regular Array: 3.9 ms -- lookup function: 6 ms Array Size: 10M -- Regular Array: 44 ms -- lookup function: 75 ms The lookup becomes a lot faster with an array size >500M, but since the 24 bit index is only valid until 16M I think testing with higher values doesn't make sense, right?
JohnAl almost 6 years

@MatteoItalia Most interesting: Array Size: 100K -- Regular Array: 1.58 ms -- lookup function: 0.633 ms So the lookup function seems to work very well with small values actually, but less good with higher values? Kinda the opposite of what I expected.
Matteo Italia almost 6 years

@JohnAl: that's interesting, I would have expected the opposite, although the very small case may make the difference between L2 and L3 cache; I'll do some tests as well. For the 500 MB array, yes, it cannot be used as is because of the 24 bit index.
JohnAl almost 6 years

@MatteoItalia That's the code I used for testing: pastebin.com/L5wWF6a6
Matteo Italia almost 6 years

@JohnAl: with my code and a 10M array I obtain consistently twice the speed with my lookup when performing random access, while plain array access is 10 times faster when performing ordered lookup (as expected). I'll post my code ASAP.
Matteo Italia almost 6 years

ideone.com/YVSgjv over on ideone it's a bit more nuanced - it's 7 vs 10, but still faster (but IDK what compile options they are using - once on Ideone I even got a 32 bit machine); on smaller arrays regular array lookup wins as expected.
Matteo Italia almost 6 years

@JohnAl in your code you are killing the cache by looping through the rand_access_order array! You should generate the indexes for the lookup "online".
JohnAl almost 6 years

@MatteoItalia Ah, you're right! Generating the index "on the fly" does improve the performance on the lookup function significantly in comparison, but still only it being 0-10% faster compared to regular array with this code, by far not twice as fast: pastebin.com/ntMGRN57 I'm using an i7 5820k, I will try your code now.
Matteo Italia almost 6 years

@JohnAl: heh, with an i7 5820k you have 15 MB of cache; the whole array fits, so I wouldn't really expect lookup to beat straight array indexing.
JohnAl almost 6 years

@MatteoItalia With your code, I'm seeing quite big fluctuations in the results: pastebin.com/riA9Wbyp Mostly similar to my test code, but sometimes the lookup is also significantly faster. Probably because I have significantly more overhead than you in my test. In the end, what matters is what happens when 10 threads at the same time run this (with different arrays), then my 15 MB of cache also won't help much any more, so I'll have to do some more tests with threading :)
JohnAl almost 6 years

@MatteoItalia I've added some simple threading to your code to let it run in 8 different threads, each with their completely own data, thats the code: pastebin.com/fncR5VZV And I do notice that it all becomes way slower (roughly 2.5 times the amount of time), and the difference between array and lookup becomes even less unfortunately. Feels more like they are exactly same fast now. I am testing them only individually to not make one take more RAM bandwidth than the other one. Shouldn't the difference in RAM bandwidth become more noticeable with more threads?
Jack Aidley almost 6 years

I'm giving this +1 just for the benchmarking. Nice to see on a question about efficiency and with results for multiple processor types too! Nice!
JohnAl almost 6 years

Regarding my threading code, I should probably split up the generation of the array and the lookup, and not measure the lookup in threads while other threads currently use up RAM bandwidth and cache with the filling of the array. Maybe that explains the results I see.
Jack Aidley almost 6 years

@JohnAI You should profile it for your actual use case and nothing else. White room speed doesn't matter.
Michael Kay almost 6 years

I suspect "random access" was being used here to indicate that accesses are unpredictable, not that they are actually random. (i.e. it's intended in the sense of "random access files")
Frax almost 6 years

That's a nice benchmark, but could you add some details about compiler, compilation options, and OS? Besides, a hash table could do better than a binary search here, at least with a hash function well fitted for the data.
Dúthomhas almost 6 years

Yes, that is likely. OP isn't clear, however. If OP's accesses are in any way not random, then some form of sparse array is indicated, as per the other answers.
Matteo Italia almost 6 years

@Frax: I made the benchmark a bit more systematic, added OS & compiler information and tested on other machines; you can find the results in the table above. Yep, a hash table may be a better idea, but I don't have much time to implement it now. Feel free to try it and report the results!
davidbak almost 6 years

I thought this was called hashlinking - but google turns up no relevant hits so it must be something else. The way it usually worked was to have say a byte array that would hold values the vast majority of which were, say, between 0..254. Then you'd use 255 as a flag, and if you had a 255 element you'd look up the true value in an associated hash table. Can someone remember what it was called? (I think I read about it in an old IBM TR.) Anyway, you could also arrange it the way @o11c suggests - always lookup in the hash first, if it is not there, look in your bit array.
davidbak almost 6 years

... and to (possibly) answer @MartinBonner's suggestion about the time performance - the time would be dominated (I imagine) not by the hash lookup but by the hash computation on the index - you could substitute something else as the "exceptional" data holder - e.g., an array mapped trie and replace the hash computation with something cheaper (bit shifting&masking, for example).
JohnAl almost 6 years

@MatteoItalia I have tested now with only measuring time while the threads are busy with the access of the array (after all threads finished populating the array) and I seem to see a constant result of Array: 180 ms - Lookup: 130 ms with 6 threads. With 1 thread its Array: 57 ms - Lookup: 56 ms. I'm always testing them separately, testing array/lookup simultaniously doesn't make sense since then one can steal cache from the other. So with 6 threads, the lookup seems to be 38% quicker. My code: pastebin.com/XhBrk7iy Have you tested anything with threads?
Martin Bonner supports Monica almost 6 years

@davidbak I would expect std::hash of an integer to be a multiplication (possibly with an addition). That's not going to dominate even an L1 cache miss. In this case (where the index is presumably uniformly distributed), I would use the integer as its hash.
Matteo Italia almost 6 years

@JohnAI not yet, will do it tonight. But in your real use case is it going to be each thread with its own data or all threads looking up in a shared data structure?
Red.Wave almost 6 years

A bitset for ones/zeros. A set of indices for twos. And a sparse associative array for the rest.
JohnAl almost 6 years

@MatteoItalia all threads work on their own data, so data is not shared between threads.
JVApen almost 6 years

That's the short summary
Red.Wave almost 6 years

Let the OP know the terms, so he can search for alternate implementations of each.
John Coleman almost 6 years

Interesting idea (+1) but I am somewhat skeptical that it would justify the overhead unless there are a lot of long runs of 0's and/or long runs of 1's. In effect you are suggesting using a run-length encoding of the data. It might be good in some situations but probably isn't a good general approach to this problem.
AnoE almost 6 years

Maybe I'm missing something, but this answer seems to address mostly space usage, while the OP seems to be concerned mostly about bandwidth (i.e. time).
Matteo Italia almost 6 years

@AnoE: AFAICT the only time optimization you can do over a random access array of usually-small elements (which in a "classic", all equally fast-memory model would be the fastest way to access them) is to squeeze it so that it will fit in cache, thus gaining speed by climbing the memory hierarchy.
AnoE almost 6 years

@MatteoItalia, would you mind adding a sentence to your answer to point out that it only makes sense to compress it this way (i.e., by a constant factor which is not dependent on the actual distribution of the data, like the approach you used) if the end results fits in the cache(s)? Only OP knows how large his data (and his cache) actually is; and it would help him deciding. That said, there is more space efficient compression (again, depending on the data => hash tables etc.), which might be in order if the above is not the case.
Matteo Italia almost 6 years

@AnoE: I can add it, but it's pretty much implicit in the question - OP already clearly states (1) how big the data is (2) its distribution (3) that his bottleneck is RAM bandwith. Also, the approach is very dependent from the distribution of data as detailed by OP: it uses a primary fast (O(1) random access) and compact (2 bit per data, quite good for the given distribution) for the most frequent data, and a slower (O(log n)) and bigger (4 byte per element) storage for the rest. A hash table has to store the index and value anyway, so at best it can be used as a better secondary storage.
davidbak almost 6 years

Valiant effort!
davidbak almost 6 years

Try this: Since 4% of the cases are the value 2 ... create a set of exceptional cases (>1). Create a tree somewhat as described for really exceptional cases (>2). If present in set and tree then use value in tree; if present in set and not tree then use value 2, otherwise (not present in set) lookup in your bitvector. Tree will contain only 100000 elements (bytes). Set contains 500000 elements (but no values at all). Does this reduce size while justifying its increased cost? (100% of lookups look in set; 5% of lookups need to look in tree also.)
Ingo Schalk-Schupp almost 6 years

I think you have a point there, since the OP indicated he would loop over the entire array in a random order. For the case that only distributions need to be observed, this is a good answer.
supercat almost 6 years

Using a small array to handle the 0/1 cases is going to be helpful, but I would think it best to handle the >1 case using a straight array. Fetching a byte from the straight array will typically result in one cache miss, and will at worst result in two (if it displaces something that would otherwise have been useful). I don't think the binary search is going to result in O(lgN) cache misses, but wouldn't expect it to work brilliantly.
kutschkem almost 6 years

I think this can be further improved. I have had success in the past with a similar but different problem where exploiting branch predicition helped a lot. It may help to split the if(code != 3) return code; into if(code == 0) return 0; if(code==1) return 1; if(code == 2) return 2;
leftaroundabout almost 6 years

Right. In particular for random access, this is almost certainly slower than a simple array or unt8_t, even if it takes much less memory.
Mooing Duck almost 6 years

@MatteoItalia: Ideas to test: does std::vector<bool> offer any improvement for the small array? (Probably not) Also, right now you're storing 16 "small values" per 32 bits. You can actually cram 20 per 32 bits by using base 3. (0,1,lookup). This appears to scale down to 5 values per byte safely. This may help the tiny-cache machines, at the expense of the large cache machines.
Matteo Italia almost 6 years

@MooingDuck: vector<bool> is probably going to give only overhead - especially because I'd have to combine two values to obtain a single 0-3; trinary encoding is something I was thinking about - I even have some code lying around that performs the encoding/decoding for 16 bit words without using divisions (it was used once in a product where we needed to compactly transmit -1/0/+1 values), but I couldn't find the time to benchmark it - and I still have to write/run the multithread benchmark and one for the simpler solution (with the original array as secondary storage)!
Matteo Italia almost 6 years

TBH I've been taken a bit by surprise by the reception of this answer - it really started just as an idle idea wrote down while on the train, and now I'm struggling to find time to write benchmarks to be run on every pc/server I have access to 😅. #StackOverflowProblems
Matteo Italia almost 6 years

@kutschkem: in that case, __builtin_expect & co or PGO can also help.
Surt almost 6 years

You are probably thinking about cachelines, which are usually 32 or 64 bytes, but that wont help much here as the access is random.
Peter Mortensen almost 6 years

That is probably already the case with the use of uint8_t - the 8 means 8 bits.
o11c almost 6 years

You always want to use a CFBS-sorted array when you have an immutable tree, so there is no allocation for the nodes, just the data.