hash table for strings in c++

18,906

Solution 1

Hashing can be done anyway you choose. Suppose that the string is ABC. You can employ hashing as A=1, B=2, C=3, Hash = 1+2+3/(length = 3) = 2. But, this is very primitive.

The size of the array will depend on the hash algorithm that you deploy, but it is better to choose an algorithm that returns a definite length hash for every string. For example, if you chose to go with SHA1, you can safely allocate 40 bytes per hash. Refer Storing SHA1 hash values in MySQL Read up on the algorithm http://en.wikipedia.org/wiki/SHA-1. I believe that it can be safely used.

On the other hand, if it just for a simple exercise, you can also use MD5 hash. I wouldn't recommend using it in practical purposes as its rainbow tables are available easily :)

---------EDIT-------

You can try to implement like this ::

#include <iostream>
#include <string>
#include <stdlib.h>
#include <stdio.h>
#define MAX_LEN 30
using namespace std;

typedef struct
{
    string name; // for the filename
    ... change this to your specification
}hashd;

hashd hashArray[MAX_LEN]; // tentative

int returnHash(string s)
{
    // A simple hashing, no collision handled
    int sum=0,index=0;
    for(string::size_type i=0; i < s.length(); i++)
    {
        sum += s[i];
    }
    index = sum % MAX_LEN;
    return index;
}

int main()
{
    string fileName;
    int index;
    cout << "Enter filename        ::\t" ;
    cin >> fileName;
    cout << "Enter filename is     ::\t" + fileName << "\n";
    index = returnHash(fileName);
    cout << "Generated index is ::\t" << index << "\n";
    hashArray[index].name = fileName;
    cout << "Filename in array    ::\t" <<hashArray[index].name ;
    return 0;
}

Then, to achieve O(1), anytime you want to fetch the filename's contents, just run the returnHash(filename) function. It will directly return the index of the array :)

Solution 2

A hash table can be implemented as a simple 2-dimensional array. The question is how to compute the unique key for each item to be stored. Some things have keys built into the data, and for other things you'll have to compute one: MD5 as suggested above is probably just fine for your needs.

The next problem you need to solve is how to lay out, or size, your hash table. That's something that you'll ultimately need to tune to your own needs through some testing. You might start by setting up the 1st dimension of your array with 255 entries -- one for each combination of the first 2 digits of the MD5 hash. Whenever you have a collision, you add another entry along the 2nd dimension of your array at that 1st dimension index. This means that you'll statically define a 1-dimensional array while dynamically allocating the 2nd dimension entries as needed. Hopefully that makes as much sense to you as it does to me.

When doing lookups, you can immediately find the right 1st dimension index using the 1st 2-digits of the MD5 hash. Then a relativley short linear search along the 2nd dimension will quickly bring you to the item you seek.

You might find from experimentation that it's more efficient to use a larger 1st dimension (use the fisrt 3 digits of the MD5 hash) if your data set is sufficiently large. Depending on the size of texts involved and the distribution of their use of the lexicon, your results will probably dictate some of your architecture.

On the other hand, you might just start small and build in some intelligence to automatically resize and layout your table. If your table gets too long in either direction, performance will suffer.

Share:
18,906
captain monk
Author by

captain monk

Updated on June 04, 2022

Comments

  • captain monk
    captain monk almost 2 years

    i've done in the past a small exercise about hashtable but the user was giving array size and also the struct was like this (so the user was giving a number and a word each time as input)

    struct data
    {
       int key;
       char c[20];
    };
    

    So it was quite simple since i knew the array size and also the user was saying how many items he will be give as input. The way i did it was

    • Hashing the keys the user gave me
    • find the position array[hashed(key)] in the array
    • if it was empty i would put the data there
    • if it wasn't i would put it in the next free position i would find.

    But now i have to make inverted index and i am reasearching so i can make a hashtable for it. So the words will be collected from around 30 txts and they will be so many. So in this case how long should the array be? How can i hash words? Should i use hasing with open adressing or with chaining. The exercise sais that we could use a hash table as it is if we find it online. but i prefer to understand and create it by my own. Any clues will help me :)

    In this exerice(inverted index using hash table) the structs looks like this. data type is the type of the hash table i will create.

    struct posting
    {
        string word;
        posting *next
    }
    
    struct data
    {
        string word;
        posting *ptrpostings;
        data *next
    };