Creating your own Tinyurl style uid

11,390

Solution 1

The probability of a collision against one specific ID is:

p = ( 0.5 * ( (0.5*1/10) + (0.5*1/26) ) )^6

which is around 1.7×10^-9.

The probability of a collision after generating n IDs is 1-p^n, so you'll have roughly a 0.17% chance of a collision for each new insertion after 1 million IDs have been inserted, around 1.7% after 10 million IDs, and around 16% after 100 million.

1000 IDs/minute works out to about 43 million/month, so as Sklivvz pointed out, using some incrementing ID is probably going to be a better way to go in this case.

EDIT:

To explain the math, he's essentially flipping a coin and then picking a number or letter 6 times. There's a 0.5 probability that the coin flip matches, and then 50% of the time there's a 1/10 chance of matching and a 50% chance of a 1/26 chance of matching. That happens 6 times independently, so you multiply those probabilities together.

Solution 2

Why do you want to use a random function? I always assumed that tinyurl used a base 62 (0-9A-Za-z) representation of a sequential Id. No clashes and the urls are always as short as possible.

You would have a DB table like

Id  URL
 1  http://google.com
 2  ...
... ...
156 ...
... ...

and the corresponding URLs would be:

http://example.com/1
http://example.com/2
...
http://example.com/2W
...

Solution 3

Look up the Birthday Paradox, it's the exact problem that you're running into.

The question is: How many people do you need to get together in a room, so that you have a 50% chance of any two people having the same birthdate? The answer may surprise you.

Solution 4

Some time ago I did exactly this, and I followed the way Sklivvz mentioned. The whole logic was developed with a SQL server stored procedure and a couple of UDF (user defined functions). The steps were:

  • say that you want to shorten this url: Creating your own Tinyurl style uid
  • Insert the URL in a table
  • Obtain the @@identity value of the last insert (a numeric id)
  • Transform the id in a corresponding alphanumeric value, based on a "domain" of letters and numbers (I actually used this set: "0123456789abcdefghijklmnopqrstuvwxyz")
  • Return that value back, something like 'cc0'

The conversion was realized thru a couple of very short UDF.

Two conversion called one after the other would return "sequential" values like these:

select dbo.FX_CONV (123456) -- returns "1f5n"

select dbo.FX_CONV (123457) -- returns "1f5o"

If you are interested I can share the UDF's code.

Share:
11,390

Related videos on Youtube

MordechayS
Author by

MordechayS

London based .NET developer working at JustEat, keen long distance runner. In my spare time I focus on .NET Core, Docker. Docker for ASP.NET Core (Packt video) Github profile Roadkill .NET Wiki C# Encrypted notes service

Updated on April 18, 2022

Comments

  • MordechayS
    MordechayS about 2 years

    I'm writing a small article on humanly readable alternatives to Guids/UIDs, for example those used on TinyURL for the url hashes (which are often printed in magazines, so need to be short).

    The simple uid I'm generating is - 6 characters: either a lowercase letter (a-z) or 0-9.

    "According to my calculations captain", that's 6 mutually exclusive events, although calculating the probability of a clash gets a little harder than P(A or B) = P(A) + P(B), as obviously it includes numbers and from the code below, you can see it works out whether to use a number or letter using 50/50.

    I'm interested in the clash rate and if the code below is a realistic simulation of anticipated clash rate you'd get from generating a hash. On average I get 40-50 clashes per million, however bare in mind the uid wouldn't be generated a million times at once, but probably only around 10-1000 times a minute.

    What is the probability of a clash each time, and can anyone suggest a better way of doing it?

    static Random _random = new Random();
    
    public static void main()
    {
        // Size of the key, 6
        HashSet<string> set = new HashSet<string>();
        int clashes = 0;
        for (int n=0;n < 1000000;n++)
        {
            StringBuilder builder = new StringBuilder();
    
            for (int i =0;i < 7;i++)
            {
                if (_random.NextDouble() > 0.5)
                {
                    builder.Append((char)_random.Next(97,123));
                }
                else
                {
                    builder.Append(_random.Next(0,9).ToString());
                }
            }
    
            if (set.Contains(builder.ToString()))
            {
                clashes++;
                Console.WriteLine("clash: (" +n+ ")" +builder.ToString());
            }
    
            set.Add(builder.ToString());
            _random.Next();
            //Console.Write(builder.ToString());
        }
    
        Console.WriteLine("Clashes: " +clashes);
        Console.ReadLine();
    }
    

    UPDATE: Here's the resulting article from this question

    I really asked two questions here so I was cheating. The answer I was after was rcar's, however Sklivvz's is also the answer to the 2nd part (an alternative). Is it possible to make a custom unique id generator in a database, or would it be client side (which would be 2 possible reads first)?

    The general idea I was after was using Ids in databases or other stores that can be used by phone or printed material, not a giant 16 byte guid.

    UPDATE 2: I put the formula for two mutually exclusive events above instead of 2 independent ones (as getting an 'a' the first time doesn't mean you can't get an 'a' the second time). Should've been P(A and B) = P(A) x P(B)

  • mattlant
    mattlant over 15 years
    I think the point is this just simulates it but is not the actual hashing algorithm. Its so he can determine metrics, like the clash rate.
  • mattlant
    mattlant over 15 years
    I think the point is this just simulates it but is not the actual hashing algorithm. Its so he can determine metrics, like the clash rate.
  • Jon Skeet
    Jon Skeet over 15 years
    Yes, because the characters aren't distributed evenly. Consider a (really bad) algorithm: if you roll a D100 and the result is exactly 42, generate a very long GUID. Otherwise, result="Doh". Number of possible hashes: massive. Chance of clash? Huge!
  • MSalters
    MSalters over 15 years
    Bad idea to hash the ID - you need to get the unhashed ID back to lookup the row. See Sklivvz' answer.
  • freespace
    freespace over 15 years
    @Jon: any reasonable hash algorithm would have an even distribution. @Ryan: your calculation is correct if the question was "if I generated 2 hash values, what is the probability of them colliding?". However the question here retains previously generated values, so the maths is not so simple.
  • vrbl
    vrbl over 15 years
    I didn't mean an actual hash; I just meant to follow Sklivvz's answer. I'll edit my answer to clarify that.
  • vrbl
    vrbl over 15 years
    The probability of a collision on the next insert after 1,000,000 is 0.17%, not for every collision prior to that. As I said, the probability of a collision is not constant; it depends on the number of IDs already used.
  • MordechayS
    MordechayS over 15 years
    Looks like if you remove the 0-9 factor and remove the mutually exclusive part, the chance of a clash is a lot lower. Maybe just use the ascii code for 0-9 instead. This would be 1/26 x 1/26 x 1/26 x 1/26 x 1/26 x 1/26 = 0.038^6. What's the 0.5*1/26 for above?
  • MordechayS
    MordechayS over 15 years
    I hadn't heard of base62 until now! Seems like the exact way they do it, probably decoding from base62 instead of storing the base62 version of the key, as you put above.
  • Pavel Repin
    Pavel Repin almost 15 years
    I suspect TinyURL uses base 36 or at least base N where N < 62 and N >= 36. I don't think they let you use both lower case and upper case chars. If you want URLs to be easy to enter when dictated by someone over the phone, you don't want case-sensitive sequences.
  • KMån
    KMån over 13 years
    @Skliwz: How would you handle 9999999th id the same day? and probably 999....9th id at the end of the month? Meaning why would I ever want my url to look like: http://site.com/999999999...999999w?
  • Buns of Aluminum
    Buns of Aluminum over 8 years
    This is why nobody should ever ask "should I share the code?" and they should just share the code. It's been over 5 years and KMan is still waiting.