MySQL: Select Random Entry, but Weight Towards Certain Entries

15,032

Solution 1

This guy asks the same question. He says the same as Frank, but the weightings don't come out right and in the comments someone suggests using ORDER BY -LOG(1.0 - RAND()) / Multiplier, which in my testing gave pretty much perfect results.

(If any mathematicians out there want to explain why this is correct, please enlighten me! But it works.)

The disadvantage would be that you couldn't set the weighting to 0 to temporarily disable an option, as you would end up dividing by zero. But you could always filter it out with a WHERE Multiplier > 0.

Solution 2

For a much better performance (specially on big tables), first index the weight column and use this query:

SELECT * FROM tbl AS t1 JOIN (SELECT id FROM tbl ORDER BY -LOG(1-RAND())/weight LIMIT 10) AS t2 ON t1.id = t2.id

On 40MB table the usual query takes 1s on my i7 machine and this one takes 0.04s.

For explanation of why this is faster see MySQL select 10 random rows from 600K rows fast

Solution 3

Don't use 0, 1 and 2 but 1, 2 and 3. Then you can use this value as a multiplier:

SELECT * FROM tablename ORDER BY (RAND() * Multiplier);

Solution 4

Well, I would put the logic of weights in PHP:

<?php
    $weight_array = array(0, 1, 1, 2, 2, 2);
    $multiplier = $weight_array[array_rand($weight_array)];
?>

and the query:

SELECT *
FROM `table`
WHERE Multiplier = $multiplier
ORDER BY RAND()
LIMIT 1

I think it will work :)

Solution 5

While I realise this is an question on MySQL, the following may be useful for someone using SQLite3 which has subtly different implementations of RANDOM and LOG.

SELECT * FROM table ORDER BY (-LOG(abs(RANDOM() % 10000))/weight) LIMIT 1;

weight is a column in table containing integers (I've used 1-100 as the range in my table).

RANDOM() in SQLite produces numbers between -9.2E18 and +9.2E18 (see SQLite docs for more info). I used the modulo operator to get the range of numbers down a bit.

abs() will remove the negatives to avoid problems with LOG which only handles non-zero positive numbers.

LOG() is not actually present in a default install of SQLite3. I used the php SQLite3 CreateFunction call to use the php function in SQL. See the PHP docs for info on this.

Share:
15,032
Admin
Author by

Admin

Updated on June 27, 2022

Comments

  • Admin
    Admin almost 2 years

    I've got a MySQL table with a bunch of entries in it, and a column called "Multiplier." The default (and most common) value for this column is 0, but it could be any number.

    What I need to do is select a single entry from that table at random. However, the rows are weighted according to the number in the "Multiplier" column. A value of 0 means that it's not weighted at all. A value of 1 means that it's weighted twice as much, as if the entry were in the table twice. A value of 2 means that it's weighted three times as much, as if the entry were in the table three times.

    I'm trying to modify what my developers have already given me, so sorry if the setup doesn't make a whole lot of sense. I could probably change it but want to keep as much of the existing table setup as possible.

    I've been trying to figure out how to do this with SELECT and RAND(), but don't know how to do the weighting. Is it possible?

  • Silver Light
    Silver Light about 14 years
    or just add 1: SELECT * FROM tablename ORDER BY (RAND() * (Multiplier+1));
  • Admin
    Admin about 14 years
    Interesting! The possible value for multiplier could theoretically be anything, but will probably go as high as 20. Wouldn't that make the array huge? Is that OK?
  • Admin
    Admin about 14 years
    I thought of doing something like this, but I don't see how multiplying a random number by another number results in anything getting weighted. Also, how does it know which entry to take the multiplier value from?
  • Silver Light
    Silver Light about 14 years
    Well, you can make $weight_array dynamic, so that you dont have to type all the numbers by hand. Don't worry about resources - a thousand of int's is not much.
  • TravisO
    TravisO about 14 years
    @John, then create the weight array dynamically with a for loop, by putting a 2nd for loop inside
  • Admin
    Admin about 14 years
    I'm racking my brain trying to understand what this code is doing, but I see some stuff there that I haven't seen before. Could you explain it in layman's terms?
  • Dor
    Dor about 14 years
    Yes :) I've edited my post with explanation for the PHP code.
  • Admin
    Admin about 14 years
    I'm not sure that this code do what I want it to do: Let's say I have 100 entries in the table: 98 have a multiplier of 0, 1 has a multiplier of 1 (counts as 2 entries), and 1 has a multiplier of 2 (counts as 3 entries). The chance of a 0-multiplier entry being chosen should be 98/103, of a 1-multiplier entry should be 2/103, and of a 2-multiplier entry should be 3/103. However, with your code the chances would be 1/6, 2/6, 3/6. Maybe I need to put every entry's ID into an array, with weighted entried entered multiple times, and then use array_rand?
  • Frank Heikens
    Frank Heikens about 14 years
    @John: RAND() gives you a random number between 0 and 1. A bigger multiplier gives you bigger chance to end up with the biggest result. Sorting on this result makes sense. Do some tests with a large dataset and see the results.
  • Admin
    Admin about 14 years
    Looks good, but the majority of entries will have a multiplier of 0 and it doesn't look like this code will ever select them.
  • Dor
    Dor about 14 years
    I can't see why not... You can assign to $mul the value of ( rand(1, $MaxMul) % rand(1, $MaxMul) )
  • Peter N Lewis
    Peter N Lewis about 14 years
    add "limit 1" to the end of the select as well to just get a single row. This is not an efficient solution though, it'll be slow on big tables.
  • Ken Arnold
    Ken Arnold about 11 years
    1 - RAND() is equivalent to RAND(), which is (ideally) Uniform between 0 and 1. -LOG(RAND())/weight is Exponential with rate weight. Think of an Expo as the time from now until you get an email of a particular kind, and the rate is how fast each kind of email arrives. LIMIT 1 just picks out the next email.
  • Ken Arnold
    Ken Arnold about 11 years
    This does not actually give the correct distribution (as I discovered by accident); limos's answer does.
  • Nathan
    Nathan almost 10 years
    Wouldn't this solution produce a substantial amount of overhead? I'm not sure how resource-intensive the creation of an entire table, manipulation of that table, then deletion would be on the system would be. Would an array of weighted values, dynamically generated, be simpler, less error-prone, and less resource-intensive?
  • khany
    khany over 9 years
    Brilliant! I modified this to weight towards an aggregate value from a related table. SELECT l.name, COUNT(l.id) FROM consignments c INNER JOIN locations l ON c.current_location_id = l.id GROUP BY l.id ORDER BY -LOG(RAND()) / COUNT(l.id) DESC
  • flyingL123
    flyingL123 almost 9 years
    Does this solution mean that the OP has to change their multiplier logic slightly? They originally said a multiplier of 0 indicates it is not weighted, but your solution means a multiplier of 0 is excluded from the result set. The OP would have to change their logic slightly so that a multiplier of 1 means not weighted, 2 means it's in the table twice, etc. This seems to make more sense anyway, but just wanted to confirm the change is necessary.
  • limos
    limos almost 9 years
    @flyingL123 true, good point. Or they could replace Multiplier with Multiplier + 1
  • Arth
    Arth over 7 years
    This gives a horribly skewed distribution.. say there are 98 rows weighted 1 and 1 row weighted 2. RAND() will produce a number between 0 and 1, so 50% of the time the number will be > 0.5. For the row weighted 2, (RAND() * 2) will be greater than 1 50% of the time. This is larger than all (RAND() * 1) results, so the row weighted 2 will be selected at least 50% of the time. It should in fact be selected 2% of the time (2/100).
  • Arth
    Arth over 7 years
    @KenArnold As pointed out by a comment by Crissistian Leonte in the same thread 1 - RAND() is actually slightly 'cleaner' because it removes the tiny chance that you end up doing LOG(0) which returns NULL. This is because RAND() returns 0 <= x < 1. Both solutions should return comparable results however.
  • Arth
    Arth over 7 years
    In fact because this is ORDER BY ASC not DESC, you've actually reduced the chance of the weighted rows being selected.
  • concat
    concat over 7 years
    Can you explain the significance of the subqueries? Why not SELECT * in the innermost subquery and do away with the other two? That then is just the form of the usual query.
  • Ali
    Ali over 7 years
    @concat That's because how SQL works: when you do an order on a big table it loads the whole data and then sorts according to the order by clause, but here the subquery only works on indexed data which are available in memory. see these tests: usual > i.stack.imgur.com/006Ym.jpg, subquery > i.stack.imgur.com/vXU8e.jpg the response time is highlighted.
  • concat
    concat over 7 years
    I can now confirm, and while very unexpected, I think now I understand how this works. Thanks for showing me something new today!
  • Ali
    Ali over 7 years
    You're welcome, there are lots of unexpected things in SQL, this is one of them!
  • Ali Gangji
    Ali Gangji about 6 years
    You don't have to put each entry ID into an array. You could get a count by weight: 98 at 0, 1 at 1, 1 at 2. Put the offset position into the array and repeat (add it to the array again) according to the weight. So the array would contain the numbers 1 to 98 each appearing once, 99 appearing twice, and 100 appearing 3 times. Randomly pick a position from the array, sort your data by weight and take the item at the selected position. This would be more suitable for a larger data set.
  • Klemen Tusar
    Klemen Tusar about 6 years
    I needed just the opposite and simply added 1 to that :) SELECT * FROM table ORDER BY 1.0 + LOG(1.0 - RAND()) / (Multiplier + 1)
  • Jasen
    Jasen almost 6 years
    could be much improved by using window functions, if mysql has that.