Easy interview question got harder: given numbers 1..100, find the missing number(s) given exactly k are missing

arrays algorithm math

308,865

Solution 1

Here's a summary of Dimitris Andreou's link.

Remember sum of i-th powers, where i=1,2,..,k. This reduces the problem to solving the system of equations

a₁ + a₂ + ... + a_k = b₁

a₁² + a₂² + ... + a_k² = b₂

...

a₁^k + a₂^k + ... + a_k^k = b_k

Using Newton's identities, knowing b_i allows to compute

c₁ = a₁ + a₂ + ... a_k

c₂ = a₁a₂ + a₁a₃ + ... + a_k-1a_k

...

c_k = a₁a₂ ... a_k

If you expand the polynomial (x-a₁)...(x-a_k) the coefficients will be exactly c₁, ..., c_k - see Viète's formulas. Since every polynomial factors uniquely (ring of polynomials is an Euclidean domain), this means a_i are uniquely determined, up to permutation.

This ends a proof that remembering powers is enough to recover the numbers. For constant k, this is a good approach.

However, when k is varying, the direct approach of computing c₁,...,c_k is prohibitely expensive, since e.g. c_k is the product of all missing numbers, magnitude n!/(n-k)!. To overcome this, perform computations in Z_q field, where q is a prime such that n <= q < 2n - it exists by Bertrand's postulate. The proof doesn't need to be changed, since the formulas still hold, and factorization of polynomials is still unique. You also need an algorithm for factorization over finite fields, for example the one by Berlekamp or Cantor-Zassenhaus.

High level pseudocode for constant k:

Compute i-th powers of given numbers
Subtract to get sums of i-th powers of unknown numbers. Call the sums b_i.
Use Newton's identities to compute coefficients from b_i; call them c_i. Basically, c₁ = b₁; c₂ = (c₁b₁ - b₂)/2; see Wikipedia for exact formulas
Factor the polynomial x^k-c₁x^k-1 + ... + c_k.
The roots of the polynomial are the needed numbers a₁, ..., a_k.

For varying k, find a prime n <= q < 2n using e.g. Miller-Rabin, and perform the steps with all numbers reduced modulo q.

EDIT: The previous version of this answer stated that instead of Z_q, where q is prime, it is possible to use a finite field of characteristic 2 (q=2^(log n)). This is not the case, since Newton's formulas require division by numbers up to k.

Solution 2

You will find it by reading the couple of pages of Muthukrishnan - Data Stream Algorithms: Puzzle 1: Finding Missing Numbers. It shows exactly the generalization you are looking for. Probably this is what your interviewer read and why he posed these questions.

Also see sdcvvc's directly related answer, which also includes pseudocode (hurray! no need to read those tricky math formulations :)) (thanks, great work!).

Solution 3

We can solve Q2 by summing both the numbers themselves, and the squares of the numbers.

We can then reduce the problem to

k1 + k2 = x
k1^2 + k2^2 = y

Where x and y are how far the sums are below the expected values.

Substituting gives us:

(x-k2)^2 + k2^2 = y

Which we can then solve to determine our missing numbers.

Solution 4

As @j_random_hacker pointed out, this is quite similar to Finding duplicates in O(n) time and O(1) space, and an adaptation of my answer there works here too.

Assuming that the "bag" is represented by a 1-based array A[] of size N - k, we can solve Qk in O(N) time and O(k) additional space.

First, we extend our array A[] by k elements, so that it is now of size N. This is the O(k) additional space. We then run the following pseudo-code algorithm:

for i := n - k + 1 to n
    A[i] := A[1]
end for

for i := 1 to n - k
    while A[A[i]] != A[i] 
        swap(A[i], A[A[i]])
    end while
end for

for i := 1 to n
    if A[i] != i then 
        print i
    end if
end for

The first loop initialises the k extra entries to the same as the first entry in the array (this is just a convenient value that we know is already present in the array - after this step, any entries that were missing in the initial array of size N-k are still missing in the extended array).

The second loop permutes the extended array so that if element x is present at least once, then one of those entries will be at position A[x].

Note that although it has a nested loop, it still runs in O(N) time - a swap only occurs if there is an i such that A[i] != i, and each swap sets at least one element such that A[i] == i, where that wasn't true before. This means that the total number of swaps (and thus the total number of executions of the while loop body) is at most N-1.

The third loop prints those indexes of the array i that are not occupied by the value i - this means that i must have been missing.

Solution 5

I asked a 4-year-old to solve this problem. He sorted the numbers and then counted along. This has a space requirement of O(kitchen floor), and it works just as easy however many balls are missing.

View more solutions

308,865

Author by

polygenelubricants

I mostly contributed in [java] and [regex] from February to August of 2010. I work for Palantir Technologies now, so I may not have much time on stackoverflow as I did then. We're currently hiring; you can e-mail me for a referral. A few habits I've developed on the site: I will no longer cast a downvote. It will stay at 54 forever. I don't like to engage in dramas on stackoverflow. If you really need to discuss politics and other non-technical issues with me, bring it to meta. I delete my comments once they've become obsolete I try to revise my answers periodically, so I prefer that you leave comments and feedbacks instead of editing my answers directly.

Updated on July 08, 2022

Comments

polygenelubricants almost 2 years
I had an interesting job interview experience a while back. The question started really easy:

Q1: We have a bag containing numbers 1, 2, 3, …, 100. Each number appears exactly once, so there are 100 numbers. Now one number is randomly picked out of the bag. Find the missing number.

I've heard this interview question before, of course, so I very quickly answered along the lines of:

A1: Well, the sum of the numbers 1 + 2 + 3 + … + N is (N+1)(N/2) (see Wikipedia: sum of arithmetic series). For N = 100, the sum is 5050.

Thus, if all numbers are present in the bag, the sum will be exactly 5050. Since one number is missing, the sum will be less than this, and the difference is that number. So we can find that missing number in O(N) time and O(1) space.

At this point I thought I had done well, but all of a sudden the question took an unexpected turn:

Q2: That is correct, but now how would you do this if TWO numbers are missing?

I had never seen/heard/considered this variation before, so I panicked and couldn't answer the question. The interviewer insisted on knowing my thought process, so I mentioned that perhaps we can get more information by comparing against the expected product, or perhaps doing a second pass after having gathered some information from the first pass, etc, but I really was just shooting in the dark rather than actually having a clear path to the solution.

The interviewer did try to encourage me by saying that having a second equation is indeed one way to solve the problem. At this point I was kind of upset (for not knowing the answer before hand), and asked if this is a general (read: "useful") programming technique, or if it's just a trick/gotcha answer.

The interviewer's answer surprised me: you can generalize the technique to find 3 missing numbers. In fact, you can generalize it to find k missing numbers.

Qk: If exactly k numbers are missing from the bag, how would you find it efficiently?

This was a few months ago, and I still couldn't figure out what this technique is. Obviously there's a Ω(N) time lower bound since we must scan all the numbers at least once, but the interviewer insisted that the TIME and SPACE complexity of the solving technique (minus the O(N) time input scan) is defined in k not N.

So the question here is simple:
- How would you solve Q2?
- How would you solve Q3?
- How would you solve Qk?
Clarifications
- Generally there are N numbers from 1..N, not just 1..100.
- I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using O(N) bits in additional space. We can't afford any additional space proportional to N.
- I'm also not looking for the obvious sort-first approach. This and the set-based approach are worth mentioning in an interview (they are easy to implement, and depending on N, can be very practical). I'm looking for the Holy Grail solution (which may or may not be practical to implement, but has the desired asymptotic characteristics nevertheless).
So again, of course you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N), and must then find the k missing numbers somehow.
- Heinrich Apfelmus almost 14 years
  
  Note that the run-time can't be independent of k. If you keep track of just k pieces of extra information, you need O(k N) time to update each extra piece whenever you touch a new number from the input. Also, the solution with the sum already keeps track of 2 log N bits.
- Dave O. almost 14 years
  
  @polygenelubricants Thank You for the clarifications. "I'm looking for an algorithm that uses O(N) time and O(K) space where K is the count of absent numbers" would have been clear from the beginning on ;-)
- Jérémie almost 14 years
  
  You should precise, in the statement of Q1 that you cannot access the numbers in order. This probably seems obvious to you, but I've never heard of the question and the term "bag" (which means "multiset" as well) was sort of confusing.
- Admin over 13 years
  
  Please read the following as the answers provided here are ridiculous: stackoverflow.com/questions/4406110/…
- CMR over 12 years
  
  Use the first number's memory for a bit_set! :-) You need space relative to N, as all the N numbers may be missing! (empty set)
- Udo Klein about 11 years
  
  The solution of summing the numbers requires log(N) space unless you consider the space requirement for an unbounded integer to be O(1). But if you allow for unbounded integers, then you have as much space as you want with just one integer.
- Chronial almost 11 years
  
  As @UdoKlein stated, you require log(N) space for big N. For the same reason you also need O(log(N)*N) time for the summation.
- David Prun over 9 years
  
  are the numbers in sequence? if it were like 1, 2, *, 4, 5, * and N = 6, then iterate i -> 1 to N and increment i. Each time you don't find match that number is missing.. I am guessing it was unsorted right? else it wouldn't be a challenge.
- sbeliakov over 8 years
  
  By the way pretty nice alternative solution for Q1 could be computing XOR of all numbers from 1 to n, then xoring result with all numbers in the given array. In the end you have your missing number. In this solution you don't need to care about overflow as in summing up.
- blazs about 8 years
  
  I think the following approach works in general. (The caveat is that it essentially encodes the bitset as a huge number. :-) Let a=sum_{s in S} 2^s. Now infer the missing numbers as indexes of turned-off bits. Both can be done in linear time; the big integer of course takes a lot of space (so it's basically cheating).
- Krish Munot almost 8 years
  
  For Q2 and Qk, I believe you need to return a set of 2 or k terms which add up as the difference of the sum of the numbers of the range. For eg: for range of 1 to 100 and there are 2 missing numbers, and the sum of the numbers is 5040, then the set of possible numbers that are missing are either { (1,9) , (2,8) , (3,7) , (4,6) } The same would imply for 'k' missing numbers as well.
- Nickpick almost 8 years
  
  Bonus question: what is the best space complexity that can be achieved with unlimited time (and vice versa) assuming 2 numbers are missing?
- toongeorges almost 7 years
  
  It is an artificial question. The bag itself already consumes O(N) space, using a bit array to keep track of the elements in the bag would not make this worse.
- Michel Rouzic over 5 years
  
  A bag?? Well assuming it's a real bag with real balls and I must go through it then I'll grab a paper and a pencil and put marks on 10x10 grid as I go through each ball by hand, I'm not gonna start doing polynomials or XORing the bits of each number in my head am I? And wow guess what it scales up to any number of missing balls. I'm not really missing the point of the question if the question is poorly thought up.
- xjcl about 5 years
  
  Dumb question but wouldn't just feeding the list into a Python set and checking if i is in the set for i=1...100 (n=100) be amortized O(n) too?
- xjcl about 5 years
  
  Since Fibonacci Heaps have O(1) insertion and O(log n) deletion shouldn't they admit an O(n + k log n) or O(n + k log k) solution?
polygenelubricants almost 14 years

+1; I've tried the formula in Maple for select numbers and it works. I still couldn't convince myself WHY it works, though.
polygenelubricants almost 14 years

I did propose this obvious answer, but this is not what the interviewer wanted. I explicitly said in the question that this is not the answer I'm looking for. Another obvious answer: sort first. Neither the O(N) counting sort nor O(N log N) comparison sort is what I'm looking for, although they are both very simple solutions.
Chris Lercher almost 14 years

@polygenelubricants: I can't find where you said that in your question. If you consider the bitset to be the result, then there is no second pass. The complexity is (if we consider N to be constant, as the interviewer suggests by saying, that the complexity is "defined in k not N") O(1), and if you need to construct a more "clean" result, you get O(k), which is the best you can get, because you always need O(k) to create the clean result.
hrnt almost 14 years

"Note that I'm not looking for the obvious set-based solution (e.g. using a bit set,". The second last paragraph from the original question.
Anon. almost 14 years

@polygenelubricants: Rearranging the first equation gives k1 = x - k2, and we can substitute that into the second equation. You can also generalize this to higher k - though it becomes computationally expensive quite rapidly.
polygenelubricants almost 14 years

@Anon: let's restrict discussion to Q2 for now. Can you show that this will always result in a solution? (i.e. no ambiguity, no unsolvable, always unique and correct solution).
Chris Lercher almost 14 years

@hmt: Yes, the question was edited a few minutes ago. I'm just giving the answer, that I would expect from an interviewee... Artificially constructing a sub-optimal solution (you can't beat O(n) + O(k) time, no matter what you do) doesn't make sense to me - except if you can't afford O(n) additional space, but the question isn't explicit on that.
polygenelubricants almost 14 years

I've edited the question again to further clarify. I do appreciate the feedback/answer.
Anon. almost 14 years

@polygenelubricants: If you wanted to prove correctness, you would first show that it always provides a correct solution (that is, it always produces a pair of numbers which, when removing them from the set, would result in the remainder of the set having the observed sum and sum-of-squares). From there, proving uniqueness is as simple as showing that it only produces one such pair of numbers.
Chris almost 14 years

The nature of the equations means that you will get two values of k2 from that equation. However, from teh first equation that you use to generate k1 you can see that these two values of k2 will mean that k1 is the other value so you have two solutions that are the same numbers the opposite way around. If you abitrarily declared that k1>k2 then you'd only have one solution to the quadratic equation and thus one solution overall. And clearly by the nature of the question an answer always exists so it always works.
Chris almost 14 years

Oooh... That's interesting. I have to admit I got a bit confused by the maths but I was jsut skimming it. Might leave it open to look at more later. :) And +1 to get this link more findable. ;-)
Heinrich Apfelmus almost 14 years

The google books link doesn't work for me. Here a better version [PostScript File].
Dimitris Andreou almost 14 years

Wow. I didn't expect this to get upvoted! Last time I posted a reference to the solution (Knuth's, in that case) instead of trying to solve it myself, it was actually downvoted: stackoverflow.com/questions/3060104/… The librarian inside me rejoices, thanks :)
Dimitris Andreou almost 14 years

@Apfelmus, note that this is a draft. (I don't blame you of course, I confused the draft for the real things for almost a year before finding the book). Btw if the link didn't work, you can go to books.google.com and search for "Muthukrishnan data stream algorithms" (without quotes), it's the first to pop up.
Heinrich Apfelmus almost 14 years

You don't have to use a prime field, you can also use q = 2^(log n). (How did you make the super- and subscripts?!)
Heinrich Apfelmus almost 14 years

Also, you can calculate the c_k on the fly, without using the power sums, thanks to the formula $c^{k+1}_m = c^k_{m+1} + c^k_m x_{k+1}$ where the superscript $k$ denotes the number of variables and $m$ the degree of the symmetric polynomial.
sdcvvc almost 14 years

@Heinrich Apfelmus Thanks! I'm not sure about the power of two approach - in general factorization would be much tricker, since there are divisors of zero and no uniqueness. To make sub/superscripts in answers, HTML tags can be used, in comments I think it is not available.
back2dos almost 14 years

+1 This is really, really clever. At the same time, it's questionable, whether it's really worth the effort, or whether (parts of) this solution to a quite artificial problem can be reused in another way. And even if this were a real world problem, on many platforms the most trivial O(N^2) solution will probably possibly outperform this beauty for even reasonably high N. Makes me think of this: tinyurl.com/c8fwgw Nonetheless, great work! I wouldn't have had the patience to crawl through all the math :)
Heinrich Apfelmus almost 14 years

@sdcvvc Ah, I mean the finite field with q=2^(log n) elements, not the ring of bit vectors with log n bits. Being a field, the former doesn't have any zero divisors; of course, multiplication is a bit more complicated than a simple bit-wise AND.
sdcvvc almost 14 years

@Heinrich Apfelmus: Thanks, added. @back2dos: The book in Dimitris's answer mentions applications of these techniques in communication complexity (set reconcilation).
phkahler almost 14 years

For a given sum k1+k2, there are many pairs. We can write these pairs as K1=a+b and K2 = a-b where a = (K1+k2/2). a is unique for a given sum. The sum of the squares (a+b)**2 + (a-b)**2 = 2*(a2 + b2). For a given sum K1+K2, the a2 term is fixed and we see that the sum of the squares will be unique due to the b2 term. Therefore, the values x and y are unique for a pair of integers.
DK. over 13 years

This answer is too simple, and we all know that simple answers don't work! ;) Seriously though, original question should probably emphasize O(k) space requirement.
Dan Tao over 13 years

I think the benefit of summing them up is that you don't have to remember which numbers you've already seen (e.g., there's no extra memory requirement). Otherwise the only option is to retain a set of all the values seen and then iterate over that set again to find the one that's missing.
Admin over 13 years

This question is usually asked with the stipulation of O(1) space complexity.
Ankit Roy over 13 years

+1, looks nice but you lost me going from line 4 to line 5 in snippet #1 -- could you explain that further? Thanks!
Admin over 13 years

Please read the following as the answers provided here are ridiculous: stackoverflow.com/questions/4406110/…
Mojo Risin about 13 years

The problem is not that is simple but that you'll have to use O(n) additional memory for the map. The problem bust me solved in constant time and constant memory
corsiKa about 13 years

I think this is a wonderful answer. I think this also illustrates how poor of an interview question it would be to extend the missing numbers beyond one. Even the first is kind of a gotchya, but it's common enough that it basically shows "you did some interview prep." But to expect a CS major to know go beyond k=1 (especially "on the spot" in an interview) is a bit silly.
tmarthal about 13 years

The sum of the first N numbers is N(N+1)/2. For N=100, Sum=100*(101)/2=5050 ;
wall-e over 11 years

I wonder why so few people vote this answer up and even did not mark it as a correct answer. Here is the code in Python. It runs in O(n) time and need extra space O(k). pastebin.com/9jZqnTzV
caf over 11 years

@wall-e: Well, I was several months late to this party.
Nemo over 11 years

How exactly do you factor a degree-k polynomial quickly?
sdcvvc over 11 years

Nemo: I linked the algorithms. Note that for purpose of this question, k is a constant, so you don't need a fast algorithm, only one whose complexity does not depend on n.
sdcvvc over 11 years

No, space complexity of this approach is O(n). Futhermore, this way is already mentioned in the question.
Fox about 11 years

@caf this is quite similar to setting the bits and counting the places where the bit is 0. And I think as you are creating an integer array more memory is occupied.
mr5 over 10 years

It should be System.out.println(k + 1) since 0 is not counted to as missing number. Also, this doesn't work in an unsorted list of numbers.
caf over 10 years

"Setting the bits and counting the places where the bit is 0" requires O(n) extra space, this solution shows how to use O(k) extra space.
comco over 10 years

Doesn't work with streams as input and modifies the input array (though I like it very much and the idea is fruitful).
WestCoastProjects about 10 years

@comco He could have just as easily used separate storage rather than mutating the existing array. The concept/structure is the same. AFA streams: how would any of these algorithms operator on streams?
WestCoastProjects about 10 years

Rather than declaring a "winner" between this and the accepted answer I'd say both are useful but being a non-math nerd find this one more approachable and immediately useful for day-to-day programming.
David Ehrmann about 10 years

This is effectively doing Reed Solomon coding on the input.
v.oddou about 10 years

I bet entering all number in a hash set and iterating over the 1...N suite using lookups to determine if numbers are missing, would be the most generic, fastest in average regarding k variations, most debuggable most maintainable and understandable solution. Of course the math way is impressive but somewhere along the way you need to be an engineer and not a mathematician. Especially when business is involved.
v.oddou about 10 years

You need to think this one again: while A[A[i]] != A[i] swap(A[i], A[A[i]]) if x != y doing swap x,y will not make them equal. Your while loop just never ends. if it was to enter at least once, your algo never finishes.
v.oddou about 10 years

;) your 4 year old must be approaching 5 or/and is a genius. my 4 year old daughter cannot even count properly to 4 yet. well to be fair let's say she just barely finally integrated the "4"'s existence. otherwise until now she would always skip it. "1,2,3,5,6,7" was her usual counting sequence. I asked her to add pencils together and she would manage 1+2=3 by denumbering all again from scratch. I'm worried actually... :'( meh..
v.oddou about 10 years

I bet you can prove the minimal solution is at least O(N). because less, would mean that you didn't even LOOK at some numbers, and since there is no ordering specified, looking at ALL numbers is mandatory.
caf about 10 years

@v.oddou: Nope, it's fine. The swap will change A[i], which means that the next iteration won't be comparing the same two values as the previous one. The new A[i] will be the same as the last loop's A[A[i]], but the new A[A[i]] will be a new value. Try it and see.
v.oddou about 10 years

@caf: oh right, I ran it (on paper) and indeed it does what you say. So nothing to see here, moving on. :)
Charles about 10 years

Did you want (data [n - 1 - odd] % 2 == 1) ++odd;?
Jack over 9 years

While I find it wonderful I wonder if there are any reason to ask this also test logical thinking. I mean, is there a real use of it?
Teepeemm over 9 years

I don't think the sum and xor uniquely define two numbers. Running a loop to get all possible k-tuples that sum to d takes time O(C(n,k-1))=O(n<sup>k-1</sup>), which, for k>2, is bad.
Teepeemm over 9 years

We're specifically not looking for the O(n) space solution of writing everything down.
Teepeemm over 9 years

Could you explain how this works? I don't understand.
gnasher729 over 9 years

The solution would be very, very, simple if I could use an array of (n + k) booleans for temporary storage, but that is not allowed. So I rearrange the data, putting the even numbers at the beginning, and the odd numbers at the end of the array. Now the lowest bits of those n numbers can be used for temporary storage, because I know how many even and odd numbers there are and can reconstruct the lowest bits! These n bits and the k extra bits are exactly the (n + k) booleans that I needed.
user3281743 over 9 years

@Anon. Could you explain this further. I am trying to understanding this. Can you actually input numbers for me so I can visualize what I am not getting
Steven over 9 years

It would be helpful to have a concretely example how the pseudo code works because I am still a little bit confused.
user3692106 over 9 years

Further Optimizations 1) Use bits instead an array of bools 2) Use a link list full of numbers 1-N and remove the ones you find Also, your smart equations still equate to my solution when you boil it down.
SirGuy about 9 years

"do a quick sort"? That doesn't fit inside the O(n) time and O(1) space complexities.
Rahul about 9 years

@GuyGreer, I should have been more precise with words. When I said quick sort , I meant partition around "0". I think you didn't understand at all. You saw quick sort and and jumped to down-vote!.
Caridorc about 9 years

This can be solved as a tiny one or two liner in python, you are you over-engeneering this.
sdcvvc about 9 years

@Caridorc: That one-liner will not satisfy the requirements of the question, which asks for space complexity independent of N.
Tomas Kubes about 9 years

It is theoretically nice, but practically I guess it would be faster to make just array of bits and find the missing number just by settings bits. Furthermore restriction to integers range does not allow to compute such big power(x,i) for bigger numbers. Am I right or not?
sdcvvc about 9 years

@qub1n: It would be faster, but the point of the question is an algorithm which does not use memory proportional to N (see "Clarifications"). As for integer range, I am doing computation modulo some prime q, which means numbers will not get huge.
Edward Doolittle about 9 years

You don't need to factor the resulting polynomial; just divide it in succession by (x-i) for i=1..100. If you get a non-zero remainder, (x-i) is not a factor.
Edward Doolittle about 9 years

In fact, you don't even need to divide it by (x-i). Just substitute i into the polynomial. The same argument works for mod p, where p is a prime greater than 100 or whatever is the largest number on the list. The calculations are entirely straightforward and fast ... I don't know why anyone would solve this problem in another way.
sdcvvc about 9 years

@EdwardDoolittle Trying each i takes Omega(N) time, while the question asked that after scanning the input the time/space complexity should be independent of N. (Technically, there is a log N factor because we are computing with integers from range [0..N] (or similar) but it's usually disregarded in the RAM model.)
Edward Doolittle about 9 years

@sdcvvc Gosh ... you're right. I should read more carefully. I see no alternative to factoring the polynomial. Thanks.
Peter Cordes over 8 years

The sum(x), sum(x^2), etc. method is nothing at all like using a bitset, except that you get the same answer. I guess mergesort equates to quicksort, too?
PabTorre over 8 years

simple yet effective approach.
user3528438 over 8 years

The easiest way is to hijack the takeout() or add() method and keep a record of what's out/in. It won't change the complexity of takeout()/add() but reduce this "K missing" problem to O(1).
AlexKoren over 8 years

This is awesome. @user3281743 here's an example. Let the missing numbers (k1 and k2) be 4 and 6. Sum(1 -> 10) = 55 and Sum(1^2 -> 10^2) = 385. Now let x = 55 - (Sum(All remaining numbers)) and y = 385 - (Sum(Squares of all remaining numbers)) thus x = 10 and y = 52. Substitute as shown which leaves us with: (10 - k2)^2 + k2^2 = 52 which you can simplify to: 2k^2 - 20k + 48 = 0. Solving the quadratic equation gives you 4 and 6 as the answer.
Blastfurnace over 8 years

The question doesn't say the numbers are sorted and explicitly says they're not looking for a sort-first solution.
Teepeemm over 8 years

What do you mean by "partition around 0"? I would interpret that to mean "find which numbers are greater than 0, and which less". But we know that the numbers come from 1 to N, so my interpretation doesn't gain us any information.
Tatarize over 8 years

Actually you can clearly do a bucket sort on such a well defined list like that which is O(n) not n log n.
Tatarize over 8 years

Provably there are no quadratic equations for numbers 3 or greater. Just 2.
Aurast about 8 years

Definitely memorize this solution and if you get this question in an interview pretend you're making up the solution on the spot. Then tell us what question you get next.
Debosmit Ray about 8 years

Did you not read the question? This finds one missing number. OP wanted k missing numbers.
rcgldr about 8 years

A[] does not need to be extended if multiple passes are used. Let m = n-k = size of A[]. Position values x = 0 to m-1 so A[x] = x, then show missing values for x = 0 to m-1. Next position values x = m to min(2m, n)-1 so that A[x-m] = x, then show values for x = m to min(2m,n)-1. If 2m < n, then repeat again for x = 2m to min(3m,n)-1 (A[x-2m] == x). Perform y passes where y is smallest integer such that y*m >= n. Since y is a linear multiplier, the process is still O(n).
Thomas Ahle about 8 years

There is also the counting bloom filter, which allows deletion. Then you can just add all the numbers and delete the ones you see in the stream.
Thomas Ahle about 8 years

The book says that the best space bound is k^2 logn bits, with a randomized factoring algorithm? Isn't that the same as what you'd get by simply hashing {1,...,n} -> {1,...,k^2}? I added an answer with this solution. That should be the most obvious for computer scientists at least.
Thomas Ahle about 8 years

@DavidEhrmann Can you expand on how one might think of this as a coding theoretic problem?
Thomas Ahle about 8 years

@sdcvvc So if we use a finite field, does this only use k log n bits of memory? That seems better than the k^2 logn memory described in the article as optimal. If we just use the natural numbers, each power-sum can be up to n^k and thus k of them takes k^2 logn memory. This is similar to what you get randomized with hashing, stackoverflow.com/a/36851791/205521 but I wonder if this solution is more efficient?
Thomas Ahle about 8 years

If we look at the input as a stream, and n is too large to keep in memory, the O(k) memory requirement makes sense. We can still use hashing though: Just make k^2 buckets and use the simple sum algorithm on each of them. That's only k^2 memory and a few more buckets can be used to get high probability of success.
Thomas Ahle about 8 years

The multiplication gets quite large though.. Also, how do you generalize to more than 2 missing numbers?
Thomas Ahle about 8 years

This uses too much space.
Thomas Ahle about 8 years

Partitioning the set is like using linear space. At least it wouldn't work in a streaming setting.
Thomas Ahle about 8 years

Would this work if you only have O(k^2) extra memory?
Thomas Ahle about 8 years

This only solves the problem for k=1, right? But I like using xor over sums, it seems a bit faster.
Thomas Ahle about 8 years

Note: We can also use xor in each bucket, rather than sum, if that's faster on our machine.
Thomas Ahle about 8 years

This uses too much memory.
Thomas Ahle about 8 years

What does max(x) mean, when x is a number?
Thomas Ahle about 8 years

How much time and memory does it use to calculate x1*x2*x3*...?
Thomas Ahle about 8 years

This wouldn't work if the data were too large to keep in memory, and you only saw it as a stream. Deliciously hacky though :)
DarkDust about 8 years

@ThomasAhle: Why are you adding useless comments to every second answer? What do you mean with it's using too much space?
Thomas Ahle about 8 years

Because the question says that "We can't afford any additional space proportional to N." This solution does exactly that.
FuzzyTree about 8 years

@ThomasAhle see en.wikipedia.org/wiki/Selection_algorithm#Space_complexity. partioning the set in place only requires O(1) additional space - not linear space. In a streaming setting it would be O(k) additional space, however, the original question does not mention streaming.
Thomas Ahle about 8 years

Not directly, but he does write "you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N)" which is usually the definition of streaming. Moving all the numbers for partitioning isn't really possible unless you have an array of size N. It's just that the question has a lot of answers witch seem to ignore this constraint.
FuzzyTree about 8 years

@ThomasAhle I think it's a big leap to conclude that "you must scan the input in O(N), but you can only capture small amount of information (defined in terms of k not N)" means op implied the data is being streamed. I interpret it simply as "Your answer must run in O(N) and only use O(k) extra storage" and I would certainly disagree that it's the definition of streaming.
Thomas Ahle about 8 years

I'm not sure, but I do agree your algorithm is very elegant. With regards to the running time, I think its probably just nlogk?
Claudiu about 8 years

@v.oddou: Or even missing_numbers = set(range(100)) - bag - voila!
bashrc about 8 years

@ThomasAhle Yes. I have called that out in my answer.
Thomas Ahle about 8 years

Right. Do you have an idea what a "second order" xor might be, for k=2? Similar to using squares for sum, could we "square" for xor?
FuzzyTree about 8 years

@ThomasAhle the running time is identical as quick select, which is O(n^2) in the worst case but O(n) on average
Thomas Ahle about 8 years

But as you say, the performance may decrease as more numbers are added? We can also use the linear time median algorithm, to always get a perfect cut, but if the k numbers are well spread out in 1,...,n, wont you have to go about logk levels "deep" before you can prune any branches?
FuzzyTree about 8 years

@ThomasAhle I should have said the complexity is identical to quickselect for k < 4. nlogk sounds like a reasonable estimate for larger k
FuzzyTree about 8 years

Interesting but I think this only respects the space constraint when k <= sqrt(n) - at least if u=k^2? Suppose k=11 and n=100, then you would have 121 buckets and the algorithm would end up being similar to having an array of 100 bits that you check off as you read each # from the stream. Increasing u improves the chances of success but there's a limit to how much you can increase it before you exceed the space constraint.
Tuomas Laakkonen about 8 years

@ThomasAhle It is O(n)-time and O(1)-space on the length of the list, but in reality it's more as multiplication (at least in Python) is O(n^1.6)-time on the length of the number and numbers are O(log n)-space on their length.
Tuomas Laakkonen about 8 years

@ThomasAhle No, log(a^n) = n*log(a) so you would have O(l log k)-space to store the number. So given a list of length l and original numbers of length k, you would have O(l)-space but the constant factor (log k) would be lower than just writing them all out. (I don't think my method is a particularly good way of answering the question.)
emu about 8 years

@FuzzyTree Why should the worst case performance be n^2? The input numbers are a contiguous range. You can pick an optimal pivot in the middle, thus assuring O(n) for a constant number of missing items.
FuzzyTree about 8 years

@emu I'm not sure if it's possible to pick an optimal pivot and do in-place partitioning since you don't know where the optimal pivot is.
bashrc about 8 years

@ThomasAhle Modified it to work for 2 missing numbers.
emu about 8 years

@FuzzyTree Of course. The pivot need not be even present and these algorithms will still work the same.
emu about 8 years

The worst-case running time is indeed nlogk because you need to process the whole input at most logk times, and then it's a geometric sequence (one that starts with at most n elements). The space requirements are logn when implemented with plain recursion, but they can be made O(1) by running an actual quickselect and ensuring the correct length of each partition.
emu about 8 years

The space complexity can be O(1). In a first pass, you process all numbers < (n - k) by exactly this algorithm, without using 'extra'. In a second pass, you clear out the parity bits again and use the first k positions for indexing numbers (n-k)..(n).
David G about 8 years

How do they expect someone to give this answer on the spot in an interview?
Thomas Ahle about 8 years

The problem makes most sense for n much larger than k, I think, but you can actually get space down to k logn with a method very similar to the hashing described, while still having constant time updates. It's described in gnunet.org/eppstein-set-reconciliation , like the sum of powers method, but basically you hash to 'two of k' buckets with a strong hash function like tabulation hashing, which guarantees that some bucket will have only one element. To decode, you identify that bucket and removes the element from both of its buckets, which (likely) frees another bucket and so on
Admin almost 8 years

O(kitchen floor) haha - but wouldn't that be O(n^2) ?
JavaHopper almost 8 years

he probably means max from the set of numbers
slebetman over 7 years

@0x499602D2: Presumably the guy who manages to give this answer on the spot gets the job. I've seen companies leaving highly specialist positions (device driver programmer, probability analyst etc.) open for years in search of the candidate that can do the exact types of problems they need solving. In such cases they're not looking for programmers, they're looking for mathematicians or physicists who can program.
dma_k over 7 years

I have tried these formulae on very simple sequence with N = 3 and missing numbers = {1, 2}. I didn't work, as I believe the error is in formulae (2) which should read M1 = M / (a * b) (see that answer). Then it works fine.
dma_k over 7 years

I tried to apply the given formulae but I failed. Let's take N=3 (sequence {1,2,3}) with two missing numbers {a,b} = {1,2}. That results a×b = 6-3, a+b = 6 ⇒ b=6-a, a²-6a+3 = 0 ⇒ wrong.
dma_k over 7 years

It would be great if @sdcvvc could provide a numeric solution for some small N and k, for example, take N=5, k=3 and follow the computations. Thanks!
sdcvvc over 7 years

@FilipHaglund No, this is supposed to be the sum of a_i * a_j where i < j. a1a2 + a2a3 + ... would suggest the cyclic sum, which changes if a_i is permuted (while the roots have no natural order)
Filip Haglund over 7 years

@sdcvvc then what's c3 and c4? I don't see the pattern.
sdcvvc over 7 years

@FilipHaglund in Python: c3 = sum(a[x]*a[y]*a[z] for x in range(k) for y in range(x+1,k) for z in range(y+1,k)), c4 = sum(a[x]*a[y]*a[z]*a[t] for x in range(k) for y in range(x+1,k) for z in range(y+1,k) for t in range(z+1,k)) etc.
SirGuy over 7 years

This uses O(N) extra space and not O(1). This code also raises an exception because you can't append to an int (which you are doing in your loop). The code will also fail if the last number is one of the ones removed, but depending on how exactly you determine N that may not be an issue. Why do you ask people not to downvote your answer? If you think people will just downvote it why did you post this at all? Asking not to downvote is not (and shouldn't) stop people from downvoting incorrect answers.
janicebaratheon over 7 years

@GuyGreer just changed to " arr.append". Thank you for your comment.
SirGuy over 7 years

This code can be summed up by missing = set(range(1, len(arr)+NMissing)) - set(arr). The loop is unnecessary and you can make the set from the range directly. This doesn't change the fact that the whole point of this question is to solve this without allocating an array the length of len(arr) and while only reading through the data once. This solution achieves neither of these.
Viktor Mellgren almost 7 years

O(m²) i guess :)
Ivan almost 7 years

Actually another solution for case k=2 computes the sum, decides from that whether the two missing numbers have the same or different parities. If they have different parities then compute the sum of odd - sum of even. Comparing with the true result this gives you the difference of the two numbers and hence combined with the sum you can find the two numbers uniquely. If the parities are both the same - both even or both odd then you can find out from the difference of odds and evens which this is. You then search only in this restricted set and repeat.
PlsWork over 6 years

Also, for k=2 see geeksforgeeks.org/…
PlsWork over 6 years

Whats the space complexity?
SirGuy over 6 years

While the test-tube-laser method is genuinely interesting, I hope you agree that it doesn't translate well to hardware instructions and so quite unlikely to be O(logn) on a computer.
SirGuy over 6 years

As for your sorting method, that will take an amount of extra space that depends on N, and more than O(N) time (in terms of its dependency on N), which we are intending to do better than.
KRoy over 6 years

@SirGuy I appreciate your concern about test-tube concept and parallel processing memory cost. My post is to share my thoughts about the problem. GPU processors are now doing parallel processing possible. Who knows, if the test-tube concept wont be available in future.
KRoy over 6 years

Hey we need O(N) space to keep the n unique number somewhere. So how is it better than O(N) bit-mask ? The bit-mask is the smallest space required to represent an unique value because each number can either appear or not(hence 2 possibility being 1 or 0).
Rusty Rob over 6 years

this is my favourite way :)
sdcvvc over 6 years

@shuva The algorithm is not remembering numbers that come in the stream. They come as input, the "bag" from the question is not part of the algorithm memory.
Martin Thoma almost 6 years

@corsiKa This question was asked in the German final round of the selection process for the International Olympiad in Informatics ("Bundeswettbewerb Informatik"). So for people before they enter university. Out of ~30 students, I think all of us got a solution for k=1 and k=2. One of the four teams got the solution for arbitrary k, IIRC. Oh, and it was about streaming algorithms, meaning you were not allowed to loop over the input multiple times (which is the obvious solution to the question as asked above)
corsiKa almost 6 years

@MartinThoma To be honest, I know some kids that would have had the skills and patience to do this kind of thing in high school. I could see the sorts of kids who excel at such events knowing this. I think expecting a typical university graduate to know this is complete hogwash.
Sneftel almost 6 years

As you can see from the many other answers, there are generic algorithms that work for any k. The bitset approach does not run in extra space O(k).
jcsahnwaldt Reinstate Monica over 5 years

isInRange is O(log n), not O(1): it compares numbers in range 1..n, so it has to compare O(log n) bits. I don't know to what extent this error affects the rest of the analysis.
jcsahnwaldt Reinstate Monica over 5 years

The line if (i >= odd) ++tmp; should be if (i >= even) ++tmp;.
jcsahnwaldt Reinstate Monica over 5 years

The line for (int i = even; i < n; ++i) data [i] += 1; doesn't correctly restore bits.
jcsahnwaldt Reinstate Monica over 5 years

I fixed these two bugs.
jcsahnwaldt Reinstate Monica over 5 years

Nice answer! One little thing: "If sum modulo 2^i is zero, then i is missing" is incorrect. But it's clear what is intended. I think "if sum modulo 2^(i+1) is less than 2^i, then i is missing" would be correct. (Of course, in most programming languages we would use bit shifting instead of modulo calculation. Sometimes programming languages are a bit more expressive than the usual mathematic notation. :-) )
sfink over 5 years

Thanks, you're totally right! Fixed, though I was lazy and strayed from mathematical notation... oh, and I messed that up too. Fixing again...
sfink over 5 years

Beautiful answer to the problem as posed, and in fact far more practical in that situation (since the "expected" approach of summing powers of numbers requires arbitrary-precision integers and so takes too much space and time.) Though I suspect the spirit of the problem is looking at each of your inputs once only. Maybe a better formulation would be having numbers assigned to people and doing a census in a single pass or something?
jcsahnwaldt Reinstate Monica over 5 years

@sdcvvc Finite field arithmetic doesn't quite work with this algorithm. For example: With n=16, k=2, we can't distinguish (a₁=0, a₂=1) from (a₁=2, a₂=3). Values in GF(16) with primitive polynomial x⁴ + x + 1: For (a₁=0,a₂=1): b₁=a₁+a₂=0+1=1, b₂=a₁²+a₂²=0+1=1. For (a₁=2,a₂=3): b₁=a₁+a₂=2+3=1, b₂=a₁²+a₂²=4+5=1. (I used these addition and multiplication tables.)
jcsahnwaldt Reinstate Monica over 5 years

@HeinrichApfelmus Maybe finite field arithmetic works if we don't use exponents that are powers of two? For example, for k=2, instead of the sum of squares we use the sum of cubes. I'm just guessing... (Of course, we'll have to adapt the algorithm that calculates a₁, a₂, .. from the sums.)
jcsahnwaldt Reinstate Monica over 5 years

I should have said that finite field arithmetic over GF(2^x) with x > 0 doesn't work here. I think finite field arithmetic over GF(p) with p prime works.
sdcvvc over 5 years

@jcsahnwaldt You are right, thank you for catching this. In characteristic 2, x^2+y^2=(x+y)^2, therefore knowing x+y and x^2+y^2 is the same as knowing x+y only - that fixes only one degree of freedom instead of two. The reason is that Newton's identities require to divide by numbers up to k. So the algorithm requires using GF(p^x) where p > k and x > 0.
Wander3r over 5 years

axb is not P1-P2, it's P1/P2
phuclv about 5 years

sorting is definitely not O(n)
xjcl about 5 years

Ahaha yes this is the same solution I came up with for Q2, just with calculating the sum again taking the negatives for all numbers below N/2, but this is even better!
Anthony Labarre about 5 years

@phuclv: the answer stated that "This has a space requirement of O(kitchen floor)". But in any case, this is an instance where sorting can be achieved in O(n) time --- see this discussion.
ldog almost 5 years

Haha this is probably one of the more practical answers, but gets little attention.
dain over 4 years

I don't understand your graph diagram. What do the nodes, edges, and numbers represent? Why are some of the edges directed and not others?
dain over 4 years

In fact I don't really understand your answer at all, can you clarify some more?
ozgeneral over 4 years

if we have more than 2 numbers this solution would be busted
Arefe over 4 years

This is not something one should be asked in a programming interview.
John McClane about 4 years

I think this is the same as Svalorzen's answer, but you explained it in better words. Have any idea how to generalize it to Qk?
Gilad Deutsch about 4 years

Sorry for missing the other answer. I'm not sure if it's possible to generalize it to $Q_k$ since in that case you cannot bound the smallest missing element to some range. You do know that some element must be smaller than $S/k$ but that might be true for multiple elements
TZubiri about 4 years

Seems like a very complex solution, considering OP couldn't find a simple solution, I'd encourage him to ignore this one until he finds a simpler way.
Elliott over 3 years

@Tatarize, yes, but counting sorts [usually] require O(n) space.
Elliott over 3 years

The contains method runs in O(n) time, so you're solution is a O(n^2) solution, which is slower than just sorting the array first then iterating over to find what's missing [O(n*log(n)) time, O(1) space, (with heap sort)].
Elliott over 3 years

The OP said I'm not looking for the obvious set-based solution, e.g. using a bit set, encoding the presence/absence each number by the value of a designated bit, therefore using O(N) bits in additional space. This answer requires far too much memory.
Elliott over 3 years

Interesting. This is very similar to how Radix sort works. I'd say that you're trading space for time a bit here. You have O(1) space, but O(n*log(n)) time (as n grows, you'll have to consider more bits).
Elliott over 3 years

A binary search assumes that the list is sorted. If the list is sorted then the problem is trivially solvable in O(n) time and O(1) space by just iterating over the list and print when you notice that numbers have been skipped.
Elliott over 3 years

This requires O(n) space, and worst-case scenario it's O(n^2) time (generic maps can take O(n) time to add an element - good maps just make it less likely to happen).
Elliott over 3 years

Also, this solution assumes that missing numbers are never consecutive.
Elliott over 3 years

A Bloom Filter (BF) still takes linear space. As you say, it doesn't guarantee a solution. The better version of this is a boolean (bit) array, which takes the minimum amount of linear space, does it in O(n) time, and always gets the right answer. The BF is basically trying to make use of this technique when the key numbers are larger that the array size, which we don't have to worry about in our case, so there's no need for the compromise designed by the BF.
Elliott over 3 years

I'm not sure how this solution is any simpler or more efficient than caf's solution, which was posted a few years before this one.
Tatarize over 3 years

Still worthwhile O(n) time complexity even if you are making n lists of numbers you fill with like 1 number.
Rusty Rob over 3 years

@Elliott sorry i should have been more clear, we are binary searching for missing intervals. E.g. start with (0, 100), in O(N) we see this interval contains less than 100 items, so we change the interval to (0, 50), then (0,25), then we maybe see 25 items from (0,25), so we try (25, 50), and so on, So we use 0 space and no sorting required
Elliott over 3 years

I'm sorry, could you explain this more? At your iteration of size 100, you said that you could "see" in linear time that there are less than 100 numbers (presumably unique numbers). But how? Because this appears to be a kind of divide and conquer method, we no longer have useful bounds on the element values. And what happens if all elements are unique except at indices 5 and 35: when you look at [0,24] you see all unique values, then look at [25,49] and see all unique values. This doesn't seem to help us...
Rusty Rob over 3 years

1+2+..+n = n*(n+1)/2, so if we maintain a count and only add numbers to the count if they're within the interval, then at the end we can see if the interval is the size we would expect e.g. for (a, b) we expect count to be b*(b+1)/2 - (a-1)*a/2. In the problem statement it mentions 'Each number appears exactly once'. People have already mentioned how to solve if theres one or zero missing from an interval. This is away to extend to K that's fairly easy to code, possibly reasonably efficient, and constant space
Elliott over 3 years

Okay, thanks for your explanation. It has best-case time complexity O(kn) and worst-case O(n^2). I downvoted the answer earlier, but I'll remove if/when you edit the answer to explain what you've just said.
Elliott over 3 years

I know' it's four years later, but I figured out how to generalise this (and yes, you could use parallelism to speed it up): stackoverflow.com/a/64285525/8658157
cisnjxqu over 3 years

@DavidEhrmann could you elaborate on how this equates to Reed-Solomon coding?
Tarje Bargheer about 3 years

How about this for Q_k: After bisecting at the average, if you count the summands while taking the sum at side of the bisection, you will know the amount of numbers missing on each side - and the problem has been reduced to Q_l on the left side and Q_r on the right side, where l + r = k where l < k and r < k by the same reasoning as in the answer - so these can be solved recursively.
piepi about 3 years

This modifies the original array in more than just extending it. Can it be said to be O(k) space when the whole array gets actually operated on? Most of the times we are not allowed to mess with the input.
lalala almost 3 years

How do you factor the polynomial efficiently?
Gulzar almost 3 years

Why use modulu only for varying k and not always?
Gulzar almost 3 years

See a video explanation of this algorithm, solved to more detail
jcsahnwaldt Reinstate Monica almost 3 years

That's the same solution as in sdcvvc's answer, isn't it? stackoverflow.com/a/3492967/131160
jcsahnwaldt Reinstate Monica almost 3 years

I don't think space is constant. If I understand correctly, you want to re-use the 4-element array for each input scan. But where do you store the results of the previous input scan?
jcsahnwaldt Reinstate Monica almost 3 years

I think you could simplify the algorithm by using a 2-element array instead of a 4-element array. During input scan i, you only look at bit i (instead of two adjacent bits).