Find the year with the most number of people alive in Python

python algorithm

12,119

Solution 1

>>> from collections import Counter
>>> from itertools import chain
>>> def most_pop(pop):
...     pop_flat = chain.from_iterable(range(i,j+1) for i,j in pop)
...     return Counter(pop_flat).most_common()
...
>>> most_pop([(1920, 1939), (1911, 1944), (1920, 1955), (1938, 1939)])[0]

Solution 2

Another solution I just though of:

Create 2 tables, birthdates and deathdates.
Accumulate birth dates and death dates in those tables.
Browse those tables to accumulate the number of alive people at the time.

Grand total complexity is O(n)

Implementation

from collections import Counter

def most_populated(population, single=True):
    birth = map(lambda x: x[0], population)
    death = map(lambda x: x[1] + 1, population)
    b = Counter(birth)
    d = Counter(death)
    alive = 0
    years = {}
    for year in range(min(birth), max(death) + 1):
        alive = alive + b[year] - d[year]
        years[year] = alive
    return max(years, key=years.get) if single else \
           [key for key, val in years.iteritems() if val == max(years.values())]

Better

from collections import Counter
from itertools import accumulate
import operator

def most_populated(population, single=True):
    delta = Counter(x[0] for x in population)
    delta.subtract(Counter(x[1]+1 for x in population))
    start, end = min(delta.keys()), max(delta.keys())
    years = list(accumulate(delta[year] for year in range(start, end)))
    return max(enumerate(years), key=operator.itemgetter(1))[0] + start if single else \
           [i + start for i, val in enumerate(years) if val == max(years)]

Solution 3

I would go like this:

Sort persons by birth year (unborn list)
Starting from the first born
- Put that person in the alive list
- Using an insertion sort by date of death (the list stays sorted, so use a binary search)
- Until you reach a person that was not born that year
Then, starting from the person in the alive list that dies first, remove it from the list.
Put the size of the alive list in a dict
Increment the year
Loop until the unborn and alive lists are empty

Complexity should be around O((m + n) * log(m)) (each year is considered only once, and each person only twice, multiplied by the insertion cost in the alive list)

Implementation

from bisect import insort

def most_populated(population, single=True):
    years = dict()
    unborn = sorted(population, key=lambda x: -x[0])
    alive = []
    dead = []
    for year in range(unborn[-1][0], max(population, key=lambda x: x[1])[1] + 1):
        while unborn and unborn[-1][0] == year:
            insort(alive, -unborn.pop()[1])
        while alive and alive[-1] == -(year - 1):
            dead.append(-alive.pop())
        years[year] = len(alive)
    return max(years, key=years.get) if single else \
           [key for key, val in years.iteritems() if val == max(years.values())]

Solution 4

We can also use numpy slicing, which is quite neat, and should also be quite efficient:

import numpy as np
from collections import namedtuple

Person = namedtuple('Person', ('birth', 'death'))
people = [Person(1900,2000), Person(1950,1960), Person(1955, 1959)]

START_YEAR = 1900
END_YEAR = 2000
people_alive = np.zeros(END_YEAR - START_YEAR + 1) # Alive each year

for p in people:
    a = p.birth - START_YEAR
    b = p.death - START_YEAR + 1 # include year of death
    people_alive[a:b] += 1

# Find indexes of maximum aliveness and convert to year
most_alive = np.flatnonzero(people_alive == people_alive.max()) + START_YEAR

EDIT It seems like the namedtuple adds a bit of overhead, so to speed up a bit more, remove the namedtuple and do for birth, death in people: instead.

Solution 5

Just put the birth and death years into a dict. If it is birth, increase the value by 1. or vice versa.
Sort the dict by keys and iterate by reading the current number of the alive people.

Follow the 'maxAlive' an 'theYear' to get the first year with the highest number

years = {} 

for p in people:
    if p.birth in years:
        years[p.birth] += 1
    else:
        years[p.birth] = 1

    if p.death in years:
        years[p.death] -= 1
    else:
        years[p.death] = -1

alive = 0
maxAlive = 0
theYear = people[0].birth
for year in sorted(years):
    alive += years[year]
    if alive > maxAlive:
        maxAlive = alive
        theYear = year

View more solutions

12,119

oski86

I graduated from Polish-Japanese Academy of Information Technology with a MSc Eng. degree in computer science (major databases). After 3 years of commercial work I returned to my university to become a PhD student in the field of social informatics. My work is connected with problems in a category of web intelligence, especially swarm creativity and collaborative innovation networks (COINs). I specialize in analysing quality of work in Open Source Software (OSS) repositories on the GitHub portal. My hobbies are Kendo, reading about successful startups, and travelling to Southeast Asia. Currently I work for company called TogetherData as a Data scientist & data engineer.

Updated on October 20, 2022

Comments

oski86 over 1 year
Given a list of people with their birth and end years (all between 1900 and 2000), find the year with the most number of people alive.

Here is my somewhat brute-force solution:
```
def most_populated(population, single=True):
    years = dict()
    for person in population:
        for year in xrange(person[0], person[1]):
            if year in years:
                years[year] += 1
            else:
                years[year] = 0
    return max(years, key=years.get) if single else \
           [key for key, val in years.iteritems() if val == max(years.values())]

print most_populated([(1920, 1939), (1911, 1944),
                      (1920, 1955), (1938, 1939)])
print most_populated([(1920, 1939), (1911, 1944),
                      (1920, 1955), (1938, 1939), (1937, 1940)], False)
```
I'm trying to find a more efficient way to solve this problem in Python. Both - readability and efficiency counts. Moreover, for some reason my code won't print [1938, 1939] while it should.

Update

Input is a list of tuples, where first element of a tuple is a year when person was born, and second element of a tuple is the year of death.

Update 2

End year (2nd part of tuple) counts as well as a year of the person being alive (so If the person dies in Sept 1939 (we don't care about the month), he is actually alive in 1939, at least part of it). That should fix the 1939' missing in results.

Best solution?

While readability counts in favor of @joran-beasley, for bigger input most efficient algorithm was provided by @njzk2. Thanks @hannes-ovrén for providing analysis in IPython notebook on Gist
- njzk2 almost 9 years
  
  years[year] = 1 for starters
- Greg Hilston almost 9 years
  
  @njzk2 Technically I think you're right from a readability standpoint, but every years[n] will be initialized to 0 and will still yield the "most_populated" year correctly. If the count was bring returned, I'd agree with you but only the year is being returned.
- DSM almost 9 years
  
  Let's say that you have 10 people born in 2000 and they all die in 2080. Then there are another 10 who are born in 2080 and they die in 2160. Do you want the maximum people alive to be 10 or 20 or 15 or something else? What is your assumption about what's going on in 2080?
- Colonel Panic almost 9 years
  
  For a harder problem playing with some of the same ideas, see Google Code Jam's The Great Wall code.google.com/codejam/contest/2437488/dashboard#s=p2
oski86 almost 9 years

Thanks, looks very neat. How can I change the code to count the 2nd part of tuple as well ? (death year should also count)
Mazdak almost 9 years

xrange(i,j+1) for i,j in pop in python2 and range(i,j+1) for i,j in pop is better!
DSM almost 9 years

This will be very slow for large populations because of insort.
njzk2 almost 9 years

@DSM: insort is O(n), which makes my initial complexity computation wrong, but nontheless this is equivalent in complexity to considering each year of each person.
njzk2 almost 9 years

you really think people_alive[a:b] += 1 is O(1)?
Hannes Ovrén almost 9 years

That is incredibly true. Not sure what I was thinking. :)
njzk2 almost 9 years

using a table instead of a Counter, we could use itertools.accumulate to compute the values for each year in a single call
DSM almost 9 years

@nkjz2: I think I'd treat the input data as an event queue, with births being (t, 1) and deaths being (t, -1). You can sort that in N log N, and then simply loop over each event either adding or decreasing to the total population, tracking whatever you want.
oski86 almost 9 years

Accepted due to best timeit test: gist.github.com/oskar-j/308aafbb7ae44df7fed1 , result: joran-beasley: 0.0668609038671
njzk2 almost 9 years

@DSM: yep, I realized something like that can be done. Also, since it is a relatively known set of dates, you can use a bucket sort (O(n)), or a Counter, like what I started in another answer.
oski86 almost 9 years

Timeit 10000 times seems to return a constant execution time, probably with bigger test data on the input it may work quite fine, but it needs further tests. +1 for using the numpy
njzk2 almost 9 years

I like that answer because it is short, clear and readable, but beware of the complexity, if you have a more significant number of people or years to handle.
Hannes Ovrén almost 9 years

Considering that the maximum length of the range a:b is (assuming normal humans) in the order of 100, you could consider it to be O(1).
Hannes Ovrén almost 9 years

@oski86, Love the benchmark. It would be interesting to see results for different sizes of the input list as well. I would guess that my numpy version suffers for small N due to the construction of the people_alive array. For the record, I do think this answer should be the accepted one. I like it :)
Hannes Ovrén almost 9 years

Actually, I take that back. This algorithm is by far the slowest of the ones tried so far. It just happens to be fast for really small population sizes. For larger populations njzk2_var2 is fastest. See benchmark plot here: gist.github.com/hovren/bf5335ba45cd7eee811e the hannes_v2 algorithm is exactly the same as my old one, but without the namedtuple, since that added a bit of overhead.
oski86 almost 9 years

That's what I wanted to try yesterday (provide bigger rand test data), but it where already wee hours. I'm not sure about changing the accepted answer, but I have one idea, I'll update content in the question with providing a winning solution separately for a small and big data on input.
Colonel Panic almost 9 years

Neat. This is the dynamic programming algorithm I hinted to.
Nas Banov over 7 years

so if i call it with most_populated([(-10025, -10000), (1960, 2010)]), for 2 people it will do >10200 iterations? This is terrible!
Machavity almost 7 years

A link to a potential solution is always welcome, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline. Take into account that being barely more than a link to an external site is a possible reason as to Why and how are some answers deleted?.
Mr. T about 6 years

This is a Java answer, the OP asked for Python solutions. Please read How do I write a good answer?
Hadh almost 6 years

I don't get the use of the second if statement, why are you adding +1 to the year of death? The first if statement is clear though.
Chris over 2 years

Welcome to Stack Overflow. Code dumps without any explanation are rarely helpful. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please edit your question and explain how it answers the specific question being asked. See How to Answer. This is particularly important when answering old questions (this one is over six years old) with existing answers.