Find the year with the most number of people alive in Python
Solution 1
>>> from collections import Counter
>>> from itertools import chain
>>> def most_pop(pop):
... pop_flat = chain.from_iterable(range(i,j+1) for i,j in pop)
... return Counter(pop_flat).most_common()
...
>>> most_pop([(1920, 1939), (1911, 1944), (1920, 1955), (1938, 1939)])[0]
Solution 2
Another solution I just though of:
- Create 2 tables,
birthdates
anddeathdates
. - Accumulate birth dates and death dates in those tables.
- Browse those tables to accumulate the number of alive people at the time.
Grand total complexity is O(n)
Implementation
from collections import Counter
def most_populated(population, single=True):
birth = map(lambda x: x[0], population)
death = map(lambda x: x[1] + 1, population)
b = Counter(birth)
d = Counter(death)
alive = 0
years = {}
for year in range(min(birth), max(death) + 1):
alive = alive + b[year] - d[year]
years[year] = alive
return max(years, key=years.get) if single else \
[key for key, val in years.iteritems() if val == max(years.values())]
Better
from collections import Counter
from itertools import accumulate
import operator
def most_populated(population, single=True):
delta = Counter(x[0] for x in population)
delta.subtract(Counter(x[1]+1 for x in population))
start, end = min(delta.keys()), max(delta.keys())
years = list(accumulate(delta[year] for year in range(start, end)))
return max(enumerate(years), key=operator.itemgetter(1))[0] + start if single else \
[i + start for i, val in enumerate(years) if val == max(years)]
Solution 3
I would go like this:
- Sort persons by birth year (
unborn
list) - Starting from the first born
- Put that person in the
alive
list - Using an insertion sort by date of death (the list stays sorted, so use a binary search)
- Until you reach a person that was not born that year
- Put that person in the
- Then, starting from the person in the
alive
list that dies first, remove it from the list. - Put the size of the
alive
list in a dict - Increment the year
- Loop until the
unborn
andalive
lists are empty
Complexity should be around O((m + n) * log(m))
(each year is considered only once, and each person only twice, multiplied by the insertion cost in the alive
list)
Implementation
from bisect import insort
def most_populated(population, single=True):
years = dict()
unborn = sorted(population, key=lambda x: -x[0])
alive = []
dead = []
for year in range(unborn[-1][0], max(population, key=lambda x: x[1])[1] + 1):
while unborn and unborn[-1][0] == year:
insort(alive, -unborn.pop()[1])
while alive and alive[-1] == -(year - 1):
dead.append(-alive.pop())
years[year] = len(alive)
return max(years, key=years.get) if single else \
[key for key, val in years.iteritems() if val == max(years.values())]
Solution 4
We can also use numpy slicing, which is quite neat, and should also be quite efficient:
import numpy as np
from collections import namedtuple
Person = namedtuple('Person', ('birth', 'death'))
people = [Person(1900,2000), Person(1950,1960), Person(1955, 1959)]
START_YEAR = 1900
END_YEAR = 2000
people_alive = np.zeros(END_YEAR - START_YEAR + 1) # Alive each year
for p in people:
a = p.birth - START_YEAR
b = p.death - START_YEAR + 1 # include year of death
people_alive[a:b] += 1
# Find indexes of maximum aliveness and convert to year
most_alive = np.flatnonzero(people_alive == people_alive.max()) + START_YEAR
EDIT It seems like the namedtuple adds a bit of overhead, so to speed up a bit more, remove the namedtuple and do
for birth, death in people:
instead.
Solution 5
- Just put the birth and death years into a dict. If it is birth, increase the value by 1. or vice versa.
- Sort the dict by keys and iterate by reading the current number of the alive people.
-
Follow the 'maxAlive' an 'theYear' to get the first year with the highest number
years = {} for p in people: if p.birth in years: years[p.birth] += 1 else: years[p.birth] = 1 if p.death in years: years[p.death] -= 1 else: years[p.death] = -1 alive = 0 maxAlive = 0 theYear = people[0].birth for year in sorted(years): alive += years[year] if alive > maxAlive: maxAlive = alive theYear = year
Related videos on Youtube
oski86
I graduated from Polish-Japanese Academy of Information Technology with a MSc Eng. degree in computer science (major databases). After 3 years of commercial work I returned to my university to become a PhD student in the field of social informatics. My work is connected with problems in a category of web intelligence, especially swarm creativity and collaborative innovation networks (COINs). I specialize in analysing quality of work in Open Source Software (OSS) repositories on the GitHub portal. My hobbies are Kendo, reading about successful startups, and travelling to Southeast Asia. Currently I work for company called TogetherData as a Data scientist & data engineer.
Updated on October 20, 2022Comments
-
oski86 over 1 year
Given a list of people with their birth and end years (all between
1900
and2000
), find the year with the most number of people alive.Here is my somewhat brute-force solution:
def most_populated(population, single=True): years = dict() for person in population: for year in xrange(person[0], person[1]): if year in years: years[year] += 1 else: years[year] = 0 return max(years, key=years.get) if single else \ [key for key, val in years.iteritems() if val == max(years.values())] print most_populated([(1920, 1939), (1911, 1944), (1920, 1955), (1938, 1939)]) print most_populated([(1920, 1939), (1911, 1944), (1920, 1955), (1938, 1939), (1937, 1940)], False)
I'm trying to find a more efficient way to solve this problem in
Python
. Both -readability
andefficiency
counts. Moreover, for some reason my code won't print[1938, 1939]
while it should.Update
Input is a
list
of tuples, where first element of a tuple is ayear
when person was born, and second element of atuple
is the year of death.Update 2
End year (2nd part of tuple) counts as well as a year of the person being alive (so If the person dies in
Sept 1939
(we don't care about the month), he is actually alive in 1939, at least part of it). That should fix the 1939' missing in results.Best solution?
While readability counts in favor of @joran-beasley, for bigger input most efficient algorithm was provided by @njzk2. Thanks @hannes-ovrén for providing analysis in IPython notebook on Gist
-
njzk2 almost 9 years
years[year] = 1
for starters -
Greg Hilston almost 9 years@njzk2 Technically I think you're right from a readability standpoint, but every years[n] will be initialized to 0 and will still yield the "most_populated" year correctly. If the count was bring returned, I'd agree with you but only the year is being returned.
-
DSM almost 9 yearsLet's say that you have 10 people born in 2000 and they all die in 2080. Then there are another 10 who are born in 2080 and they die in 2160. Do you want the maximum people alive to be 10 or 20 or 15 or something else? What is your assumption about what's going on in 2080?
-
Colonel Panic almost 9 yearsFor a harder problem playing with some of the same ideas, see Google Code Jam's The Great Wall code.google.com/codejam/contest/2437488/dashboard#s=p2
-
-
oski86 almost 9 yearsThanks, looks very neat. How can I change the code to count the 2nd part of tuple as well ? (death year should also count)
-
Mazdak almost 9 years
xrange(i,j+1) for i,j in pop
in python2 andrange(i,j+1) for i,j in pop
is better! -
DSM almost 9 yearsThis will be very slow for large populations because of
insort
. -
njzk2 almost 9 years@DSM: insort is
O(n)
, which makes my initial complexity computation wrong, but nontheless this is equivalent in complexity to considering each year of each person. -
njzk2 almost 9 yearsyou really think
people_alive[a:b] += 1
isO(1)
? -
Hannes Ovrén almost 9 yearsThat is incredibly true. Not sure what I was thinking. :)
-
njzk2 almost 9 yearsusing a table instead of a Counter, we could use itertools.accumulate to compute the values for each year in a single call
-
DSM almost 9 years@nkjz2: I think I'd treat the input data as an event queue, with births being (t, 1) and deaths being (t, -1). You can sort that in N log N, and then simply loop over each event either adding or decreasing to the total population, tracking whatever you want.
-
oski86 almost 9 yearsAccepted due to best timeit test: gist.github.com/oskar-j/308aafbb7ae44df7fed1 , result: joran-beasley: 0.0668609038671
-
njzk2 almost 9 years@DSM: yep, I realized something like that can be done. Also, since it is a relatively known set of dates, you can use a bucket sort (
O(n)
), or a Counter, like what I started in another answer. -
oski86 almost 9 yearsTimeit 10000 times seems to return a constant execution time, probably with bigger test data on the input it may work quite fine, but it needs further tests. +1 for using the numpy
-
njzk2 almost 9 yearsI like that answer because it is short, clear and readable, but beware of the complexity, if you have a more significant number of people or years to handle.
-
Hannes Ovrén almost 9 yearsConsidering that the maximum length of the range
a:b
is (assuming normal humans) in the order of 100, you could consider it to be O(1). -
Hannes Ovrén almost 9 years@oski86, Love the benchmark. It would be interesting to see results for different sizes of the input list as well. I would guess that my numpy version suffers for small N due to the construction of the
people_alive
array. For the record, I do think this answer should be the accepted one. I like it :) -
Hannes Ovrén almost 9 yearsActually, I take that back. This algorithm is by far the slowest of the ones tried so far. It just happens to be fast for really small population sizes. For larger populations njzk2_var2 is fastest. See benchmark plot here: gist.github.com/hovren/bf5335ba45cd7eee811e the hannes_v2 algorithm is exactly the same as my old one, but without the namedtuple, since that added a bit of overhead.
-
oski86 almost 9 yearsThat's what I wanted to try yesterday (provide bigger rand test data), but it where already wee hours. I'm not sure about changing the accepted answer, but I have one idea, I'll update content in the question with providing a winning solution separately for a small and big data on input.
-
Colonel Panic almost 9 yearsNeat. This is the dynamic programming algorithm I hinted to.
-
Nas Banov over 7 yearsso if i call it with
most_populated([(-10025, -10000), (1960, 2010)])
, for 2 people it will do >10200 iterations? This is terrible! -
Machavity almost 7 yearsA link to a potential solution is always welcome, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline. Take into account that being barely more than a link to an external site is a possible reason as to Why and how are some answers deleted?.
-
Mr. T about 6 yearsThis is a Java answer, the OP asked for Python solutions. Please read How do I write a good answer?
-
Hadh almost 6 yearsI don't get the use of the second if statement, why are you adding +1 to the year of death? The first if statement is clear though.
-
Chris over 2 yearsWelcome to Stack Overflow. Code dumps without any explanation are rarely helpful. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please edit your question and explain how it answers the specific question being asked. See How to Answer. This is particularly important when answering old questions (this one is over six years old) with existing answers.