Extracting date from a string in Python

137,307

Solution 1

If the date is given in a fixed form, you can simply use a regular expression to extract the date and "datetime.datetime.strptime" to parse the date:

import re
from datetime import datetime

match = re.search(r'\d{4}-\d{2}-\d{2}', text)
date = datetime.strptime(match.group(), '%Y-%m-%d').date()

Otherwise, if the date is given in an arbitrary form, you can't extract it easily.

Solution 2

Using python-dateutil:

In [1]: import dateutil.parser as dparser

In [18]: dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
Out[18]: datetime.datetime(2010, 7, 10, 0, 0)

Invalid dates raise a ValueError:

In [19]: dparser.parse("monkey 2010-07-32 love banana",fuzzy=True)
# ValueError: day is out of range for month

It can recognize dates in many formats:

In [20]: dparser.parse("monkey 20/01/1980 love banana",fuzzy=True)
Out[20]: datetime.datetime(1980, 1, 20, 0, 0)

Note that it makes a guess if the date is ambiguous:

In [23]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True)
Out[23]: datetime.datetime(1980, 10, 1, 0, 0)

But the way it parses ambiguous dates is customizable:

In [21]: dparser.parse("monkey 10/01/1980 love banana",fuzzy=True, dayfirst=True)
Out[21]: datetime.datetime(1980, 1, 10, 0, 0)

Solution 3

For extracting the date from a string in Python; the best module available is the datefinder module.

You can use it in your Python project by following the easy steps given below.

Step 1: Install datefinder Package

pip install datefinder

Step 2: Use It In Your Project

import datefinder

input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))

if len(matches) > 0:
    # date returned will be a datetime.datetime object. here we are only using the first match.
    date = matches[0]
    print date
else:
    print 'No dates found'

note: if you are expecting a large number of matches; then typecasting to list won't be a recommended way as it will be having a big performance overhead.

Solution 4

Using Pygrok, you can define abstracted extensions to the Regular Expression syntax.

The custom patterns can be included in your regex in the format %{PATTERN_NAME}.

You can also create a label for that pattern, by separating with a colon: %s{PATTERN_NAME:matched_string}. If the pattern matches, the value will be returned as part of the resulting dictionary (e.g. result.get('matched_string'))

For example:

from pygrok import Grok

input_string = 'monkey 2010-07-10 love banana'
date_pattern = '%{YEAR:year}-%{MONTHNUM:month}-%{MONTHDAY:day}'

grok = Grok(date_pattern)
print(grok.match(input_string))

The resulting value will be a dictionary:

{'month': '07', 'day': '10', 'year': '2010'}

If the date_pattern does not exist in the input_string, the return value will be None. By contrast, if your pattern does not have any labels, it will return an empty dictionary {}

References:

Solution 5

Hands Down The Best Ways

There are two good modules on PyPI and GitHub, that make this task easier for us. Those are

  1. DATEFINDER Module, useful for finding dates in strings of text.

Installation pip install datefinder

EXAMPLE

import datefinder

input_string = "monkey 2010-07-10 love banana"
# a generator will be returned by the datefinder module. I'm typecasting it to a list. Please read the note of caution provided at the bottom.
matches = list(datefinder.find_dates(input_string))

if len(matches) > 0:
    # date returned will be a datetime.datetime object. here we are only using the first match.
    date = matches[0]
    print date
else:
    print 'No dates found'

SOURCE: Finny Abraham

  1. DATERPARSER, extremely useful for scraping dates from an HTML file, in different lingual formats, supports Hijri and Jalali Calender as well. And supporters almost 200+ Languages in Different Formats

Features

Generic parsing of dates in over 200 language locales plus numerous formats in a language agnostic fashion. Generic parsing of relative dates like: '1 min ago', '2 weeks ago', '3 months, 1 week and 1 day ago', 'in 2 days', 'tomorrow'.

Advanced Features

Generic parsing of dates with time zones abbreviations or UTC offsets like: 'August 14, 2015 EST', 'July 4, 2013 PST', '21 July 2013 10:15 pm +0500'. Date lookup in longer texts. Support for non-Gregorian calendar systems. See Supported Calendars. Extensive test coverage.

SOURCE CODE [Example]

>>> parse('1 hour ago')
datetime.datetime(2015, 5, 31, 23, 0)
>>> parse('Il ya 2 heures')  # French (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)
>>> parse('1 anno 2 mesi')  # Italian (1 year 2 months)
datetime.datetime(2014, 4, 1, 0, 0)
>>> parse('yaklaşık 23 saat önce')  # Turkish (23 hours ago)
datetime.datetime(2015, 5, 31, 1, 0)
>>> parse('Hace una semana')  # Spanish (a week ago)
datetime.datetime(2015, 5, 25, 0, 0)
>>> parse('2小时前')  # Chinese (2 hours ago)
datetime.datetime(2015, 5, 31, 22, 0)

Share:
137,307
dmpop
Author by

dmpop

Updated on June 17, 2021

Comments

  • dmpop
    dmpop almost 3 years

    How can I extract the date from a string like "monkey 2010-07-10 love banana"? Thanks!

  • Hamish Grubijan
    Hamish Grubijan almost 14 years
    What if it is in European format, such as 20/01/1980 meaning "Jan 20 1980"? What if months/days/years fall outside of reasonable range?
  • unutbu
    unutbu almost 14 years
    @Hamish: If there are two dates (as in the case of "monkey 10/01/1980 love 7/10/2010 banana"), it may raise a ValueError, or (as in the case of "monkey 10/01/1980 love 2010-07-10 banana") it may misinterpret the second date as denoting hours, minutes, seconds or timezone. fuzzy=True gives it license to guess.
  • saravanan
    saravanan almost 12 years
    @unutbu str = "By flufie · October 14, 2010 at 11:22 pm · 26 replies" By using dateutil i am getting "ValueError: hour must be in 0..23 "
  • alvas
    alvas about 9 years
    what happens if there are more than 1 date in the text?
  • unutbu
    unutbu about 9 years
    @alvas: The parse function may raise an exception (even if fuzzy=True), or with fuzzy=True, it may return the first date or a mish-mash composed of parts of both dates. So really, parse should only be called on a string containing one date.
  • Kailegh
    Kailegh over 6 years
    is it possible to get the index of the characters that forms the date when using fuzzy?
  • unutbu
    unutbu over 6 years
    @Kailegh: Yes, it would be possible to deduce the indices using fuzzy_with_tokens=True. If you'd like more clarification, please start a new question.
  • vishal
    vishal almost 6 years
    @lunaryorn In the first statement does "re" refer to the string where we are seaching for our desired pattern?
  • lunaryorn
    lunaryorn almost 6 years
    @vishal.k It refers to the built-in re module, ie, import re.
  • CpILL
    CpILL almost 6 years
    I found that datefinder handed ambiguous date matching better than python-dateutil returning only two possible dates from a random medium.com blog post as opposed to five. Not sure how it handles different locales however...
  • Peter.k
    Peter.k about 5 years
    It matches single numbers!
  • dankal444
    dankal444 almost 5 years
    In case someone else made same mistake: you need to from datetime import datetime instead of import datetime
  • Narahari B M
    Narahari B M over 4 years
    This is pretty good, except it somehow doesnt work when there is a colon(:) before date string: string = "Assessment Date: 17-May-2017 at 13:31" list(datefinder.find_dates(string.lower())) #[] string = "Assessment Date 17-May-2017 at 13:31" list(datefinder.find_dates(string.lower())) #[datetime.datetime(2017, 5, 17, 13, 31)]
  • Jay Jung
    Jay Jung over 4 years
    agree that datefinder is heaps better than dateparser for ambiguous text
  • Admin
    Admin over 3 years
    This lib ie very Python 2