Grab part of filename with Python

16,411

Solution 1

Here's a simple solution using the re module as mentioned in other answers.

# Libraries
import re

# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf") 

for fname in file_list:
    res = re.findall("ID_(\d+).pdf", fname)
    if not res: continue
    print res[0] # You can append the result to a list

And below should be your output. You should be able to adapt this to other patterns.

# Output
123
456

Goodluck!

Solution 2

Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):

>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>> 

Solution 3

If the numbers are variable length, you'll want the regex module "re"

import re

# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")

pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'

Regex is generally used to match variable strings. The regex I just wrote says:

Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")

Solution 4

You can use the os module in python and do a listdir to get a list of filenames present in that path like so:

import os
filenames = os.listdir(path)

Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:

import re
for filename in filenames:
    m = re.search('(?<=ID_)\w+', filename)
    print (m)

The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.

Share:
16,411
P A N
Author by

P A N

Updated on June 04, 2022

Comments

  • P A N
    P A N almost 2 years

    Newbie here.

    I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.

    So in my case, let's say I have four .pdf like this:

    aaa_ID_8423.pdf
    bbbb_ID_8852.pdf
    ccccc_ID_7413.pdf
    dddddd_ID_4421.pdf
    
    (Note that they are of variable length.)
    

    I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.

    Can you point me in the direction to which Python modules and possibly guides that could assist me?

    • LampPost
      LampPost almost 9 years
      If those are the only numbers in the string you can use this, oh and the library is re ID = re.findall(r"[0-9]+", *stringname")
  • Al Wang
    Al Wang almost 9 years
    To elaborate on this, please take a look at the Regular Expressions library found in docs.python.org/2/library/re.html There's also a couple of regular expression cheat sheets floating around the web including debuggex.com/cheatsheet/regex/python which explain how KCzar's program work
  • P A N
    P A N almost 9 years
    Hi, thanks for your answer. I have tried this with an actual file and get this response: <_sre.SRE_Match object at 0x10d10aac0> It seems like it's finding the ID_ at a location, but can't get it to output the string. Any ideas what I'm doing wrong?
  • KCzar
    KCzar almost 9 years
    why put the import inside the for loop?
  • suripoori
    suripoori almost 9 years
    Because I wasn't paying attention when I wrote the answer ;). Thanks for the correction. Editing it.