Grab part of filename with Python
Solution 1
Here's a simple solution using the re module as mentioned in other answers.
# Libraries
import re
# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf")
for fname in file_list:
res = re.findall("ID_(\d+).pdf", fname)
if not res: continue
print res[0] # You can append the result to a list
And below should be your output. You should be able to adapt this to other patterns.
# Output
123
456
Goodluck!
Solution 2
Here's another alternative, using re.split()
, which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match()
and re.search()
, among others, are just as valid, useful, and instructive):
>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>>
Solution 3
If the numbers are variable length, you'll want the regex module "re"
import re
# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")
pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'
Regex is generally used to match variable strings. The regex I just wrote says:
Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")
Solution 4
You can use the os module in python and do a listdir to get a list of filenames present in that path like so:
import os
filenames = os.listdir(path)
Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:
import re
for filename in filenames:
m = re.search('(?<=ID_)\w+', filename)
print (m)
The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.
P A N
Updated on June 04, 2022Comments
-
P A N almost 2 years
Newbie here.
I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.
So in my case, let's say I have four .pdf like this:
aaa_ID_8423.pdf bbbb_ID_8852.pdf ccccc_ID_7413.pdf dddddd_ID_4421.pdf (Note that they are of variable length.)
I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?
-
LampPost almost 9 yearsIf those are the only numbers in the string you can use this, oh and the library is re ID = re.findall(r"[0-9]+", *stringname")
-
-
Al Wang almost 9 yearsTo elaborate on this, please take a look at the Regular Expressions library found in docs.python.org/2/library/re.html There's also a couple of regular expression cheat sheets floating around the web including debuggex.com/cheatsheet/regex/python which explain how KCzar's program work
-
P A N almost 9 yearsHi, thanks for your answer. I have tried this with an actual file and get this response:
<_sre.SRE_Match object at 0x10d10aac0>
It seems like it's finding the ID_ at a location, but can't get it to output the string. Any ideas what I'm doing wrong? -
KCzar almost 9 yearswhy put the import inside the for loop?
-
suripoori almost 9 yearsBecause I wasn't paying attention when I wrote the answer ;). Thanks for the correction. Editing it.