Getting file extension using pattern matching in python
Solution 1
>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>
The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.
Solution 2
root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
ext = os.path.splitext(root)[1] + ext
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Solution 3
Starting from phihags answer:
DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.
def guess_extension(filename):
"""
Guess the extension of given filename.
"""
root,ext = os.path.splitext(filename)
if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
root, first_ext = os.path.splitext(root)
ext = first_ext + ext
return root, ext
Solution 4
I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]
Solution 5
this is simple and works on both single and multiple extensions
In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'
In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'
In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'
Pushpak Dagade
My LinkedIn profile: https://in.linkedin.com/in/pushpak-dagade-47275121
Updated on June 13, 2022Comments
-
Pushpak Dagade almost 2 years
I am trying to find the extension of a file, given its name as a string. I know I can use the function
os.path.splitext
but it does not work as expected in case my file extension is.tar.gz
or.tar.bz2
as it gives the extensions asgz
andbz2
instead oftar.gz
andtar.bz2
respectively.
So I decided to find the extension of files myself using pattern matching.print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext') >>> gz # I want this to come as 'tar.gz' print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext') >>> bz2 # I want this to come 'tar.bz2'
I am using
(?P<ext>...)
in my pattern matching as I also want to get the extension.Please help.
-
Pushpak Dagade almost 13 yearsOk that will work in this case but I want to solve it using python regular expressions.
-
user1066101 almost 13 years@Guanidene: If it's homework, mark the question homework. If it's not homework, don't use a regular expression when the function's already been written, debugged and works.
-
Pushpak Dagade almost 13 years@S.Lott - It is no homework, I want to tackle the problem using regex thats all. If just solving was the aim, I could have done it long before as phihag says.
-
Pushpak Dagade almost 13 yearsI really thank you for this! This thing took a lot of my time!
-
Pushpak Dagade almost 13 yearsSuppose there exists a file extension
.gz
(assume) too which I may want to match. So in this case your tuple will beextensions=('.gz','.tar.gz','.py')
(and name=filenmae.tar.gz
) and if I execute this -[x for x in extensions if name.endswith(x)]
it will wrongly matchgz
when I want to match it withtar.gz
. Dude, what I want is a universal solution, not a data specific solution! -
Kracekumar almost 13 yearsThere is a option for that ,so place tar.gz first in the list if it matches return,but this method will not work once you place gz before tar.gz.>>> extensions=('tar.gz','gz','py')
>>> name 'set.tar.gz' >>> def test(): ... for x in extensions: ... if name.endswith(x): ... return x ... return ' ' >>> test() 'tar.gz' >>>
-
Pushpak Dagade almost 13 years@phihag - one more reason why I wish to use regex - It is compact. If I go your way, I will unnecessarily need 3 lines (and which also makes my code clumsy), while using regex I can get everything in a single line!
-
phihag almost 13 years@Guanidene More compact does not equal more readable and maintainable. Also, why is a complicated regular expression less clumsy than three lines even non-programmers could understand? Anyway, to each his own.
-
Kracekumar almost 13 years@Guanidene:Yea I will not remember agreed,but you can leave a comment,might be regex is a correct solution and it is,but while answering the question I started by saying it might be stupid and there is a option,I am not arguing this is right .
-
four43 over 5 yearsIn [4]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[1:] Out[4]: ['tar', 'gz'] Works well for me!