How to make a Ruby string safe for a filesystem?

29,149

Solution 1

From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:

def sanitize_filename(filename)
  returning filename.strip do |name|
   # NOTE: File.basename doesn't work right with Windows paths on Unix
   # get only the filename, not the whole path
   name.gsub!(/^.*(\\|\/)/, '')

   # Strip out the non-ascii character
   name.gsub!(/[^0-9A-Za-z.\-]/, '_')
  end
end

Solution 2

I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.

Here's my solution:

def sanitize_filename(filename)
  # Split the name when finding a period which is preceded by some
  # character, and is followed by some character other than a period,
  # if there is no following period that is followed by something
  # other than a period (yeah, confusing, I know)
  fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m

  # We now have one or two parts (depending on whether we could find
  # a suitable period). For each of these parts, replace any unwanted
  # sequence of characters with an underscore
  fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }

  # Finally, join the parts with a period and return the result
  return fn.join '.'
end

You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:

  • There should be at most one filename extension, which means that there should be at most one period in the filename
  • Trailing periods do not mark the start of an extension
  • Leading periods do not mark the start of an extension
  • Any sequence of characters beyond AZ, az, 09 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')

The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.

My results from testing the function:

1.9.3p125 :006 > sanitize_filename 'my§document$is°°   very&interesting___thisIs%nice445.doc.pdf'
 => "my_document_is_very_interesting_thisIs_nice445_doc.pdf"

which I think is what you requested. I hope this is nice and elegant enough.

Solution 3

In Rails you might also be able to use ActiveStorage::Filename#sanitized:

ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"

Solution 4

If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.

"my§document$is°°   very&interesting___thisIs%nice445.doc.pdf".parameterize

Solution 5

For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:

filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")

Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb

def parameterize(string, separator: "-", preserve_case: false)
  # Turn unwanted chars into the separator.
  parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
  #... some more stuff
end
Share:
29,149
marcgg
Author by

marcgg

Trying to build useful software. Find me on twitter or give my blog a read!

Updated on January 11, 2022

Comments

  • marcgg
    marcgg over 2 years

    I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.

    For instance:

    my§document$is°°   very&interesting___thisIs%nice445.doc.pdf
    

    should become

    my_document_is_____very_interesting___thisIs_nice445_doc.pdf
    

    and then ideally

    my_document_is_very_interesting_thisIs_nice445_doc.pdf
    

    Is there a nice and elegant way for doing this?

  • marcgg
    marcgg over 14 years
    Thanks for the link! BTW, in the article you linked, the poster says that this function has problem.
  • JP.
    JP. over 11 years
    Getting an "undefined (?...) sequence..." when I attempt to use the code. Any limitations with ruby version?
  • Anders Sjöqvist
    Anders Sjöqvist over 10 years
    @JP. Sorry for the extremely late reply, and you've probably figured it out yourself by now. Haven't tested it, but I believe that look-behinds (which is what the question mark indicates) appeared in Ruby 1.9. So yes, there are limitations. See for example stackoverflow.com/q/7605615/1117365
  • Rob Yurkowski
    Rob Yurkowski almost 10 years
    This isn't technically accurate because it will also remove the decimal character, which is somewhat essential in preserving extensions. Fortunately, the code behind parameterize is relatively simple and can be implemented with just a few gsub calls.
  • Aleks
    Aleks over 9 years
    the name.gsub!(/[^0-9A-Za-z.\-]/, '_') is the only part I have used after 5 years :D
  • wgp
    wgp over 9 years
    Won't the use of gsub! cause the function to return nil if no replacement is performed? If so, won't this now create a need to assign the value of the gsub'd string to a new variable and test for nil before returning anything?
  • lfender6445
    lfender6445 about 9 years
    looks like this matches filenames with numbers such as '2015-03-09-ruby-block-procs-and-method-call.md'
  • Huliax
    Huliax about 8 years
    gsub! will return nil if there are no matches but it is done within the context of the block passed to strip. That block finishes and strip returns to filename, which is what the method returns. Funky but just fine.
  • Adriano Resende
    Adriano Resende about 7 years
    Is better with this code fn[0] = fn[0].parameterize and after return fn.join '.'
  • Joshua Pinter
    Joshua Pinter over 2 years
    This is sexy. I was using parameterize before but it was a little too heavy handed and would remove "safe" characters like spaces and the ampersand. But this doesn't do that. e.g. ActiveStorage::Filename.new("foo:bar & baz.jpg").sanitized #=> "foo-bar & baz.jpg". Nice! Also, you can easily add an initializer that monkey patches String and adds a String#sanitized method that essentially just calls this, so you can do something like "foo:bar & baz.jpg".sanitized #=> "foo-bar & baz.jpg". Dead sexy.