R getting substrings and regular expressions?

23,905

Solution 1

Here is a one-liner solution

gsub(".*\\#(.*)\\..*", "\\1", c("HelloWorld#you.txt"))

Output:

you

To explain the code, it matches everything up to # and then extracts all word characters up to ., so the final output will be the in-between string which what you are looking for.

Edit:

The above solution matches file name up to the last . i.e. allow file name to have multiple dots. If you want to extract the name up to the first . you can use the regex .*\\#(\\w*)\\..* instead.

Solution 2

strapplyc To extract the word immediately after # try this using strapplyc in the gsubfn package:

> library(gsubfn)
>
> strapplyc("HelloWorld#you.txt", "#(\\w+)")[[1]]
[1] "you"

or this which allows the file name to contain dots:

> strapplyc("HelloWorld#you.txt", "#(.*)\\.")[[1]]
[1] "you"

file_path_sans_ext Another more filename oriented approach using the tools package (which comes bundled with R so no extra packages need be installed) is as follows:

> library(tools)
>
> file_path_sans_ext(sub(".*#", "", "HelloWorld#you.txt")) 
[1] "you"

ADDED: additional solutions

Solution 3

You can use gsub. Advantage of this is you can match multiple .s until the last one.

> s <- 'HelloWorld#you.and.me.txt'
> gsub('.*#(.*)\\.+.*','\\1', s)
[1] "you.and.me"

Solution 4

This solution is easy for those not wanting to learn regex but doesn't align with the poster's intent (more for future searchers). This approach covers the case when you have no # as the function will return character(0).

library(qdap)
x <- c("HelloWorld#you.txt", "HelloWorldyou.txt")
genXtract(x, "#", ".")

Yields:

> genXtract(x, "#", ".")
$`#  :  right1`
[1] "you"

$`#  :  right2`
character(0)

Though I think there's a bug in the label but not the actual return values.

EDIT: This is indeed a bug that has been fixed in the development version. Output with devel. ver.:

> genXtract(x, "#", ".")
$`#  :  .1`
[1] "you"

$`#  :  .2`
character(0)

Solution 5

grep returns the index in terms of item numbers, not character placement (HelloWorld#you.txt has only one item, so it should return 1).

You want regexpr instead, it counts characters rather than items.

hashPos = regexpr("#", name, fixed=TRUE) + 1
dotPos = length(name)-3
finalText = substring(name, hashPos, dotPos)
Share:
23,905
CodeKingPlusPlus
Author by

CodeKingPlusPlus

Updated on July 09, 2022

Comments

  • CodeKingPlusPlus
    CodeKingPlusPlus almost 2 years

    I have a set of strings that are file names. I want to extract all characters after the # symbol but before the file extension. For example, one of the file names is:

    HelloWorld#you.txt
    

    I would want to return the stringyou

    Here is my code:

        hashPos = grep("#", name, fixed=TRUE)
        dotPos = length(name)-3
        finalText = substring(name, hashPos, dotPos)
    

    I read online that grep is supposed to return the index where the first parameter occurs (in this case the # symbol). So, I was expecting the above to work but it does not.

    Or how would I use a regular expression to extract this string? Also, what happens when the string does not have a # symbol? Would the function return a special value such as -1?

  • CHP
    CHP about 11 years
    removed my erroneous comment.
  • Ehsan88
    Ehsan88 almost 9 years
    If a reader is still confused, they can check the table at the bottom of this page : endmemo.com/program/R/gsub.php. That helped me a lot.
  • Stan
    Stan about 6 years
    The endmemo post was very helpful. Also, I thought @Chinmay Patil's answer below superior in that it handles multiple ".".