How can I trim leading and trailing white space?

340,944

Solution 1

Probably the best way is to handle the trailing white spaces when you read your data file. If you use read.csv or read.table you can set the parameterstrip.white=TRUE.

If you want to clean strings afterwards you could use one of these functions:

# Returns string without leading white space
trim.leading <- function (x)  sub("^\\s+", "", x)

# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)

# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

To use one of these functions on myDummy$country:

 myDummy$country <- trim(myDummy$country)

To 'show' the white space you could use:

 paste(myDummy$country)

which will show you the strings surrounded by quotation marks (") making white spaces easier to spot.

Solution 2

As of R 3.2.0 a new function was introduced for removing leading/trailing white spaces:

trimws()

See: Remove Leading/Trailing Whitespace

Solution 3

To manipulate the white space, use str_trim() in the stringr package. The package has manual dated Feb 15, 2013 and is in CRAN. The function can also handle string vectors.

install.packages("stringr", dependencies=TRUE)
require(stringr)
example(str_trim)
d4$clean2<-str_trim(d4$V2)

(Credit goes to commenter: R. Cotton)

Solution 4

A simple function to remove leading and trailing whitespace:

trim <- function( x ) {
  gsub("(^[[:space:]]+|[[:space:]]+$)", "", x)
}

Usage:

> text = "   foo bar  baz 3 "
> trim(text)
[1] "foo bar  baz 3"

Solution 5

Ad 1) To see white spaces you could directly call print.data.frame with modified arguments:

print(head(iris), quote=TRUE)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
# 1        "5.1"       "3.5"        "1.4"       "0.2" "setosa"
# 2        "4.9"       "3.0"        "1.4"       "0.2" "setosa"
# 3        "4.7"       "3.2"        "1.3"       "0.2" "setosa"
# 4        "4.6"       "3.1"        "1.5"       "0.2" "setosa"
# 5        "5.0"       "3.6"        "1.4"       "0.2" "setosa"
# 6        "5.4"       "3.9"        "1.7"       "0.4" "setosa"

See also ?print.data.frame for other options.

Share:
340,944
mropa
Author by

mropa

Updated on August 16, 2022

Comments

  • mropa
    mropa almost 2 years

    I am having some trouble with leading and trailing white space in a data.frame.

    For example, I look at a specific row in a data.frame based on a certain condition:

    > myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)] 
    
    
    
    [1] codeHelper     country        dummyLI    dummyLMI       dummyUMI       
    
    [6] dummyHInonOECD dummyHIOECD    dummyOECD      
    
    <0 rows> (or 0-length row.names)
    

    I was wondering why I didn't get the expected output since the country Austria obviously existed in my data.frame. After looking through my code history and trying to figure out what went wrong I tried:

    > myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
       codeHelper  country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
    18        AUT Austria        0        0        0              0           1
       dummyOECD
    18         1
    

    All I have changed in the command is an additional white space after Austria.

    Further annoying problems obviously arise. For example, when I like to merge two frames based on the country column. One data.frame uses "Austria " while the other frame has "Austria". The matching doesn't work.

    1. Is there a nice way to 'show' the white space on my screen so that I am aware of the problem?
    2. And can I remove the leading and trailing white space in R?

    So far I used to write a simple Perl script which removes the whites pace, but it would be nice if I can somehow do it inside R.

  • hadley
    hadley over 14 years
    Or, a little more succinctly, "^\\s+|\\s+$"
  • Jay
    Jay over 14 years
    As hadley pointed it this regex "^\\s+|\\s+$" will identify leading and trailing whitespace. so x <- gsub("^\\s+|\\s+$", "", x) many of R's read functions as have this option: strip.white = FALSE
  • Aleksey Balenko
    Aleksey Balenko over 14 years
    @Jay: Thanks for the hint. I changed the regexps in my answer to use the shorter "\\s" instead of "[ \t]".
  • Aleksey Balenko
    Aleksey Balenko over 14 years
    Just wanted to point out, that one will have to use gsub instead of sub with hadley's regexp. With sub it will strip trailing whitespace only if there is no leading whitespace...
  • Jyotirmoy Bhattacharya
    Jyotirmoy Bhattacharya over 14 years
    Didn't know you could use \s etc. with perl=FALSE. The docs say that POSIX syntax is used in that case, but the syntax accepted is actually a superset defined by the TRE regex library laurikari.net/tre/documentation/regex-syntax
  • Richie Cotton
    Richie Cotton over 14 years
    See also str_trim in the stringr package.
  • Chris Beeley
    Chris Beeley over 12 years
    Plus one for "Trim function now stored for future use"- thanks!
  • Thieme Hennis
    Thieme Hennis almost 10 years
    is there a trim param in read.spss? I tried trim_values = TRUE and trim.factor.names = TRUE but to no avail...
  • Thieme Hennis
    Thieme Hennis almost 10 years
    FYI: I trimmed all trailing spaces of the entire dataframe using apply: df_trimmed <- as.data.frame(apply(df,2,function (x) sub("\\s+$", "", x)))
  • A5C1D2H2I1M1N2O1R2T1
    A5C1D2H2I1M1N2O1R2T1 about 9 years
    It depends on the definition of a best answer. This answer is nice to know of (+1) but in a quick test, it wasnt as fast as some of the alternatives out there.
  • Rodrigo
    Rodrigo almost 9 years
    Unfortunately, strip.white=TRUE only works on non-quoted strings.
  • Alex
    Alex over 8 years
    There is a much easier way to trim whitespace in R 3.2.0. See the next answer!
  • Jubbles
    Jubbles over 8 years
    doesn't seem to work for multi-line strings, despite \n being in the covered character class. trimws("SELECT\n blah\n FROM foo;") still contains newlines.
  • wligtenberg
    wligtenberg over 8 years
    @Jubbles That is the expected behaviour. In the string you pass to trimws there are no leading or trailing white spaces. If you want to remove leading and trailing white spaces from each of the lines in the string, you will first have to split it up. Like this: trimws(strsplit("SELECT\n blah\n FROM foo;", "\n")[[1]])
  • Jack Wasey
    Jack Wasey over 8 years
    Although a built-in function for recent versions of R, it does 'just' do a PERL style regex under the hood. I might have expected some fast custom C code to do this. Maybe the trimws regex is fast enough. stringr::str_trim (based on stringi) is also interesting in that it uses a completely independent internationalized string library. You'd think whitespace would be immune from problems with internationalization, but I wonder. I've never seen a comparison of results of native vs stringr/stringi or any benchmarks.
  • Richard Telford
    Richard Telford over 7 years
    This solution removed some mutant whitespace that trimws() was unable to remove.
  • wligtenberg
    wligtenberg over 7 years
    @RichardTelford could you provide an example? Because that might be considered a bug in trimws.
  • PatrickT
    PatrickT over 6 years
    For some reason I could not figure out, trimws() did not remove my leading white spaces, while Bryan's trim.strings() below (only 1 vote, mine!) did...
  • moodymudskipper
    moodymudskipper almost 6 years
    or df[] <- lapply(df, trimws) to be more compact. But it will in both cases coerce columns to character. df[sapply(df,is.character)] <- lapply(df[sapply(df,is.character)], trimws) to be safe.
  • EcologyTom
    EcologyTom almost 6 years
    Also need to include stringsAsFactors = FALSE when using read.csv, as this won't work on factors. trimws() detailed below will work regardless, but by silently converting factor to character. Both useful answers though, thanks!
  • Peter
    Peter over 4 years
    IMO this is the best solution. Not much of code and highly performant
  • pgee70
    pgee70 over 4 years
    Thanks for the require(stringr) their documentation or examples did not have this required line of code!
  • Tomas
    Tomas almost 4 years
    From which package? This function doesn't exist by default.
  • Gmichael
    Gmichael almost 4 years
    I don't think this is a good idea, since we don't know how many countries/levels the df actually have. Additionally, R would encode the first element of Dummy$Country as "Austria", even if it were "Spain".
  • tjebo
    tjebo over 3 years
    @JackWasey I've added a benchmark - the example might be somewhat simple, but it should give an idea about the performance
  • JasTonAChair
    JasTonAChair over 2 years
    Just not a useful answer without providing the package name, @J.Dan
  • Resource
    Resource about 2 years
    Removes trailing \r\n in all columns, unlike any other solutions I've seen that claim to work at the data frame level. I can get rid of tons of untidy and inelegant per-column trims.