Match word containing characters beyond a-zA-Z

8,573

Solution 1

Vim (as of version 7.3) is very limited with respect to support of non-ASCII characters in patterns. In particular, \w only matches ASCII letters, which is of limited usefulness.

There are a few character class patterns that do support Unicode. Of interest to you are \I, which by and large matches letters and only letters, plus _ and @. At least on Debian squeeze (in a UTF-8 locale), there are errors; for example × and ÷ are matched as letters, but all Latin accented letters seem to be recognied correctly. \I can be configured through the isident option, at least for the ASCII part.

If you want serious Unicode support, you'll need to rely on an external tool. For example perl -C -e '/\p{L}/' to match UTF-8 letters (assuming an UTF-8 locale).

Solution 2

Use \k. See iskeyword option.

Solution 3

It works also for Cyrillic

\v\k

A bit more complicated and fails with Cyrillic

\v(\c[0-9a-z_[=a=][=c=][=e=][=i=][=l=][=n=][=o=][=r=][=s=][=t=][=u=][=y=][=z=]])

Doc.

Tested on Vim 7.4.

Share:
8,573

Related videos on Youtube

Marco
Author by

Marco

Updated on September 18, 2022

Comments

  • Marco
    Marco over 1 year

    To match a word one can use

    \v(\w+)
    

    From the vim help :h \w:

    \w word character: [0-9A-Za-z_]

    This works exactly as described in the manual. However, I want to match words that contain characters beyond a-z, e.g. prästgården. Matching the regular expression \v(\w+) against prästgården yields to three matches, instead:

    prästgården
    ^^ ^^^ ^^^^
    

    How to match words containing characters beyond a-z? My locale is set to English and if possible I'd like to keep it that way.

    Edit: The words might not belong to a single locale, e.g.

    prästgården
    treść
    
    • Warren Young
      Warren Young over 11 years
      POSIX character classes (e.g. [[:alpha:]]\+ in this case) are supposed to do what you want here, but according to the Vim docs (:help regex) it doesn't: "These items only work for 8-bit characters." It does happen to work here with Vim 7.3 on OS X 10.8, but Vim 7.3 on Linux doesn't work, so I assume there's something Apple-specific about this Vim that allows it. You'll also find that doing it through the Vim Perl binding also fails, even though Perl has very good Unicode support. You might need to switch to an external Perl script, so you can turn on full Unicode support.
    • Warren Young
      Warren Young over 11 years
      By the way, if you do go with Perl, you want to use \p{Word} instead of a POSIX character class. There are a lot of exception cases in Perl's POSIX character class handling, which you avoid when you use Unicode properties instead.
  • Marco
    Marco almost 9 years
    I'd add [=l=] to the list which would cover ł (e.g. złoty), etc. as well. But this already fails for Russian. Anyway, thanks for sharing.