difference between .* and * in regular Expression
Solution 1
notation (.*)
The * in the regular expressions .* and * is referring to a count, not characters per say, more exactly it means 'zero or more'. Furthermore, the . means 'any single character'.
So when you put them together you get 'zero or more of any characters'. For example strings like these:
- linux
- linnnnnx
- lnx
- hi linux
- lx
Would be matched by <l.*x>
. The last one is important, it shows that the .* can match nothing too.
notation (*)
The use of * alone as I said is a counter. So when you put it after a letter such as 'l' the * is saying 'zero or more of l'.
Notice if we grep for l*x
, this will match l...x
, but probably not for the reason you'd think.
% echo "l...x" | grep "l*x"
l...x
It's matching on the trailing 'x'. The 'l' has nothing to do with why this is getting matched, other than the fact that the 'x' is preceded by 'zero or more l's'.
Solution 2
For the shell (eg. bash) when jokers are used to match filenames, *
and ?
are the characters themselves - they represent the character(s).
For regular-expression on the other hand, *
, ?
, {n,m}
(range of occurrences) and +
(egrep
only) are nothing by themselves. They always refers to the previous character/atom - weather this is an actual character (eg. L
or 5
), the .
(joker) which can represent any character, a range of characters (e.g. [a-f]
) or a pattern of several characters (egrep only; e.g. (abba)
- where "abba" is considered a unit). The *
and ?
thus represent nothing by themselves, but tell something about how many times the previous character (which may be a joker for any or a group treated as a unit) should be repeated.
Once you remember this distinction, between the way the shell and regex uses the *
and ?
, it should fall into place.
So for regex:
.
- represent exactly one occurrence of any charactera..a
- matches two a's with two characters of any sort between.*
- matches 0, 1 or more occurrences of any characterB*
- matches 0, 1 or more occurrences of "B"
Solution 3
If you wanted to match anything starting with "l" and ending in "x", try regular expression "l.*x". Here "." and "*" are special characters representing a single valid character and characters of at least zero length respectively. Here what precedes "*" is a ".", so whatever comes in the place of "." is repeated according to "*" 's definition as per above.
ravi
Updated on September 18, 2022Comments
-
ravi over 1 year
I've a file named "test" that contains
linux Unixlinux Linuxunix it's linux l...x
now when i use
grep '\<l.*x\>'
, it matches :linux it's linux l...x
but when i use
grep '\<l*x\>'
, it only matches:l...x
, but according to the reference guide, when using * , The preceding item will be matched zero or more times, i.e it should match anything that starts with 'l' and ends with 'x'Can anyone explain why ,it's not showing the desired result or if i've understood it wrong ?
-
Bernhard about 11 yearsWhy are you using
\<
and\>
? -
Bernhard about 11 yearsPlease note that
.
is a special character that should be escape if you want to use it as a dot. -
Pavan Kumar about 11 yearsrun grep using option --color ; that will help you understand what happens (hint: x is a word starting with zero l )
-
ravi about 11 yearsthanks @guido, --color , really helped, and will also help in future
-
erch about 11 years@Bernhard
\<
matches the beginning and\>
matches the end of a 'word' [ paragraph GNU Word Boundaries on regular-expressions.info/wordboundaries.html ]
-
-
Bernhard about 11 yearsAccording to your explanation
l*x
shoud match neitherl...x
norlinux
. Right? -
slm about 11 yearsNo it will match
l...x
because the last.x
, will be matched as zero l's and anx
. Let me update my answer to make that clearer, thanks. -
ravi about 11 yearsand as @guido, has written in reply of the question, using --color, will acutally show what's been matched.