Regular Expression for finding double characters in Bash
Solution 1
This really is two questions, and should have been split up. But since the answers are relatively simple, I will put them here. These answers are for GNU grep
specifically.
a) egrep
is the same as grep -E
. Both indicate that "Extended Regular Expressions" should be used instead of grep
's default Regular Expressions. grep
requires the backslashes for plain Regular Expressions.
From the man
page:
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).
See the man
page for additional details about historical conventions and portability.
b) Use egrep '(.)\1{N}'
and replace N
with the number of characters you wish to replace minus one (since the dot matches the first one). So if you want to match a character repeated four times, use egrep '(.)\1{3}'
.
Solution 2
This would look for 2 or more occurences of the same character:
grep -E '(.)\1+' file
If your awk has the -o option this would print it each match on a new line..
grep -Eo '(.)\1+' file
To find matches with exactly 3 matches:
grep -E '(.)\1{2}' file
Or 3 or more:
grep -E '(.)\1{2,}' file
etc..
edit
Actually @stephane_chazelas is right about back references and -E. I had forgotten about that. I tried it in BSD grep and GNU grep and it works there but it is not in some other greps. You would need to use one of the below version..
Regular grep versions:
grep '\(.\)\1\{1,\}' file
grep -o '\(.\)\1\{1,\}' file
grep '\(.\)\1\{2\}' file
grep '\(.\)\1\{2,\}' file
The -o
option is also not standard grep BTW (probably if your grep understands -o it can also do the back reference)..
Note:
grep -E '(.)\1{2,}'
file and grep '\(.\)\1\{2\}'
file are wrong as alexis indicated and should be ignored..
Solution 3
First, thank you all for your supporting comments and suggestions. As it turns out I was already quite close to the answer.
The Main Issue was about:
Is there a simple way to look for n occurences of the same character, e.g.
aa
,tttttt
Short answer:
The following [variations of] commands will repeat a
at least one and infinite times
grep 'a\{1,}
grep -E \(a\)\{1,\}
egrep a{1,}
or, with GNU Regular Expressions available
grep a\+
The number of repeatings are set inside the curly brackets, through the pattern {min,max}
→ {n}
repeat exactly n
times, {n,}
repeat at least n
times and {n,m}
repeat at least n
but at most m
times.
Thus, as a consequence, raised the secondary issue:
Is the necessity of setting backlashes bound to the command I use?
Short answer: Yes, the use of backslashes depends on whether one uses grep
or egrep
-
grep
: backslash activates metacharacters [uses Basic Regular Expressions] -
egrep
backslash de-activates metacharacters [uses Extended Regular Expressions]
As this is the short answer, I want to provide those who ran into comparable issues, I added my basic summary of what out one seemingly has to be aware of, working with grep
and egrep
.
Basic, Extended, and GNU Regular Expressions
Basic Regular Expressions
Used in grep
, ed
and sed
command
Basic Regular Expressions set features are:
- Most Metacharacters, e.g.
? [ . \ )
etc. are activated through a backslash. If there is no backslash they will be taken as (part of the) search term. -
^ $ \<
and\>
are supported without a backslash - No shorthand characters [
\b
,\s
, etc.]
GNU Basic Regular Expressions add to these
-
\?
repeat character zero or one time (c\?
matchesc
andcc
) and is an alternative for\{0,1\}
\+
repeat a character at least one time (c\+
matchescc
,cccccccc
etc.) and is an alternative for\{1,\}
\|
is supported (e.g.grep a\|b
will look fora
orb
grep -E
enables the command to use the whole set of the Extended Regular Expressions:
Extended Regular Expressions [ERE]
Used in egrep
, awk
and emacs
is the Basic Set plus quite some features.
- Metacharacters are deactivated through a backslash
- No back references
- else: a lot of the the magic Regular Expressions usually can do for one
GNU Extendend Regular Expressions
adds the following features
The two links will direct one to regular-expressions.info which, in addition to the awsome support I've got here, really helped me a lot.
Related videos on Youtube
Comments
-
erch over 1 year
I am looking for a regular expression that finds all occurences of double characters in a text, a listing, etc. on the command line (Bash).
Main Question: Is there a simple way to look for sequences like
aa
,ll
,ttttt
, etc. where one defines a regular expression that looks for n occurences of the same character with? What I am looking for is achieving this on a very very basic level. On the command line. In a Linux Shell.After quite some research I came to the following answers – and questions resulting from them, thus they just gave me a hint where the solution might be. But:
a) (e)grep and the backslash issue
-
grep 'a\{2\}'
looks foraa
-
egrep'a{2}'
looks foraa
Question: Is the necessity of setting backlashes really bound to the command I use? If so, can anyone give me hint what else is to be taken into account when using (e)grep here?
b) I found this answer here for my question, though it isn't exactly what I was looking for:
grep -E '(.)\1' filename
looks for entries with the same character appearing more than once but doesn't ask how often. This is close to what I am looking for, but I still want to set a number of repeatings.I probably should split this into two or more questions, but then I don't want to flood this awesome site here.
P.S.: Another question, possibly off topic but: is it
in
,inside
,at
oron the shell
. And ison the command line
correct? -
-
erch about 11 yearsThanks you, so far. But: Am I right saying that without the
-E
optiongrep
wouldn't do much? This would explain quite a lot, for example why I wasted so much time looking for where I was wrong! -
erch about 11 yearsWhen reading the man page I must have really misunderstood or misinterpreted the part you pointed at. When I worked through some regular expression tutorials there were no hints of such behaviour to be expected. I thought that Regular Expression means something on such a basic level that most applications are working with the same set of symbols. Again, I was proven wrong. Thanks for your help! This really helped me out.
-
depquid about 11 years@cellar.dweller It is confusing! A lot of the reasoning is historical. I'm more familiar with the Extended form, so I make a habit of always just using
egrep
if I need regular expressions (as opposed to just simple string matching) so that I don't have to worry about remembering the differences betweengrep
's two types of regular expressions. -
Stéphane Chazelas about 11 yearsNote that standard EREs don't support back-references, while standard BREs do. So
grep '\(.\)\1\{3\}'
is standard,grep -E '(.)\1{3}'
is not. -
Scrutinizer about 11 yearsWithout the -E option you can do the same in this case, but you would need to escape more and there is no
+
operator.. I'll post examples too. -
Scrutinizer about 8 yearsYes you are absolutely right, that does not work as intended, in fact it is not possible like that..