Regular Expression for finding double characters in Bash

linux bash command-line grep regular-expression

53,968

Solution 1

This really is two questions, and should have been split up. But since the answers are relatively simple, I will put them here. These answers are for GNU grep specifically.

a) egrep is the same as grep -E. Both indicate that "Extended Regular Expressions" should be used instead of grep's default Regular Expressions. grep requires the backslashes for plain Regular Expressions.

From the man page:

Basic vs Extended Regular Expressions

In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, $, and $.

See the man page for additional details about historical conventions and portability.

b) Use egrep '(.)\1{N}' and replace N with the number of characters you wish to replace minus one (since the dot matches the first one). So if you want to match a character repeated four times, use egrep '(.)\1{3}'.

Solution 2

This would look for 2 or more occurences of the same character:

grep -E '(.)\1+' file

If your awk has the -o option this would print it each match on a new line..

grep -Eo '(.)\1+' file

To find matches with exactly 3 matches:

grep -E '(.)\1{2}' file

Or 3 or more:

grep -E '(.)\1{2,}' file

etc..

edit

Actually @stephane_chazelas is right about back references and -E. I had forgotten about that. I tried it in BSD grep and GNU grep and it works there but it is not in some other greps. You would need to use one of the below version..

Regular grep versions:

grep '\(.\)\1\{1,\}' file

grep -o '\(.\)\1\{1,\}' file

grep '\(.\)\1\{2\}' file

grep '\(.\)\1\{2,\}' file

The -o option is also not standard grep BTW (probably if your grep understands -o it can also do the back reference)..

Note: grep -E '(.)\1{2,}' file and grep '$.$\1\{2\}' file are wrong as alexis indicated and should be ignored..

Solution 3

First, thank you all for your supporting comments and suggestions. As it turns out I was already quite close to the answer.

The Main Issue was about:

Is there a simple way to look for n occurences of the same character, e.g. aa, tttttt

Short answer:

The following [variations of] commands will repeat a at least one and infinite times

grep 'a\{1,}

grep -E $a$\{1,\}

egrep a{1,}

or, with GNU Regular Expressions available grep a\+

The number of repeatings are set inside the curly brackets, through the pattern {min,max} → {n} repeat exactly n times, {n,} repeat at least n times and {n,m} repeat at least n but at most m times.

Thus, as a consequence, raised the secondary issue:

Is the necessity of setting backlashes bound to the command I use?

Short answer: Yes, the use of backslashes depends on whether one uses grep or egrep

grep: backslash activates metacharacters [uses Basic Regular Expressions]
egrep backslash de-activates metacharacters [uses Extended Regular Expressions]

As this is the short answer, I want to provide those who ran into comparable issues, I added my basic summary of what out one seemingly has to be aware of, working with grep and egrep.

Basic, Extended, and GNU Regular Expressions

Basic Regular Expressions

Used in grep, ed and sed command

Basic Regular Expressions set features are:

Most Metacharacters, e.g. ? [ . \ ) etc. are activated through a backslash. If there is no backslash they will be taken as (part of the) search term.
^ $ \< and \> are supported without a backslash
No shorthand characters [\b, \s, etc.]

GNU Basic Regular Expressions add to these

\? repeat character zero or one time (c\? matches cand cc) and is an alternative for \{0,1\}
\+ repeat a character at least one time (c\+ matches cc, cccccccc etc.) and is an alternative for \{1,\}
\| is supported (e.g. grep a\|b will look for a or b

grep -E enables the command to use the whole set of the Extended Regular Expressions:

Extended Regular Expressions [ERE]

Used in egrep, awk and emacs is the Basic Set plus quite some features.

Metacharacters are deactivated through a backslash
No back references
else: a lot of the the magic Regular Expressions usually can do for one

GNU Extendend Regular Expressions

adds the following features

The two links will direct one to regular-expressions.info which, in addition to the awsome support I've got here, really helped me a lot.

53,968

erch

my about me is blink at the moment

Updated on September 18, 2022

Comments

erch over 1 year
I am looking for a regular expression that finds all occurences of double characters in a text, a listing, etc. on the command line (Bash).

Main Question: Is there a simple way to look for sequences like aa, ll, ttttt, etc. where one defines a regular expression that looks for n occurences of the same character with? What I am looking for is achieving this on a very very basic level. On the command line. In a Linux Shell.

After quite some research I came to the following answers – and questions resulting from them, thus they just gave me a hint where the solution might be. But:

a) (e)grep and the backslash issue
- grep 'a\{2\}' looks for aa
- egrep'a{2}' looks for aa
Question: Is the necessity of setting backlashes really bound to the command I use? If so, can anyone give me hint what else is to be taken into account when using (e)grep here?

b) I found this answer here for my question, though it isn't exactly what I was looking for:

grep -E '(.)\1' filename looks for entries with the same character appearing more than once but doesn't ask how often. This is close to what I am looking for, but I still want to set a number of repeatings.

I probably should split this into two or more questions, but then I don't want to flood this awesome site here.

P.S.: Another question, possibly off topic but: is it in, inside, at or on the shell. And is on the command line correct?
erch about 11 years

Thanks you, so far. But: Am I right saying that without the -E option grep wouldn't do much? This would explain quite a lot, for example why I wasted so much time looking for where I was wrong!
erch about 11 years

When reading the man page I must have really misunderstood or misinterpreted the part you pointed at. When I worked through some regular expression tutorials there were no hints of such behaviour to be expected. I thought that Regular Expression means something on such a basic level that most applications are working with the same set of symbols. Again, I was proven wrong. Thanks for your help! This really helped me out.
depquid about 11 years

@cellar.dweller It is confusing! A lot of the reasoning is historical. I'm more familiar with the Extended form, so I make a habit of always just using egrep if I need regular expressions (as opposed to just simple string matching) so that I don't have to worry about remembering the differences between grep's two types of regular expressions.
Stéphane Chazelas about 11 years

Note that standard EREs don't support back-references, while standard BREs do. So grep '$.$\1\{3\}' is standard, grep -E '(.)\1{3}' is not.
Scrutinizer about 11 years

Without the -E option you can do the same in this case, but you would need to escape more and there is no + operator.. I'll post examples too.
Scrutinizer about 8 years

Yes you are absolutely right, that does not work as intended, in fact it is not possible like that..