How to find repeating sequence of characters in a given array?

25,340

Solution 1

For your examples, my first approach would be to

  1. get the first character of the array (for your last example, that would be C)
  2. get the index of the next appearance of that character in the array (e.g. 9)
  3. if it is found, search for the next appearance of the substring between the two appearances of the character (in this case CARPENTER)
  4. if it is found, you're done (and the result is this substring).

Of course, this works only for a very limited subset of possible arrays, where the same word is repeated over and over again, starting from the beginning, without stray characters in between, and its first character is not repeated within the word. But all your examples fall into this category - and I prefer the simplest solution which could possibly work :-)

If the repeated word contains the first character multiple times (e.g. CACTUS), the algorithm can be extended to look for subsequent occurrences of that character too, not only the first one (so that it finds the whole repeated word, not only a substring of it).

Note that this extended algorithm would give a different result for your second example, namely RONRON instead of RON.

Solution 2

Tongue-in-cheek O(NlogN) solution

Perform an FFT on your string (treating characters as numeric values). Every peak in the resulting graph corresponds to a substring periodicity.

Solution 3

In Python, you can leverage regexes thus:

def recurrence(text):
    import re
    for i in range(1, len(text)/2 + 1):
        m = re.match(r'^(.{%d})\1+$'%i, text)
        if m: return m.group(1)

recurrence('abcabc') # Returns 'abc'

I'm not sure how this would translate to Java or C. (That's one of the reasons I like Python, I guess. :-)

Solution 4

First write a method that find repeating substring sub in the container string as below.

boolean findSubRepeating(String sub, String container);

Now keep calling this method with increasing substring in the container, first try 1 character substring, then 2 characters, etc going upto container.length/2.

Solution 5

Just figured this out myself and wrote some code for this (written in C#) with a lot of comments. Hope this helps someone:

// Check whether the string contains a repeating sequence.
public static bool ContainsRepeatingSequence(string str)
{
    if (string.IsNullOrEmpty(str)) return false;

    for (int i=0; i<str.Length; i++)
    {
        // Every iteration, cut down the string from i to the end.
        string toCheck = str.Substring(i);

        // Set N equal to half the length of the substring. At most, we have to compare half the string to half the string. If the string length is odd, the last character will not be checked against, but it will be checked in the next iteration.
        int N = toCheck.Length / 2;

        // Check strings of all lengths from 1 to N against the subsequent string of length 1 to N.
        for (int j=1; j<=N; j++)
        {
            // Check from beginning to j-1, compare against j to j+j.
            if (toCheck.Substring(0, j) == toCheck.Substring(j, j)) return true;
        }
    }

    return false;
}

Feel free to ask any questions if it's unclear why it works.

Share:
25,340
brainless
Author by

brainless

Once I was: WebMethods, Java developer, Middleware Engineer Then: PHP Developer, Laravel, Application Design and Architecture Today:Android Developer Tomorrow : May be mainframe! or start a restaurant!! SOreadytohelp

Updated on November 20, 2020

Comments

  • brainless
    brainless over 3 years

    My problem is to find the repeating sequence of characters in the given array. simply, to identify the pattern in which the characters are appearing.

       .---.---.---.---.---.---.---.---.---.---.---.---.---.---.
    1: | J | A | M | E | S | O | N | J | A | M | E | S | O | N |
       '---'---'---'---'---'---'---'---'---'---'---'---'---'---'

       .---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
    2: | R | O | N | R | O | N | R | O | N | R | O | N | R | O | N |
       '---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'

       .---.---.---.---.---.---.---.---.---.---.---.---.
    3: | S | H | A | M | I | L | S | H | A | M | I | L |
       '---'---'---'---'---'---'---'---'---'---'---'---'

       .---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.---.
    4: | C | A | R | P | E | N | T | E | R | C | A | R | P | E | N | T | E | R |
       '---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'---'


    Example

    Given the previous data, the result should be:

    1. "JAMESON"
    2. "RON"
    3. "SHAMIL"
    4. "CARPENTER"

    Question

    • How to deal with this problem efficiently?
  • Eyal Schneider
    Eyal Schneider over 13 years
    +1 for simplicity and linear time solution. However, I understood the problem differently. I guess that the question should specify whether characters can repeat in the pattern, and whether we are looking for the largest or smallest pattern that repeats itself.
  • Jonathan
    Jonathan over 13 years
    assuming the word never repeats the first letter is pretty much throwing up your hands and going home.
  • Péter Török
    Péter Török over 13 years
    @Jonathan, in case you haven't noticed, I actually describe how to deal with that case :-)
  • pmg
    pmg over 13 years
    oops p=fgrs("BARBARABARBARAB-RBARA", &t);
  • Dan
    Dan over 13 years
    As you mention the FFT, it got me thinking about using a cross-correlation (whichever method is used) to find matches of the substring in the sequence. Normally my caveman brute-force approach, if I couldn't use an off-the-shelf regex library, would be the "walk the sequence, try to match" approach. But your answer got me thinking -- I wonder if/when a cross correlation would be more efficient. Probably depends on the length of the pattern, the length of the sequence to search, etc... but anyway, your answer got me thinking ("out of the box" as Jonathan said). Thanks.
  • Oliver Charlesworth
    Oliver Charlesworth over 13 years
    Cross-correlation and doing a Fourier transform are effectively the same thing (see en.wikipedia.org/wiki/Convolution_theorem). For anything other than quite small values of N, the FFT will be more efficient.
  • Philipp
    Philipp over 10 years
    Could you explain why every peak corresponds to a substring periodicity? Unfortunately, I cannot fully grasp the idea. I only know FFT for frequency analysis in music (there, multiple frequencies overlay at the same time; but when analyzing text we've got a straight sequence of characters). How does this match up?
  • Péter Török
    Péter Török over 9 years
    @sagivo, care to explain why you think so?
  • Intoxicated Penguin
    Intoxicated Penguin almost 3 years
    This is NOT an answer! Edit: I see this is old! I hope you have bettered yourself in this stretch of time!