Finding shortest repeating cycle in word?

11,368

Solution 1

O(n) solution. Assumes that the entire string must be covered. The key observation is that we generate the pattern and test it, but if we find something along the way that doesn't match, we must include the entire string that we already tested, so we don't have to reobserve those characters.

def pattern(inputv):
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            print pattern_end
            return inputv[0:pattern_end+1];
    return inputv;

Solution 2

Here is a correct O(n) algorithm. The first for loop is the table building portion of KMP. There are various proofs that it always runs in linear time.

Since this question has 4 previous answers, none of which are O(n) and correct, I heavily tested this solution for both correctness and runtime.

def pattern(inputv):
    if not inputv:
        return inputv

    nxt = [0]*len(inputv)
    for i in range(1, len(nxt)):
        k = nxt[i - 1]
        while True:
            if inputv[i] == inputv[k]:
                nxt[i] = k + 1
                break
            elif k == 0:
                nxt[i] = 0
                break
            else:
                k = nxt[k - 1]

    smallPieceLen = len(inputv) - nxt[-1]
    if len(inputv) % smallPieceLen != 0:
        return inputv

    return inputv[0:smallPieceLen]

Solution 3

This is an example for PHP:

<?php
function getrepeatedstring($string) {
    if (strlen($string)<2) return $string;
    for($i = 1; $i<strlen($string); $i++) {
        if (substr(str_repeat(substr($string, 0, $i),strlen($string)/$i+1), 0, strlen($string))==$string)
            return substr($string, 0, $i);
    }
    return $string;
}
?>

Solution 4

Most easiest one in python:

def pattern(self, s):
    ans=(s+s).find(s,1,-1)
    return len(pat) if ans == -1 else ans

View more solutions

11,368

jack44

Updated on September 24, 2021

Comments

jack44 over 2 years

I'm about to write a function which, would return me a shortest period of group of letters which would eventually create the given word.

For example word abkebabkebabkeb is created by repeated abkeb word. I would like to know, how efficiently analyze input word, to get the shortest period of characters creating input word.
- jack44 almost 13 years
  
  @Tony The Tiger, the result (shortest period) does not have to be a real word.
pimvdb almost 13 years

This returns 'abkeb' which should be correct but I'm not sure in what way the OP is asking for 'kebab' rather than 'abkeb'.
jack44 almost 13 years

This is what I'm looking for. But it runs in O(n). Any ideas if this can be speeded up?
mellamokb almost 13 years

@jack44: You can't know if you have the shortest cycle until you've examined the entire string. Unless you have other knowledge, like what the largest possible cycle might be. It might be that the last character in the string throws the whole cycle off, you don't know.
dfb almost 13 years

I don't know PHP, but this looks like it's O(n^2).
Ishtar almost 13 years

Is for pattern_end in range(len(inputv)/2) necessary? I don't think it is.
dfb almost 13 years

@Ishtar - sorry I'm not following. Do you mean the look of the len()/2 part
Ishtar almost 13 years

I mean, replacing that line with pattern_end = 0.
dfb almost 13 years

@Richard86 - String comparison is going to O(n), though, isn't it?
AndersTornkvist almost 13 years

@jack44 The speed could be improved. One solution is to start at pos strlen($string)/2. Then, if it is a match, you would go to strlen($string)/4, strlen($string)/8, and so on, until it is not a match, after that, you would check each char from the current position until match. If it isn't a match at pos strlen($string)/2, you would check at pos strlen($string)-strlen($string)/2, if not a match, continue with strlen($string)-strlen($string)/4, strlen($string)-strlen($string)/8 and so on until a match, and then, check from one step back. I'll use PHP if no one tries with pseudo.
AndersTornkvist almost 13 years

@jack44 I believe the O(n) solution by @spin is to prefer, and there is no need to improve this answer as it will not be able to be as fast as the O(n) solution is.
Eyal Schneider about 11 years

I'm afraid the algorithm is incorrect. Please consider the input: "BCBDBCBCBDBC". The smallest repeating pattern is "BCBDBC", but the algorithm above will miss it. Also, I think it doesn't deal correctly with the case "HELLOHELL" (where it returns "HELLO" instead of the complete string).
Buge over 8 years

Just looking at the first for loop this is O(n^2), because each .equals() can take n time.
Buge over 8 years

This is actually O(n^2). That is because you reset i to be smaller with i = currentRepeat.length-1;. So with a 10 character string ling 'aaaaaaaaab' it takes 46 iterations. With a 20 character string it takes 191 iterations.
davidbuttar over 8 years

So is this a solution you have come up with or is this a known algorithm?
Buge over 8 years

Well KMP is a known algorithm. This question was very similar to a homework problem I had, and this is the answer I came up with for the homework. The instructor's solution was a bit different, but also used KMP.
Charlie Fish over 7 years

Would be good if you gave some detail regarding why exactly this works. Providing more detail helps the entire community and helps to get more up votes.
Lin Ma over 7 years

Hi Buge, love your solution, and vote up. but confused by this line smallPieceLen = len(inputv) - nxt[-1], and nxt[-1] means if the whole string does not match, what index we will be used to compare next. smallPieceLen represents the differences total length of string and nxt[-1], how it could be represented as shortest repetitive string?
greybeard over 7 years

@LinMa: (Buge wasn't active lately) nxt[-1] means if the whole string does not match, what index we will be used to compare next no (contorted grammar, btw.). It is the index to compare next when all of the pattern matched and you want to find its next occurrence in a longer text. nxt[i] = p means pattern[i+1-p:i+1] equals pattern[0:p] (& != for p+1). nxt[-1] is the index to compare next if the "first" mismatch was "at len+1". (In many a presentation/implementation of KMP, there is a special value of -1 at index 0, with the n values as above "shifted to an index higher by one".)
Lin Ma over 7 years

Hi @greybeard, spend time today to debug your comments and Buge's code carefully, and you are correct, thanks for clarifying the meaning of nxt[i]=j. There is one thing I still mis-understand, len(inputv) - nxt[-1], this line checks the longest prefix, which matches suffix of the same string, smallPieceLen is the remaining part length, why if len(inputv) could be divided by smallPieceLen, then it is shortest repetitive string? Is there a simple prove or some explanation? Is it too aggressive to make the conclusion?
Lin Ma over 7 years

Maybe this is a question to both @Buge and greybeard. :)
greybeard over 7 years

@LinMa: (both are notified, anyway) Let me call len(inputv) len and nxt[-1] matchLen. If matchLen < smallPieceLen, the only chance for smallPieceLen to divide len is to be equal to it. If smallPieceLen ≤ matchLen, inputv[0:smallPieceLen] equals inputv[smallPieceLen:2*smallPieceLen], and k never got reset (again): inputv is made up of repetitions of inputv[0:smallPieceLen] - the divisibility check just ensures that it ends with a full repetition.
Lin Ma over 7 years

@greybeard, nice answer, I think about your reply for more than one day, but not sure how do you get this conclusion inputv[0:smallPieceLen] equals inputv[smallPieceLen:2*smallPieceLen], would you elaborate a bit more?
Buge over 7 years

The way I think about it is nxt[-1] is the max amount that the beginning of inputv can overlap with the end of itself (without overlapping the entire thing). For example in abcabcabc nxt[-1]==6 because the last 6 letters are the same as the first 6. I specifically like to think of it as the length of the last characters. So in the example it can be split into 2 parts: abc|abcabc where the first part has length smallPieceLen and the second part has length nxt[-1]. Because of the overlap knowledge, we know abc overlaps at the beginning with abcabc as far as the overlap or `abc
Axel over 6 years

Can someone translate this answer to a known programming language like java or C? It would be much appreciated to someone like me who doesn't understand this one. Thank you.
Axel over 6 years

what does: nxt = [0]*len(inputv) do?
Buge over 6 years

It's python which I consider a known programming language, the same language as the accepted answer. [0]*len(inputv) means create a python list (array) with the same length as inputv and with all elements being 0. Anyway, here it is in Java. C would be a little annoying because of its less flexible strings and arrays.
Alfonso Fernando Alvarez over 6 years

@ward (defn find-pattern-string [string] (let [pattern "" working-str string] (reduce #(if (not (or (empty? (clojure.string/split string (re-pattern %1))) (empty? (second (clojure.string/split string (re-pattern %1)))))) (str %1 %2) %1) pattern working-str)))
Boris Verkhovskiy about 5 years

This doesn't work the way I expected if the pattern ends abruptly. For example [1, 2, 1] should return [1, 2] but this code returns [1, 2, 1].
Boris Verkhovskiy about 5 years

@EyalSchneider returning "HELLO" for "HELLOHELL" makes perfect sense. Otherwise you could be using the length of the string to determine which patterns are even allowed. You only need to check patterns whose lengths are combinations of the prime factors (and 1) of the length of the input. For example if you get a string that's 101 characters long what's the point of even trying a 5 letter subsequence? It's not going to fit perfectly into 101 characters. If your input's length is 13 there's no point in even checking, no pattern (except a 1 letter pattern) repeated n times can ever add up to 13.
Eyal Schneider about 5 years

@Boris: The problem is finding the smallest sub-sequence of S such that K>=1 repetitions of it would result in S itself. The input "HELLOHELL" has no repeating subsequence with K>1, so "HELLOHELL" should be returned.
Buge about 5 years

@Boris I consider [1, 2] wrong. The question says "For example word abkebabkebabkeb is created by repeated abkeb word." But it's not possible to create [1, 2, 1] by repeating [1, 2]. You need to repeat it, then chop off the end to get [1, 2, 1]. But yes, the question is somewhat ambiguous about what the correct answer to [1, 2, 1] is.
Anikeev Gennadiy almost 4 years

I think this would fail for the string "ababcababc"
thisisjaymehta over 3 years

It will be helpful if you explain what you did
Himanshuman almost 2 years

This won't work for the string abaa where expected output is a
Kevin Cruijssen almost 2 years

@HimanshuPoddar If I read the question correctly, the full word is always build from multiple repeating substrings without any other characters, so abab (2x ab) or aaa (3x a) are both valid inputs, but abaa not (unless you count it as 1x abaa).
Himanshuman almost 2 years

Oh yeah, my bad I confused between questions, was reading similar question on the same lines. Thanks for the reply tho