How to find the period of a string

12,847

Solution 1

You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).

Solution 2

Let me assume that the length of the string n is at least twice greater than the period p.

Algorithm

  1. Let m = 1, and S the whole string
  2. Take m = m*2
    • Find the next occurrence of the substring S[:m]
    • Let k be the start of the next occurrence
    • Check if S[:k] is the period
    • if not go to 2.

Example

Suppose we have a string

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC

For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD   CDCD   CDCD   CD

We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD   CDCD

Now let's extend our string to the first repetition. We get

CDCDFBF

we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC

CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC      CDCDFBFC      

we try to extend and we get

CDCDFBFCDCDFDF

and this indeed is our period.

I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.

Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.

A very nice problem though.

Solution 3

You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).

Share:
12,847

Related videos on Youtube

Aditya
Author by

Aditya

Updated on October 22, 2022

Comments

  • Aditya
    Aditya over 1 year

    I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.

    Say

    S1 = AAAA // substring is A
    S2 = ABAB // Substring is AB
    S3 = ABCAB // Substring is ABC
    S4 = EFIEFI // Substring is EFI
    

    I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.

    • takendarkk
      takendarkk over 10 years
      Why is S2 substring not AB?
    • Evgeny Kluev
      Evgeny Kluev over 10 years
      For string "abababxxxxabababxxxx" tortoise and hare gives you period 2 while actual period is 10.
  • takendarkk
    takendarkk over 10 years
    Could you show a code example of how this would work for a couple inputs?
  • Aditya
    Aditya over 10 years
    Dude. Read the paragraph of text below the example in my question :|
  • Mehmet Sedat Güngör
    Mehmet Sedat Güngör over 10 years
    Oh I didn't see that. But since you said it has to include a cycle, that method's complexity will be O(N) because you are just comparing first character with the whole string in the worst case scenario. @csmckelvey Pseudocode will be like this: char first = input[0]; String repeatingString = first; int i = 1; char nextChar = input[i]; while(first!=nextChar) { repeatingString += nextChar; i++; nextChar = input[i]; }
  • Aditya
    Aditya over 10 years
    Its O(N^2) in the worst case cause it would first check with 1 character then 2 characters than 3 .. till N/2 so it is 1+2+3+..N/2 => (N/2)(N/2+1)/2 Dropping all constants you have N^2
  • Mehmet Sedat Güngör
    Mehmet Sedat Güngör over 10 years
    @Aditya No, with the above method, it will check first character with the second one, then third one etc. So it will be 1+1+1.. and in the worst case scenerio, it will be N(length of input) times character comparing. You are not comparing A with AB, and then A with ABC. You are comparing A with B and then A with C. So it will be O(N). Just look at above pseudocode I provided. It will loop maximumly for input string's length.
  • Łukasz Kidziński
    Łukasz Kidziński over 10 years
    Sure, the pseudocode is O(n) but it returns just a string of repeated first character.
  • tmyklebu
    tmyklebu over 10 years
    The point of KMP preprocessing is that it computes suffix-prefix matches. You can read the period of the string off from the last entry of the failure table.
  • tmyklebu
    tmyklebu over 10 years
    You certainly can do this with a suffix tree, but that will be orders of magnitude slower than doing KMP preprocessing.
  • Łukasz Kidziński
    Łukasz Kidziński over 10 years
    Well, here I meant KMP as an O(n) algorithm for finding a pattern in a string.
  • Łukasz Kidziński
    Łukasz Kidziński over 10 years
    +1 for a good reference. The algorithm is really there, page 340. Doesn't seem like an interview solution though :)
  • user2566092
    user2566092 over 10 years
    @tmyklebu How do you figure that the constants for suffix trees are orders of magnitude larger than other algorithms? That means like constants of 1000. A suffix tree certainly does not have hidden constants of 1000.
  • Mehmet Sedat Güngör
    Mehmet Sedat Güngör over 10 years
    @ŁukaszKidziński No, next characters until the match is found will be appended to String. So, it will have repeated cycle String value in it. Look at the code, it appends the nextChar of input to String until a match found.
  • tmyklebu
    tmyklebu over 10 years
    I haven't actually seen an implementation that doesn't. You're welcome to link me to your favourite implementation and I can compare against a good implementation of the linear-time, constant-space algorithm I referenced, but suffix trees being a factor of 1000 slower really wouldn't surprise me.
  • Bernhard Barker
    Bernhard Barker over 10 years
    The pseudo-code doesn't work, consider ababacababac (the correct substring is ababac, not ab). And if you modify it so that it does work, it will no longer be O(n).
  • Mehmet Sedat Güngör
    Mehmet Sedat Güngör over 10 years
    @Dukeling I repeat, this is not a general solution for finding a substring that repeats. OP mentioned that "every character in the input string is part of the repeating substring" and gave some examples, so that was a specific solution. Tortoise and Hare will also give wrong output for that case. So it has to be bigger than O(N), no different thoughts.
  • Peter Taylor
    Peter Taylor over 6 years
    A reference to a list of errata would also help, because there are at least three errors in the subroutine Next_MS and at least two more in Per.
  • R.. GitHub STOP HELPING ICE
    R.. GitHub STOP HELPING ICE over 5 years
    @PeterTaylor: If you're aware of those errors, can you clarify what they are? Leaving a comment on an old answer to an old question saying that the answer has errors but without any clue what those errors might be is mildly helpful but also extremely frustrating.
  • Peter Taylor
    Peter Taylor over 5 years
    @R, looking briefly at the text, in the signature of Next_MS I suspect x[1…m] should be x[1...n] (since m is not used and n is used without definition); then end should almost certainly be then begin; ms := j; j := ms+1; looks suspicious; in Per there's a then which should probably be then begin and an else end which should probably be else begin. But I'm not certain that those are the errors I saw 15 months ago, and an official errata list would in any case be far more valuable than my opinion.