Detect EOL type using PHP

14,599

Solution 1

/**
 * Detects the end-of-line character of a string.
 * @param string $str The string to check.
 * @param string $default Default EOL (if not detected).
 * @return string The detected EOL, or default one.
 */
function detectEol($str, $default=''){
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        //"\0x76",       // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
    );
    $cur_cnt = 0;
    $cur_eol = $default;
    foreach($eols as $eol){
        if(($count = substr_count($str, $eol)) > $cur_cnt){
            $cur_cnt = $count;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}

Notes:

  • Needs to check encoding type
  • Needs to somehow know that we may be on an exotic system like ZX8x (since ASCII x76 is a regular letter) @radu raised a good point, in my case, it's not worth the effort to handle ZX8x systems nicely.
  • Should I split the function into two? mb_detect_eol() (multibyte) and detect_eol()

Solution 2

Wouldn't it be easier to just replace everything except new lines using regex?

The dot matches a single character, without caring what that character is. The only exception are newline characters.

With that in mind, we do some magic:

$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);

Not sure if we can trust regex to do all this, but I don't have anything to test with.

enter image description here

Solution 3

The here already given answers provide the user of enough information. The following code (based on the already given anwers) might help even more:

  • It provides a reference of the found EOL
  • The detection sets also a key which can be used by an application to this reference.
  • It shows how to use the reference in a utility class.
  • Shows how to use it for detection of a file returning the key name of the found EOL.
  • I hope this is of usage to all of you.
    /**
    Newline characters in different Operating Systems
    The names given to the different sequences are:
    ============================================================================================
    NewL  Chars       Name     Description
    ----- ----------- -------- ------------------------------------------------------------------
    LF    0x0A        UNIX     Apple OSX, UNIX, Linux
    CR    0x0D        TRS80    Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
    LFCR  0x0A 0x0D   ACORN    Acorn BBC and RISC OS spooled text output.
    CRLF  0x0D 0x0A   WINDOWS  Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
                              and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
    ----- ----------- -------- ------------------------------------------------------------------
    */
    const EOL_UNIX    = 'lf';        // Code: \n
    const EOL_TRS80   = 'cr';        // Code: \r
    const EOL_ACORN   = 'lfcr';      // Code: \n \r
    const EOL_WINDOWS = 'crlf';      // Code: \r \n
    

    then use the following code in a static class Utility to detect

    /**
    Detects the end-of-line character of a string.
    @param string $str      The string to check.
    @param string $key      [io] Name of the detected eol key.
    @return string The detected EOL, or default one.
    */
    public static function detectEOL($str, &$key) {
       static $eols = array(
         Util::EOL_ACORN   => "\n\r",  // 0x0A - 0x0D - acorn BBC
         Util::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
         Util::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
         Util::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
      );
    
      $key = "";
      $curCount = 0;
      $curEol = '';
      foreach($eols as $k => $eol) {
         if( ($count = substr_count($str, $eol)) > $curCount) {
            $curCount = $count;
            $curEol = $eol;
            $key = $k;
         }
      }
      return $curEol;
    }  // detectEOL
    

    and then for a file:

    /**
    Detects the EOL of an file by checking the first line.
    @param string  $fileName    File to be tested (full pathname).
    @return boolean false | Used key = enum('cr', 'lf', crlf').
    @uses detectEOL
    */
    public static function detectFileEOL($fileName) {
       if (!file_exists($fileName)) {
         return false;
       }
    
       // Gets the line length
       $handle = @fopen($fileName, "r");
       if ($handle === false) {
          return false;
       }
       $line = fgets($handle);
       $key = "";
       <Your-Class-Name>::detectEOL($line, $key);
    
       return $key;
    }  // detectFileEOL
    

    Change the Your-Class-Name into your name for the implementation Class (all static members).

    Solution 4

    My answer, because I could make neither ohaal's one or transilvlad's one work, is:

    function detect_newline_type($content) {
        $arr = array_count_values(
                   explode(
                       ' ',
                       preg_replace(
                           '/[^\r\n]*(\r\n|\n|\r)/',
                           '\1 ',
                           $content
                       )
                   )
               );
        arsort($arr);
        return key($arr);
    }
    

    Explanation:

    The general idea in both proposed solutions is good, but implementation details hinder the usefulness of those answers.

    Indeed, the point of this function is to return the kind of newline used in a file, and that newline can either be one or two character long.

    This alone renders the use of str_split() incorrect. The only way to cut the tokens correctly is to use a function that cuts a string with variable lengths, based on character detection instead. That is when explode() comes into play.

    But to give useful markers to explode, it is necessary to replace the right characters, in the right amount, by the right match. And most of the magic happens in the regular expression.

    3 points have to be considered:

    1. using .* as suggested by ohaal will not work. While it is true that . will not match newline characters, on a system where \r is not a newline character, or part of a newline character, . will match it incorrectly (reminder: we are detecting newlines because they could be different from the ones on our system. Otherwise there is no point).
    2. replacing /[^\r\n]*/ with anything will "work" to make the text vanish, but will be an issue as soon as we want to have a separator (since we remove all characters but the newlines, any character that isn't a newline will be a valid separator). Hence the idea to create a match with the newline, and use a backreference to that match in the replacement.
    3. It is possible that in the content, multiple newlines will be in a row. However we do not want to group them in that case, since they will be seen by the rest of the code as different types of newlines. That is why the list of newlines is explicitly stated in the match for the backreference.

    Solution 5

    Based on ohaal's answer.

    This can return one or two caracters for EOL like LF, CR+LF..

      $eols = array_count_values(str_split(preg_replace("/[^\r\n]/", "", $string)));
      $eola = array_keys($eols, max($eols));
      $eol = implode("", $eola);
    
    Share:
    14,599

    Related videos on Youtube

    Christian
    Author by

    Christian

    I push buttons for a living.

    Updated on October 03, 2022

    Comments

    • Christian
      Christian over 1 year

      Reference: This is a self-answered question. It was meant to share the knowledge, Q&A style.

      How do I detect the type of end of line character in PHP?

      PS: I've been writing this code from scratch for too long now, so I decided to share it on SO, plus, I'm sure someone will find ways for improvement.

    • KingCrunch
      KingCrunch almost 12 years
      You are sure about mixing encodings? At least 0A appears twice. @Alexander The source is linked in the question. Christian just wanted to ask a question, that he wants to answer himself.
    • Chibueze Opata
      Chibueze Opata almost 12 years
      So which one is "/r/n"? Doesn't the server have a way of taking care of whichever environment it is operating on?
    • Christian
      Christian almost 12 years
      Alexander, I've answered my own question. See my note in the main question. KingCrunch To be honest, I didn't think about that. Chibueze Opata \r\n is ASCII CR+LF (Windows). If it wasn't obvious, my code aims to find EOL of any string, even if it came from another server, client or a remote database. PHP is completely oblivious to what your client browser is using as EOL.
    • KingCrunch
      KingCrunch almost 12 years
      Whats about "mixed line endings"? For me it feels not unusual, when a vertical tab and a regular line feed appears in the same file with a paragraph separator. And this code snippet silently assumes, that every file is well formed
    • Christian
      Christian almost 12 years
      Hmm, that's a good point. It should cater for cases where different EOL types might exist. Then again, I'll have to check which of them make sense to co-exist.
    • rid
      rid almost 12 years
      @Christian, also, are you sure 0x1E, 0x76 and 0x15 can't be part of a multibyte character? Maybe it would be a good idea to leave these out, if you're not convinced that they're going to be useful (the OSs mentioned look pretty old).
    • Christian
      Christian almost 12 years
      @Radu Wikipedia seems to claim so. I don't have an IBM mainframe nor a Sinclair ZX8x at hand to check. :D
    • rid
      rid almost 12 years
      @Christian, what I mean is, even if they are indeed EOL on these platforms, they might also be part of a UTF-8 character for example. So if the document contains that character, you would erroneously find that it contains an EOL, when in fact it doesn't. For example, there is the Unicode character "latin capital letter sharp s" which has the code U+1E9E. If the document would contain this character, your code would conclude that it contains an EOL instead of the "sharp s" character, because you're looking for 0x1E, which is part of the "sharp s" character.
    • rid
      rid almost 12 years
      @Christian, if these (very) old systems are not a primary concern, better safe than sorry, I think. Otherwise, maybe try to determine the document's encoding before applying this method.
    • transilvlad
      transilvlad over 10 years
      What if you have mixed content? For example first few lines end in CR+LF and the rest in LF? I need something that tells me which line ending is used primarily.
    • ohaal
      ohaal over 10 years
      Interesting question. I'm not even sure if my theory works, but if it does, this might work for you, returning the most used newline: $arr = array_count_values(str_split($newlines));arsort($arr);return key($arr);
    • transilvlad
      transilvlad over 10 years
      Sorry it does not work if the entire document has CR+LF it return LF.
    • Richard - Rogue Wave Limited
      Richard - Rogue Wave Limited over 7 years
      By default regex considers 'newline' to be only \n. (This can be changed with build options). However I did find a regex that will work above instead of the '/.*/' and it is '/(*ANYCRLF)./'. There is a very good article about regex and line endings here: nikic.github.io/2011/12/10/PCRE-and-newlines.html
    • Noel Whitemore
      Noel Whitemore about 6 years
      This worked for me. To test whether a script has been saved with Windows or Unix line encodings you just need to call strlen() on the string sent back by this function (2 = Windows CR+LF, 1 = Unix LF).
    • Kiser
      Kiser over 5 years
      Interesting subject and interesting discussion. Curious though if we could have a case where the real EOL is two characters (CR+LF for example) but a lone CR or LF is found elsewhere in the document. Then, this lone character will have a higher occurence count than the real EOL. Should we not, in this case, have a way to give priority to the two character solution even though the single character has a higher count? Shoot me down if I'm way off base; I have thick skin. :-)
    • Sorin Trimbitas
      Sorin Trimbitas over 3 years
      Check my solution, why do you care about what is inside a line?