What could cause the file command in Linux to report a text file as binary data?

12,307

Solution 1

I found the issue using binary search to locate the problematic lines.

head -n {1/2 line count} file.cpp > a.txt
tail -n {1/2 line count} file.cpp > b.txt

Running file against each half, and repeating the process, helped me locate the offending line. I found a Control+P (^P) character embedded in it. Removing it solved the problem. I'll write myself a Perl script to search for these characters (and other extended) in the future.

A big thanks to everyone who provided an answer for all the tips!

Solution 2

Vim tries very hard to make sense of whatever you throw at it without complaining. This makes it a relatively poor tool to use to diagnose file's output.

Vim's "[converted]" notice indicates there was something in the file that vim wouldn't expect to see in the text encoding suggested by your locale settings (LANG etc).

Others have already suggested

  • cat -v
  • xxd

You could try grepping for non-ASCII characters.

  • grep -P '[\x7f-\xff]' filename

The other possibility is non-standard line-endings for the platform (i.e. CRLF or CR) but I'd expect file to cope with that and report "DOS text file" or similar.

Solution 3

If you run file -D filename, file displays debugging information, including the tests it performs. Near the end, it will show what test was successful in determining the file type.

For a regular text file, it looks like this:

[31> 0 regex,=^package[ \t]+[0-9A-Za-z_:]+ *;,""]
1 == 0 = 0
ascmagic 1
filename.txt: ISO-8859 text, with CRLF line terminators

This will tell you what it found to determine it's that mime type.

Share:
12,307

Related videos on Youtube

Jonah Bishop
Author by

Jonah Bishop

Updated on September 18, 2022

Comments

  • Jonah Bishop
    Jonah Bishop over 1 year

    I have a couple of C++ source files (one .cpp and one .h) that are being reported as type data by the file command in Linux. When I run the file -bi command against these files, I'm given this output (same output for each file):

    application/octet-stream; charset=binary
    

    Each file is clearly plain-text (I can view them in vi). What's causing file to misreport the type of these files? Could it be some sort of Unicode thing? Both of these files were created in Windows-land (using Visual Studio 2005), but they're being compiled in Linux (it's a cross-platform application).

    Any ideas would be appreciated.

    Update: I don't see any null characters in either file. I found some extended characters in the .cpp file (in a comment block), removed them, but file still reports the same encoding. I've tried forcing the encoding in SlickEdit, but that didn't seem to have an effect. When I open the file in vim, I see a [converted] line as soon as I open the file. Perhaps I can get vim to force the encoding?

  • Jonah Bishop
    Jonah Bishop about 12 years
    That an interesting tip. I've run both files through xxd, and I see no BOM in the first character position. Each file starts out with a giant comment block, so I see a bunch of slashes to start.
  • GodEater
    GodEater about 12 years
    Care to share an excerpt?
  • Jonah Bishop
    Jonah Bishop about 12 years
    This search resulted in a comment block containing some extended characters in my .cpp file. However, I don't see any similar characters in the .h...
  • garyjohn
    garyjohn about 12 years
    I updated my answer to include searching for nulls as Mehrdad suggested.
  • Jonah Bishop
    Jonah Bishop about 12 years
    I don't see any null characters in either file. :(
  • Jonah Bishop
    Jonah Bishop about 12 years
    I don't see a -D option in my file install (v5.04)...
  • garyjohn
    garyjohn about 12 years
    Try -d instead. That works with file-5.03 as installed on Fedora 11.
  • HikeMike
    HikeMike about 12 years
    Notifying @JonahBishop about garyjohn's comment. My post was written for the file included with OS X. My Debian 6 has neither -d nor -D though...
  • Jonah Bishop
    Jonah Bishop about 12 years
    The -d flag works for me, but there's so much output I'm not sure what to look for...