How to read binary files in Python using NumPy?

41,865

If you want to make an image out of a binary file, you need to read it in as integer, not float. Currently, the most common format for images is unsigned 8-bit integers.

As an example, let's make an image out of the first 10,000 bytes of /bin/bash:

>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))

In the above, we used the OpenCV library to write the integers to a PNG file. Any of several other imaging libraries could have been used.

This what the first 10,000 bytes of bash "looks" like:

enter image description here

Share:
41,865
Suyash Shetty
Author by

Suyash Shetty

Updated on September 29, 2020

Comments

  • Suyash Shetty
    Suyash Shetty over 3 years

    I know how to read binary files in Python using NumPy's np.fromfile() function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan and inf values.

    I need to apply machine learning algorithms to this dataset and I cannot work with this data. I cannot normalise the dataset because of the nan values.

    I've tried np.nan_to_num() but that doesn't seem to work. After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it.

    Is there any way to scale this data down? If not, how should I deal with this?

    Thank you.

    EDIT:

    Some context. I'm working on a malware classification problem. My dataset consists of live malware binaries. They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it.

    • maxymoo
      maxymoo over 7 years
      try pandas to read the file
    • Suyash Shetty
      Suyash Shetty over 7 years
      @maxymoo can pandas read binary files?
    • maxymoo
      maxymoo over 7 years
      what do you mean "binary file"? can you add a sample of the file to your post?
    • hpaulj
      hpaulj over 7 years
      What kinds of numbers do you expect? Have you tried specifying the dtype parameter? It's using the default float.
    • hpaulj
      hpaulj over 7 years
      OK, the values don't represent real numbers; no wonder the float version looks odd. Try a dtype like np.uint8, which represents each byte as number between 0 and 255.
  • Suyash Shetty
    Suyash Shetty over 7 years
    This worked! I was reading the binary files as floats, which caused the error. I read it as uint8 and it worked fine. Thanks