how to find offset of one binary file inside another?

5,118

Solution 1

I could not come up with an existing tool.

grep -F --binary --byte-offset --only-matching seems to be close enough - but you can't escape newlines with -F. And cmp only allows to skip characters. diff also does not seem to be of much help.

But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:

#include <boost/algorithm/string/find.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>
#include <iostream>
using namespace boost;
using namespace boost::algorithm;
using namespace boost::iostreams;
using namespace std;

int main(int argc, char **argv)
{
  if (argc != 3) {
    cerr << "Call: " << argv[0] << " PATTERN_FILE SRC_FILE\n";
    return 3;
  }
  mapped_file_source pattern(argv[1]);
  mapped_file_source src(argv[2]);
  iterator_range<const char*> p_range(pattern.data(),
      pattern.data() + pattern.size());
  iterator_range<const char*> s_range(src.data(), src.data() + src.size());
  iterator_range<const char*> result = find_first(s_range, p_range);
  if (result) {
    size_t pos = result.begin()-s_range.begin();
    cout << pos << '\n';
    return 0;
  }
  return 1;
}

You can compile it like this (when the program source is saved as find.cc):

$ make CXXFLAGS="-Wall -g" LDLIBS="-lboost_iostreams" searchb

To test it:

$ dd if=WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3 of=t skip=232323 bs=1 count=4K
$ ls -l t
-rw-r--r-- 1 juser users 4096 2012-05-31 15:24 t
$ ./searchb t WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3
232323

The output is the matching position in the source file.

If the file is not contained the exit status is 1.

Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for searchb*. The Python implementation is the shortest one and doesn't require any external dependencies.

Solution 2

We do this in bioinformatics all the time - except we also want partial matches and we want to know how well they matched.

BLAT is the fastest solution I know: https://en.wikipedia.org/wiki/BLAT_(bioinformatics)

It builds an index and after building the index is ridiculously fast.

Share:
5,118
Cyryl Płotnicki
Author by

Cyryl Płotnicki

Updated on September 18, 2022

Comments

  • Cyryl Płotnicki
    Cyryl Płotnicki over 1 year

    I have two binary files.
    One of few hundreds kilos and other of few gigabytes.
    I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
    I am interested only in exact matches i.e. whether the whole file is contained by the another.
    Both files are binary.
    Is there any existing tool/one-liner that does that ?

    • Behrooz
      Behrooz almost 12 years
      I don't have the code but you may be able to do with CINT and strstr() or similar function
  • Cyryl Płotnicki
    Cyryl Płotnicki almost 12 years
    thanks a lot. does it load the whole pattern file into memory ?
  • Cyryl Płotnicki
    Cyryl Płotnicki almost 12 years
    also it does this: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail‌​::error_info_injecto‌​r<std::exception> >' what(): std::exception Aborted (core dumped)
  • maxschlepzig
    maxschlepzig almost 12 years
    @CyrylPlotnicki-Chudyk, not really, it (the kernel) memory-maps the file into virtual memory. Thus it depends on the kernel's virtual memory system (and your available resources) how much of the pattern is loaded into memory during the execution of the program.
  • maxschlepzig
    maxschlepzig almost 12 years
    @CyrylPlotnicki-Chudyk, the error means that mapping the file into memory failed. Perhaps a virtual memory limit is reached or the filesystem does not support memory mapping. How large are your files exactly? What distribution and what architecture are you using (64/32 bit)?
  • Cyryl Płotnicki
    Cyryl Płotnicki almost 12 years
    Hey, thanks for your reply ! Actually it's Fedora 16, 32bit, with ~500kb pattern and 5gb haystack files
  • maxschlepzig
    maxschlepzig almost 12 years
    @CyrylPlotnicki-Chudyk, ok, that explains it - 5gb exceeds the virtual address space of a process on a 32 Bit system (which is limited to at most 4GB, or 2GB, 3GB depending on your kernel). Thus the memory mapping is not possible. On a 64 bit system you don't have that limit. Thus, if the hardware has a 64 Bit CPU you should upgrade to a 64 Bit version of your distribution (also for other reasons). One can of course change the program such that the big file is not mapped but read as a stream. I will look into that later.
  • Cyryl Płotnicki
    Cyryl Płotnicki almost 12 years
    I though of writing simple KMP search in C, will look into that later on, just though that it's just a matter of me not knowing of the existence of the simple tool
  • maxschlepzig
    maxschlepzig almost 12 years
    To port the above code to a 32 bit system one has to write a custom input-iterator that reads characters from an input stream - quite less elegant then the above solution that uses memory-mapped IO both for pattern and text.