how to find offset of one binary file inside another?
Solution 1
I could not come up with an existing tool.
grep -F --binary --byte-offset --only-matching
seems to be close enough - but you can't escape newlines with -F
.
And cmp
only allows to skip characters. diff
also does not seem to be of much help.
But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:
#include <boost/algorithm/string/find.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>
#include <iostream>
using namespace boost;
using namespace boost::algorithm;
using namespace boost::iostreams;
using namespace std;
int main(int argc, char **argv)
{
if (argc != 3) {
cerr << "Call: " << argv[0] << " PATTERN_FILE SRC_FILE\n";
return 3;
}
mapped_file_source pattern(argv[1]);
mapped_file_source src(argv[2]);
iterator_range<const char*> p_range(pattern.data(),
pattern.data() + pattern.size());
iterator_range<const char*> s_range(src.data(), src.data() + src.size());
iterator_range<const char*> result = find_first(s_range, p_range);
if (result) {
size_t pos = result.begin()-s_range.begin();
cout << pos << '\n';
return 0;
}
return 1;
}
You can compile it like this (when the program source is saved as find.cc
):
$ make CXXFLAGS="-Wall -g" LDLIBS="-lboost_iostreams" searchb
To test it:
$ dd if=WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3 of=t skip=232323 bs=1 count=4K
$ ls -l t
-rw-r--r-- 1 juser users 4096 2012-05-31 15:24 t
$ ./searchb t WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3
232323
The output is the matching position in the source file.
If the file is not contained the exit status is 1
.
Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for searchb*
. The Python implementation is the shortest one and doesn't require any external dependencies.
Solution 2
We do this in bioinformatics all the time - except we also want partial matches and we want to know how well they matched.
BLAT is the fastest solution I know: https://en.wikipedia.org/wiki/BLAT_(bioinformatics)
It builds an index and after building the index is ridiculously fast.
Cyryl Płotnicki
Updated on September 18, 2022Comments
-
Cyryl Płotnicki over 1 year
I have two binary files.
One of few hundreds kilos and other of few gigabytes.
I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
I am interested only in exact matches i.e. whether the whole file is contained by the another.
Both files are binary.
Is there any existing tool/one-liner that does that ?-
Behrooz almost 12 yearsI don't have the code but you may be able to do with CINT and strstr() or similar function
-
-
Cyryl Płotnicki almost 12 yearsthanks a lot. does it load the whole pattern file into memory ?
-
Cyryl Płotnicki almost 12 yearsalso it does this: terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::exception> >' what(): std::exception Aborted (core dumped)
-
maxschlepzig almost 12 years@CyrylPlotnicki-Chudyk, not really, it (the kernel) memory-maps the file into virtual memory. Thus it depends on the kernel's virtual memory system (and your available resources) how much of the pattern is loaded into memory during the execution of the program.
-
maxschlepzig almost 12 years@CyrylPlotnicki-Chudyk, the error means that mapping the file into memory failed. Perhaps a virtual memory limit is reached or the filesystem does not support memory mapping. How large are your files exactly? What distribution and what architecture are you using (64/32 bit)?
-
Cyryl Płotnicki almost 12 yearsHey, thanks for your reply ! Actually it's Fedora 16, 32bit, with ~500kb pattern and 5gb haystack files
-
maxschlepzig almost 12 years@CyrylPlotnicki-Chudyk, ok, that explains it - 5gb exceeds the virtual address space of a process on a 32 Bit system (which is limited to at most 4GB, or 2GB, 3GB depending on your kernel). Thus the memory mapping is not possible. On a 64 bit system you don't have that limit. Thus, if the hardware has a 64 Bit CPU you should upgrade to a 64 Bit version of your distribution (also for other reasons). One can of course change the program such that the big file is not mapped but read as a stream. I will look into that later.
-
Cyryl Płotnicki almost 12 yearsI though of writing simple KMP search in C, will look into that later on, just though that it's just a matter of me not knowing of the existence of the simple tool
-
maxschlepzig almost 12 yearsTo port the above code to a 32 bit system one has to write a custom input-iterator that reads characters from an input stream - quite less elegant then the above solution that uses memory-mapped IO both for pattern and text.