How can I fix "TypeError: cannot serialize '_io.BufferedReader' object" error when trying to multiprocess

20,039

File handles don't serialize very well... But you could send the name of the zip file instead of the zip filehandle (a string serializes okay between processes). And avoid zip for your filename as it's a built-in. I've chosen zip_filename

p = Process(target=extract_zip, args=(zip_filename, password))

then:

def extract_zip(zip_filename, password):
      try:
          zip_file = zipfile.ZipFile(zip_filename)
          zip_file.extractall(pwd=password)

The other problem is that your code won't run in parallel because of this:

      p.start()
      p.join()

p.join waits for the process to finish... hardly useful. You have to store the process identifiers to join them in the end.

This may cause other problems: creating too many processes in parallel may be an issue for your machine and won't help much after some point. Consider a multiprocessing.Pool instead, to limit the number of workers.

Trivial example is:

with multiprocessing.Pool(5) as p:
    print(p.map(f, [1, 2, 3, 4, 5, 6, 7]))

Adapted to your example:

with multiprocessing.Pool(5) as p:
    p.starmap(extract_zip, [(zip_filename,line.strip()) for line in txt_file])

(starmap expands the tuples as 2 separate arguments to fit your extract_zip method, as explained in Python multiprocessing pool.map for multiple arguments)

Share:
20,039
Arszilla
Author by

Arszilla

Updated on February 04, 2020

Comments

  • Arszilla
    Arszilla about 4 years

    I'm trying to switch the threading in my code to multiprocessing to measure its performance and hopefully achieve better brute-forcing potential as my program is meant to brute-force password protected .zip files. But whenever I try to run the program I get this:

    BruteZIP2.py -z "Generic ZIP.zip" -f  Worm.txt
    Traceback (most recent call last):
      File "C:\Users\User\Documents\Jetbrains\PyCharm\BruteZIP\BruteZIP2.py", line 40, in <module>
        main(args.zip, args.file)
      File "C:\Users\User\Documents\Jetbrains\PyCharm\BruteZIP\BruteZIP2.py", line 34, in main
        p.start()
      File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
      File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
      File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
      File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
      File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
    TypeError: cannot serialize '_io.BufferedReader' object
    

    I did find threads that had the same issue as I did but they were both unanswered/unsolved. I also tried inserting Pool above p.start() as I believe this was caused due to the fact that I am on a Windows-based machine but it was no help. My code is as follows:

      import argparse
      from multiprocessing import Process
      import zipfile
    
      parser = argparse.ArgumentParser(description="Unzips a password protected .zip by performing a brute-force attack using either a word list, password list or a dictionary.", usage="BruteZIP.py -z zip.zip -f file.txt")
      # Creates -z arg
      parser.add_argument("-z", "--zip", metavar="", required=True, help="Location and the name of the .zip file.")
      # Creates -f arg
      parser.add_argument("-f", "--file", metavar="", required=True, help="Location and the name of the word list/password list/dictionary.")
      args = parser.parse_args()
    
    
      def extract_zip(zip_file, password):
          try:
              zip_file.extractall(pwd=password)
              print(f"[+] Password for the .zip: {password.decode('utf-8')} \n")
          except:
              # If a password fails, it moves to the next password without notifying the user. If all passwords fail, it will print nothing in the command prompt.
              print(f"Incorrect password: {password.decode('utf-8')}")
              # pass
    
    
      def main(zip, file):
          if (zip == None) | (file == None):
              # If the args are not used, it displays how to use them to the user.
              print(parser.usage)
              exit(0)
          zip_file = zipfile.ZipFile(zip)
          # Opens the word list/password list/dictionary in "read binary" mode.
          txt_file = open(file, "rb")
          for line in txt_file:
              password = line.strip()
              p = Process(target=extract_zip, args=(zip_file, password))
              p.start()
              p.join()
    
    
      if __name__ == '__main__':
          # BruteZIP.py -z zip.zip -f file.txt.
          main(args.zip, args.file)
    

    As I said before, I believe this is happening mainly because I am on a Windows-based machine right now. I shared my code with a few others who were on Linux based machines and they had no problem running the code above.

    My main goal here is to get 8 processes/pools started to maximize the number of attempts done compared to threading, but due to the fact that I cannot get a fix for TypeError: cannot serialize '_io.BufferedReader' object message I am unsure on what to do here and how can I go on to fix it. Any assistance would be appreciated.

  • Arszilla
    Arszilla about 5 years
    I tried pooling but I dont think I got the right idea. I did add pool = Pool(8) above p.start() but that might not be the way to do it, right? If so, is there a good guide on it?
  • Arszilla
    Arszilla about 5 years
    Also how should I do the start and join then? I looked up guides and some documentation from various sources and many of them did it that way. I also did look at Python 3 documentation (this part specifically but unsure how to implement it here as my function (extract_zip) has 2 args in it.
  • Arszilla
    Arszilla about 5 years
    I assume that with part would go to where for line in txt_file is at? Removing for and placing with?
  • Jean-François Fabre
    Jean-François Fabre about 5 years
    yes, remove the loop, it's done in the argument generation
  • Arszilla
    Arszilla about 5 years
    I did take your reply and comments and edited the parts you told me to. As a result I got this. When I do try to run it with a ZIP named "Generic ZIP" and a .txt with 5 numbers like 00001, 00214321, 0987654321 etc I still get the same TypeError. Not sure what is wrong/why is the error ongoing, as I removed p.start() and p.join() and replaced the whole for line in txt_file with with multiprocessing.Pool(5) as pool.
  • Jean-François Fabre
    Jean-François Fabre about 5 years
    note that I've changed zip_file for zip_filename, which was zip for you, but I don't want to use zip as it's a built-in. Pass the string. Read my answer again.
  • Arszilla
    Arszilla about 5 years
    As the site is telling me to move the chatter here to a chat (even though I am 1 rep short) I'll ask some brief questions. In the original code, zip was used as an arg for def main() i.e def main(zip, file). Is that what you mean with zip, as I don't see any other zip? Also I am not entirely sure with what you mean with pass the string. Pass which string to where? You talkingg about def extract_zip?
  • Arszilla
    Arszilla about 5 years
    Works! I guess that solves it! One last question before I tick this post off as answered: Are there any 'dangers' to multiprocessing? Like same entry being done twice by any instance or such? Or 'writing to the disk' as some others put it (Not sure what that still means) Also will Pool prevent the code working in MacOS or Linux by any chance?
  • Jean-François Fabre
    Jean-François Fabre about 5 years
    no, same entry would not be done twice unless there's a bug in the input. And danger? well, the only danger is: don't rely on multiprocessing until you've optimized your code very well. And python isn't very good at intensive computations, maybe a compiled language could do better (run from python with multithreading)
  • Arszilla
    Arszilla about 5 years