Parsing pcap files with dpkt (Python)

26,076

Solution 1

I have encountered the same problem while working with HTTP Requests and dpkt.

The problem is that the dpkt's HTTP headers parser uses wrong logic. This exception is raised when the HTTP doesn't end with \r\n\r\n. (And as you say, there are a lot of good packets with no \r\n\r\n at the end.)

Here is the bug report to your problem.

Solution 2

I've added an example to dpkt that parses and displays HTTP Headers. The docs can be found here: http://dpkt.readthedocs.io/en/latest/print_http_requests.html and the example code can be found in dpkt/examples/print_http_requests.py

# For each packet in the pcap process the contents
for timestamp, buf in pcap:

    # Unpack the Ethernet frame (mac src/dst, ethertype)
    eth = dpkt.ethernet.Ethernet(buf)

    # Make sure the Ethernet data contains an IP packet
    if not isinstance(eth.data, dpkt.ip.IP):
        print 'Non IP Packet type not supported %s\n' % eth.data.__class__.__name__
        continue

    # Now grab the data within the Ethernet frame (the IP packet)
    ip = eth.data

    # Check for TCP in the transport layer
    if isinstance(ip.data, dpkt.tcp.TCP):

        # Set the TCP data
        tcp = ip.data

        # Now see if we can parse the contents as a HTTP request
        try:
            request = dpkt.http.Request(tcp.data)
        except (dpkt.dpkt.NeedData, dpkt.dpkt.UnpackError):
            continue

        # Pull out fragment information (flags and offset all packed into off field, so use bitmasks)
        do_not_fragment = bool(ip.off & dpkt.ip.IP_DF)
        more_fragments = bool(ip.off & dpkt.ip.IP_MF)
        fragment_offset = ip.off & dpkt.ip.IP_OFFMASK

        # Print out the info
        print 'Timestamp: ', str(datetime.datetime.utcfromtimestamp(timestamp))
        print 'Ethernet Frame: ', mac_addr(eth.src), mac_addr(eth.dst), eth.type
        print 'IP: %s -> %s   (len=%d ttl=%d DF=%d MF=%d offset=%d)' % \
              (inet_to_str(ip.src), inet_to_str(ip.dst), ip.len, ip.ttl, do_not_fragment, more_fragments, fragment_offset)
        print 'HTTP request: %s\n' % repr(request)

Example Output

Timestamp:  2004-05-13 10:17:08.222534
Ethernet Frame:  00:00:01:00:00:00 fe:ff:20:00:01:00 2048
IP: 145.254.160.237 -> 65.208.228.223   (len=519 ttl=128 DF=1 MF=0 offset=0)
HTTP request: Request(body='', uri='/download.html', headers={'accept-language': 'en-us,en;q=0.5', 'accept-encoding': 'gzip,deflate', 'connection': 'keep-alive', 'keep-alive': '300', 'accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1', 'user-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113', 'accept-charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'host': 'www.ethereal.com', 'referer': 'http://www.ethereal.com/development.html'}, version='1.1', data='', method='GET')

Timestamp:  2004-05-13 10:17:10.295515
Ethernet Frame:  00:00:01:00:00:00 fe:ff:20:00:01:00 2048
IP: 145.254.160.237 -> 216.239.59.99   (len=761 ttl=128 DF=1 MF=0 offset=0)
HTTP request: Request(body='', uri='/pagead/ads?client=ca-pub-2309191948673629&random=1084443430285&lmt=1082467020&format=468x60_as&output=html&url=http%3A%2F%2Fwww.ethereal.com%2Fdownload.html&color_bg=FFFFFF&color_text=333333&color_link=000000&color_url=666633&color_border=666633', headers={'accept-language': 'en-us,en;q=0.5', 'accept-encoding': 'gzip,deflate', 'connection': 'keep-alive', 'keep-alive': '300', 'accept': 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,image/jpeg,image/gif;q=0.2,*/*;q=0.1', 'user-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6) Gecko/20040113', 'accept-charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'host': 'pagead2.googlesyndication.com', 'referer': 'http://www.ethereal.com/download.html'}, version='1.1', data='', method='GET')

Solution 3

In your python code, before assignment ip=eth.data check it that whether the Ethernet type is IP or not. If the Ethernet type is not ip do nothing to that packet. And check whether IP protocol is TCP protocol

        To Check
               1. IP packet or not
               2. TCP protocol or not

modified your program code

 
............            
      eth=dpkt.ethernet.Ethernet(buf)          
      ip=eth.data  
      tcp=ip.data      
      ........   

as

    
............         
     eth=dpkt.ethernet.Ethernet(buf)  
     if eth.type!=2048: #For ipv4, dpkt.ethernet.Ethernet(buf).type =2048        
           continue         
     ip=eth.data
     if ip.p!=6:
           continue
     tcp=ip.data        
     .......
and see whether there is any error issue.        

with regard,
Irengbam Tilokchan Singh

Share:
26,076
Leif
Author by

Leif

Updated on July 09, 2022

Comments

  • Leif
    Leif almost 2 years

    I'm trying to parse a previously-captured trace for HTTP headers using the dpkt module:

    import dpkt
    import sys
    
    f=file(sys.argv[1],"rb")
    pcap=dpkt.pcap.Reader(f)
    
    
    for ts, buf in pcap:
      eth=dpkt.ethernet.Ethernet(buf)
      ip=eth.data
      tcp=ip.data
    
    if tcp.dport==80 and len(tcp.data)>0:
        try:
            http=dpkt.http.Request(tcp.data)
            print http.uri
        except:
            print 'issue'
            continue
    
    
      f.close()
    

    While it seems to effectively parse most of the packets, I'm receiving a NeedData("premature end of headers") exception on some. They appear to be valid packets within WireShark, so I'm a bit confused as to why the exceptions are being thrown.

    Some output:

    /ec/fd/ls/GlinkPing.aspx?IG=4a06eefebcc1495f8f4de7cb41f0ce5c&CID=2265e1228f3451ff8011dcbe5e0cdff7&ID=API.YAds%2C5037.1&1307036510547
    issue
    issue #misses one packet here, two exceptions
    /?ld=4vyO5h1FkjCNjBpThUTGnzF50sB7QUGL0Ok8YefDTWNmO6RXghgDqHXtcp1OqeXATbCAHliIkglLj95-VEwG6ZJN3fblgd3Lh5NvTp4mZPcBGXUyKqXn9FViBAsmt1T96oumpCL5gm7gZ3qlZqSdLNUWjpML_9I8FvB2TLKPSYcJmb_VwwvJhiHpiUIvrjRdzqdVVnuQZVjQmZIIlfaMq0LOmgew_plopjt7hYvOSzBi3VJl4bqOBVk3zdhIvgZK0SfJp3kEWTXAr2_UU_q9KHBpSTnvuhY2W1xo3K2BOHKGk1VAlMiWtWC_nUaJdZmhzzWfb6yRAmY3M9YkUzFGs9z10-70OszkkNpVMSS3-p7xsNXQnC3Zpaxks
    

    Help is appreciated; perhaps an alternative library recommendation is needed.