Log file to Pandas Dataframe

20,390

Solution 1

You can use read_csv with separator \s*\[ - whitespaces with [:

import pandas as pd
from pandas.compat import StringIO

temp=u"""LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')

Then remove ] by strip and convert column Time to_datetime:

df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f')
df.Text = df.Text.str.strip(']')

print (df)
      Level                    Time          Text
0  LogLevel 2015-10-13 00:30:00.650  Message Text
1  LogLevel 2015-10-13 00:30:00.650  Message Text
2  LogLevel 2015-10-13 00:30:00.650  Message Text
3  LogLevel 2015-10-13 00:30:00.650  Message Text

print (df.dtypes)
Level            object
Time     datetime64[ns]
Text             object
dtype: object

Solution 2

I had to parse mine manually since my separator showed up in my message body and the message body would span multiple lines as well, eg if an exception were thrown from my Flask application and the stack track recorded.

Here's my log creation format...

logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )

And the parsing code in my Utilities module

Utilities.py

import re
import pandas

logFolder = "./Logs"

logLevelToString = { "50" : "CRITICAL",
                     "40" : "ERROR"   ,
                     "30" : "WARNING" ,
                     "20" : "INFO"    ,
                     "10" : "DEBUG"   ,
                     "0"  : "NOTSET"  } # https://docs.python.org/3.6/library/logging.html#logging-levels

def logFile2DataFrame( filePath ) :
    dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] )
    tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},"

    with open( filePath , 'r' ) as logFile :
        numRows = -1
        for line in logFile :
            if re.search( tsPattern , line ) :
                tokens    = line.split(",")
                timestamp = tokens[0]
                module    = tokens[1]
                processID = tokens[2]
                level     = logLevelToString[ tokens[3] ]
                message   = ",".join( tokens[4:] )
                numRows += 1
                dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ]
            else :
                # Multiline message, integrate it into last record
                dfLog.loc[ numRows , 'Message' ] += line
    return dfLog

I actually created this helper message to allow me to view my logs directly from my Flask app as I have a handy template that renders a DataFrame. Should accelerate debugging a bunch since encasing the flaskapp in a Tornado WSGI server prevents the display of the normal debug page visible from Flask when an exception gets thrown. If anyone knows how to restore that functionality in such a usage please share.

Share:
20,390
ukbaz
Author by

ukbaz

Updated on July 26, 2022

Comments

  • ukbaz
    ukbaz almost 2 years

    I have log files, which have many lines in the form of :

    LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
    

    My goal is to convert each line in the log file into a nice Data frame. I have tired to do that, by splitting the lines on the [ character, however I am still not getting a neat dataframe.

    My code:

    level = []
    time = []
    text = []
    
       with open(filename) as inf:
         for line in inf:
           parts = line.split('[')
             if len(parts) > 1:  
               level = parts[0]
               time = parts[1]
               text = parts[2]
            print (parts[0],parts[1],parts[2])
    
     s1 = pd.Series({'Level':level, 'Time': time, 'Text':text})
     df = pd.DataFrame(s1).reset_index()
    

    Heres my printed Data frame:

    Info      10/08/16 10:56:09.843]   In Function CCatalinaPrinter::ItemDescription()]
    
    Info      10/08/16 10:56:09.843]   Sending UPC Description Message ]
    

    How can I improve this to strip the whitespace and the other ']' character

    Thank you

    • ukbaz
      ukbaz over 7 years
      @atkawa7 nope that didnt work
  • ukbaz
    ukbaz over 7 years
    works perfectly until I replace the temp with the filename, perhaps because my file is a .log and not .csv ?
  • jezrael
    jezrael over 7 years
    need replace StringIO(temp) to filenam.log, it is not problem it can endswith .log
  • Vasu
    Vasu almost 4 years
    What if my log file in below format: LogLevel [13/10/2015 00:30:00.650] [Message Text1] [Message Text1A] [Message Text1B] LogLevel [13/10/2015 00:30:00.650] [Message Text2] LogLevel [13/10/2015 00:30:00.650] [Message Text3]