Parsing Apache access logs with Python
I operate an Apache web server. Occasionally I want to see what’s happening on that web server, i.e., which documents are being viewed, what links folks are following to this site, etc. In addition to these basic things, I also want to know when my web server is being probed by hackers for vulnerabilities that they might exploit. I haven’t found any free or commercial log analyzers that does exactly what I want.
So, I decided to write my own, in Python of course because that’s my hack-tools-quickly language of choice and has been for a while now. One of the basic things that any Apache log analyzer needs is a bit of code that’s able to parse the Apache access log format. After a superficial bit of Googling I didn’t see any libraries with a clean enough interface.
Given that this sort of thing isn’t rocket science, I spent a few minutes blasting something out. The resulting module is called apachelogs. I tried to make the API for apachelogs as simple and straightforward as possible. For instance, the following code is all that is required to open a log file, count the number of 40x responses therein, and print the result.
import apachelogs if __name__ == '__main__': alf = apachelogs.ApacheLogFile('data/access.log.1') num_40xs = 0 for log_line in alf: if log_line.http_response_code.startswith('40'): num_40xs += 1 print "Saw %d 40x responses." % num_40xs