Section 8.5. Log Analysis

Successful log analysis begins long before the need for it arises. It starts with the Apache installation, when you are deciding what to log and how. By the time something that requires log analysis happens, you should have the information to perform it.

If you are interested in log forensics, then Scan of the Month 31 (http://www.honeynet.org/scans/scan31/) is the web site you should visit. As an experiment, Ryan C. Barnett kept an Apache proxy open for a month and recorded every transaction in detail. It resulted in almost 300 MB of raw logs. The site includes several analyses of the abuse techniques seen in the logs.

Ensure all Apache installations are configured to log sufficient information, prior to any incidents.
Determine all the log files where relevant information may be located. The access log and the error log are the obvious choices, but many other potential logs may contain useful information: the suEXEC log, the SSL log (it's in the error log on Apache 2), the audit log, and possibly application logs.
The access log is likely to be quite large. You should try to remove the irrelevant entries (e.g., requests for static files) from it to speed up processing. Watch carefully what is being removed; you do not want important information to get lost.
In the access log, try to group requests to sessions, either using the IP address or a session identifier if it appears in logs. Having the unique id token in the access log helps a lot since you can perform access log analysis much faster than you could with the full audit log produced by mod_security. The audit log is more suited for looking at individual requests.
Do not forget the attacker could be working from multiple IP addresses. Attackers often perform reconnaissance from one point but attack from another.

Log analysis is a long and tedious process. It involves looking at large quantities of data trying to make sense out of it. Traditional Unix tools (e.g., grep, sed, awk, and sort) and the command line are very good for text processing and, therefore, are a good choice for log file processing. But they can be difficult to use with web server logs because such logs contain a great deal of information. The bigger problem is that attackers often utilize evasion methods that must be taken into account during analysis, so a special tool is required. I have written one such tool for this book: logscan.

logscan parses log lines and allows field names to be used with regular expressions. For example, the following will examine the access log and list all requests whose status code is 500:

The parameters are the name of the log file, the field name, and the pattern to be used for comparison. By default, logscan understands the following field names, listed in the order in which they appear in access log entries:

logscan also attempts to counter evasion techniques by performing the following operations against the request_uri field:

Decode URL-encoded characters.
Remove multiple occurrences of the slash character.
Remove self-referencing folder occurrences.
Detect null byte attacks.