How to split a huge size log file and only get what you want in Linux?

Today I encounter a task as follow:

I need to extract a 7.3GB Apache access log and ONLY grab the access log starting from 30 August and up to today.

I don't need to say it is non-sense to open a 7.3 GB file in a text editor, even in vi under linux, the RAM that is used is huge, and it just doesn't work that way.

So to solve this problem, here are the steps I took, thanks to the following reference:

  1. Find out the exact string first occurrence in the log file and print the first 5 lines.  Actually I can only print one line but I just like 5:
    1. grep -n "30/Aug/2015" access_log | head -n 5
  2. The returned line will be:
    1. 61445828: - - [30/Aug/2015:00:00:01 +0800] "GET <somewebsite>/index.htm HTTP/1.1" 200 10824
    2. The first item: 61445828 is the line number
  3. Count the total number of lines in access log, and get the number:
    1. wc -l access_log
  4. The return is: 64328208 access_log.old
  5. Now, do this calculation: (Total line of access_log - Starting line of required text) = Starting line, so we just need the log starting from: 64328208 - 61445828 = 2882380.  2882380 is the starting line number
  6. We export the content starting from 2282380 to a new file: 
    1. tail -n 2882380 access_log > custom_log_file_for_analysis.log
Now that file is just 414 MB, which is 95% smaller than the original file.  

You can now do whatever analysis you want with this file.

Hope it helps someone!


Popular posts from this blog

TCPDF How to show/display Chinese Character?

Using wget bypass htaccess username password 401 authorization

Wordpress Load balancing: 2 web servers 1 MySQL without any Cloud services