How to split a huge size log file and only get what you want in Linux?
Today I encounter a task as follow:
I need to extract a 7.3GB Apache access log and ONLY grab the access log starting from 30 August and up to today.
I don't need to say it is non-sense to open a 7.3 GB file in a text editor, even in vi under linux, the RAM that is used is huge, and it just doesn't work that way.
So to solve this problem, here are the steps I took, thanks to the following reference:
- http://stackoverflow.com/questions/3066948/how-to-file-split-at-a-line-number
I need to extract a 7.3GB Apache access log and ONLY grab the access log starting from 30 August and up to today.
I don't need to say it is non-sense to open a 7.3 GB file in a text editor, even in vi under linux, the RAM that is used is huge, and it just doesn't work that way.
So to solve this problem, here are the steps I took, thanks to the following reference:
- http://stackoverflow.com/questions/3066948/how-to-file-split-at-a-line-number
- Find out the exact string first occurrence in the log file and print the first 5 lines. Actually I can only print one line but I just like 5:
- grep -n "30/Aug/2015" access_log | head -n 5
- The returned line will be:
- 61445828:203.129.95.51 - - [30/Aug/2015:00:00:01 +0800] "GET <somewebsite>/index.htm HTTP/1.1" 200 10824
- The first item: 61445828 is the line number
- Count the total number of lines in access log, and get the number:
- wc -l access_log
- The return is: 64328208 access_log.old
- Now, do this calculation: (Total line of access_log - Starting line of required text) = Starting line, so we just need the log starting from: 64328208 - 61445828 = 2882380. 2882380 is the starting line number
- We export the content starting from 2282380 to a new file:
- tail -n 2882380 access_log > custom_log_file_for_analysis.log
Now that file is just 414 MB, which is 95% smaller than the original file.
You can now do whatever analysis you want with this file.
Hope it helps someone!
Comments