I would like to get some ideas.
My situation: there are tons of logs on my Linux server that are big and they are also have tons of things in them. I would like to catch ONLY the login with a timestamp and ONLY the email address from the log and collect them to a .txt file.
An example log:
[...]
2019-07-21 03:13:06.939 login
[things not needed between the two]
(mail=>[email protected]< method=>email< cmd=>login<)
[...]
An example output:
************** 2019-07-21 **************
2019-07-21 03:13:06.939 login
[email protected]
2019-07-21 06:22:19.424 login
[email protected]
2019-07-21 12:10:23.665 login
[email protected]
2019-07-21 14:26:19.068 login
[email protected]
************** 2019-07-22 **************
2019-07-22 08:01:50.157 login
[email protected]
2019-07-22 08:12:35.504 login
[email protected]
2019-07-22 09:10:35.416 login
[email protected]
To achieve this I am using this right now:
for i in $(ls); do echo "" && printf "************** " && cat $i | head -c 10 && printf " **************\n"; while read line; do echo $line | grep "login"; echo "$line" | grep -h -o -P '(?<=mail=>).*?(?=<)'; done < $i; done >> ../logins.txt
The for loop is going through the files, cat $i | head -c 10 will get the date (because that is the first thing in every log), the while loop is reading the file line-by-line and greps login and ONLY the mail address (grep between "mail=>" "<"). And at the end it is outputting to logins.txt.
While this is working I find it very-very slow because it's executing a lots of commands. (And we are talking about 2 years of logs here) And it is also looks really dirty.
I really think that there is an effective way to do this but I don't really get what would that be.
CodePudding user response:
awk would do a nice job of this. You can tell it to print the line only when the line matches a particular regex. Something like:
awk '$0~/[0-9]{4}-[0-9]{2}-[0-9]{2}|\(mail=>/{print $0}' * > output.log
Updated: Noticed you just want the email. In the case, two blocks will suffice. In the second block we split by characters < or > and then retrieve the email from index 2 of the resulting array.
awk '$1~/^[0-9]{4}-[0-9]{2}-[0-9]{2}/{print $0}$1~/^\(mail=>/{split($1,a,"[<>]");print a[2]}' * > output.log
This awk says:
- If the first field (where the field is delimited by awk's default of a space character) of the row we are reading starts with a date of format
nnnn-nn-nn:$1~/^[0-9]{4}-[0-9]{2}-[0-9]{2}/ - Then print the entire line
{print $0} - If the first field of the row we are reading starts with the characters
(mail=>:$1~/^\(mail=>/ - Then split the first field by either characters
<or>into an array nameda:split($1,a,"[<>]") - Then print the 3rd item in the array (index 2):
print a[2] - For all of the files in this current directory:
* - Instead of printing to the command line, send the output to a file:
> output.log
CodePudding user response:
If there's no other way to grab the date than the first 10 characters of the logfiles, then at least you can simplify the grep part:
for logfile in ./*
do
printf '************** %s **************\n' $(head -c 10 "$logfile")
grep -h -o -P '.* login$|(?<=mail=>)[^<]*' "$logfile"
echo
done
But the best would be to write the whole thing with a single language like perl/awk/ruby/python.
CodePudding user response:
With awk use the -F for selecting the mail account:
sep='************************'
awk -v sep="$sep" -F '(mail=>|<)' '
FNR==1 { printf("%s %s %s\n", sep, substr($0,0,10), sep)}
/mail=>/ {print $2}
/login *$/ {print}
' *
When you have additional requirements and want to use a loop, consider
for f in *; do
sed -nr '
1s/(.{10}).*/********* \1 **********/p;
/login *$/p;
s/.*mail=>([^<]*).*/\1/p
' "${f}"
done
