Is there any effective & fast way to catch two match in a log file?-CodePudding

I would like to get some ideas.

My situation: there are tons of logs on my Linux server that are big and they are also have tons of things in them. I would like to catch ONLY the login with a timestamp and ONLY the email address from the log and collect them to a .txt file.

An example log:

[...]
2019-07-21 03:13:06.939 login 
[things not needed between the two]
(mail=>[email protected]< method=>email< cmd=>login<)
[...]

An example output:

************** 2019-07-21 **************
2019-07-21 03:13:06.939 login
[email protected]
2019-07-21 06:22:19.424 login
[email protected]
2019-07-21 12:10:23.665 login
[email protected]
2019-07-21 14:26:19.068 login
[email protected]

************** 2019-07-22 **************
2019-07-22 08:01:50.157 login
[email protected]
2019-07-22 08:12:35.504 login
[email protected]
2019-07-22 09:10:35.416 login
[email protected]

To achieve this I am using this right now:

for i in $(ls); do echo "" && printf "************** " && cat $i | head -c 10 && printf " **************\n"; while read line; do echo $line | grep "login"; echo "$line" | grep -h -o -P '(?<=mail=>).*?(?=<)'; done < $i; done >> ../logins.txt

The for loop is going through the files, cat $i | head -c 10 will get the date (because that is the first thing in every log), the while loop is reading the file line-by-line and greps login and ONLY the mail address (grep between "mail=>" "<"). And at the end it is outputting to logins.txt.

While this is working I find it very-very slow because it's executing a lots of commands. (And we are talking about 2 years of logs here) And it is also looks really dirty.

I really think that there is an effective way to do this but I don't really get what would that be.

CodePudding user response：

awk would do a nice job of this. You can tell it to print the line only when the line matches a particular regex. Something like:

 awk '$0~/[0-9]{4}-[0-9]{2}-[0-9]{2}|\(mail=>/{print $0}' * > output.log

Updated: Noticed you just want the email. In the case, two blocks will suffice. In the second block we split by characters < or > and then retrieve the email from index 2 of the resulting array.

 awk '$1~/^[0-9]{4}-[0-9]{2}-[0-9]{2}/{print $0}$1~/^\(mail=>/{split($1,a,"[<>]");print a[2]}' * > output.log

This awk says:

If the first field (where the field is delimited by awk's default of a space character) of the row we are reading starts with a date of format nnnn-nn-nn: $1~/^[0-9]{4}-[0-9]{2}-[0-9]{2}/
Then print the entire line {print $0}
If the first field of the row we are reading starts with the characters (mail=>: $1~/^\(mail=>/
Then split the first field by either characters < or > into an array named a: split($1,a,"[<>]")
Then print the 3rd item in the array (index 2): print a[2]
For all of the files in this current directory: *
Instead of printing to the command line, send the output to a file: > output.log

CodePudding user response：

If there's no other way to grab the date than the first 10 characters of the logfiles, then at least you can simplify the grep part:

for logfile in ./*
do
    printf '************** %s **************\n' $(head -c 10 "$logfile")
    grep -h -o -P '.* login$|(?<=mail=>)[^<]*' "$logfile"
    echo
done

But the best would be to write the whole thing with a single language like perl/awk/ruby/python.

CodePudding user response：

With awk use the -F for selecting the mail account:

sep='************************'
awk -v sep="$sep" -F '(mail=>|<)' '
  FNR==1 { printf("%s %s %s\n", sep, substr($0,0,10), sep)}
  /mail=>/ {print $2}
  /login *$/ {print}
' *

When you have additional requirements and want to use a loop, consider

for f in *; do
  sed -nr '
    1s/(.{10}).*/********* \1 **********/p;
    /login *$/p;
    s/.*mail=>([^<]*).*/\1/p
  ' "${f}"
done