Home > Mobile >  Extract all links between ' and ' in a text file, using CLI (Linux)
Extract all links between ' and ' in a text file, using CLI (Linux)

Time:01-31

I have a very big text (.sql) file, and I want to get all the links out of it in a nice clean text file, where the link are all one in each line.

I have found the following command grep -Eo "https?://\S ?\.html" filename.txt > newFile.txt from anubhava, which nearly works for me, link: Extract all URLs that start with http or https and end with html from text file

Unfortunately, it does not quite work: Problem 1: In the above link, the webpages end with .html. Not so in my case. They do not have a common ending, so I just have to finish before the second ' symbol.

Problem 2: I do not want it to copy the ' symbol.

To give an example, (cause, I think I explain rather bad here):

Say, my file says things like this:

Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as.

I would want

https://I_want_this
https://I_want_this_too

as the outputfile.

Sorry for the easy question, but I am new to this whole thing and grep/sed etc. are not so easy for me to understand, esp. when I want it to search for special characters, such as /,'," etc.

CodePudding user response:

You can use a GNU grep command like

grep -Po "'\Khttps?://[^\s'] " file

Details:

  • P enables PCRE regex engine
  • o outputs matches only, not matched lines
  • '\Khttps?://[^\s'] - matches a ', then omits it from the match with \K, then matches http, then an optional s, ://, and then one or more chars other than whitespace and ' chars.

See the online demo:

#!/bin/bash
s="Not him old music think his found enjoy merry. Listening acuteness dependent at or an. 'https://I_want_this' Apartments thoroughly unsatiable terminated sex how themselves. She are ten hours wrong walls stand early. 'https://I_want_this_too'. Domestic perceive on an ladyship extended received do. Why jennings our whatever his learning gay perceive. Is against no he without subject. Bed connection unreserved preference partiality not unaffected. Years merit trees so think in hoped we as."
grep -Po "'\Khttps?://[^\s'] " <<< "$s"

Output:

https://I_want_this
https://I_want_this_too

CodePudding user response:

With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.

awk '
{
  while(match($0,/\047https?:\/\/[^\047]*/)){
    print substr($0,RSTART 1,RLENGTH-1)
    $0=substr($0,RSTART RLENGTH)
  }
}
'  Input_file

Explanation: Simple explanation would be, using a while loop in main program and running awk's match function in it. Where match function has regex \047https?:\/\/[^\047]*(which matches 'http OR 'https followed by :// till next occurrence of '), then printing sub-string of matched values(by match function).

  •  Tags:  
  • Related