Home > OS >  Pipe curl to awk to download and unzip files
Pipe curl to awk to download and unzip files

Time:01-23

I want to download all files from this section of a HTML page :

    <td><a  name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
    <td><a  name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
    <td><a  name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>

The download link for the first file is https://foo.bar/data/24765/dd, and as it's a zip file, I'd like to unzip it as well.

My script is this :

#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html

gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt

for f in $(cat data.txt); do 
    curl -s "https://foo.bar/$f" > data.zip
    unzip data.zip
done

Is there a more elegant way to write this script? I'd like to avoid saving the html, txt and zip files.

CodePudding user response:

The bsdtar command can unzip archives from stdin, allowing you to do this:

curl -s "https://foo.bar/$f" | bsdtar -xf-

And of course you can pipe the first curl command directly into awk:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt

And in fact you might as well just pipe the output of that pipeline directly into a loop:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
    curl -s "https://foo.bar/$archive" | bsdtar -xf-
done
  •  Tags:  
  • Related