Pipe curl to awk to download and unzip files-CodePudding

I want to download all files from this section of a HTML page :

    <td><a  name="item_1" type="dd" href="/data/24765/dd">Item 1</a></td>
    <td><a  name="item_2" type="dd" href="/data/12345/dd">Item 2</a></td>
    <td><a  name="item_3" type="dd" href="/data/75239/dd">Item 3</a></td>

The download link for the first file is https://foo.bar/data/24765/dd, and as it's a zip file, I'd like to unzip it as well.

My script is this :

#!/bin/bash
curl -s "https://foo.bar/path/to/page" > data.html

gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' data.html > data.txt

for f in $(cat data.txt); do 
    curl -s "https://foo.bar/$f" > data.zip
    unzip data.zip
done

Is there a more elegant way to write this script? I'd like to avoid saving the html, txt and zip files.

CodePudding user response：

The bsdtar command can unzip archives from stdin, allowing you to do this:

curl -s "https://foo.bar/$f" | bsdtar -xf-

And of course you can pipe the first curl command directly into awk:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' > data.txt

And in fact you might as well just pipe the output of that pipeline directly into a loop:

curl -s "https://foo.bar/path/to/page" |
gawk 'match($0, /href="\/(data\/[0-9]{5}\/dd)"/, m){print m[1]}' |
while read archive; do
    curl -s "https://foo.bar/$archive" | bsdtar -xf-
done