Home > Blockchain >  How can we parse tab-delimited data as it's being downloaded from the web and also parse a URL
How can we parse tab-delimited data as it's being downloaded from the web and also parse a URL

Time:02-05

I put together some scrappy code that downloads data from a few URLs. I have two problems that I am trying to overcome.

  1. I need to parse this tab-delimited data before it is written to a CSV file, so the final saved version is a CSV (not TSV)
  2. I need to download data from a link that has an apostrophe in the URL (the apostrophe is not handled correctly so the download fails)

My hacked-together code.

import requests
from bs4 import BeautifulSoup
import urllib

all_links = ['/vsoch/hospital-chargemaster/tree/0.0.2/data/ochsner-clinic-foundation',
 '/vsoch/hospital-chargemaster/tree/0.0.2/data/ohio-state-university-hospital',
 '/vsoch/hospital-chargemaster/tree/0.0.2/data/orlando-health',
 'vsoch/hospital-chargemaster/blob/0.0.2/data/st.-joseph\'s-hospital-(tampa)']
for item in all_links:
    #print(item)
    item = item.replace('tree/','')
    #print(item)
    try:
        length = len(item)
        last_slash = item.rfind('/')   1
        file_name = (length-last_slash)
        file_name = item[-file_name:]
        print(file_name)
        DOWNLOAD_URL = 'https://raw.githubusercontent.com'   item   '/data-latest.tsv'
        r = requests.get(DOWNLOAD_URL)
        soup = BeautifulSoup(r.text, "html.parser")
        DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\'   file_name   '.csv'
        urllib.request.urlretrieve(DOWNLOAD_URL,DOWNLOAD_PATH)
    except Exception as e: print(e)

So, how can I parse a TSV into a CSV? Also, how can I download the data from the last URL in the list of four URLs?

CodePudding user response:

The following approach should work:

import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
import csv
import io

all_links = [
    "/vsoch/hospital-chargemaster/tree/0.0.2/data/ochsner-clinic-foundation",
    "/vsoch/hospital-chargemaster/tree/0.0.2/data/ohio-state-university-hospital",
    "/vsoch/hospital-chargemaster/tree/0.0.2/data/orlando-health",
    "/vsoch/hospital-chargemaster/tree/0.0.2/data/st.-joseph’s-hospital-(tampa)",
]

for item in all_links:
    item = item.replace('tree/', '')
    
    try:
        file_name = unquote(item.split('/')[-1])
        DOWNLOAD_URL = f'https://raw.githubusercontent.com{item}/data-latest.tsv'
        r_tsv = requests.get(DOWNLOAD_URL)
        
        if r_tsv.status_code == 404:
            print(f"Not found - {DOWNLOAD_URL}")
        else:
            print(f"Downloaded - {DOWNLOAD_URL}")
            data = list(csv.reader(io.StringIO(r_tsv.text), delimiter='\t'))
            DOWNLOAD_PATH = fr'C:\Users\ryans\Desktop\hospital_data\{file_name}.csv'
            
            with open(DOWNLOAD_PATH, 'w', newline='') as f_output:
                csv_output = csv.writer(f_output)
                csv_output.writerows(data)
    except Exception as e: 
        print(e)

You need to also add a call to unquote() to handle the % escapes for your filename

  •  Tags:  
  • Related