Bulk Data from Europe PMC annotation api-CodePudding

i have a pmc.txt file which contains atleast 20k pmc ids, and the api will only take i think 1000 request each time. i have written the code for one id, but i'm not able to do for the whole file, below is my main code. Please help.

if __name__ == '__main__':
URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'


article_ids = ['PMC:4771370']

for article_id in article_ids:
  params = {
    'articleIds': article_id,
    'section': 'Abstract',
    'provider': 'Europe PMC',
    'format': 'JSON'
  }
json_data = requests.get(URL, params=params).content
r = json.loads(json_data)
df = json_to_dataframe(r)
print(df)
df.to_csv("data.csv")

CodePudding user response：

you can read in the data from the file like so:

with open('pmc.txt', 'r') as file:
    article_ids = [item.replace('\n', '') for item in file]

which you can do instead of article_ids = ['PMC:4771370']

Though you are going to have to save your files with a different name (you will have 20,000 files then or instead you have to append your json data to the dataframe before you make it a csv)

CodePudding user response：

You can use grequests. You can try setting stream=False in grequests.get, or call explicitly response.close() after reading response.content. It's discussed in detail here

Additionally, you can also test requests-futures. Grequests is faster but brings monkey patching and additional problems with dependencies. requests-futures is several times slower than grequests but simply wrapped requests into ThreadPoolExecutor can be as fast as grequests, but without external dependencies. Reference here.