I'm trying to get bulk data from Europe PMC annotations api in python-CodePudding

my code is

if name == 'main': json_data=requests.get("https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC:4771370&section=Abstract&provider=Europe PMC&format=JSON").content r=json.loads(json_data) df = json_to_dataframe(r) print(df)

My only problem is how can run this for multiple IDs, like i have atleast thousands of ids in a file. Please help I'm using python.

CodePudding user response：

Assuming you know Python and can get all the IDs from the file into a list article_ids, you can use the following script:

URL = 'https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds'

article_ids = ['PMC:4771370']

for article_id in article_ids:
    params = {
        'articleIds': article_id,
        'section': 'Abstract',
        'provider': 'Europe PMC',
        'format': 'JSON'
    }
    json_data = requests.get(URL, params=params).content
    r = json.loads(json_data)
    df = json_to_dataframe(r)
    print(df)

CodePudding user response：

After analyzing the shared URL and reading the URL Encodings article, I observed that each value of annotationByArticleIDs has format of SOURCE:EXTERNAL_ID format.

TEST1: If you hit the url:

https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds?articleIds=PMC

Output is: It must contain values with format SOURCE:EXTERNAL_ID where SOURCE must have one of the following values [PMC, MED, PAT, AGR, CBA, HIR, CTX, ETH, CIT, PPR, NBK] and EXTERNAL_ID must be a number when SOURCE=PMC

Above output shows possible list of sources
Each source is separated by EXTERNAL_ID using colon
Colon is represented by : in URL Encoding article
In order to separate one pair of value from another value, you could use comma operator
Comma is represented using , in the same URL encoding article

ANSWER: So to fetch multiple articles you could generate string of article ids in the format SOURCE1:EXTERNAL_ID1,SOURCE2:EXTERNAL_ID2 .... SOURCE3:EXTERNAL_ID3 and append in the main url

Few Limitations:

Max URL Length could be 2048 characters
Depending upon possible ids, you will be able to fetch around 150 to 200 articles
You could loop over a batch of 150 and then fetch the required information