Home > database >  Python Requests Post within a nested Json - retrieve data with a specific value
Python Requests Post within a nested Json - retrieve data with a specific value

Time:01-07

I already look in stackoverflow and I could not find an answer to my problem. I'm accessing an API from the German Government that has a output limit of 10.000 entries. I want all data from a specific city, and since there is more than 10.000 entries in the original database, I need to "do the query" while doing the requests.post.

Here is one entry of Json result, when I simply do request.post to this API:

{
    "results":[
        {
            "_id":"CXPTYYFY807",
            "CREATED_AT":"2019-12-17T14:48:17.130Z",
            "UPDATED_AT":"2019-12-17T14:48:17.130Z",
            "result":{
                "id":"CXPTYYFY807",
                "title":"Bundesstadt Bonn, SGB-315114, Ortsteilzentrum Brüser Berg, Fliesenarbeiten",
                "description":["SGB-315114","Ortsteilzentrum Brüser Berg, Fliesenarbeiten"],
                "procedure_type":"Ex ante Veröffentlichung (§ 19 Abs. 5)",
                "order_type":"VOB",
                "publication_date":"",
                "cpv_codes":["45431000-7","45431100-8"],
                "buyer":{
                    "name":"Bundesstadt Bonn, Referat Vergabedienste",
                    "address":"Berliner Platz 2",
                    "town":"Bonn",
                    "postal_code":"53111"},
                    "seller":{
                        "name":"",
                        "town":"",
                        "country":""
                        
                    },
                    "geo":{
                        "lon":7.0944,
                        "lat":50.73657
                        
                    },
                    "value":"",
                    "CREATED_AT":"2019-12-17T14:48:17.130Z",
                    "UPDATED_AT":"2019-12-17T14:48:17.130Z"}
            
        }
        ],
    "aggregations":{},
    "pagination":{
        "total":47389,
        "start":0,
        "end":0 }}

What I want is all the data which was bought in "town" : "Bonn"

What I already tryed:

import requests 

url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}
data = {"results": [{"result": {"buyer": {"town":"Bonn"}}}]}

#need to put the size limit, otherwise he delivers me less:
params = {'size': 10000}
 
req = requests.post(url, params=params, headers=headers, json=data)

This returns me the post, but not "filtered" by city. I also tryed req = requests.post(url, params=params, headers=headers, data=data) , which returns me ERROR 400 .

Another way is to grab all the data with the pagination parameters on the end of the json code within a loop, but again I'm not being able to writwe down the json path to the pagination, for example : start: 0 , end:500

Can anyone help me solving it?

CodePudding user response:

Try:

url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
 
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}

query1 = {
    "query": {
        "match": {
            "buyer.town": "Bonn"
        }
    }
}

req = requests.post(url, headers=headers, json=query1)

# Check the output
req.text

Edit: This won't work if the filter matches with more than 10.000 results, but it may be a quick workaround to the problem you are facing.

CodePudding user response:

import json
import requests
import math

url = "https://daten.vergabe.nrw.de/rest/vmp_rheinland"

size = 5000


payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":' str(size) '}'
headers = {
    'accept': "application/json",
    'content-type': "application/json"
    'cache-control': "no-cache"
    }

response = requests.request("POST", url, data=payload, headers=headers)

tenders_array = []

query_data = json.loads(response.text)

tenders_array.extend(query_data['results'])

total_hits = query_data['pagination']['total']
result_size = len(query_data['results'])
last_id = query_data['results'][-1]["_id"]

number_of_loops = ((total_hits - size) // size )
last_loop_size = ((total_hits - size) % size)


for i in range(number_of_loops 1):
    if i == number_of_loops:
        size=last_loop_size
    payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":' str(size) ',"search_after":["' last_id '"]}'
    response = requests.request("POST", url, data=payload, headers=headers)
    query_data = json.loads(response.text)
    result_size = len(query_data['results'])
    if result_size > 0:
        tenders_array.extend(query_data['results'])
        last_id = query_data['results'][-1]["_id"] 
    else:
        break

https://gist.github.com/thiagoalencar/34401e204358499ea3b9aa043a18395f code in the gist.

Some code to paginate through elasticsearch API. This is an API over the elasticsearch API, and the docs where not so clear. Tried scroll, no sucess. This solutions uses search_after parameter without point in time, because the endpoint is not available. Some times the servers refuses the request and it is necessary to verify with response.status_code==502.

The code is messy and need refactoring. But it works. The final tenders_array contains all objects.

  •  Tags:  
  • Related