I already look in stackoverflow and I could not find an answer to my problem.
I'm accessing an API from the German Government that has a output limit of 10.000 entries. I want all data from a specific city, and since there is more than 10.000 entries in the original database, I need to "do the query" while doing the requests.post.
Here is one entry of Json result, when I simply do request.post to this API:
{
"results":[
{
"_id":"CXPTYYFY807",
"CREATED_AT":"2019-12-17T14:48:17.130Z",
"UPDATED_AT":"2019-12-17T14:48:17.130Z",
"result":{
"id":"CXPTYYFY807",
"title":"Bundesstadt Bonn, SGB-315114, Ortsteilzentrum Brüser Berg, Fliesenarbeiten",
"description":["SGB-315114","Ortsteilzentrum Brüser Berg, Fliesenarbeiten"],
"procedure_type":"Ex ante Veröffentlichung (§ 19 Abs. 5)",
"order_type":"VOB",
"publication_date":"",
"cpv_codes":["45431000-7","45431100-8"],
"buyer":{
"name":"Bundesstadt Bonn, Referat Vergabedienste",
"address":"Berliner Platz 2",
"town":"Bonn",
"postal_code":"53111"},
"seller":{
"name":"",
"town":"",
"country":""
},
"geo":{
"lon":7.0944,
"lat":50.73657
},
"value":"",
"CREATED_AT":"2019-12-17T14:48:17.130Z",
"UPDATED_AT":"2019-12-17T14:48:17.130Z"}
}
],
"aggregations":{},
"pagination":{
"total":47389,
"start":0,
"end":0 }}
What I want is all the data which was bought in "town" : "Bonn"
What I already tryed:
import requests
url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}
data = {"results": [{"result": {"buyer": {"town":"Bonn"}}}]}
#need to put the size limit, otherwise he delivers me less:
params = {'size': 10000}
req = requests.post(url, params=params, headers=headers, json=data)
This returns me the post, but not "filtered" by city.
I also tryed req = requests.post(url, params=params, headers=headers, data=data) , which returns me ERROR 400 .
Another way is to grab all the data with the pagination parameters on the end of the json code within a loop, but again I'm not being able to writwe down the json path to the pagination, for example : start: 0 , end:500
Can anyone help me solving it?
CodePudding user response:
Try:
url = 'https://daten.vergabe.nrw.de/rest/evergabe/aggregation_search'
headers = {'Accept': 'application/json', 'Content-Type': 'application/json'}
query1 = {
"query": {
"match": {
"buyer.town": "Bonn"
}
}
}
req = requests.post(url, headers=headers, json=query1)
# Check the output
req.text
Edit: This won't work if the filter matches with more than 10.000 results, but it may be a quick workaround to the problem you are facing.
CodePudding user response:
import json
import requests
import math
url = "https://daten.vergabe.nrw.de/rest/vmp_rheinland"
size = 5000
payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":' str(size) '}'
headers = {
'accept': "application/json",
'content-type': "application/json"
'cache-control': "no-cache"
}
response = requests.request("POST", url, data=payload, headers=headers)
tenders_array = []
query_data = json.loads(response.text)
tenders_array.extend(query_data['results'])
total_hits = query_data['pagination']['total']
result_size = len(query_data['results'])
last_id = query_data['results'][-1]["_id"]
number_of_loops = ((total_hits - size) // size )
last_loop_size = ((total_hits - size) % size)
for i in range(number_of_loops 1):
if i == number_of_loops:
size=last_loop_size
payload = '{"sort":[{"_id":"asc"}],"query":{"match_all":{}},"size":' str(size) ',"search_after":["' last_id '"]}'
response = requests.request("POST", url, data=payload, headers=headers)
query_data = json.loads(response.text)
result_size = len(query_data['results'])
if result_size > 0:
tenders_array.extend(query_data['results'])
last_id = query_data['results'][-1]["_id"]
else:
break
https://gist.github.com/thiagoalencar/34401e204358499ea3b9aa043a18395f code in the gist.
Some code to paginate through elasticsearch API. This is an API over the elasticsearch API, and the docs where not so clear. Tried scroll, no sucess. This solutions uses search_after parameter without point in time, because the endpoint is not available. Some times the servers refuses the request and it is necessary to verify with response.status_code==502.
The code is messy and need refactoring. But it works. The final tenders_array contains all objects.
