CodePudding user response:
This page sends cookies with PHPSESSIONID and in HTML it sends token like this
<script>token = "NDQ4MTg3MjMw"
and it uses JavaScript to get this value and add in headers
num: NDQ4MTg3MjMw,
And server needs PHPSESSIONID and num to send data.
Every connection creates new value in PHPSESSIONID and token - so you could hardcode some values in your code, but session ID can be valid only for a few minutes - and it is better to get fresh values from GET request before POST request.
So you have to use requests.Session to work with cookies and first send GET to https://vahaninfos.com/vehicle-details-by-number-plate to get cookie PHPSESSIONID and HTML with <script>token = "..."
Next you have to get this token from HTML - ie. using regex - and add it as header num: .... in POST request.
It seems other headers are not important - even X-Requested-With.
This page needs to send data as form so you need data=payload instead of data=json.load(payload). And it creates automatically headers Content-Type and Content-Length with correct values.
import requests
import re
session = requests.Session()
# --- GET ---
headers = {
# "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
}
url = "https://vahaninfos.com/vehicle-details-by-number-plate"
res = session.get(url, verify=False)
number = re.findall('token = "([^"]*)"', res.text)[0]
# --- POST ---
headers = {
# "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0",
# "X-Requested-With": "XMLHttpRequest",
'num': number,
}
payload = {
"number": "UP32AT5472",
"g-recaptcha-response": "",
}
url = "https://vahaninfos.com/getdetails.php"
res = session.post(url, data=payload, headers=headers, verify=False)
print(res.text)
Result:
<tr><td>Registration Number</td><td>:</td><td>UP32AT5472</td></tr>
<tr><td>Registration Authority</td><td>:</td><td>LUCKNOW</td></tr>
<tr><td>Registration Date</td><td>:</td><td>2003-06-06</td></tr>
<tr><td>Chassis Number</td><td>:</td><td>487530</td></tr>
<tr><td>Engine Number</td><td>:</td><td>490062</td></tr>
<tr><td>Fuel Type</td><td>:</td><td>PETROL</td></tr>
<tr><td>Engine Capacity</td><td>:</td><td></td></tr>
<tr><td>Model/Model Name</td><td>:</td><td>TVS VICTOR</td></tr>
<tr><td>Color</td><td>:</td><td></td></tr>
<tr><td>Owner Name</td><td>:</td><td>HARI MOHAN PANDEY</td></tr>
<tr><td>Ownership Type</td><td>:</td><td></td></tr>
<tr><td>Financer</td><td>:</td><td>CENTRAL BANK OF INDIA</td></tr>
<tr><td>Vehicle Class</td><td>:</td><td>M-CYCLE/SCOOTER(2WN)</td></tr>
<tr><td>Fitness/Regn Upto</td><td>:</td><td></td></tr>
<tr><td>Insurance Company</td><td>:</td><td>NATIONAL INSURANCE CO LTD.</td></tr>
<tr><td>Insurance Policy No</td><td>:</td><td>4165465465465</td></tr>
<tr><td>Insurance expiry</td><td>:</td><td>2004-06-05</td></tr>
<tr><td>Vehicle Age</td><td>:</td><td></td></tr>
<tr><td>Vehicle Type</td><td>:</td><td></td></tr>
<tr><td>Vehicle Category</td><td>:</td><td></td></tr>
Now you can use beautifulsoup or lxml (or other module) to get values from HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')
for row in soup.find_all('tr'):
cols = row.find_all('td')
key = cols[0].text
val = cols[-1].text
print(f'{key:22} | {val}')
Result:
Registration Number | UP32AT5472
Registration Authority | LUCKNOW
Registration Date | 2003-06-06
Chassis Number | 487530
Engine Number | 490062
Fuel Type | PETROL
Engine Capacity |
Model/Model Name | TVS VICTOR
Color |
Owner Name | HARI MOHAN PANDEY
Ownership Type |
Financer | CENTRAL BANK OF INDIA
Vehicle Class | M-CYCLE/SCOOTER(2WN)
Fitness/Regn Upto |
Insurance Company | NATIONAL INSURANCE CO LTD.
Insurance Policy No | 4165465465465
Insurance expiry | 2004-06-05
Vehicle Age |
Vehicle Type |
Vehicle Category |
EDIT:
After running code few times POST started sending me only values R - maybe it needs some other headers to hide bot (ie. User-Agent), or maybe sometimes it needs to send correct code for ReCaptcha.
At least in Chrome it stops sending R when I set ReCaptha.
But Firefox still send R.
Originally I was using User-Agent from my Firefox and it may remeber it.
EDIT:
If I use User-Agent different then my Firefox then code again gets correct values and Firefox still gets only R.
headers = {
"User-Agent": "Mozilla/5.0",
}
So it seems code may need to use random User-Agent in every request to hide bot.



