I need a hint from you about an issue I'm handling. Using requests to do some webscraping in python, the URL gives me a file to download, but when I get the content from the request, I get the following result:
b'"PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9InllcyI/Pg0KPERhZG9zRWNvbm9taWNvRmluYW5jZWlyb3MgeG1sbnM6eHNpPSJodHRwOi8vd3d3LnczLm9yZy8yMDAxL1hNTFNjaGVtYS1pbnN0YW5jZSI DQoJPERhZG9zR2VyYWlzPg0KCQk8Tm9tZUZ1bmRvPkZJSSBCVEdQIExPR0lTVElDQTwvTm9tZUZ1bmRvPg0KCQk8Q05QSkZ1bmRvPjExODM5NTkzMDAwMTA5PC9DTlBKRnVuZG8 DQoJCTxOb21lQWRtaW5pc3RyYWRvcj5CVEcgUGFjdHVhbCBTZXJ2acOnb3MgRmluYW5jZWlyb3MgUy5BLiBEVFZNPC9Ob21lQWRtaW5pc3RyYWRvcj4NCgkJPENOUEpBZG1pbmlzdHJhZG9yPjU5MjgxMjUzMDAwMTIzPC9DTlBKQWRtaW5pc3RyYWRvcj4NCgkJPFJlc3BvbnNhdmVsSW5mb3JtYWNhbz5MdWNhcyBNYXNzb2xhPC9SZXNwb25zYXZlbEluZm9ybWFjYW8 DQoJCTxUZWxlZm9uZUNvbnRhdG8 KDExKSAzMzgzLTI1MTM8L1RlbGVmb25lQ29udGF0bz4NCgkJPENvZElTSU5Db3RhPkJSQlRMR0NURjAwMDwvQ29kSVNJTkNvdGE DQoJCTxDb2ROZWdvY2lhY2FvQ290YT5CVExHMTE8L0NvZE5lZ29jaWFjYW9Db3RhPg0KCTwvRGFkb3NHZXJhaXM DQoJPEluZm9ybWVSZW5kaW1lbnRvcz4NCgkJPFJlbmRpbWVudG8 DQoJCQk8RGF0YUFwcm92YWNhbz4yMDIxLTEyLTE1PC9EYXRhQXByb3ZhY2FvPg0KCQkJPERhdGFCYXNlPjIwMjEtMTItMTU8L0RhdGFCYXNlPg0KCQkJPERhdGFQYWdhbWVudG8 MjAyMS0xMi0yMzwvRGF0YVBhZ2FtZW50bz4NCgkJCTxWYWxvclByb3ZlbnRvQ290YT4wLjcyPC9WYWxvclByb3ZlbnRvQ290YT4NCgkJCTxQZXJpb2RvUmVmZXJlbmNpYT5Ob3ZlbWJybzwvUGVyaW9kb1JlZmVyZW5jaWE DQoJCQk8QW5vPjIwMjE8L0Fubz4NCgkJCTxSZW5kaW1lbnRvSXNlbnRvSVI dHJ1ZTwvUmVuZGltZW50b0lzZW50b0lSPg0KCQk8L1JlbmRpbWVudG8 DQoJCTxBbW9ydGl6YWNhbyB0aXBvPSIiLz4NCgk8L0luZm9ybWVSZW5kaW1lbnRvcz4NCjwvRGFkb3NFY29ub21pY29GaW5hbmNlaXJvcz4="'
and these headers:
{'Date': 'Thu, 13 Jan 2022 13:25:03 GMT', 'Set-Cookie': 'dtCookie=v_4_srv_27_sn_A24AD4C76E5194F3DB0056C40CBABEF7_perc_100000_ol_0_mul_1_app-3A97e61c3a8a7c6a0b_1_rcs-3Acss_0; Path=/; Domain=.bmfbovespa.com.br, JSESSIONID=LWB pcQEPreUbb BtwZ9pyOm.sfnNODE01; Path=/fnet; Secure; HttpOnly, TS01871345=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; Path=/; HTTPOnly, TS01e3f871=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/; domain=.bmfbovespa.com.br; HTTPonly, TS01d1c2dd=011d592ce1f641d52fa6af8d3b5a924eddc7997db2f6611d8d70aeab610f5e34ea2706a45b6f2c35f2b500d01fc681c74e5caa356c; path=/fnet; HTTPonly', 'X-OneAgent-JS-Injection': 'true', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'no-cache, no-store, must-revalidate', 'Pragma': 'no-cache', 'Expires': '0', 'Content-Disposition': 'attachment; filename="08706065000169-ACE28022020V01-000083505.xml"', 'Server-Timing': 'dtRpid;desc="258920448"', 'Connection': 'close', 'Content-Type': 'text/xml', 'X-XSS-Protection': '1; mode=block', 'Transfer-Encoding': 'chunked'}
But it works perfectly and download the .xml file when I point the browser to https://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031 URL address, for example, with the following data
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DadosEconomicoFinanceiros xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<DadosGerais>
<NomeFundo>FII BTGP LOGISTICA</NomeFundo>
<CNPJFundo>11839593000109</CNPJFundo>
<NomeAdministrador>BTG Pactual Serviços Financeiros S.A. DTVM</NomeAdministrador>
<CNPJAdministrador>59281253000123</CNPJAdministrador>
<ResponsavelInformacao>Lucas Massola</ResponsavelInformacao>
<TelefoneContato>(11) 3383-2513</TelefoneContato>
<CodISINCota>BRBTLGCTF000</CodISINCota>
<CodNegociacaoCota>BTLG11</CodNegociacaoCota>
</DadosGerais>
<InformeRendimentos>
<Rendimento>
<DataAprovacao>2021-12-15</DataAprovacao>
<DataBase>2021-12-15</DataBase>
<DataPagamento>2021-12-23</DataPagamento>
<ValorProventoCota>0.72</ValorProventoCota>
<PeriodoReferencia>Novembro</PeriodoReferencia>
<Ano>2021</Ano>
<RendimentoIsentoIR>true</RendimentoIsentoIR>
</Rendimento>
<Amortizacao tipo=""/>
</InformeRendimentos>
</DadosEconomicoFinanceiros>
It seems to me that the data is cryptographed, but I have no idea how to get the xml data to use the data inside it. Can you help me?
Thank you very much.
EDIT: The example code I've used is quite simple:
Python 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> url='fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
>>> xhtml = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
Then xhtml.content command shows the string. (There is a HTTPS warning due to the verify=False that i will handle after) I have tried a solution using urllib.request, but got the same result
CodePudding user response:
Data seems to be base64 encoded. Try to decode it:
import requests
import base64
url = 'http://fnet.bmfbovespa.com.br/fnet/publico/downloadDocumento?id=247031'
response = requests.get(url,verify=False, headers={'User-Agent':'Mozzila/5.0'})
decoded = base64.b64decode(response.content)
print(decoded)
