I am trying to merge 2 arrays. I am trying to arrange it in the way (ip,port). How can I arrange it-CodePudding

Title. I'm going to note that the project parses IPs, ports, and its type (https or not) from a free proxy website, and later tests on linux to find whether they work or not. It saves those in tuples and writes them into a csv.

import requests
import lxml
from bs4 import BeautifulSoup
import csv

names = []

url = 'https://free-proxy-list.net/'
page = requests.get(url)
soup = BeautifulSoup(page.content, features='lxml')
headers = soup.find_all('th')
headers_refined = []
headers_refined.append(headers[0])
headers_refined.append(headers[1])
headers_refined.append(headers[6])
ips = soup.find_all('td')


ips = ips[::8]
ports = soup.find_all('td')
ports = ports[1::8]

element_index = 0
for i in ips:
    ips[element_index] = str(ips[element_index])
    element_index  = 1
    
element_index = 0
for i in headers_refined:
    headers_refined[element_index] = str(headers_refined[element_index])
    element_index  = 1
    
element_index = 0
for i in ports:
    ports[element_index] = str(ports[element_index])
    element_index  = 1
    
ips = ' '.join(ips).replace('<td>', '').split()
ips = ' '.join(ips).replace('</td>', '').split()
ips = ips[:-43:]
headers_refined = ' '.join(headers_refined).replace('<th>', '').split()
headers_refined = ' '.join(headers_refined).replace('</th>', '').split()
headers_refined = ' '.join(headers_refined).replace('<th >', '').split()
ports = ' '.join(ports).replace('<td>', '').split()
ports = ' '.join(ports).replace('</td>', '').split()
while len(ports)>len(ips):
    ports=ports[:-1:]
prev_len_ips=len(ips)
index=0
for i in range(prev_len_ips):
    ips.insert(i 1,ports[i])


# print(headers_refined)
# print(ips)
# print(ports)
print(prev_len_ips)
print(len(ports))



print(ips)
ips = [*zip(ips[::2])]
with open('ips.csv', ' w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerows(ips)

The code above prints out the list in a sequence such as:

['IP','port','port','port','port',...]

That goes until it runs out all available ports. After that, it prints the IPs that are left in the list.

P.S. I will gladly accept any other suggestions about improving and optimizing my code to look better. Thank you in advance!

CodePudding user response：

To get what you want from that page, there's way simpler ways to get there. Since you are already using lxml as a parser, this does exactly what you need:

from urllib.request import urlopen, Request
from lxml import etree

# free-proxy-list.net doesn't like Python announcing itself, use at your own risk
req = Request(
    'https://free-proxy-list.net/',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

# reading the contents of the page, getting the part you need
with urlopen(req) as f:
    root = etree.parse(f, parser=etree.HTMLParser())
    # get the proxies from the only textarea on the page, skip the description and timestamp
    proxies = root.xpath('*//textarea/text()')[0].split('\n')[3:]

# the format you want
proxies = [tuple(proxy.split(':')) for proxy in proxies]
print(proxies)

No external dependencies outside lxml (no bs4 or requests) and only a few lines of code.

Result:

[('64.17.30.238', '63141'), ('62.33.210.34', '58918'), ... ]