Home > Net >  Parse HTML table for specific content in one column and print resulting table to file with python
Parse HTML table for specific content in one column and print resulting table to file with python

Time:01-10

I have a file test_input.htm with a table:

    <table>
          <thead>
               <tr>
                    <th>Acronym</th>
                    <th>Full Term</th>
                    <th>Definition</th>
                    <th>Product </th>
                </tr>
         </thead>
         <tbody>
                <tr>
                    <td>a1</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>PRISMA</p>
                        <p>SDDS-NG</p>
                    </td>
                </tr>
                <tr>
                    <td>a2</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>PRISMA</p>
                    </td>
                </tr>
                <tr>
                    <td>a3</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: PRISMA-GLO</p>
                    </td>
                    <td>
                        <p>SDDS-NG</p>
                    </td>
                </tr> 
                <tr>
                    <td>a4</td>
                    <td>term</td>
                    <td>
                        <p>texttext.</p>
                        <p>Source: SD-GLO</p>
                    </td>
                    <td>
                        <p>SDDS-NG</p>
                    </td>
                </tr>         
           </tbody>
    </table>

I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).

The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:

from bs4 import BeautifulSoup

file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')


with open('test_output.htm', 'a') as f:
    data = [[td.findChildren(text=True) for td in inhalte]]
    for line in inhalte: #if you see a line in the table
        if line.get_text().find('PRISMA') > -1 : #and you find the specific string
                f.write("%s\n" % str(line)) 

I really tried hard but could not figure out how to restict the search to column 4. The following did not work:

data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]  

I would really appreciate if someone could help me find the solution.

CodePudding user response:

Select more specific to get the elements you expect - For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:

soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')

Example

from bs4 import BeautifulSoup

file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')

with open('test_output.htm', 'a') as f:
    for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
        f.write("%s\n" % str(line)) 
  •  Tags:  
  • Related