WordPress: Generate table of contents based on headlines from content-CodePudding

I want to generate a table of contents list based on the headline of my article.

I already found a solution to get all headlines from the content and replace the <h2> tags with an <a> tag.

The problem is, that I also need to replace the <h3> tags with links and show them in a link list.

My result should look like this:

<ul>
    <li><a href="#h2-1">I was a H2 headline</a></li>
    <li>
        <a href="#h2-2">Also a H2 headline</a>
        <ul>
            <li><a href="#h3-1">H3 headline</a></li>
            <li><a href="#h3-2">Another H3 headline</a></li>
        </ul>
    </li>
</ul>

My problem is, that some headlines could have a element and other headlines don't. At the moment I delete every with str_replace. It's not the best solution but it works for me and my very little understanding of regex.

The following code is my function to get every headline from my content.

I first get the content of the post and store it in $content.

From there I'm getting all headlines (<h2> - <h6>) and store them in $results with this line:

preg_match_all('#<h[2-6]*[^>]*>.*?<\/h[2-6]>#',$content,$results);

At the moment I only use the <h2> headlines because I'm not sure how to do it in an intelligent way and I have to repeat the following lines for every headline level:

$toc = str_replace('<h2','<li><a',$toc);
$toc = str_replace('</h2>','</a></li>',$toc);

But my biggest problem is the nesting of the headlines. How could I generate a HTML code like above?

And also important: How could I handle different headline formats like these:

<h2 id="name">
<h2 id="name" >
<h2 id="name">

Here's my current code:

$content_postid = get_the_ID();
$content_post   = get_post($content_postid);
$content        = $content_post->post_content;
$content        = apply_filters('the_content', $content);
$content        = str_replace(']]>', ']]&gt;', $content);

preg_match_all('#<h[2-6]*[^>]*>.*?<\/h[2-6]>#',$content,$results);

$toc = implode("\n",$results[0]);

// This part is messy because I don't really understand regex :-(
$toc = preg_replace('//', '', $toc);
$toc = str_replace('<strong>','',$toc);
$toc = str_replace('</strong>','',$toc);
$toc = str_replace('<h2','<li><a',$toc);
$toc = str_replace('</h2>','</a></li>',$toc);
$toc = str_replace('id="','href="#',$toc);

//plug the results into appropriate HTML tags
$toc = '<div id="toc">
<ul >
'.$toc.'
</ul>
</div>';

echo $toc;

This is my current output (as you can see, only <h2> headlines):

<ul >
    <li><a href="#h2-1">I was a H2 headline</a></li>
    <li><a href="#h2-2">Also a H2 headline</a></li>
</ul>

EDIT: Here's a sample HTML code that could be inside of $content:

<p>Lorem ipsum dolor sit amet...</p>
<p>consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat</p>
<img src="/path/to/image.jpg" />
<h2  id="name">
<p>Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat</p>
<p>Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat</p> 
<h3  id="name">Headline 3</h3>
<p>vel illum dolore eu feugiat nulla facilisis at vero et accumsan et iusto odio dignissim qui</p>
<h3  id="name">On more Headline 3</h3>
<p>blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi</p>
<h2 id="name" >Headline 2 with class</h2>
<p>Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet</p>
<h2 id="name">Another Headline 2 without class</h2>
<p>doming id quod mazim placerat facer possim assum</p>

EDIT 2:

I found a function (here) that looks right. But I couldn't make it work.

I also found a function which explicity uses DOMDocument here. But I'm testing with that right now. At the moment it shows the whole content.

Here's the code from that:

$doc = new DOMDocument();
$doc->loadHTML($code);

// create document fragment
$frag = $doc->createDocumentFragment();
// create initial list
$frag->appendChild($doc->createElement('ol'));
$head = &$frag->firstChild;
$xpath = new DOMXPath($doc);
$last = 1;

// get all H1, H2, …, H6 elements
foreach ($xpath->query('//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]') as $headline) {
    // get level of current headline
    sscanf($headline->tagName, 'h%u', $curr);

    // move head reference if necessary
    if ($curr < $last) {
        // move upwards
        for ($i=$curr; $i<$last; $i  ) {
            $head = &$head->parentNode->parentNode;
        }
    } else if ($curr > $last && $head->lastChild) {
        // move downwards and create new lists
        for ($i=$last; $i<$curr; $i  ) {
            $head->lastChild->appendChild($doc->createElement('ol'));
            $head = &$head->lastChild->lastChild;
        }
    }
    $last = $curr;

    // add list item
    $li = $doc->createElement('li');
    $head->appendChild($li);
    $a = $doc->createElement('a', $headline->textContent);
    $head->lastChild->appendChild($a);

    // build ID
    $levels = array();
    $tmp = &$head;
    // walk subtree up to fragment root node of this subtree
    while (!is_null($tmp) && $tmp != $frag) {
        $levels[] = $tmp->childNodes->length;
        $tmp = &$tmp->parentNode->parentNode;
    }
    $id = 'sect'.implode('.', array_reverse($levels));
    // set destination
    $a->setAttribute('href', '#'.$id);
    // add anchor to headline
    $a = $doc->createElement('a');
    $a->setAttribute('name', $id);
    $a->setAttribute('id', $id);
    $headline->insertBefore($a, $headline->firstChild);
}

// append fragment to document
$doc->getElementsByTagName('body')->item(0)->appendChild($frag);

// echo markup
echo $doc->saveHTML();

CodePudding user response：

An approach that uses the DOM only to parse and extract relevant informations from the html source code. Then the result is build as a string.

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html);

$xp = new DOMXPath($dom);
$nodes = $xp->query('//*[contains("h1 h2 h3 h4 h5 h6", name())]');

$currentLevel = ['level' => 0 /*, 'count' => 0*/ ];
$stack = [];
$format = '<li><a href="#%s">%s</a></li>';
$result = '';

foreach($nodes as $node) {
    $level = (int)$node->tagName[1]; // extract the digit after h
  
    while($level < $currentLevel['level']) {
        $currentLevel = array_pop($stack);
        $result .= '</ul>';
    }
    
    if ($level === $currentLevel['level']) {
        $currentLevel['count']  ;
    } else {
        $stack[] = $currentLevel;
        $currentLevel = ['level' => $level, 'count' => 1];
        $result .= '<ul>';
    }

    $result .= sprintf($format, $node->getAttribute('id'), $node->nodeValue);    
}

$result .= str_repeat('</ul>', count($stack));

demo

To build step by step the expected tree structure, this code uses a stack (FILO) that stores arrays with the level (the number after h) and the number of nodes already added for this level. When the current node has a higher level than the previous node then the array is stored in the stack. If the current node has a lower level than the previous node, then the last element is unstacked (until the last element has a higher or equal level). If the levels of the current and previous nodes are the same, the stack stays unchanged and the count item is incremented in the array.

After the main loop, the code counts the remaining items in the stack to properly close the ul tags.

xpath query details:

 //*        [contains("h1 h2 h3 h4 h5 h6", name())]
|___|      |_______________________________________|
location   predicate
path

location path:

// everywhere in the DOM tree from the current location (that is by defaut the root)
* any Element node

predicate:

name() returns the current Element name
contains(haystack, needle)