How to achieve parallelism and asynchrony in Web crawling using BFS-CodePudding

This question is going to be quite long but I want to explain my code and thought process as thoroughly as possible, so here goes...

I am coding a web crawler in C# which is supposed to search through Wikipedia from a given source link and find a way to a destination link. For example, you can give it a toaster Wiki page link and a pancake Wiki link and it should output a route which takes you to pancake from toast. In other words - I want to find the shortest path between two Wiki articles.

I think I have correctly coded that up, I created two classes: one is called a CrawlerPage and here is its body:

using HtmlAgilityPack;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;

namespace Wikipedia_Crawler
{
    internal class CrawlerPage
    {
        public string mainLink;
        private List<CrawlerPage> _pages = new();
        public CrawlerPage(string mainLink)
        {
            this.mainLink = mainLink;
        }

        public async Task<List<CrawlerPage>> GetPages()
        {
            var pagesLinks = await Task.Run(() => GetPages(this));
            
            foreach(var page in pagesLinks)
            {
                _pages.Add(new CrawlerPage(page));
            }

            return _pages;
        }

        private HashSet<string> GetPages(CrawlerPage page)
        {
            string result = "";

            using (HttpClient client = new HttpClient())
            {
                using (HttpResponseMessage response = client.GetAsync(page.mainLink).Result)
                {
                    using (HttpContent content = response.Content)
                    {
                        result = content.ReadAsStringAsync().Result;
                    }
                }
            }

            var wikiLinksList = ParseLinks(result)
                .Where(x => x.Contains("/wiki/") && !x.Contains("https://") && !x.Contains(".jpg") &&
                            !x.Contains(".png"))
                .AsParallel()
                .ToList();

            var wikiLinksHashSet = new HashSet<string>();
            foreach(var wikiLink in wikiLinksList)
            {
                wikiLinksHashSet.Add("https://en.wikipedia.org"   wikiLink);
            }

            HashSet<string> ParseLinks(string html)
            {
                var doc = new HtmlDocument();
                doc.LoadHtml(html);
                var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
                return nodes == null ? new HashSet<string>() : nodes.AsParallel().ToList().ConvertAll(
                       r => r.Attributes.AsParallel().ToList().ConvertAll(
                       i => i.Value)).SelectMany(j => j).AsParallel().ToHashSet();
            }

            return wikiLinksHashSet;
        }
    }
}

The class above is supposed to represent a Wiki page article. It contains its own link (mainLink field) and a list of every other page that is on that page (_pages field). GetPages() methods are basically reading a page in HTML and parsing them to a HashSet with links that are of my interest (with links to other articles, this way we can discard any other junk links).

Second class is a Crawler class that performs BFS (Breadth-first search). Code below:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;

namespace Wikipedia_Crawler
{
    internal class Crawler
    {
        private int _maxDepth;
        private int _currDepth;

        public Crawler(int maxDepth)
        {
            _currDepth = 0;
            _maxDepth = maxDepth;
        }
        
        public async Task CrawlParallelAsync(string sourceLink, string destinationLink)
        {
            var sourcePage = new CrawlerPage(sourceLink);
            var destinationPage = new CrawlerPage(destinationLink);
            var visited = new HashSet<string>();
            
            Queue <CrawlerPage> queue = new();
            queue.Enqueue(sourcePage);

            while (queue.Count > 0)
            {
                var currPage = queue.Dequeue();
                Console.WriteLine(currPage.mainLink);
                
                var currPageSubpages = await Task.Run(() => currPage.GetPages());

                if (currPage.mainLink == destinationPage.mainLink || _currDepth == _maxDepth)
                {
                    visited.Add(currPage.mainLink);
                    break;
                }

                if (visited.Contains(currPage.mainLink))
                    continue;

                visited.Add(currPage.mainLink);
                
                foreach (var page in currPageSubpages)
                {
                    if (!visited.Contains(page.mainLink))
                    {
                        queue.Enqueue(page);
                    }
                }
            }

            foreach (var visitedPage in visited)
            {
                Console.WriteLine(visitedPage);
            }
        }
    }
}

Note that I am not incrementing currDepth - I want to make it so that if the depth of the search goes too far, the search would stop because of the route being too long. The class above works as follows: it enqueues the page with sourceLink and performs standard BFS: it dequeues the page, checks if it has been visited, checks if this is the destination page and then gets every subpage of that page (using currPage.GetPages() and adds them to the queue. I believe that the algorithm works fine, although it is extremely sluggish and does not provide any use because of that.

My conclusion: it absolutely needs to be done asynchronously and parallel in order to be efficient. I have tried with Tasks as you can tell, but that doesn't improve the performance at all. My intuition tells me that every time we read subpages of a page, we should do that async and parallel and every time we start crawling on a page, we have to do that async and in parallel as well. I have no idea on how to achieve that, do I need to completely refactor my code? Should I create a new crawler every time I enqueue a subpage?

I'm lost, can you help me figure it out?

CodePudding user response：

You could consider using the new (.NET 6) API Parallel.ForEachAsync. This method accepts an enumerable sequence, and invokes an asynchronous delegate for each element in the sequence, with a specific degree of parallelism. One overload of this method is particularly interesting, because it accepts an IAsyncEnumerable<T> as input, which is essentially an asynchronous stream of data. You could create such a stream dynamically with an iterator method (a method that yields), but it is probably easier to use a Channel<T> that exposes its contents as IAsyncEnumerable<T>. Here is a rough demonstration of this idea:

var channel = Channel.CreateUnbounded<CrawlerPage>();
channel.Writer.TryWrite(new CrawlerPage(sourceLink));

var cts = new CancellationTokenSource();
var options = new ParallelOptions()
{
    MaxDegreeOfParallelism = 10,
    CancellationToken = cts.Token
};

await Parallel.ForEachAsync(channel.Reader.ReadAllAsync(), async (page, ct) =>
{
    CrawlerPage[] subpages = await GetPagesAsync(page);
    foreach (var subpage in subpages) channel.Writer.TryWrite(subpage);
});

The parallel loop will continue crunching pages until the channel.Writer.Complete() method is called and then all remaining pages in the channel are consumed, or until the CancellationTokenSource is canceled.

CodePudding user response：

Calling client.GetAsync(page.mainLink).Result makes your code wait synchronously. Use await client.GetAsync(page.mainLink). Doing so you should not use Task.Run. Task.Run can be used to have synchronous work be excecuted in asynchronously.

If you want parallelism you can await several Task using Task.WhenAll.