golang multithreaded web crawler runs into deadlock-CodePudding

I just started to learn multithreaded programming using golang, and I'm trying to write a multithreaded web crawler using BFS traversal, however I cannot get the code working. The error I get is fatal error: all goroutines are asleep - deadlock!

I will paste the code below, but let me explain conceptually how it works:

I have one master thread (the main function itself) and N worker threads. I intentionally chose to use BFS approach with a fixed amount of worker threads, because it seems using a DFS approach I will have to spawn a new thread for each single new URL to crawl, which might become a huge burden for context switch.

I am using two channels:

urlsToCrawl: master thread sends URLs to crawl to worker threads.
urlsDiscovered: worker threads send discovered URLs back to master.

Here is the code implementation, I removed some non relevant details (e.g. how to parse html page etc..)

The trick I'm trying to do here is: I am using the channel as a queue to do BFS, and when the queue's size is 0, it is impossible to know whether it is because "A. there is really no more URLs to crawl" OR because "B. some worker thread(s) are still working so there might be more URLs to crawl soon". Therefore I introduced this count variable, basically whenever a new url is sent to workers to be crawled, count is incremented, therefore when count == 0 and channel is empty, it would mean "A. there is really no more URLs to crawl"; otherwise when count > 0 and channel is empty, it would mean "B. some worker thread(s) are still working so there might be more URLs to crawl soon".

However as I mentioned, this doesn't seem to work and I run into deadlock. Would anyone please shed some light? Thanks!

package main

import (
    "fmt"
)

var (
    count = 0   // This tracks how many worker threads are actively working right now
)

func crawlUrl(urlsToCrawl chan string, urlsDiscovered chan Pair) {
    for url := range urlsToCrawl {
        urls := getUrls(url)  // This returns an array of string, if no URL found, it returns an empty array
        urlsDiscovered <- urls
    }
}

func main() {
    urlsToCrawl := make(chan string)
    urlsDiscovered := make(chan string[])
    i := 0
    for i < 8 {
        go crawlUrl(urlsToCrawl, urlsDiscovered)
        i  
    }
    visited := map[string]bool{"some_seed_url": true}
    count  
    urlsToCrawl <- "some_seed_url"
    for urls := range urlsDiscovered {
        count--  // One message is received by master, meaning one worker thread has finished an job item, therefore decrementing count
        for _, url := range urls {
            _, ok := visited[url]
            if ok {
                continue  // This URL has been crawled before
            }
            visited[url] = true
            count     // One more work item will be sent to worker, therefore first increment count
            urlsToCrawl <- url
        }
        if count == 0 {
            close(urlsDiscovered)
            close(urlsToCrawl)
            break
        }  // else some worker must be working so let's wait to see if there is new msg coming through the channel
    }
}

CodePudding user response：

Try to integrate the WaitGroup package.

CodePudding user response：

Your channels are unbuffered

urlsToCrawl := make(chan string)
urlsDiscovered := make(chan string[])

So a goroutine which reads from or writes to a channel will block until a goroutine on the other side is doing the opposite.

So you start 8 crawlUrl goroutines which all block while reading from urlsToCrawl, meaning that main can send 8 urls before blocking. The crawlUrl goroutines are blocked until main reads from urlsDiscovered. So if you have more than 8 URL's going around, all goroutines are waiting on each other(deadlock).

The solution to this is to use buffered channels with a capacity you are very unlikely to exceed:

urlsToCrawl := make(chan string, 1000)
urlsDiscovered := make(chan string[], 100)

If you expect you might still exceed the capacity of the channel in extreme cases, you can perform non-blocking operations which allow you to for example discard discovered URL's if the channel is full instead of blocking.

select {
case: urlsDiscovered <- urls:
  // on success (url written)
default:
  // channel is full, can't write without blocking
}