Tldr: How do I go about creating a web crawler that runs iteratively and concurrently?
I’m trying to write a web crawler in c# for my learning. I originally was going through a recursive BFS approach where I would get a url, scan html and find the anchor tags with urls that satisfy a condition, will would scan the html and find anchor tags with urls that satisfy condition etc etc. But I dont like this idea. it’s really quick to hit stack overflow for obvious reasons. I really want to get better at multi-threaded concurrent solutions in C#.
I would like to create a solution that accomplishes this concurrently and iteratively but I can’t quite formulate the steps in my head. Even still, I think an iterative concurrent approach will be a better, scalable one (though please correct me if i’m wrong).
Any one have any ideas or some good resources I could use for inspiration? I’m comfortable with Javascript but want to write this in c# if possible and everything I’ve found online is written in functional languages I dont understand well and so I’m struggling to get the concepts.
Thanks in advance.
All recursive operations can be implemented with a queue, and it just so happens that a queue is a nice structure for doling out work between multiple threads. In particular, .NET has a ConcurrentQueue
designed for this.
Here’s the basic way that works. Let’s say you have three threads. At the start of the program:
The queue is empty.
All three threads have nothing to do.
To start the process, you put the first URL into the queue. That doesn’t sound like a lot, but the threads should be running code that looks like this:
Once you queue the first URL, one of the threads will eventually wake up and notice there’s some work. It dequeues that work and starts working. If the other threads wake up after this, they won’t see any work. If two threads wake up at the same time, only one of them gets to dequeue the URL.
When the thread that’s working finds a matching URL, it enqueues it. So at some point while it’s working, another thread wakes up and finds there’s work and starts working. Now two threads are crawling and adding links to the queue. Eventually the third thread will wake up, find work to do, and start.
It’s up to you to figure out what “done” means and how to tell the threads when it’s time to stop.
That’s it. This approach won’t really keep things in order, as once all three threads are working there’s no telling what order they’ll add stuff to the list. Making sense of that means putting an object more complex than just a URL in the queue so something further down the pipe can reassemble results into the correct order. A good way for the threads to publish “results” is another queue!
This isn’t the only way. A newer approach would use “Channels”, a pretty cool new type that is a bit less clunky and more async-friendly than ConcurrentQueue
. However, it’s the same basic concept. It just uses async/await
instead of dedicated threads. I recommend getting the hang of it with threads first before diving into async/await
. The main difference is your pseudocode for a “thread” becomes this loop:
Obviously that’s less complicated, but I’m not as familiar with channels and didn’t want to make mistakes.
The main thing to avoid is manual thread synchronization. Madness lies down that path. Once you start writing your own lock
statements it’s always worth stepping back and asking if you can avoid it!
Thank you for this, I was actually going to write my own lock statements.
And thank you for introducing me to the concept of channels, much appreciated.
I don’t quite get why you don’t like the BFS approach. If you want to improve your skills on concurrency, you could implement a concurrent version of BFS. Think about it this way: You have a thread-safe FIFO queue that acts as a task queue and you have a pool of workers. Each worker just runs the following loop forever:
Grab the first element of the queue.
Parse the HTML and extract the links.
Add the new links to the back of the task queue. When the task queue is empty and all workers are idle, you are finished. You could probably use the TPL for this but if you are mainly interested in understanding the concepts, you could start by implementing this from scratch.
First, many thanks for the explanation it really has helped my mental model
Regarding my worry about BFS, I could be wrong but I’m really worried about SO and the scalability of a recursive solution.
First reaction is to use some sort of producer/consumer mechanism to produce and consume the links to crawl. (where the consuming thread will also produce more links)
https://devblogs.microsoft.com/dotnet/an-introduction-to-system-threading-channels/
thanks for the link, much appreciated
Well, What you could do is to have a thread that loops indefinitely. First it starts from a page and extract its urls. Then it saves those urls into a memory or a database and finishes the loop. On the next loop, it should pick one of the urls at the top of the database and read and extract its urls and add them into the database (preferably if they don’t already exist)
This would work for single thread. If you want to have more than one thread, then you need a mechanism to flag a row in your database that this row is being worked on so that no other thread can pick it.
C# devs
null reference exceptions