Advice how to buid a high performance / high throughput crawler

Posted: 4 years ago Quote #69
Prior to testing your library I was using HtmlAgilityPack and drilling down through a store site to get item title, upc & prices. It was a recursive routine that called itself if there were category links present and when it reached a plain item link it would parse the necessary data. My throughput was about 100 items a minute including updating a SQL database after a pageful of links. This, of course, was taking 3.5 to 5 hours to complete a full parse due to the number of items on the site. I wanted to trim this down by at least 4 fold.
I revamped the program to pull item links for one category on a site into a queue which I referenced in a TaskRequestBase and ran with a TaskBase. I experimented with Parallel processing and limited it by running only 10 links at a time and waiting a second or so to repeat. I tried all different kinds of links at a time and delays in between. I was getting results, but it was missing quite a bit of data here and there. I decided to give your Retry a test. This actually seemed worse and I actually didn't see where I was getting any Retry data back.
Now I'm sure that maybe I'm not adept to how the parallel processing works and it may be my fault in the way that I am approaching it. Maybe it is the limitations of the community edition and I'm just not addressing that limitation correctly in code. I did see where someone wrote about the 600 requests per minute and 2 threads. I did, however, trim down to only a couple requests every few seconds and still had bottlenecks. I would have expected at least comparable results to serial processing using only HtmlAgilityPack. Have you any other samples of how you would go about this type of process?
The site can take the throughput or I wouldn't be able to acheive 100 per minute using serial methods. They do no checking of IPs or blocking. They are probably delirious with pleasure reporting to their sponsors how many hits they are receiving.
I have my parsing routines inside the TaskBase. Do you think that this is causing it to be too weighty and using too high a memory footprint? I'll look into changing that around to see if it helps. But any suggestions that you have would be of great value to me.
I'm sure this is an awesome library if it weren't for Neanderthals like me going about it the wrong way. Sorry for the trouble but I did spend about 10 hours on it today.
Posted: 4 years ago Quote #70
I will help you to design the task. Parsing inside the task is absolutely ok. I assume you have a test project for the task design. Please send me a zipped version of this to You will get back a working sample with the best performance. I will treat your code confidential.

Kind regards Tom

Thank you for your response. I know we have a lack of samples at the moment and the Crawler-Lib Framework is too complex to be understood out of the box. But major parts (like the storage operation processor) are not released yet, so we decided to release all parts first and come up with the samples later.  This kind of feedback is very important for us and we encourage everybody to share their use cases with us. We will provide specific samples and solutions for your problems.          
Crawler-Lib Developer
Posted: 4 years ago Quote #72
I sent a zipped copy of the console test program. Thanks
Posted: 4 years ago Quote #73
With Pleasure. Please inform me due the forum when you have send it. Best regards Tom
Crawler-Lib Developer
Posted: 4 years ago Quote #74
I just sent about a minute ago to Let me know if you didn't receive. Thank you so much.
Posted: 4 years ago Quote #75
I have just finished coding for the link following (recursion in your code). The engine retrieves about 600 pages per minute with the community edition (this is in fact the limit of the community edition). This is 10 pages per second.
3000 pages will be crawled in about fife minutes. The rest of the time is parsing which I'm integrating now.
BTW at the moment I have a very small memory footprint of about 45 MB during the retrieval process.  

In the production environment 10+ pages per second are a lot of load on a single server, so it would be nice to reduce this gracefully to a more reasonable value with a Limiter/Limit workflow element.

Crawler-Lib Developer
Posted: 4 years ago Quote #76
Wow! That is fantastic news, Tom. Thank you for all your hard work. I was playing with the outline you had sent me earlier. I really appreciate it!
Posted: 4 years ago Quote #77
I've just send you links to download my solution for your crawling problem. My tests have shown that 10 pages per second (the 600 tasks per second / limit of the community edition) can be retrieved and parsed. The memory footprint and CPU load is low.

I will add a limiter in the workflow (currently the parallel requests are limited by the license restrictions) and test it with the unlimited edition, to see what the server can provide at maximum.

In fact the crawler-lib engine can retrieve and process 10.000 pages per second with ease, when the bandwidth is high, the crawled servers are responding in reasonable time and the scheduler is correct implemented. But I don't think the server you're crawling can provide that much throughput.

Crawler-Lib Developer
Posted: 4 years ago Quote #78
Thank you so much, Tom. I have been messing with the code all day! I have a Windows form app with a DataSet that I am integrating your code into. Stunning, to say the least. I really appreciate showing this Neanderthal how your Lib operates. Expect our team to get the full blown version soon. We will not take the Walmart servers down. 400 a minute is plenty for us and much better than what we were getting. We all appreciate your help in this and may shoot you a few more questions as we go. Again, Thanks!
Posted: 4 years ago Quote #79
You're welcome Peter. Thank you for your patience and the chance you gave us to show how the things work. Contact me if you have any question or suggestion, we will help as best as we can. Thanks!
Crawler-Lib Developer