Crawler-Lib - How to use Robots Exclusion Protocol (robots.txt) when crawling?

Christian

Total Posts: 50

PM

Posted: one year ago Quote #112

Unless I am mistaken, I have not found on use of Robots Exclusion Protocol when crawling site.

The Robots Exclusion Protocol or robots.txt protocol, is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is otherwise publicly viewable.

It would be interesting to have an easy-to-use class to work with robots.txt files (a robots.txt parser).

Tom

Status: Moderator
Total Posts: 103

PM

Posted: one year ago Quote #113

A nice implementation of the Robots.txt for C# / .NET is this project, and it can be included without pain in the scheduler or in tasks:
https://code.google.com/p/robotstxt/

It is also available as NuGet package:
https://www.nuget.org/packages/RobotsTxt

You can fine tune your crawling process by respecting the crawl-delay directives. I would recommend to do this in the scheduler not in the task workflow. There are several possibilities to implement this.

BTW. This library allows to retrieve the URLs of the sitemaps from the Robots.txt

Crawler-Lib Developer

Christian

Total Posts: 50

PM

Posted: one year ago Quote #115

Tom,

Can you help me to implement / do this in the scheduler not in the task workflow?
I use Group workflow (eg: How to use WorkFlow Group Element dynamically?)

Thanks.

Tom

Status: Moderator
Total Posts: 103

PM

Posted: one year ago Quote #117

I would like to integrate the Robots.txt checking in the Website Crawler Example, because it has an threaded scheduler to perform tasks. I recommend to use this code when you design a crawler.

I use this scheduler design for a gui based tool I'm currently supervising:
http://www.web-seo-ranking.com/

If you want to do something really big, we have also backend hosting framework components in the pipeline. This is a modularized toolbox to build windows services and linux deamons (with Mono). Such a service can host a crawler, has a scheduler module out of the box and can spread the crawling over several machines. The tasks are send/recceived via WCF. This is why all TaskRequest and TaskResult classes have the [DataContract] attribute. The Crawler-Lib Engine is designed to spread work to different computers to build a task processing farm. In your case it is a crawler farm.

BTW. To use multiple computers for crawling is relatively painless even when you haven't designed from the beginning. The structure of the Crawler-Lib Engine forces you to use the Request/Task/Result pattern.

Crawler-Lib Developer

Christian

Total Posts: 50

PM

Posted: one year ago Quote #149

What is the Crawler-Lib Engine Class/Property to set the number of seconds that is indicate by the Crawl-delay directive in robots.txt file?

Tom

Status: Moderator
Total Posts: 103

PM

Posted: one year ago Quote #153

You have to include this in your scheduler or in your task workflow if you want to handle delays.

Crawler-Lib Developer

Christian

Total Posts: 50

PM

Posted: one year ago Quote #154

With your example that you explain in this topic : Workflow Example with Retry & Limit & Group workflow elements, to handle Crawl-delay directive, I can do this


int idxCrawlDelay = 0;
long CrawlDelayDirective =2;

await Group 
{
  for (int i = 0; i < 10; i++)
  {
      idxCrawlDelay++;
      new Delay(CrawlDelayDirective * 1000 * idxCrawlDelay,
      async delai =>
      {
          await new Retry(3,
          async retry => 
          {
              await new Limit("PageRequestLimiter",
              {
                  async limited =>
                  {
                      await HttpRequest
                  }
              }
          }
      }
   }
}

Do we agree?

How to use Robots Exclusion Protocol (robots.txt) when crawling?

Information

User service

Follow us