Newsletter

Simple Task Sample

In this C# sample we will develop a task that gets an URL of a website in its request and provides a list of links from the website in its result. Fist of all the classes for the requests and results:

[DataContract]
public class SimpleTaskRequest : TaskRequestBase
{
	[DataMember]
	public Uri Url { get; set; }
	public override TaskBase CreateTask()
	{
		return new SimpleTask();
	}
}
[DataContract]
public class SimpleTaskResult : TaskResultBase
{
	[DataMember]
	public List Links { get; set; }
}

As we can see, the task request contains the CreateTask() method which is a factory method for the task. This is how the real task will be created. This is the task itself:

public class SimpleTask : TaskBase
{
	public new SimpleTaskRequest TaskRequest{ get {	return (SimpleTaskRequest)base.TaskRequest;	}}
	public new SimpleTaskResult TaskResult{ get { return (SimpleTaskResult)base.TaskResult; }}
	public override async void StartWork()
	{
		base.TaskResult = new SimpleTaskResult();
		var request = await new HttpRequest(TaskRequest.Url, new HttpRequestQuota { MaxDownloadSize = 100000, OperationTimeoutMilliseconds = 10000, ResponseTimeoutMilliseconds = 5000 });
		this.TaskResult.Links = new List();
		HtmlNodeCollection nodes = request.Html.DocumentNode.SelectNodes("//a[@href]");
		foreach (var node in nodes)
		{
			string href = node.Attributes["href"].Value;
			this.TaskResult.Links.Add(href);
		}
	}
}

The workflow begins in the StartWork() method of the task. It is important to set the TaskResult property before any other code is executed. Exceptions in the StartWork() are delivered in the FatalException property of the task result, so there must be an instance to store it. This workflow uses the async/await pattern to specify the success handler for the request as a continuation.

Building a Task Skeleton – Step By Step

  1. Create a task class and derive it from TaskBase.
  2. Create a task request class, mark it with [DataContract]and derive it from TaskRequestBase.
  3. Create a task result class, mark it with [DataContract]and derive it from TaskResultBase.
  4. In the task request class override the CreateTask() method and return a task instance.
  5. In the task class override the StartWork() method and assign a task result instance to the TaskResult property.

For convenience you can redefine the TaskRequest and the TaskResult properties with the correct type as done in the sample above. After that you can start to code the business logic of your task.

Task Request and Result Design

Somebody will ask why such a bloated concept of requests and results is introduced in a crawler. In fact it wasn’t till the crawler engine was generalized to a task processor. Due to the fact nobody can say what a general task needs to start and what it delivers there must be an mechanism to provide parameters to the task and to deliver results. We have decided to use classes for this that must derive from TaskRequestBase and TaskResultBase because of the following reasons:

  • Short Lifetime for the Resource Consuming Task
    The task itself is at least created when the task starts. So any resources needed by the task are not allocated for the waiting task requests. By design the task is disposed after competition. So at least at this point all unmanaged resources are released. In fact the workflow elements are disposed after they have finished theyr work, so most of the resources are released much earlier. 
  • Memory Footprint
    Normally the requests don’t contain much data (in this example an URL) so thousands of them can wait in the engine until they are ready for processing without using much memory. This is quite true for the results. They only contain the processed data which is normally an fraction of the internal data a task uses.
  • Data Contract Serialization
    Task requests and results simplify the process of serialization, because they only contain data that should be serialized. So all members can be marked as DataMember where all what is in the task itself is temporary and not delivered.
  • Separation of Concerns
    The task request provides the parameters, the task result contains the result and the task itself do the processing.
  • Acceptable Overhead
    The overhead of three classes instead of mangling all in one is acceptable to the advantages of this design.

Memory pressure is one of the main reasons why crawlers have no throughput. So after all we can see that the concept of task requests and results are lean and not bloated. And it is needed especially when we look on the memory footprint.