DepSpid
The Dependency Spider
DepSpid is a distributed web crawler (like the ones used by search engines) and has two major goals:
- Build up a database containing the dependencies between individual web sites and groups of web sites.
- Collect statistical data about the structure of the web.
All information collected by the spider will be made publically available.
Two phases
Every DepSpid task is divided in two phases.
Networking phase (Phase One)
During this non-cpuintensive phase the DepSpid spider will scan a set of web pages and store the results in a temporary database for later processing during Phase Two.
A DepSpid task consists of multiple jobs. The exact number of jobs per task may vary but usually will be between 10 and 50. A job normally is a domain, subdomain or a directory under a domain. During Phase One the DepSpid spider cycles through the jobs of a task to limit the load it makes on the visited servers.
Every job will start downloading the main page of the domain/subdomain and the corresponding robots.txt (if available). The dowloaded page will then be scanned for links (and some other parts). Each of this links will be validated with a HTTP HEAD request. The dependency between the page and its links will be stored into a temporary database. The spider will follow each link that belongs to the domain where it started and will these pages like the main page if they are not excluded by the robots.txt. Links that would leave the original domain will be marked as external links and will not be processed futher by this job.
A job will end when there is no more link to visit or when one of the predefined limits is reached. Current limits are the level (deepness), the number of visited links and the amount of bytes transfered.
Phase One is non-cpuintensive but will use more or less of your network bandwidth. If the internet connection is closed the network phase will be suspended until the internet connection is available again. Normally, Phase One will take only a few seconds or minutes for each job but may run over a few hours or days depending on the speed of your internet connection and the response times of the visited server.
Computational phase (Phase Two)
This phase doesn't require an internet connection but will use more cpu time. As BOINC doesn't allow switching between non-cpuintensive phases and normal processing phases, this phase will be processed as it would be non-cpuintensive. This means that it will run permanently and not toggle as normal BOINC projects would do. However, DepSpid will respect your ressource share settings. It calculates the relation between the cpu time and the wall clock time and will fall into sleep if the value is higher than the prefered ressource share.
Please note: Respecting your ressource share only works this way with BOINC version 5.5.6 and higher. Older clients will be able to participate until one of the new client versions leaves the develoment state but will use a fixed ressource share which may be far away from your true settings.
Phase Two uses the data collected during Phase One and calculates the dependencies between all pages. An example how this works will be posted soon.....
After all dependencies have been calculated, the dependencies to external links that meet a predefined threshold will be reported to the project server and merged into its main database.

