- 1. Getting Started
- 2. Get the Hosts
- 3. The Search Frontend
- 4. Introducing Referrals
- 5. Testing and Results
Now let's get started!
1. Getting Started
1.1 The Python Script
As I said: we will be using a little python script called getHosts.py. The script takes a host (say: http://www.onderstekop.nl/) as input and delivers a list of hosts the webpage is referring to (for instance: http://www.python.org/, http://www.digg.com/ and any other outgoing link that starts with 'http://' or 'https://' and isn't 'http://www.onderstekop.nl/')It will all get clearer when you test run getHosts like this: ./getHosts.py -h=http://www.onderstekop.nl/The results:
http://www.onderstekop.nl/
http://antiguawebsolutions.com/
http://www.blogrush.com/
http://www.blogarama.com/
http://www.schuurtje.net/
(...)
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://beans.seartipy.com/
http://antiguawebsolutions.com/
http://www.blogrush.com/
http://www.blogarama.com/
http://www.schuurtje.net/
(...)
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://beans.seartipy.com/
As you will see the script prints out a host and starts a new line, now this is perfect to use with a pipe (on Linux) or in another scripting language such as PHP off course and that's exactly what we're going to do. You might be asking yourself: why aren't we writing getHosts in PHP as well? Well, sure, we could do that, but it would take a considerable amount of coding, thinking and debugging, which isn't needed at all in python since it comes with a lot of libraries to help us with that. For those of you who are interested in the actual getting of the 'a' tags out of the html file, I suggest you take a look at the getHosts.py scripts, learn some python and get your teeth in parser functions (functional programming languages are also said to be great for doing this sort of work, so you might want to consider them as well)
1.2 The Database
For the first part of this tutorial we will only be using one simple database table (we will create another one in 'chapter' 4 - Introducing Referrals). If you got the rights to create a database then I would advise you to create one named crawler or search engine to keep things seperate from your other data. An effective search engine needs a lot of data, so be sure to have a couple of mb's free. Since we're only capturing the host name you will be alright with only a couple, but this will not be enough for Google (imagine the terabytes of data they must be storing), this is a serious point that should not be overlooked when building a full grown search engine that needs to answer fast. Anyway! Here's the table, run it in your phpMyAdmin, mysql command line, PHP or any other way you can think of:CREATE TABLE host (ID INT NOT NULL AUTO_INCREMENT, name VARCHAR(254), lastcrawled DATETIME, crawllock BOOLEAN DEFAULT '0', PRIMARY KEY(ID));
Now let's build ourselves a crawler!
2. Get the Hosts
Alright, so we got our little getHosts script and our database table all set up and we're are ready to go, but what exactly should we be doing? We are going to use PHP to feed the python script hosts. The python scripts in return will output a number of hosts which our PHP script will store in the database as to feed them back later to generate even more hosts..etc. Let's take a look at the script:2.1 crawler.php
Because of the layout it would be unwise to show the code here so you will need to download the package and extract the 'Without referrals directory', the scripts are fully documented.*Gentle pause while you download the source and extract the files*
When you run this script from the command line with php crawler.php (NB. you need the php-cli module for this) you will notice that you don't see any results. That's because the host table doesn't contain any data yet...it needs a starting point! So to run this script you will need to perform an SQL command first. Just run something similar to this: 'INSERT INTO host (name) VALUE ("http://www.onderstekop.nl/");' and run the script again. Now the crawler will do its work and will keep doing its work..forever!

Running 4 crawlers at the same time (Click for a larger shot)
2.2 Some crawl tips
The starting point is very important in our situation because it will influence all our contents. When you start on a site about linux your first 10.000 results will have off course a bias towards linux. This will even out when more hosts have been crawled but it is good to keep in mind to start on a very general site with outgoing links to different locations because the 'host to be crawled next' isn't picked randomly but sequently (you could certainly change this behaviour, but I didn't include it because I wanted to keep things simple).3. The Search Frontend
Ah, the easy part! Or is it? Yes, in our case it is. In this little chapter we will be building a frontend to search the database for hosts, because now that we have some hosts in our database..we want to show them! It will only consist of a text input, a button, a result field and some stats. The search algorithm itself is going to be very basic, butwe will discuss some techniques to optimize the results. Let us first take a look at the script.
3.1 search.php
Download the source hereBasically the script shows a page and once a search tag has been omitted you get the results matching that tag. The way the script select which host does and which host doesn't match the tag is very simple, but could be fine-tuned. Think of people submitting a tag containing two words...Do you search two times for each word? Do you search hosts containing those words? You need to think about things like this when creating your search front-end, because in our case a two word tag would give us any results.
3.2 Search Tips
Notice how the % is used as a wildmark in SQL. You can use this when you search for something. For instance 'f%o' searches for all the hosts which have a f and then an o in their name. It doesn't sound useful but it could help you in finding sites with similar names or letter combinations in them. Also notice that currently all the results are viewed on one page. This means that when you enter nothing as a tag, the page will try to list all the hosts in the database and this could be a lot.4. Introducing Referrals
Now, this is all nice. We can crawl hosts and we can search through what we've crawled but we have no way in telling what result should be shown first when we search on 'google', 'digg' or 'youtube'. So the search engine is pretty useless now *unless you just want to discover hosts*. We need to do something about that but how? By introducing referrals! We will need to keep track of which pages link to which pages. That means we need to alter our crawler.php and search.php script and that we should make a new table in our database to store the new information.4.1 The new table
Run the following SQL comment:CREATE TABLE linkedby (refID INT NOT NULL, hostID INT NOT NULL, PRIMARY KEY (refID, hostID), FOREIGN KEY(refID) REFERENCES host(ID), FOREIGN KEY(hostID) REFERENCES host(ID));
This will create a table with rows 'refID' and 'hostID'. The refID will contain the ID from the host linking, the hostID will contain the ID from the host that is being linked.
4.2 Altering crawler.php
We need the crawler to not only add found hosts to the table but also to keep track of referrals. We do this by getting the ID from a host whenever the site apears to be already in the list so that we can take the ID of the host that is referring and insert the result into the new linkedby table. You can see the new source code in the package that was created for this tutorial4.3 Altering search.php
The search page should also be changed a little bit so that it shows the hosts with the most referrals (the highest PageRank - in Google terms), but that is actually everything. In my own (non-public) engine I eventually wrote in a special tag ('link:http://www.example-site.com/') which would show you what the actual referrals are, but you could write more (and more interesting) plugins if you want. Again, you can see the final results in the tutorial package5. Results and Testing
Since this tutorial is coming to an end we finally are ready to test it out and gather some statistics! One of the joys of projects like this lies in the fact that you can acquire raw information which you can use to find trends, hyves of networks and keyword based results that most of the time is hidden in public search engines.For this little test I crawled over 5000 hosts and found nearly 60.000 hosts. I could have crawled more, but you got to stop somewhere and I reckoned this would be enough for a little testing.
5.1 Top 4 Pages

This shows the top 4 hosts with the most referrals
(Click for a larger shot)
5.2 Popular hosts test
This was a test to check if popular hosts were listed as the top result when I searched on them. A '+' means it did, a '+/-' means that it was the second result and a '-' means that it isn't shown in the first two results. I started the crawl with 'http://www.onderstekop.nl/' as starting point (obviously :P)| 57579 indexed hosts But are the big ones properly listed? | |||
|---|---|---|---|
| + | imdb | + | python |
| + | youtube | + | tweakers |
| + | feedburner | + | del.icio.us |
| + | amazon | + | flickr |
| + | + | myspace | |
| + | wikipedia | + | |
| + | technorati | + | nytimes |
| + | digg | + | microsoft |
| +/- | yahoo | +/- | bbc |
| +/- | ebay | - | php |
| - | firefox | ||
As you can see the test went pretty good and a lot of the hosts claimed the spots they rightfully deserve.

The test taking place (Click for a larger shot)
6. Final Words
I hope you got something out of this and that it will help you with your own programming. If you've found some factual errors or if the code doesn't work then don't hesitate to tell me. The same goes if you've used this tutorial to build something yourself and want to share your experience. I'm an giant ear - to quote Black Books. Thanks for reading and digging (if you did :P)!Download the source here
UPDATE: If you would like to expand your search engine to make the results better, check out my new article Parsing HTML in PHP, which gives you a PHP Class that can extract the title, links and images from a page.
10 Comments
1
Written by: Bart site
2008-05-11 10:49:50
If not, then it's more likely that there is a problem with the next host waiting to be crawled. This can happen when the host is on a very slow connection, when it sends out corrupted data (no ACK packages for example) or when it just doesn't exist (=the name can't be resolved). In most cases the program finds out about the bad host and moves on to the next, but it takes a while before this happens. Try to wait for a bit.
If this doesn't help, you have to remove the host waiting to be crawled from the database manually using something like phpMyAdmin or the hardcore command line program mysql.
Hope this helped.
2
Written by: Gholamreza Sabery Tabrizy
2009-09-21 09:09:10
I have one question could you Recommend a book on this title to me?
Thx alot.
3
Written by: Gholamreza Sabery Tabrizy
2009-09-21 09:10:12
I have one question could you Recommend a book on this title to me?
Thx alot.
4
Written by: yeshin site
2009-11-04 08:46:26
I want to know how to write this search engine?
I know how to use for php
5
Written by: jacob
2009-11-25 03:57:51
6
Written by: jacob
2009-11-25 04:01:09
7
Written by: jacob
2009-11-25 04:38:03
8
Written by: jacob
2009-11-25 04:39:36
9
Written by: prashant nalawade site
2009-12-15 18:59:37
10
Written by: johnny darwin
2010-05-13 13:47:38
How about your plan to modify and improve for an Open Source project?