14
Oct
How to write a search engine?
Being a web developer I have often wondered how one would go about making a search engine. I have looked for some tutorials to help me out with some of the algorithms, but off course I didn't find much since the search engine technology is worth A LOT (GOOGLE!). So I thought it would be nice to write a tutorial myself to explain some search engine basics and maybe to create an open source search engine, but that's in the very, very long run (unless you want to join me?). I think this is a great way to learn about some SEO (Search Engine Optimization) technics as well, because what better way is there to understand the workings of a crawler if you have written one yourself?

That's the informal stuff done with, now let's talk code! By no means are we going to try to write the best search engine or 'the next google', but I will just be showing you some techniques which I hope will be both inspriring and insightful for your own sites and projects. In fact, we won't build a search engine which looks at the pages content -no- we are going to make a search engine based on host names, which still leads to pretty good results if a decent portion of data has been supplied. The crawler we are going to write could off course be extended to index pages as well, but that's something for another tutorial or for yourself to explore.

The tutorial code is done in PHP and we will be using a little python script as well, but you don't have to worry - if you did- because you won't be needing any skills in that language (unless you are running Windows; in that case you have to download and install python, but that isn't so hard..). We will also be using MySQL to store our data in, but this could of course be any database of your choice. Here's a quick overview of what's to come:


Now let's get started!

1. Getting Started

1.1 The Python Script

As I said: we will be using a little python script called getHosts.py. The script takes a host (say: http://www.onderstekop.nl/) as input and delivers a list of hosts the webpage is referring to (for instance: http://www.python.org/, http://www.digg.com/ and any other outgoing link that starts with 'http://' or 'https://' and isn't 'http://www.onderstekop.nl/')It will all get clearer when you test run getHosts like this: ./getHosts.py -h=http://www.onderstekop.nl/


The results:
http://www.onderstekop.nl/
http://antiguawebsolutions.com/
http://www.blogrush.com/
http://www.blogarama.com/
http://www.schuurtje.net/
(...)
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://www.onderstekop.nl/
http://beans.seartipy.com/

As you will see the script prints out a host and starts a new line, now this is perfect to use with a pipe (on Linux) or in another scripting language such as PHP off course and that's exactly what we're going to do. You might be asking yourself: why aren't we writing getHosts in PHP as well? Well, sure, we could do that, but it would take a considerable amount of coding, thinking and debugging, which isn't needed at all in python since it comes with a lot of libraries to help us with that. For those of you who are interested in the actual getting of the 'a' tags out of the html file, I suggest you take a look at the getHosts.py scripts, learn some python and get your teeth in parser functions (functional programming languages are also said to be great for doing this sort of work, so you might want to consider them as well)

1.2 The Database

For the first part of this tutorial we will only be using one simple database table (we will create another one in 'chapter' 4 - Introducing Referrals). If you got the rights to create a database then I would advise you to create one named crawler or search engine to keep things seperate from your other data. An effective search engine needs a lot of data, so be sure to have a couple of mb's free. Since we're only capturing the host name you will be alright with only a couple, but this will not be enough for Google (imagine the terabytes of data they must be storing), this is a serious point that should not be overlooked when building a full grown search engine that needs to answer fast. Anyway! Here's the table, run it in your phpMyAdmin, mysql command line, PHP or any other way you can think of:

CREATE TABLE host (ID INT NOT NULL AUTO_INCREMENT, name VARCHAR(254), lastcrawled DATETIME, crawllock BOOLEAN DEFAULT '0', PRIMARY KEY(ID));

Now let's build ourselves a crawler!

2. Get the Hosts

Alright, so we got our little getHosts script and our database table all set up and we're are ready to go, but what exactly should we be doing? We are going to use PHP to feed the python script hosts. The python scripts in return will output a number of hosts which our PHP script will store in the database as to feed them back later to generate even more hosts..etc. Let's take a look at the script:

2.1 crawler.php

Because of the layout it would be unwise to show the code here so you will need to download the package and extract the 'Without referrals directory', the scripts are fully documented.

*Gentle pause while you download the source and extract the files*

When you run this script from the command line with php crawler.php (NB. you need the php-cli module for this) you will notice that you don't see any results. That's because the host table doesn't contain any data yet...it needs a starting point! So to run this script you will need to perform an SQL command first. Just run something similar to this: 'INSERT INTO host (name) VALUE ("http://www.onderstekop.nl/");' and run the script again. Now the crawler will do its work and will keep doing its work..forever!


Running 4 crawlers at the same time (Click for a larger shot)

2.2 Some crawl tips

The starting point is very important in our situation because it will influence all our contents. When you start on a site about linux your first 10.000 results will have off course a bias towards linux. This will even out when more hosts have been crawled but it is good to keep in mind to start on a very general site with outgoing links to different locations because the 'host to be crawled next' isn't picked randomly but sequently (you could certainly change this behaviour, but I didn't include it because I wanted to keep things simple).

3. The Search Frontend

Ah, the easy part! Or is it? Yes, in our case it is. In this little chapter we will be building a frontend to search the database for hosts, because now that we have some hosts in our database..we want to show them! It will only consist of a text input, a button, a result field and some stats. The search algorithm itself is going to be very basic, but
we will discuss some techniques to optimize the results. Let us first take a look at the script.

3.1 search.php

Download the source here

Basically the script shows a page and once a search tag has been omitted you get the results matching that tag. The way the script select which host does and which host doesn't match the tag is very simple, but could be fine-tuned. Think of people submitting a tag containing two words...Do you search two times for each word? Do you search hosts containing those words? You need to think about things like this when creating your search front-end, because in our case a two word tag would give us any results.

3.2 Search Tips

Notice how the % is used as a wildmark in SQL. You can use this when you search for something. For instance 'f%o' searches for all the hosts which have a f and then an o in their name. It doesn't sound useful but it could help you in finding sites with similar names or letter combinations in them. Also notice that currently all the results are viewed on one page. This means that when you enter nothing as a tag, the page will try to list all the hosts in the database and this could be a lot.

4. Introducing Referrals

Now, this is all nice. We can crawl hosts and we can search through what we've crawled but we have no way in telling what result should be shown first when we search on 'google', 'digg' or 'youtube'. So the search engine is pretty useless now *unless you just want to discover hosts*. We need to do something about that but how? By introducing referrals! We will need to keep track of which pages link to which pages. That means we need to alter our crawler.php and search.php script and that we should make a new table in our database to store the new information.

4.1 The new table

Run the following SQL comment:
CREATE TABLE linkedby (refID INT NOT NULL, hostID INT NOT NULL, PRIMARY KEY (refID, hostID), FOREIGN KEY(refID) REFERENCES host(ID), FOREIGN KEY(hostID) REFERENCES host(ID));


This will create a table with rows 'refID' and 'hostID'. The refID will contain the ID from the host linking, the hostID will contain the ID from the host that is being linked.

4.2 Altering crawler.php

We need the crawler to not only add found hosts to the table but also to keep track of referrals. We do this by getting the ID from a host whenever the site apears to be already in the list so that we can take the ID of the host that is referring and insert the result into the new linkedby table. You can see the new source code in the package that was created for this tutorial

4.3 Altering search.php

The search page should also be changed a little bit so that it shows the hosts with the most referrals (the highest PageRank - in Google terms), but that is actually everything. In my own (non-public) engine I eventually wrote in a special tag ('link:http://www.example-site.com/') which would show you what the actual referrals are, but you could write more (and more interesting) plugins if you want. Again, you can see the final results in the tutorial package

5. Results and Testing

Since this tutorial is coming to an end we finally are ready to test it out and gather some statistics! One of the joys of projects like this lies in the fact that you can acquire raw information which you can use to find trends, hyves of networks and keyword based results that most of the time is hidden in public search engines.
For this little test I crawled over 5000 hosts and found nearly 60.000 hosts. I could have crawled more, but you got to stop somewhere and I reckoned this would be enough for a little testing.

5.1 Top 4 Pages


This shows the top 4 hosts with the most referrals
(Click for a larger shot)

5.2 Popular hosts test

This was a test to check if popular hosts were listed as the top result when I searched on them. A '+' means it did, a '+/-' means that it was the second result and a '-' means that it isn't shown in the first two results. I started the crawl with 'http://www.onderstekop.nl/' as starting point (obviously :P)

57579 indexed hosts
But are the big ones properly listed?
+ imdb+ python
+ youtube+ tweakers
+ feedburner+ del.icio.us
+ amazon+ flickr
+ facebook+ myspace
+ wikipedia+ google
+ technorati+ nytimes
+ digg+ microsoft
+/- yahoo+/- bbc
+/- ebay- php
- firefox


As you can see the test went pretty good and a lot of the hosts claimed the spots they rightfully deserve.


The test taking place (Click for a larger shot)

6. Final Words

I hope you got something out of this and that it will help you with your own programming. If you've found some factual errors or if the code doesn't work then don't hesitate to tell me. The same goes if you've used this tutorial to build something yourself and want to share your experience. I'm an giant ear - to quote Black Books. Thanks for reading and digging (if you did :P)!

Download the source here

UPDATE: If you would like to expand your search engine to make the results better, check out my new article Parsing HTML in PHP, which gives you a PHP Class that can extract the title, links and images from a page.



10 Comments


1
RE: How to write a search engine?
Written by: Bart site
2008-05-11 10:49:50
Did the second crawler you started output anything? If so, could you post the output?

If not, then it's more likely that there is a problem with the next host waiting to be crawled. This can happen when the host is on a very slow connection, when it sends out corrupted data (no ACK packages for example) or when it just doesn't exist (=the name can't be resolved). In most cases the program finds out about the bad host and moves on to the next, but it takes a while before this happens. Try to wait for a bit.

If this doesn't help, you have to remove the host waiting to be crawled from the database manually using something like phpMyAdmin or the hardcore command line program mysql.

Hope this helped.

2
RE: How to write a search engine?
Written by: Gholamreza Sabery Tabrizy
2009-09-21 09:09:10
Thanks alot for your information, I used the alot. Thank you very much.
I have one question could you Recommend a book on this title to me?
Thx alot.


3
RE: How to write a search engine?
Written by: Gholamreza Sabery Tabrizy
2009-09-21 09:10:12
Thanks alot for your information, I used the alot. Thank you very much.
I have one question could you Recommend a book on this title to me?
Thx alot.


4
RE: How to write a search engine?
Written by: yeshin site
2009-11-04 08:46:26
yes..

I want to know how to write this search engine?


I know how to use for php


5
RE: How to write a search engine?
Written by: jacob
2009-11-25 03:57:51
okay i use ubuntu linux but where do i start? your not very clearn on where to put the code or what not can you please explain this to me better thank you?


6
RE: How to write a search engine?
Written by: jacob
2009-11-25 04:01:09
okay i use ubuntu linux but where do i start? your not very clearn on where to put the code or what not can you please explain this to me better thank you?


7
RE: How to write a search engine?
Written by: jacob
2009-11-25 04:38:03
okay i use ubuntu linux but where do i start? your not very clearn on where to put the code or what not can you please explain this to me better thank you?


8
RE: How to write a search engine?
Written by: jacob
2009-11-25 04:39:36
okay i use ubuntu linux but where do i start? your not very clearn on where to put the code or what not can you please explain this to me better thank you?


9
RE: How to write a search engine?
Written by: prashant nalawade site
2009-12-15 18:59:37
good better best.u r osam


10
RE: How to write a search engine?
Written by: johnny darwin
2010-05-13 13:47:38
Good introduction on search engine. Truly speaking it's more than an introduction and thanks for this.

How about your plan to modify and improve for an Open Source project?


Leave a comment
Name*
E-mail
Website
Title*
Comment*
Notify me when somebody else comments on this article