Tuesday, January 20, 2009

Crawler: legally or illegally

Crawler, the love of every web-developer; or, probably the love of every involved data collector. Who knows....

Anyway, we need this to be known to the people; we invite Google, Yahoo! or others web search engines to index our page. What about other robots? Somebody who try to stole our data ... hmm, bad guy! Not nice from him, not nice ... But he needs our data, that means that your data are important! That means that you have to protect your data. How? there are many ways to protect your data:
  1. The firewall; if a firewall is good configured, then you will not have problems; that is because an bad guy, create an simple application that made a lot of requests to your; here the firewall can stop that crawler: the firewall will stop the bad guy application using his IP. Bye bye!
  2. The application should implement some simple security "barriers": you don't have to expose e-mail addresses or other contact data that is directly involved in a user collection system. Other way, you can think how a special crawler work. Most of times, an robot that is made to grab the data from your site, will attack your web application will walk using your website navigation system: the search system, the region/category division of your site, a public unsecured web service or other weaknesses of you application. From this pages, the bud guy application will collect the url-s to your "row" data, to individual data object; for example if your site is Google, from here the robot will grab the search results. So it will walks your database in an specially order; you will see that a "user", a special user walks on your database, in a specially order. It takes all your results, so you can stop him (also Google is doing something like this, if there are are 1000000 results, the user get only 1000). Similary, you can implement something like Google: if the user has more than X results, you can offer to him Y results (Y less then X - something like 1000 and100000 ). If he needs ,more results he change his query to an more exactly query; also your search system should work very good if you decide to use this defend system
This are some methods that helps you to self defend against the robots. But a good "bad guy" will walk anyway on your web site.... Using an proxy server, an dinamically IP, or running his application slowly! Here you should defend in other ways, against the bud guys. If they get his data you can get some money from him, but you have to identify him, you have to prove that this guy, has stole your data. Most of the cases the bud guy will publish your data, on his site. So you have to find this site, to go with the bad guy in justice, involving the copyright law. In this moment the bad guy is renamed: the por guy. But still bad guy!

How do I proceed ... Simple. If there is ansimple site, that involves a simple data to be collected (most of all, your data, or your friends/enemyes data, from games, social networks and so on), here I consider that this robot is a necesary tool for you, that is not dangerous for the crawled website and I made the robot as quick is posible. I see a lot of these robots, and nobody doesn't cry because this robots are doing bad work.

But when your robot should collect a lot of data, here I create the robot but you should have an permission from that website. The crawled web site will not be afected by my robot, and you will have your data very quick. The robot is ready to go, so contact me, as quick is posible!

Why to chose an robot, when you can work and create an export from an database to another? Because is more than quick. Because it is cost less than paying an programmer to work a day or two on this. Because the crawled data is ready to go directly in your special database. There are a lot of reasons to chose an crawler! Contact me: post a comment on this blog, and you will be contacted by me very quick!

0 comments:

Post a Comment

 

Copyright 2009 Pop Adrian-Nicolae