<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7014766648070950269</id><updated>2011-11-27T16:27:21.596-08:00</updated><category term='web application'/><category term='technology'/><category term='theory'/><category term='mysql'/><category term='example'/><category term='needed data'/><category term='pan'/><category term='robots'/><category term='avaible'/><category term='proverbs'/><category term='mssql'/><category term='anonymus'/><category term='legally'/><category term='hidden'/><category term='limitations'/><category term='download'/><category term='defend'/><category term='contact'/><category term='behavior'/><category term='pan-internet.com'/><category term='crawler theory'/><category term='firewall'/><category term='crawler'/><category term='visible'/><category term='yahoo messenger'/><title type='text'>The Data Provider</title><subtitle type='html'>My job, your benefits!</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>8</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-7202648497153915819</id><published>2010-01-10T12:54:00.000-08:00</published><updated>2010-01-10T12:58:50.945-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='pan'/><category scheme='http://www.blogger.com/atom/ns#' term='theory'/><category scheme='http://www.blogger.com/atom/ns#' term='pan-internet.com'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler theory'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>My articles</title><content type='html'>I still write, even if not here.&lt;div&gt;About programing, including crawlers/spiders/robots go to my programing blog on &lt;a href="http://popnadrian.pan-internet.com/"&gt;pan-internet&lt;/a&gt;.com&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Sorry for leaving in this way this domain, probably this will be used when i will write somewhere some articles about the subject that was treated on this blog untill now: crawlers, here&lt;a href="http://popnadrian.pan-internet.com/post/2010/01/10/buildng-simple-crawler-indexing-internet-starting-from-one-page.aspx"&gt; crawler theory&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I hope that you will enjoy this.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-7202648497153915819?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/7202648497153915819/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2010/01/my-articles.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/7202648497153915819'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/7202648497153915819'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2010/01/my-articles.html' title='My articles'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-6383318898696022219</id><published>2009-05-11T21:03:00.000-07:00</published><updated>2009-05-11T21:06:45.085-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='download'/><category scheme='http://www.blogger.com/atom/ns#' term='needed data'/><category scheme='http://www.blogger.com/atom/ns#' term='robots'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>What kind of robot do you need</title><content type='html'>An ordinary crawler? that is doing anything, anyhow, anywhere? Yes, I will publish something like that soon. Because my visitors expects from me something like this.&lt;br /&gt;&lt;br /&gt;So,&lt;span style="font-weight: bold;"&gt; what do you need?&lt;/span&gt; &lt;span style="font-style: italic;"&gt;Describe your needed data&lt;/span&gt; and I will made an aplication to get this data regulary. Describe it in the comments from here, and you will get your robot for free.&lt;br /&gt;&lt;br /&gt;Thanks!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-6383318898696022219?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/6383318898696022219/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/05/what-kind-of-robot-do-you-need.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/6383318898696022219'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/6383318898696022219'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/05/what-kind-of-robot-do-you-need.html' title='What kind of robot do you need'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-1865490146179346349</id><published>2009-04-28T12:49:00.000-07:00</published><updated>2009-05-14T10:01:23.609-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='behavior'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='mysql'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><category scheme='http://www.blogger.com/atom/ns#' term='mssql'/><category scheme='http://www.blogger.com/atom/ns#' term='limitations'/><title type='text'>Database servers: MySql versus MsSql Express</title><content type='html'>A modern application require a fast database. In any domains. So when you chose what database you should use....for your convenience thinck, not twice, ten times, if this is posible.&lt;br /&gt;What happens.... A friend ask me to get the data from an site for him. It was sample so I chose to to this with his favorite database engine: &lt;span style="font-weight: bold;"&gt;MySql&lt;/span&gt;.  I create the robot, I runned it. In this time I saw that my computer works slowly. I finished it, i give to my my friend the database and after that he cause me to a fight, a &lt;span style="font-weight: bold;"&gt;SEO &lt;/span&gt;fight. Each of us should create an application that will publish that data on the web and who get more money wins. I accept this with one condition: we can use any techonlogy that we want. So I create a little application that moves the data from &lt;span style="font-weight: bold;"&gt;MySQL &lt;/span&gt;to &lt;span style="font-weight: bold;"&gt;MSSql 2005 Express&lt;/span&gt;.&lt;br /&gt;In this moment I realize why my computers goes slowly during the robot running. If &lt;span style="font-weight: bold;"&gt;MSSql &lt;/span&gt;is using a little percent from processor on &lt;span style="font-weight: bold;"&gt;MySql &lt;/span&gt;the processor was ussed constant at 30%. That is very bad!&lt;br /&gt;It is very bad because &lt;span style="font-weight: bold;"&gt;MySql &lt;/span&gt;has doit two instructions:  &lt;span style="font-style: italic;"&gt;select top 1 * from x where flag is true &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;update x set processed = 1 where id = @id&lt;/span&gt;. In this time &lt;span style="font-weight: bold;"&gt;MSSql &lt;/span&gt;server was doit one insert operation, and other 5-6 insert operations using the id from the 1st insert.&lt;br /&gt;Is there such a big diference? I know that &lt;span style="font-style: italic; font-weight: bold;"&gt;update command &lt;/span&gt;&lt;span style="font-weight: bold;"&gt;consumes more resurces  &lt;/span&gt;but....such a big diference?&lt;br /&gt;&lt;br /&gt;The final conclusions:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;MySql &lt;/span&gt;is working hard on the processor part. So when you thinck an aplication you could thinck at this.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;MSSql &lt;/span&gt;use a lot of RAM memory but I thinck that this isn't a problem with the ram prices.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;MySql &lt;/span&gt;has no limitations on the database size. Howewer, in aprevisous use of this database server I see that on a big database (10G if I remembet good) this server work to hard, or it doesn't working.&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;MSSql &lt;/span&gt;is limited to 4 giga on database size. Generally this is not a problem...but thinck also at this ... Also there are and other limitations.&lt;/li&gt;&lt;li&gt;Speed of command execution, is biger on &lt;span style="font-weight: bold;"&gt;MSSql&lt;/span&gt;, in general, than on &lt;span style="font-weight: bold;"&gt;MySql&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;With others words, if you ask an advice from me, you got it: use &lt;span style="font-weight: bold;"&gt;MSSql&lt;/span&gt; is my recomandation when we speak about databases. Also there is more simple to used as you have all technologies from a singel provider.&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;Here you can see the processor work  on this two servers. This takes to me 10 Hours, for 70 000 data rows (only text, the pictures was saved on the hdd). Also when I have execute an command into the servers clients (&lt;span style="font-weight: bold;"&gt;ssmsee for MSSql &lt;/span&gt;and &lt;span style="font-weight: bold;"&gt;Heidi for MySql&lt;/span&gt;) there was some &lt;span style="font-weight: bold;"&gt;horrible results on the MySql side&lt;/span&gt;: timeout and a long time to count a number of 2-30000 rows; here &lt;span style="font-weight: bold;"&gt;MSSql has doit a good job&lt;span style="font-weight: bold;"&gt;; &lt;/span&gt;&lt;/span&gt;One moment i belive that this two cann't be compared; anyway the limitations is the only price that you have to pay for MSSql 2005 Express; in MySql you pay with the application speed: this is ugly.&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/__0P_YRb2aNY/SfdkFAZNHNI/AAAAAAAAADQ/nk7Gp_se8gM/s1600-h/mssql_vs_mysql_computer_processes.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 400px; height: 379px;" src="http://3.bp.blogspot.com/__0P_YRb2aNY/SfdkFAZNHNI/AAAAAAAAADQ/nk7Gp_se8gM/s400/mssql_vs_mysql_computer_processes.jpg" alt="" id="BLOGGER_PHOTO_ID_5329838721184111826" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;MySql&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/__0P_YRb2aNY/SfdoEwFReaI/AAAAAAAAADo/x5Ogzf_rbiY/s1600-h/mssql_vs_mysql_computer_mysql_samplequery3.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 159px;" src="http://2.bp.blogspot.com/__0P_YRb2aNY/SfdoEwFReaI/AAAAAAAAADo/x5Ogzf_rbiY/s400/mssql_vs_mysql_computer_mysql_samplequery3.jpg" alt="" id="BLOGGER_PHOTO_ID_5329843114852055458" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;MSSql&lt;br /&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/__0P_YRb2aNY/Sfdm1mEU3qI/AAAAAAAAADg/SBtKYnKMzc8/s1600-h/mssql_vs_mysql_computer_mssql_samplequery2.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 140px;" src="http://4.bp.blogspot.com/__0P_YRb2aNY/Sfdm1mEU3qI/AAAAAAAAADg/SBtKYnKMzc8/s400/mssql_vs_mysql_computer_mssql_samplequery2.jpg" alt="" id="BLOGGER_PHOTO_ID_5329841754954063522" border="0" /&gt;&lt;/a&gt;&lt;p&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-1865490146179346349?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/1865490146179346349/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/04/database-servers-mysql-versus-mssql.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/1865490146179346349'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/1865490146179346349'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/04/database-servers-mysql-versus-mssql.html' title='Database servers: MySql versus MsSql Express'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/__0P_YRb2aNY/SfdkFAZNHNI/AAAAAAAAADQ/nk7Gp_se8gM/s72-c/mssql_vs_mysql_computer_processes.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-2406806662348986392</id><published>2009-04-13T13:25:00.000-07:00</published><updated>2009-05-14T10:02:36.846-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='yahoo messenger'/><category scheme='http://www.blogger.com/atom/ns#' term='hidden'/><category scheme='http://www.blogger.com/atom/ns#' term='visible'/><category scheme='http://www.blogger.com/atom/ns#' term='robots'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><category scheme='http://www.blogger.com/atom/ns#' term='avaible'/><title type='text'>Special Robots: Yahoo!</title><content type='html'>&lt;div style="text-align: justify;"&gt;Yahoo! Messenger. A communication tool that is used by a lot of peoples. In Romania is about 95% of the Internet users....&lt;br /&gt;&lt;br /&gt;Well...this tool offer to you the oporunity to set your state: avaible, hidden, a special message and other options. But the guys from Yahoo!  are not so good programmers, so if you set your state as hidden someone can see that you are hidden. This bug was exploited by a lot of programmers who see in this an oportunity to made some money. Not very much....but it means more than nothing, be sure about this. About this bug some of my old partners  start they own "bussines" on Yahoo!'s body....&lt;br /&gt;&lt;br /&gt;Is about &lt;a href="http://www.status-yahoo.ro/"&gt;http://www.status-yahoo.ro/&lt;/a&gt; and &lt;a href="http://www.yahoo-invisible.eu/"&gt;http://www.yahoo-invisible.eu/&lt;/a&gt;. This are some samples robots (don't forget, I am expert in this :)  ) that are doing some requests o the Yahoo!'s servers (or Hi5 server, or any other bad application) and they get some informations about the asked id's. Nice...for some of us...for others....&lt;br /&gt;&lt;br /&gt;Anyway, my expartners are happy with it's sites because they won some money. And how you can see in a previsous post I don't agree this kind of jobs; this doesn't respect the personal privacy. If somebody wants to talk to me, but I have no time for he/she....what should I do? Yahoo! let me to set my account to the invisible...but my friends are see my real status because of this guys... I understand...this is a Yahoo! problem but my how is about my provacy? The only way how I can be sure about my privacy is to not start Yahoo Mesenger....Not a good way.&lt;br /&gt;&lt;br /&gt;In the end....they ell me that this site's aren't used by my friends. The main visitors/clients for this site's are children with the age between 13 and 18 years. "Kinders" but they can affeact my "reputation"....&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-2406806662348986392?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/2406806662348986392/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/04/special-robots-yahoo.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/2406806662348986392'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/2406806662348986392'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/04/special-robots-yahoo.html' title='Special Robots: Yahoo!'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-4534261178275172618</id><published>2009-04-08T13:06:00.000-07:00</published><updated>2009-05-11T21:01:19.410-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='anonymus'/><category scheme='http://www.blogger.com/atom/ns#' term='download'/><category scheme='http://www.blogger.com/atom/ns#' term='robots'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>My crawlers: downloads</title><content type='html'>Here you can download my crawlers. And here is nothing to be downloadable; there is no crawler.&lt;br /&gt;&lt;br /&gt;I post this information because I receive an post from an Anonymus who told me that one of my crawlers has doit something bad on his computer, after he has downloaded and installed it.&lt;br /&gt;&lt;br /&gt;Sory guys, i didn't published an crawler; first of all because a crawler means money; from this I live :) so be sure that I will not publish for free; second - in this moment I don't belive in open source....&lt;br /&gt;&lt;br /&gt;So if you download a robot from "me" be sure by it provenieince. :))&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-4534261178275172618?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/4534261178275172618/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/04/my-crawlers-downloads.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/4534261178275172618'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/4534261178275172618'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/04/my-crawlers-downloads.html' title='My crawlers: downloads'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-7851316200300372488</id><published>2009-03-04T23:24:00.000-08:00</published><updated>2009-03-04T23:26:03.245-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='robots'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>I'm an data provider! What does that means?</title><content type='html'>&lt;div style="text-align: justify;"&gt;Hello everyone! My name is Adrian and I am a &lt;span style="font-weight: bold;"&gt;data provider. &lt;/span&gt;What means that? In the last three years I've been working as an &lt;span style="font-weight: bold;"&gt;crawler &lt;/span&gt;programmer or an &lt;span style="font-weight: bold;"&gt;robot &lt;/span&gt;programmer. What means this kind of programing?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Crawler &lt;/span&gt;(also called &lt;span style="font-weight: bold;"&gt;web spyder&lt;/span&gt;) is a program, or a script that surfs on the internet automatically and grab data from the internet. Where is useful this kind of data? The answer is everywhere; you only should see where you need the information and I will grab that data for you.&lt;br /&gt;&lt;br /&gt;Who is using the &lt;span style="font-weight: bold;"&gt;data&lt;/span&gt;? Everybody; the first big users of the crawled data are the &lt;span style="font-weight: bold;"&gt;search engine&lt;/span&gt;; google, yahoo, windows live and all others search engines that exist on the internet are based on an robot, an crawler; the robot walks on the internet and grabs a lot of informations: text content, pictures, diferent files formats (pdf, html, .....); since the data is collected, the search engine has some steps where the work on the data: the collect the keywords, they see how many URL are sended to an page; after the data is processed they have data that will go to the users; the user get a list of pages relevant to his search. All this is thanks to an &lt;span style="font-weight: bold;"&gt;crawler&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Who else is using the &lt;span style="font-weight: bold;"&gt;robots&lt;/span&gt;? A lot of companies and people. For example, my current company has an real estate site and the ads are coming from a lot of real estate agencies who has an webpage. I made for every real estate agency web site an &lt;span style="font-weight: bold;"&gt;robot&lt;/span&gt; that walks on the client page every week and collects his ads details (the address, phone, price, title, description, no. of rooms and other stuffs). My work is available on the internet, just contact me for this.&lt;br /&gt;&lt;br /&gt;This is my first article for this blog, so here I don't explain to yo all the process; there will be articles that will explain in more details what I am doing. For more details or if you are interested by some data that is on the internet don't hesitate to contact me.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-7851316200300372488?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/7851316200300372488/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/03/im-data-provider-what-does-that-means.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/7851316200300372488'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/7851316200300372488'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/03/im-data-provider-what-does-that-means.html' title='I&apos;m an data provider! What does that means?'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-8083034380313460410</id><published>2009-01-20T11:29:00.000-08:00</published><updated>2009-01-25T12:06:20.464-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='web application'/><category scheme='http://www.blogger.com/atom/ns#' term='contact'/><category scheme='http://www.blogger.com/atom/ns#' term='firewall'/><category scheme='http://www.blogger.com/atom/ns#' term='defend'/><category scheme='http://www.blogger.com/atom/ns#' term='robots'/><category scheme='http://www.blogger.com/atom/ns#' term='legally'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>Crawler: legally or illegally</title><content type='html'>Crawler, the &lt;span style="font-style: italic;"&gt;love&lt;/span&gt; of every web-developer; or, probably the love of every involved data collector. Who knows....&lt;br /&gt;&lt;br /&gt;Anyway, we need this to be known to the people; we invite Google, Yahoo! or others web search engines to index our page. What about other robots? Somebody who try to stole our data ... hmm, bad guy! Not nice from him, not nice ... But he needs our data, that means that your data are important! That means that you have to protect your data. How? there are many ways to protect your data:&lt;br /&gt;&lt;ol style="text-align: justify;"&gt;&lt;li&gt;The firewall; if a firewall is good configured, then you will not have problems; that is because an bad guy, create an simple application that made a lot of requests to your; here the firewall can stop that crawler: the firewall will stop the bad guy application using his IP. Bye bye!&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The application should implement some simple security "barriers": you don't have to expose e-mail addresses or other contact data that is directly involved in a user collection system. Other way, you can think how a special crawler work. Most of times, an robot that is made to grab the data from your site, will attack your web application will walk using your website navigation system: the search system, the region/category division of your site, a public unsecured web service or other weaknesses of you application. From this pages, the bud guy application will collect the url-s to your "row" data, to individual data object; for example if your site is Google, from here the robot will grab the search results. So it will walks your database in an specially order; you will see that a "user", a special user walks on your database, in a specially order. It takes all your results, so you can stop him (also Google is doing something like this, if there are are 1000000 results, the user get only 1000). Similary, you can implement something like Google: if the user has more than X results, you can offer to him Y results (Y less then X - something like 1000 and100000 ). If he needs ,more results he change his query to an more exactly query; also your search system should work very good if you decide to use this defend system&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div style="text-align: justify;"&gt;This are some methods that helps you to self defend against the robots. But a good "bad guy" will walk anyway on your web site.... Using an proxy server, an dinamically IP, or running his application slowly! Here you should defend in other ways, against the bud guys. If they get his data you can get some money from him, but you have to identify him, you have to prove that this guy, has stole your data. Most of the cases the &lt;span style="font-style: italic;"&gt;bud guy&lt;/span&gt; will publish your data, on his site. So you have to find this site, to go with the bad guy in justice, involving the copyright law. In this moment the bad guy is renamed: the por guy. But still bad guy!&lt;br /&gt;&lt;br /&gt;How do I proceed ... Simple. If there is ansimple site, that involves a simple data to be collected (most of all, your data, or your friends/enemyes data, from games, social networks and so on), here I consider that this &lt;span style="font-weight: bold;"&gt;robot &lt;/span&gt;is a &lt;span style="font-weight: bold;"&gt;necesary &lt;/span&gt;tool for you, that is not dangerous for the  crawled website and I made the robot as quick is posible. I see a lot of these robots, and nobody doesn't cry because this robots are doing bad work.&lt;br /&gt;&lt;br /&gt;But when your robot should collect a lot of data, here I create the robot but you should have an permission from that website. The crawled web site will not be afected by my robot, and you will have your data very quick. The &lt;span style="font-weight: bold;"&gt;robot is ready to go, so contact me&lt;/span&gt;, as quick is posible!&lt;br /&gt;&lt;br /&gt;Why to chose an robot, when you can work and create an export from an database to another? Because is more than quick. Because it is cost less than paying an programmer to work a day or two on this. Because the crawled data is ready to go directly in your special database. There are a lot of reasons to chose an crawler! &lt;span style="font-weight: bold;"&gt;Contact me: &lt;/span&gt;&lt;span style="font-style: italic;"&gt;post a comment on this blog&lt;/span&gt;, and you will be contacted by me very quick!&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-8083034380313460410?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/8083034380313460410/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/01/crawler-legally-or-illegally.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/8083034380313460410'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/8083034380313460410'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/01/crawler-legally-or-illegally.html' title='Crawler: legally or illegally'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7014766648070950269.post-8881378742664860319</id><published>2009-01-15T03:26:00.000-08:00</published><updated>2009-01-15T13:42:49.042-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='example'/><category scheme='http://www.blogger.com/atom/ns#' term='proverbs'/><category scheme='http://www.blogger.com/atom/ns#' term='crawler'/><title type='text'>Example: Site that is builded using the crawlers</title><content type='html'>&lt;div style="text-align: justify;"&gt;&lt;br /&gt;Last year I have an exam on my faculty (computer science, Faculty of Science, University of Oradea), an exam that means, that we, the students, we should show to the teacher a webpage that it is programed by us, using PHP. Even if I am not a PHP guru, I build the site; but the ideea is that the site should have a subject; and I chose one: Proverbs. A site where the internet users should add thir proverbs, were they can read proverbs, in more languages and so on…you can see the application here: &lt;a href="http://www.proverbs.iuliumaniu.ro/"&gt;http://www.proverbs.iuliumaniu.ro/&lt;/a&gt; .&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;And for a good exemplifying of what my application can do, I create an robot that collects proverbs from a lot of sites; in the end I got ~15,000 proverbs, a number that is enough for my application, to be ok, when it will be presented to the teacher. How do I do this? Simple...&lt;br /&gt;&lt;br /&gt;First I take a look on a site where are proverbs; I know what I need: proverbs, proverb language, proverb provenience (country/region/time – example Ancient ) and I am looking on that site how can I extract this data; after that I create a simple script (generally I work in .net, c#, so I made an little program for this); after I run my magic program I goat all the proverbs and foreach proverb his language and his provenience into &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/__0P_YRb2aNY/SW-oxx7W2WI/AAAAAAAAACQ/lWKupDdYZZY/s1600-h/crawler_brut.PNG"&gt;&lt;img style="margin: 0pt 0pt 10px 10px; float: right; cursor: pointer; width: 320px; height: 246px;" src="http://3.bp.blogspot.com/__0P_YRb2aNY/SW-oxx7W2WI/AAAAAAAAACQ/lWKupDdYZZY/s320/crawler_brut.PNG" alt="" id="BLOGGER_PHOTO_ID_5291633660352977250" border="0" /&gt;&lt;/a&gt;my data storage. My data storage means a text file, an Xml file, an MySql database or usualy, an Microsoft SQL Express database (that is what I preffer). In this data storage, the collected data is looking not very good; more exactly i don't know what data my robot will extract (example - i don't know all country names, or all languages) so to prevent this kind of events, i put all my data in an table with all the fields like you can see in this picture. One important reason why I didn't extracted the data into an relational database is that when I build an robot, i teach it some patterns, for a spefic site; not all the time that pattern is respectet by the entire web page  and sometimes there appear some errors, as you can see in this picture. After all the datais grabbed into my database, I start to process it: i remove duplicate data, I repair bad data (for example, in ths picture you can see the provenience of the proverb as "Traditional Proverb ( - )"; a good reference regarding the provenience, in this case is "Tradition" so I have to update the table from bad data, to good data). The last step what I have to do, is that I have to break the pocessed data into relational tabels; i break it, and I finish my job regarding the data: &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/__0P_YRb2aNY/SW-rgunBOAI/AAAAAAAAACY/DPgkgMy_hIU/s1600-h/crawler_processed.PNG"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 114px;" src="http://1.bp.blogspot.com/__0P_YRb2aNY/SW-rgunBOAI/AAAAAAAAACY/DPgkgMy_hIU/s200/crawler_processed.PNG" alt="" id="BLOGGER_PHOTO_ID_5291636665939474434" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;In this moment my site is populated,  I have more than 15,000 proverbs, I finish the web application that can be founded at &lt;a href="http://www.proverbs.iuliumaniu.ro/"&gt;http://www.proverbs.iuliumaniu.ro/&lt;/a&gt;, I made a presentation to the teacher and my colegs  and I got my 10. Regarding the resources, I program aproximately 15 minutes (i have my own framework for this) and i run the application ~30 minutes; all this for an ten; now I add to that application some Google ads, it loks nicer, andI goat some money from it. All this in 30 minutes!&lt;br /&gt;&lt;br /&gt;Are you interested in this? Don't hesitate to contact me! Write a comment I will contact you as soon is posible. Thanks!&lt;br /&gt;&lt;br /&gt;And don't forget! A crawler can be the start of your web application! Or why not? A desktop application! So command your own crawler right now; your needed programmer is here!&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7014766648070950269-8881378742664860319?l=dataprovider.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://dataprovider.blogspot.com/feeds/8881378742664860319/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://dataprovider.blogspot.com/2009/01/example-site-that-is-builded-using.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/8881378742664860319'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7014766648070950269/posts/default/8881378742664860319'/><link rel='alternate' type='text/html' href='http://dataprovider.blogspot.com/2009/01/example-site-that-is-builded-using.html' title='Example: Site that is builded using the crawlers'/><author><name>Pop Adrian</name><uri>http://www.blogger.com/profile/14420584885621264756</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/__0P_YRb2aNY/SW-oxx7W2WI/AAAAAAAAACQ/lWKupDdYZZY/s72-c/crawler_brut.PNG' height='72' width='72'/><thr:total>0</thr:total></entry></feed>
