An ordinary crawler? that is doing anything, anyhow, anywhere? Yes, I will publish something like that soon. Because my visitors expects from me something like this.
So, what do you need? Describe your needed data and I will made an aplication to get this data regulary. Describe it in the comments from here, and you will get your robot for free.
Thanks!
Monday, May 11, 2009
Tuesday, April 28, 2009
Database servers: MySql versus MsSql Express
A modern application require a fast database. In any domains. So when you chose what database you should use....for your convenience thinck, not twice, ten times, if this is posible.
What happens.... A friend ask me to get the data from an site for him. It was sample so I chose to to this with his favorite database engine: MySql. I create the robot, I runned it. In this time I saw that my computer works slowly. I finished it, i give to my my friend the database and after that he cause me to a fight, a SEO fight. Each of us should create an application that will publish that data on the web and who get more money wins. I accept this with one condition: we can use any techonlogy that we want. So I create a little application that moves the data from MySQL to MSSql 2005 Express.
In this moment I realize why my computers goes slowly during the robot running. If MSSql is using a little percent from processor on MySql the processor was ussed constant at 30%. That is very bad!
It is very bad because MySql has doit two instructions: select top 1 * from x where flag is true and update x set processed = 1 where id = @id. In this time MSSql server was doit one insert operation, and other 5-6 insert operations using the id from the 1st insert.
Is there such a big diference? I know that update command consumes more resurces but....such a big diference?
The final conclusions:
What happens.... A friend ask me to get the data from an site for him. It was sample so I chose to to this with his favorite database engine: MySql. I create the robot, I runned it. In this time I saw that my computer works slowly. I finished it, i give to my my friend the database and after that he cause me to a fight, a SEO fight. Each of us should create an application that will publish that data on the web and who get more money wins. I accept this with one condition: we can use any techonlogy that we want. So I create a little application that moves the data from MySQL to MSSql 2005 Express.
In this moment I realize why my computers goes slowly during the robot running. If MSSql is using a little percent from processor on MySql the processor was ussed constant at 30%. That is very bad!
It is very bad because MySql has doit two instructions: select top 1 * from x where flag is true and update x set processed = 1 where id = @id. In this time MSSql server was doit one insert operation, and other 5-6 insert operations using the id from the 1st insert.
Is there such a big diference? I know that update command consumes more resurces but....such a big diference?
The final conclusions:
- MySql is working hard on the processor part. So when you thinck an aplication you could thinck at this.
- MSSql use a lot of RAM memory but I thinck that this isn't a problem with the ram prices.
- MySql has no limitations on the database size. Howewer, in aprevisous use of this database server I see that on a big database (10G if I remembet good) this server work to hard, or it doesn't working.
- MSSql is limited to 4 giga on database size. Generally this is not a problem...but thinck also at this ... Also there are and other limitations.
- Speed of command execution, is biger on MSSql, in general, than on MySql.
Here you can see the processor work on this two servers. This takes to me 10 Hours, for 70 000 data rows (only text, the pictures was saved on the hdd). Also when I have execute an command into the servers clients (ssmsee for MSSql and Heidi for MySql) there was some horrible results on the MySql side: timeout and a long time to count a number of 2-30000 rows; here MSSql has doit a good job; One moment i belive that this two cann't be compared; anyway the limitations is the only price that you have to pay for MSSql 2005 Express; in MySql you pay with the application speed: this is ugly.

MySql
Labels:
behavior,
crawler,
limitations,
mssql,
mysql,
technology
Monday, April 13, 2009
Special Robots: Yahoo!
Yahoo! Messenger. A communication tool that is used by a lot of peoples. In Romania is about 95% of the Internet users....
Well...this tool offer to you the oporunity to set your state: avaible, hidden, a special message and other options. But the guys from Yahoo! are not so good programmers, so if you set your state as hidden someone can see that you are hidden. This bug was exploited by a lot of programmers who see in this an oportunity to made some money. Not very much....but it means more than nothing, be sure about this. About this bug some of my old partners start they own "bussines" on Yahoo!'s body....
Is about http://www.status-yahoo.ro/ and http://www.yahoo-invisible.eu/. This are some samples robots (don't forget, I am expert in this :) ) that are doing some requests o the Yahoo!'s servers (or Hi5 server, or any other bad application) and they get some informations about the asked id's. Nice...for some of us...for others....
Anyway, my expartners are happy with it's sites because they won some money. And how you can see in a previsous post I don't agree this kind of jobs; this doesn't respect the personal privacy. If somebody wants to talk to me, but I have no time for he/she....what should I do? Yahoo! let me to set my account to the invisible...but my friends are see my real status because of this guys... I understand...this is a Yahoo! problem but my how is about my provacy? The only way how I can be sure about my privacy is to not start Yahoo Mesenger....Not a good way.
In the end....they ell me that this site's aren't used by my friends. The main visitors/clients for this site's are children with the age between 13 and 18 years. "Kinders" but they can affeact my "reputation"....
Well...this tool offer to you the oporunity to set your state: avaible, hidden, a special message and other options. But the guys from Yahoo! are not so good programmers, so if you set your state as hidden someone can see that you are hidden. This bug was exploited by a lot of programmers who see in this an oportunity to made some money. Not very much....but it means more than nothing, be sure about this. About this bug some of my old partners start they own "bussines" on Yahoo!'s body....
Is about http://www.status-yahoo.ro/ and http://www.yahoo-invisible.eu/. This are some samples robots (don't forget, I am expert in this :) ) that are doing some requests o the Yahoo!'s servers (or Hi5 server, or any other bad application) and they get some informations about the asked id's. Nice...for some of us...for others....
Anyway, my expartners are happy with it's sites because they won some money. And how you can see in a previsous post I don't agree this kind of jobs; this doesn't respect the personal privacy. If somebody wants to talk to me, but I have no time for he/she....what should I do? Yahoo! let me to set my account to the invisible...but my friends are see my real status because of this guys... I understand...this is a Yahoo! problem but my how is about my provacy? The only way how I can be sure about my privacy is to not start Yahoo Mesenger....Not a good way.
In the end....they ell me that this site's aren't used by my friends. The main visitors/clients for this site's are children with the age between 13 and 18 years. "Kinders" but they can affeact my "reputation"....
Wednesday, April 8, 2009
My crawlers: downloads
Here you can download my crawlers. And here is nothing to be downloadable; there is no crawler.
I post this information because I receive an post from an Anonymus who told me that one of my crawlers has doit something bad on his computer, after he has downloaded and installed it.
Sory guys, i didn't published an crawler; first of all because a crawler means money; from this I live :) so be sure that I will not publish for free; second - in this moment I don't belive in open source....
So if you download a robot from "me" be sure by it provenieince. :))
I post this information because I receive an post from an Anonymus who told me that one of my crawlers has doit something bad on his computer, after he has downloaded and installed it.
Sory guys, i didn't published an crawler; first of all because a crawler means money; from this I live :) so be sure that I will not publish for free; second - in this moment I don't belive in open source....
So if you download a robot from "me" be sure by it provenieince. :))
Wednesday, March 4, 2009
I'm an data provider! What does that means?
Hello everyone! My name is Adrian and I am a data provider. What means that? In the last three years I've been working as an crawler programmer or an robot programmer. What means this kind of programing?
Crawler (also called web spyder) is a program, or a script that surfs on the internet automatically and grab data from the internet. Where is useful this kind of data? The answer is everywhere; you only should see where you need the information and I will grab that data for you.
Who is using the data? Everybody; the first big users of the crawled data are the search engine; google, yahoo, windows live and all others search engines that exist on the internet are based on an robot, an crawler; the robot walks on the internet and grabs a lot of informations: text content, pictures, diferent files formats (pdf, html, .....); since the data is collected, the search engine has some steps where the work on the data: the collect the keywords, they see how many URL are sended to an page; after the data is processed they have data that will go to the users; the user get a list of pages relevant to his search. All this is thanks to an crawler.
Who else is using the robots? A lot of companies and people. For example, my current company has an real estate site and the ads are coming from a lot of real estate agencies who has an webpage. I made for every real estate agency web site an robot that walks on the client page every week and collects his ads details (the address, phone, price, title, description, no. of rooms and other stuffs). My work is available on the internet, just contact me for this.
This is my first article for this blog, so here I don't explain to yo all the process; there will be articles that will explain in more details what I am doing. For more details or if you are interested by some data that is on the internet don't hesitate to contact me.
Crawler (also called web spyder) is a program, or a script that surfs on the internet automatically and grab data from the internet. Where is useful this kind of data? The answer is everywhere; you only should see where you need the information and I will grab that data for you.
Who is using the data? Everybody; the first big users of the crawled data are the search engine; google, yahoo, windows live and all others search engines that exist on the internet are based on an robot, an crawler; the robot walks on the internet and grabs a lot of informations: text content, pictures, diferent files formats (pdf, html, .....); since the data is collected, the search engine has some steps where the work on the data: the collect the keywords, they see how many URL are sended to an page; after the data is processed they have data that will go to the users; the user get a list of pages relevant to his search. All this is thanks to an crawler.
Who else is using the robots? A lot of companies and people. For example, my current company has an real estate site and the ads are coming from a lot of real estate agencies who has an webpage. I made for every real estate agency web site an robot that walks on the client page every week and collects his ads details (the address, phone, price, title, description, no. of rooms and other stuffs). My work is available on the internet, just contact me for this.
This is my first article for this blog, so here I don't explain to yo all the process; there will be articles that will explain in more details what I am doing. For more details or if you are interested by some data that is on the internet don't hesitate to contact me.
Tuesday, January 20, 2009
Crawler: legally or illegally
Crawler, the love of every web-developer; or, probably the love of every involved data collector. Who knows....
Anyway, we need this to be known to the people; we invite Google, Yahoo! or others web search engines to index our page. What about other robots? Somebody who try to stole our data ... hmm, bad guy! Not nice from him, not nice ... But he needs our data, that means that your data are important! That means that you have to protect your data. How? there are many ways to protect your data:
Anyway, we need this to be known to the people; we invite Google, Yahoo! or others web search engines to index our page. What about other robots? Somebody who try to stole our data ... hmm, bad guy! Not nice from him, not nice ... But he needs our data, that means that your data are important! That means that you have to protect your data. How? there are many ways to protect your data:
- The firewall; if a firewall is good configured, then you will not have problems; that is because an bad guy, create an simple application that made a lot of requests to your; here the firewall can stop that crawler: the firewall will stop the bad guy application using his IP. Bye bye!
- The application should implement some simple security "barriers": you don't have to expose e-mail addresses or other contact data that is directly involved in a user collection system. Other way, you can think how a special crawler work. Most of times, an robot that is made to grab the data from your site, will attack your web application will walk using your website navigation system: the search system, the region/category division of your site, a public unsecured web service or other weaknesses of you application. From this pages, the bud guy application will collect the url-s to your "row" data, to individual data object; for example if your site is Google, from here the robot will grab the search results. So it will walks your database in an specially order; you will see that a "user", a special user walks on your database, in a specially order. It takes all your results, so you can stop him (also Google is doing something like this, if there are are 1000000 results, the user get only 1000). Similary, you can implement something like Google: if the user has more than X results, you can offer to him Y results (Y less then X - something like 1000 and100000 ). If he needs ,more results he change his query to an more exactly query; also your search system should work very good if you decide to use this defend system
This are some methods that helps you to self defend against the robots. But a good "bad guy" will walk anyway on your web site.... Using an proxy server, an dinamically IP, or running his application slowly! Here you should defend in other ways, against the bud guys. If they get his data you can get some money from him, but you have to identify him, you have to prove that this guy, has stole your data. Most of the cases the bud guy will publish your data, on his site. So you have to find this site, to go with the bad guy in justice, involving the copyright law. In this moment the bad guy is renamed: the por guy. But still bad guy!
How do I proceed ... Simple. If there is ansimple site, that involves a simple data to be collected (most of all, your data, or your friends/enemyes data, from games, social networks and so on), here I consider that this robot is a necesary tool for you, that is not dangerous for the crawled website and I made the robot as quick is posible. I see a lot of these robots, and nobody doesn't cry because this robots are doing bad work.
But when your robot should collect a lot of data, here I create the robot but you should have an permission from that website. The crawled web site will not be afected by my robot, and you will have your data very quick. The robot is ready to go, so contact me, as quick is posible!
Why to chose an robot, when you can work and create an export from an database to another? Because is more than quick. Because it is cost less than paying an programmer to work a day or two on this. Because the crawled data is ready to go directly in your special database. There are a lot of reasons to chose an crawler! Contact me: post a comment on this blog, and you will be contacted by me very quick!
How do I proceed ... Simple. If there is ansimple site, that involves a simple data to be collected (most of all, your data, or your friends/enemyes data, from games, social networks and so on), here I consider that this robot is a necesary tool for you, that is not dangerous for the crawled website and I made the robot as quick is posible. I see a lot of these robots, and nobody doesn't cry because this robots are doing bad work.
But when your robot should collect a lot of data, here I create the robot but you should have an permission from that website. The crawled web site will not be afected by my robot, and you will have your data very quick. The robot is ready to go, so contact me, as quick is posible!
Why to chose an robot, when you can work and create an export from an database to another? Because is more than quick. Because it is cost less than paying an programmer to work a day or two on this. Because the crawled data is ready to go directly in your special database. There are a lot of reasons to chose an crawler! Contact me: post a comment on this blog, and you will be contacted by me very quick!
Thursday, January 15, 2009
Example: Site that is builded using the crawlers
Last year I have an exam on my faculty (computer science, Faculty of Science, University of Oradea), an exam that means, that we, the students, we should show to the teacher a webpage that it is programed by us, using PHP. Even if I am not a PHP guru, I build the site; but the ideea is that the site should have a subject; and I chose one: Proverbs. A site where the internet users should add thir proverbs, were they can read proverbs, in more languages and so on…you can see the application here: http://www.proverbs.iuliumaniu.ro/ .
And for a good exemplifying of what my application can do, I create an robot that collects proverbs from a lot of sites; in the end I got ~15,000 proverbs, a number that is enough for my application, to be ok, when it will be presented to the teacher. How do I do this? Simple...
First I take a look on a site where are proverbs; I know what I need: proverbs, proverb language, proverb provenience (country/region/time – example Ancient ) and I am looking on that site how can I extract this data; after that I create a simple script (generally I work in .net, c#, so I made an little program for this); after I run my magic program I goat all the proverbs and foreach proverb his language and his provenience into
In this moment my site is populated, I have more than 15,000 proverbs, I finish the web application that can be founded at http://www.proverbs.iuliumaniu.ro/, I made a presentation to the teacher and my colegs and I got my 10. Regarding the resources, I program aproximately 15 minutes (i have my own framework for this) and i run the application ~30 minutes; all this for an ten; now I add to that application some Google ads, it loks nicer, andI goat some money from it. All this in 30 minutes!
Are you interested in this? Don't hesitate to contact me! Write a comment I will contact you as soon is posible. Thanks!
And don't forget! A crawler can be the start of your web application! Or why not? A desktop application! So command your own crawler right now; your needed programmer is here!
Subscribe to:
Posts (Atom)
