[Haifux] [Haifux Meeting] Crawling in Lightning

Tzafrir Rehan tzafrir.r at gmail.com
Sun Feb 10 09:12:37 IST 2008


Depends on how nice the crawler is...

If it uses a specific user agent, respects robots.txt, and keeps a certain
IP address, then you can block it using those methods.

If it sends a user agent of "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
5.2; .NET CLR 1.1.4322)", ignores robots.txt, crawls a page every 15 seconds
or so, and switches an IP address after a short while using anonymous
proxies (also read: virus infected computers worldwide) then no program, or
human, can know it's not a human surfing.

--

   Tzafrir Rehan.

On Feb 10, 2008 8:47 AM, Shahar Dag <dag at cs.technion.ac.il> wrote:

> Hi
>
> OK, this sounds interesting, but what about the other side.
> How do a web muster can block all those crawlers?
> (I prefer a mail answer since I can't come to the lecture)
>
> Thanks
> Shahar Dag
>
> _____________________________________________________________________________________________
> I am looking for old Vinyl record.
> If you have any that you don't need please mail me
>
> Thanks
> Shahar
>
> ----- Original Message -----
> From: "Eli Billauer" <eli at billauer.co.il>
> To: "Haifa linux club" <haifux at haifux.org>
> Cc: "linux-il" <linux-il at cs.huji.ac.il>
> Sent: Saturday, February 09, 2008 3:15 PM
> Subject: [Haifux Meeting] Crawling in Lightning
>
>
> > Next Monday, 11th of February, at 18:30 the Haifa Linux Club, will
> gather
> > for a lightning talk session
> >
> >             Crawling in Lightning
> >
> > Abstract
> >
> > This is a show-me-the-source meeting, during which several one-liners
> and
> > scripts will be presented. The core subject is methods for interacting
> > with HTTP web servers ("faking Firefox") in order to fetch information,
> > vote automatically in polls etc.
> >
> > This meeting consists of several short talks, by several speakers (*).
> The
> > agenda is as follows, 5-10 minutes per item (subject to change):
> >
> > * A very short introduction to HTTP (mainly showing a typical session
> > transcript)
> > * GET
> > * wget
> > * curl
> > * A script in Python with exception handling
> > * A short script in Python for fetching mp3's
> > * Perl script to rip image galleries (LWP) with cookie handling for
> login
> > * A Ruby script
> > * Perl: Using the POST method to vote automatically
> > * A Perl/Tk GUI script helping in developing crawlers
> >
> > (*) It turned out that there is more interest than experience in the
> field
> > among Haifuxers. As a result, more than one of the items above will be
> > delivered by yours truly.
> >
> > ======================================================
> >
> > We meet in Taub building, room 6. For location information see:
> > http://www.haifux.org/where.html
> >
> > Attendance is free, and you are all invited!
> >
> > ======================================================
> >
> > Future Lectures:
> >
> > Tapping into the Fountain of CPUs---On Operating System Support for
> > Programmable Devices, by Muli Ben-Yehuda, 25/2/2008
> >
> > ======================================================
> >
> > We are always interested in hearing your talks and ideas. If you wish to
> > give a talk, hold a discussion, or just plan some event Haifux might be
> > interested in, please contact us at webmaster at haifux.org
> >
> >
> >
> > =================================================================
> > To unsubscribe, send mail to linux-il-request at cs.huji.ac.il with
> > the word "unsubscribe" in the message body, e.g., run the command
> > echo unsubscribe | mail linux-il-request at cs.huji.ac.il
> >
>
> _______________________________________________
> Haifux mailing list
> Haifux at haifux.org
> http://hamakor.org.il/cgi-bin/mailman/listinfo/haifux
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://hamakor.org.il/pipermail/haifux/attachments/20080210/71702930/attachment-0002.htm>


More information about the Haifux mailing list