2:00 PM -- I was recently asked to find a robot, one that looks and acts like a human. It doesn't rip the site quickly, it doesn't have robot headers, it doesn't respect or even look at robots.txt, yet it was costing this company hundreds of thousands of dollars in lost revenue due to content theft. My job was to seek and destroy it.
Fortunately in this case, it was trivial to find the robot due to non-rotating IP addresses, and the fact that the software they were using was off the shelf and I could test against it. But sometimes it's just not that easy.
There are only a few defenses to a spider like that:
- Monitor changes to the document object model
- Modify the page and look for erroneous clicks using something like Click Density
- Request user input based on visual changes to the page (like CAPTCHAs or other turing tests).
All of these things can slow down the user and can make programming the page much more complex. The age of the robots is here, and they are looking more and more like us every day.