Currently nutch has an AdaptiveFetchSchedule which sets the fetch time according to if a page is modified or not. What I want to do is to set the fetch time according ...
In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing ...
I'm using nutch 1.2. When I run the crawl command like so:
bin/nutch crawl urls -dir crawl -depth 2 -topN 1000
Injector: starting at 2011-07-11 12:18:37
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected ...
I injected some URLs in crawl db in Nutch 1.3, but Nutch doesn't fetch URLs from each site equal to -topN.
How can I do that?