What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but ...
I would like to implement a search engine which should crawl a set of web sites, extract specific information from the pages and create full-text index of that specific information.
It seems ...
I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect ...
I would like to know what is best way to design Notification System for website update:
For Example use case:
Let suppose you have site like craiglist.com and any time a new posting ...