I blog this so that anyone(mainly me- the guy with very short term memory), can install Nutch and Solr and get it working in the shortest amount of time and fuss.
I'm using Nutch 1.7 and Solr 4.4.
Make sure you have installed Java and have set JAVA_HOME.
I am using Oracle Java on Ubuntu server.
1.
Download binares:
My nearby mirrors:
http://mirror.nus.edu.sg/apache/nutch/1.7/apache-nutch-1.7-bin.tar.gz
http://mirror.nus.edu.sg/apache/lucene/solr/4.4.0/solr-4.4.0.tgz
2.
Unpack nutch and solr so that you will have
~/apache-nutch-1.7
~/solr-4.4.0
Copy the schema for solr4 from Nutch to Solr directory:
cp apache-nutch-1.7/conf/schema-solr4.xml ~/solr-4.4.0/example/solr/collection1/conf/schema.xml
3.
Edit the scheme.xml that you have just copied to solr directory.
vi ~/solr-4.4.0/example/solr/collection1/conf/schema.xml
Add additional line in the name field in
Such that it will look like
4.
Start Solr.
cd ~/solr-4.4.0/example
java -jar start.jar
Check if you can load the administrative page:
http://
5. Start crawling.
Create a text file with one or a list of URLs , one per line in ~/apache-nutch-1.7/url/seed.txt
Or use the DMOZ example in the Nutch documentation.
6. execute:
bin/nutch crawl urls/seed.txt -solr http://localhost:8983/solr -depth 3 -topN 50