Sunday, September 29, 2013

Setting up Nutch (1.7) and Solr (4.4) - Quick Start

I blog this so that anyone(mainly me- the guy with very short term memory), can install Nutch and Solr and get it working in the shortest amount of time and fuss.

I'm using Nutch 1.7 and Solr 4.4.

Make sure you have installed Java and have set JAVA_HOME.

I am using Oracle Java on Ubuntu server.


Download binares:

My nearby mirrors:

 Unpack nutch and solr so that you will have


Copy the schema for solr4 from Nutch to Solr directory:

 cp apache-nutch-1.7/conf/schema-solr4.xml ~/solr-4.4.0/example/solr/collection1/conf/schema.xml

Edit the scheme.xml that you have just copied to solr directory.

vi ~/solr-4.4.0/example/solr/collection1/conf/schema.xml

Add additional line in the name field in

  <  field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/   >

Such that it will look like

Start Solr.

cd  ~/solr-4.4.0/example

java -jar start.jar

Check if you can load the administrative page:


5. Start crawling.

Create a text file with one or a list of URLs , one per line in ~/apache-nutch-1.7/url/seed.txt

Or use the DMOZ example in the Nutch documentation.

6. execute:

bin/nutch crawl urls/seed.txt -solr http://localhost:8983/solr -depth 3 -topN 50

Saturday, September 7, 2013

PhantomJS and Ubuntu Server

Downloaded PhantomJS on new LAMP install of 12.04 VM .

Encountered an error:
"error while loading shared libraries :

It won't run without an additional package :

apt-get install libfontconfig.

Problem solved