Behemoth: September 2011

Monday, September 19, 2011

Compound primary key for Solr's Data Import Handler (DIH)

If you ever find yourself with a datasource where you need to concatenate multiple columns or values to form the primary or unique key ... but hasn't been already done for you ... then you can do so on-the-fly with Solr's DIH using the TemplateTransformer like so:

Wednesday, September 14, 2011

Import data from Amazon RSS feeds into Solr

For this example, lets use the RSS feed for new products that have been tagged as blu-ray, here's the URL:
- http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn
- If you are using Firefox, you may not get a chance to view the feed in its raw xml format as the browser tends to actually interpret and present most of the RSS feed in a user-friendly fashion. This is not very desirable for developers or adminstrators. In order to view the raw content of the feed, you can simply view the same URL via Feed-Proxy: http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn
- Refer to the following links for info on how to get the RSS feeds you want from Amazon:
  - http://www.amazon.com/gp/help/customer/display.html?nodeId=200202840
  - http://www.amazon.com/gp/tagging/rss-help.html
Before we import any data into Solr, lets take a moment to understand the format of the data from the RSS feed. Here's a sample item from the RSS feed:
1. There are 5 basic fields per item: title, guid, link, pubdate and description.
2. If we take a closer look at the HTML CHUNK inside description, we will find more information like:
  - the image URL
  - the price for a new item
```
Buy new: $10.99
```
  - the price of used items starting from a lowerbound:
```
  22 used and new
  from $6.35
```
  - In fact, the HTML Chunk is like a complete webpage, it even has a separate description section inside of itself!
```
...
```
  - All this is rather messy and unpredictable (sometimes its there, sometimes its not, sometimes its a class, sometimes its an id, sometimes there are duplicates) but we must try to make the best of it.
Having understood the complexities of our source of data, we are now ready to configure the Data Import Handler (DIH) for Solr.
1. Navigate to the directory which has the out-of-the-box sample core for configuring RSS feeds. And edit the rss-data-config.xml file for importing data to look as follows:
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
 
 
  
   
   
   
  
 
```
In order to grab and add the price from the description:
1. add the price as a field to the schema.xml file
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/schema.xml
```
2. add the RegexTransformer to the chain of transformers in the rss-data-config.xml file
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
   ...
```
3. add the regex to find and extract the price
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
   ...
   
   
```
  Please keep in mind that this is NOT the best regex that you can use. This one simply grabs the last set of digits with a dollar sign in front of them in the HTML, so it may grab the used price instead of the new price! Feel free to come up with a better regex for yourself.
4. If there are any items where the regex is incapable of pulling the price due to malformed HTML or any other reason, you can choose to either skip those items and not add them to Solr. OR you can introduce your own dummy price for them. Here's how you may do so:
  - add the use of the ScriptTransformer function you'll define into the chain of transformers in the rss-data-config.xml file
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
   ...
```
  - add the following script to rss-data-config.xml file if you want to skip the items where a price wasn't available or couldn't be deduced
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
...
```
  - add the following script to rss-data-config.xml file if you simply want to inject a dummy price
```
cd /trunk/solr/example/example-DIH/solr/rss
vi conf/rss-data-config.xml
```
```
...
```
The following can be added to pull out the image URL:
```
   ...
   
   
```
The following can be added to pull out the date (SimpleDateFormat was used as a reference to devise a format string to parse the incoming dates from the feed such as Mon, 12 Sep 2011 21:14:23 GMT):
```
   ...
   
   
```

Start the Solr server:

cd /trunk/solr/example
java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar

Navigate to the following URL

http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/dataimport

Click the Full Import with Cleaning button to import data from Amazon RSS feed into Solr.

Junk / Errata

url="http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn"
regex=".*Buy new.*span.class..tgProductPrice...(\d*.\d*)"
stripHTML="true"

Saturday, September 10, 2011

Multicore master-slave replication in Solr Cloud

If you've already done some work with Solr Cloud then you may want to start fresh by cleaning up any previous ZooKeeper configuration data in order to run this example exercise smoothly.
```
cd /trunk/solr/example/solr
rm -rf zoo_data
```
We will create the following setup:
1. there will be 2 Solr instances, each with 3 cores
2. 1 of the 3 cores will be a master and the other 2 will be slaves
3. the slaves of one instance will be configured to use the master of the other one
4. The infrastructure will look like:
  - Solr-Instance-A
    - master1 (indexes changes for shard1)
    - slave1-master2 (replicates changes from shard2)
    - slave2-master2 (replicates changes from shard2)
  - Solr-Instance-B
    - master2 (indexes changes for shard2)
    - slave1-master1 (replicates changes from shard1)
    - slave2-master1 (replicates changes from shard1)
We can reuse the multicore directory of the out-of-the-box example. It already has the core0 and core1 directories, lets create an additional core:
```
cd /trunk/solr/example/multicore
cp -r core0 core2
```
If we were NOT using Solr Cloud which has us upload an universal configuration at startup, then we would perform the following sub-steps:
- ~~Replace any mention of core0 with core2~~
```
sed -ibak 's/core0/core2/g' core2/conf/solrconfig.xml
sed -ibak 's/core0/core2/g' core2/conf/schema.xml
sed -ibak 's/zero/two/g' core2/conf/schema.xml
```
But right now these are pointless because each individual core's configuration will not be used ... instead the configuration in ZooKeeper will be used.
Edit solr.xml as follows:
```
    
    
    
```
Copy over the example's zoo.cfg from the single-solr setup over to the multicore setup:
```
cd /trunk/solr/example
cp ./solr/zoo.cfg ./multicore/
```
Copy example to example2 in order to create another Solr instance
```
cd /trunk/solr
cp -r example example2
```

Edit solr.xml for example2 as follows:

cd trunk/solr/example2/multicore
vi solr.xml

So where is the configuration that we will we be uploading to ZooKeeper? And what should we edit? Well, the most well formed configuration is sitting in the out-of-the-box single core example so let us simply upload it from there to ZooKeeper! And have it configured such that it can be applied conditionally to all our cores based on the java params that we specify at startup!

Let us begin by editing the solrconfig.xml file of the single solr core example as follows:

cd /trunk/solr/example/solr/conf
vi solrconfig.xml

       
         ${enable.master:false}
         commit
         startup
         schema.xml,stopwords.txt
       
       
         ${enable.slave:false}
         http://${masterHost:localhost}:${masterPort:8983}/solr/${masterCoreName:master1}/replication
         00:00:60

We cannot pass the true/false values via -Denable.master or -Denable.slave at startup because it will end up applying globally to all the cores (1 master & 2 slaves) and there isn't a way to start only one core at a time from the command line. So we must leverage each individual multicore's solr.xml to provide core specific values as follows:
```
cd /trunk/solr/example/multicore
vi solr.xml
  
    
      
    
    
      
    
    
      
    
  


cd /trunk/solr/example2/multicore
vi solr.xml
  
    
      
    
    
      
    
    
      
    
  
```

Now let us start the 1st instance of the multicore Solr with the appropriate java params and let ZooKeeper know exactly where to get its universal-config (bootstrap_confdir) from:

cd /trunk/solr/example
#java -Dbootstrap_confdir=./solr/conf \
#     -Dsolr.solr.home=multicore \
#     -DmasterHost=localhost -DmasterPort=7574 -DmasterCoreName=master2 \
#     -DzkRun \
#     -jar start.jar
java -Dbootstrap_confdir=./solr/conf -Dsolr.solr.home=multicore -DmasterHost=localhost -DmasterPort=7574 -DmasterCoreName=master2 -DzkRun -jar start.jar

Start the 2nd instance of the multicore Solr with the appropriate java params:

cd /trunk/solr/example2
#java -Djetty.port=7574 \
#     -DhostPort=7574 \
#     -Dsolr.solr.home=multicore \
#     -DmasterHost=localhost -DmasterPort=8983 -DmasterCoreName=master1 \
#     -DzkHost=localhost:9983 \
#     -jar start.jar
java -Djetty.port=7574 -DhostPort=7574 -Dsolr.solr.home=multicore -DmasterHost=localhost -DmasterPort=8983 -DmasterCoreName=master1 -DzkHost=localhost:9983 -jar start.jar

Now, you can check the ZooKeeper status here:

http://localhost:8983/solr/master1/admin/zookeeper.jsp

And that's all there is to it, feel free to leave any feedback as comments below.

Friday, September 9, 2011

My Solr Cloud Wishlist

Add code in Solr such that the admin may configure a limit on how many documents is way too many to hold in single Solr core and kick-off an automated process to:
1. Either, CREATE another core (same or separate machine?) and add it to the ZooKeeper configuration with a weight that signifies that all new addition should happen to this new core's index only. Though I wonder how an update (delete+add) would work?
2. Or, begin sharding the existing core. This could be done by CREATE-ing a copy of it (core_copy) and distributing the index into two shards (core_shard1,core_shard2) using a scheme/policy that does so in a best-effort manner such that the scoring wouldn't get thrown off by too much due to each individual shard's differing IDF. Then SWAP the two sharded-cores in as replacement for the overloaded core. What would happen to any changes made during this process?

Setup Solr master-slave replication with ZooKeeper

Reading Chapter 9: Scaling Solr from the book Solr 1.4 Enterprise Search Server before jumping into the world of Solr Cloud is essential for anyone who wants to understand what the embedded ZooKeeper can or cannot do. This is because you have to know how to configure al the nuts & bolts in Solr manually before you can gain a natural understanding of what the automation does and does not take care of for you.

If you go through the basic exercises for Solr Cloud, then you will come across Example B: Simple two shard cluster with shard replicas. It is important to note that the wording here can be a bit misleading based on what you are looking to accomplish. It is not replication that is being set up there. Instead, that example uses "replicas" as "copies", to demonstrate high search availability.

Here are the tested & tried steps for replication with a master-slave setup that will fit-in with a ZooKeeper managed Solr Cloud:

If you've already done some work with Solr Cloud then you may want to start fresh by cleaning up any previous ZooKeeper configuration data in order to run this example exercise smoothly.
```
cd /trunk/solr/example/solr
rm -rf zoo_data
```
Collection is a ZooKeeper oriented terminology to indicate a bunch of Solr cores that share the same schema and this has nothing to do with the name of a Solr Core itself. Lets keep this fact plain to see by editing the solr.xml file and providing an appropriate name for the core & collection:
```
<cores adminPath="/admin/cores" defaultCoreName="master1">
 <core name="master1" instanceDir="." shard="shard1" collection="collection1"></core>
</cores>
```
Navigate to the configuration directory for the example in the trunk & begin editing solrconfig.xml using your preferred text-editor:
```
cd /trunk/solr/example/solr/conf
vi solrconfig.xml
```

Uncomment and edit the replication requestHandler to be as follows:

<requesthandler name="/replication" class="solr.ReplicationHandler" >
       <lst name="master">
         <str name="enable">${enable.master:false}</str>
         <str name="replicateAfter">commit</str>
         <str name="replicateAfter">startup</str>
         <str name="confFiles">schema.xml,stopwords.txt</str>
       </lst>
       <lst name="slave">
         <str name="enable">${enable.slave:false}</str>
         <str name="masterUrl">http://localhost:8983/solr/replication</str>
         <str name="pollInterval">00:00:60</str>
       </lst>
</requestHandler>

Navigate out of the examples directory and create another copy of it
```
cd /trunk/solr/
cp -r example example2
```
Edit the solr.xml file for the example2 directory:
1. change the name of the core to indicate that it is a slave
2. leave the name of the shard as-is to indicate which shard it is a replica of
3. leave the name of the collection as-is because this slave core should join the same collection as its master in ZooKeeper config
```
cd /trunk/solr/example2/solr
vi solr.xml
```
```
<cores adminPath="/admin/cores" defaultCoreName="slave1">
 <core name="slave1" instanceDir="." shard="shard1" collection="collection1"></core>
</cores>
```

Start the master core, the use of java params allows us to call this out as a master at startup:

cd /trunk/solr/example
java -Dbootstrap_confdir=./solr/conf -Denable.master=true -DzkRun -jar start.jar

Start the slave core, the use of java params allows us to call this out as a slave at startup:

cd /trunk/solr/example2
java -Djetty.port=7574 -DhostPort=7574 -Denable.slave=true -DzkHost=localhost:9983 -jar start.jar

After starting the slave, towards the end of the logs for the slave, you should be able to spot info to affirm that replication is working:

INFO: Updating cloud state from ZooKeeper...
Sep 9, 2011 6:20:00 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave in sync with master.

Sources:

http://lucene.472066.n3.nabble.com/Solr-Cloud-is-replication-really-a-feature-on-the-trunk-td3317695.html
http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html

Behemoth