Monday, September 19, 2011

Compound primary key for Solr's Data Import Handler (DIH)

If you ever find yourself with a datasource where you need to concatenate multiple columns or values to form the primary or unique key ... but hasn't been already done for you ... then you can do so on-the-fly with Solr's DIH using the TemplateTransformer like so:

Wednesday, September 14, 2011

Import data from Amazon RSS feeds into Solr

  1. For this example, lets use the RSS feed for new products that have been tagged as blu-ray, here's the URL:
  2. Before we import any data into Solr, lets take a moment to understand the format of the data from the RSS feed. Here's a sample item from the RSS feed:
    1. There are 5 basic fields per item: title, guid, link, pubdate and description.
    2. If we take a closer look at the HTML CHUNK inside description, we will find more information like:
      • the image URL
      • the price for a new item
        Buy new: $10.99
        
      • the price of used items starting from a lowerbound:
        
          22 used and new
          from $6.35
        
        
      • In fact, the HTML Chunk is like a complete webpage, it even has a separate description section inside of itself!
        ...
      • All this is rather messy and unpredictable (sometimes its there, sometimes its not, sometimes its a class, sometimes its an id, sometimes there are duplicates) but we must try to make the best of it.
  3. Having understood the complexities of our source of data, we are now ready to configure the Data Import Handler (DIH) for Solr.
    1. Navigate to the directory which has the out-of-the-box sample core for configuring RSS feeds. And edit the rss-data-config.xml file for importing data to look as follows:
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
       
       
        
         
         
         
        
       
      
      
  4. In order to grab and add the price from the description:
    1. add the price as a field to the schema.xml file
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/schema.xml
      
      
      
      
      
    2. add the RegexTransformer to the chain of transformers in the rss-data-config.xml file
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
         ...
      
      
    3. add the regex to find and extract the price
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
         ...
         
         
      
      
      Please keep in mind that this is NOT the best regex that you can use. This one simply grabs the last set of digits with a dollar sign in front of them in the HTML, so it may grab the used price instead of the new price! Feel free to come up with a better regex for yourself.
    4. If there are any items where the regex is incapable of pulling the price due to malformed HTML or any other reason, you can choose to either skip those items and not add them to Solr. OR you can introduce your own dummy price for them. Here's how you may do so:
      • add the use of the ScriptTransformer function you'll define into the chain of transformers in the rss-data-config.xml file
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
           ...
        
        
      • add the following script to rss-data-config.xml file if you want to skip the items where a price wasn't available or couldn't be deduced
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
        
        
        ...
        
        
      • add the following script to rss-data-config.xml file if you simply want to inject a dummy price
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
        
        
        ...
        
        
  5. The following can be added to pull out the image URL:
    
       ...
       
       
    
    
  6. The following can be added to pull out the date (SimpleDateFormat was used as a reference to devise a format string to parse the incoming dates from the feed such as Mon, 12 Sep 2011 21:14:23 GMT):
    
       ...
       
       
    
    
  7. Start the Solr server:
    cd /trunk/solr/example
    java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar
    
  8. Navigate to the following URL
    http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/dataimport
    
  9. Click the Full Import with Cleaning button to import data from Amazon RSS feed into Solr.

Junk / Errata
url="http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn"
regex=".*Buy new.*span.class..tgProductPrice...(\d*.\d*)"
stripHTML="true"

Saturday, September 10, 2011

Multicore master-slave replication in Solr Cloud

  1. If you've already done some work with Solr Cloud then you may want to start fresh by cleaning up any previous ZooKeeper configuration data in order to run this example exercise smoothly.
    cd /trunk/solr/example/solr
    rm -rf zoo_data
    
  2. We will create the following setup:
    1. there will be 2 Solr instances, each with 3 cores
    2. 1 of the 3 cores will be a master and the other 2 will be slaves
    3. the slaves of one instance will be configured to use the master of the other one
    4. The infrastructure will look like:
      • Solr-Instance-A
        • master1 (indexes changes for shard1)
        • slave1-master2 (replicates changes from shard2)
        • slave2-master2 (replicates changes from shard2)
      • Solr-Instance-B
        • master2 (indexes changes for shard2)
        • slave1-master1 (replicates changes from shard1)
        • slave2-master1 (replicates changes from shard1)
  3. We can reuse the multicore directory of the out-of-the-box example. It already has the core0 and core1 directories, lets create an additional core:
    cd /trunk/solr/example/multicore
    cp -r core0 core2
    
  4. If we were NOT using Solr Cloud which has us upload an universal configuration at startup, then we would perform the following sub-steps:
    • Replace any mention of core0 with core2
      sed -ibak 's/core0/core2/g' core2/conf/solrconfig.xml
      sed -ibak 's/core0/core2/g' core2/conf/schema.xml
      sed -ibak 's/zero/two/g' core2/conf/schema.xml
      
    But right now these are pointless because each individual core's configuration will not be used ... instead the configuration in ZooKeeper will be used.
  5. Edit solr.xml as follows:
    
        
        
        
    
    
  6. Copy over the example's zoo.cfg from the single-solr setup over to the multicore setup:
    cd /trunk/solr/example
    cp ./solr/zoo.cfg ./multicore/
    
  7. Copy example to example2 in order to create another Solr instance
    cd /trunk/solr
    cp -r example example2
    
  8. Edit solr.xml for example2 as follows:
    cd trunk/solr/example2/multicore
    vi solr.xml
    
        
        
        
    
    
  9. So where is the configuration that we will we be uploading to ZooKeeper? And what should we edit? Well, the most well formed configuration is sitting in the out-of-the-box single core example so let us simply upload it from there to ZooKeeper! And have it configured such that it can be applied conditionally to all our cores based on the java params that we specify at startup!
    1. Let us begin by editing the solrconfig.xml file of the single solr core example as follows:
      cd /trunk/solr/example/solr/conf
      vi solrconfig.xml
      
             
               ${enable.master:false}
               commit
               startup
               schema.xml,stopwords.txt
             
             
               ${enable.slave:false}
               http://${masterHost:localhost}:${masterPort:8983}/solr/${masterCoreName:master1}/replication
               00:00:60
             
      
      
    2. We cannot pass the true/false values via -Denable.master or -Denable.slave at startup because it will end up applying globally to all the cores (1 master & 2 slaves) and there isn't a way to start only one core at a time from the command line. So we must leverage each individual multicore's solr.xml to provide core specific values as follows:
      cd /trunk/solr/example/multicore
      vi solr.xml
        
          
            
          
          
            
          
          
            
          
        
      
      
      cd /trunk/solr/example2/multicore
      vi solr.xml
        
          
            
          
          
            
          
          
            
          
        
      
    3. Now let us start the 1st instance of the multicore Solr with the appropriate java params and let ZooKeeper know exactly where to get its universal-config (bootstrap_confdir) from:
      cd /trunk/solr/example
      #java -Dbootstrap_confdir=./solr/conf \
      #     -Dsolr.solr.home=multicore \
      #     -DmasterHost=localhost -DmasterPort=7574 -DmasterCoreName=master2 \
      #     -DzkRun \
      #     -jar start.jar
      java -Dbootstrap_confdir=./solr/conf -Dsolr.solr.home=multicore -DmasterHost=localhost -DmasterPort=7574 -DmasterCoreName=master2 -DzkRun -jar start.jar
      
    4. Start the 2nd instance of the multicore Solr with the appropriate java params:
      cd /trunk/solr/example2
      #java -Djetty.port=7574 \
      #     -DhostPort=7574 \
      #     -Dsolr.solr.home=multicore \
      #     -DmasterHost=localhost -DmasterPort=8983 -DmasterCoreName=master1 \
      #     -DzkHost=localhost:9983 \
      #     -jar start.jar
      java -Djetty.port=7574 -DhostPort=7574 -Dsolr.solr.home=multicore -DmasterHost=localhost -DmasterPort=8983 -DmasterCoreName=master1 -DzkHost=localhost:9983 -jar start.jar
      
  10. Now, you can check the ZooKeeper status here:
    http://localhost:8983/solr/master1/admin/zookeeper.jsp
    

And that's all there is to it, feel free to leave any feedback as comments below.

Friday, September 9, 2011

My Solr Cloud Wishlist

  1. Add code in Solr such that the admin may configure a limit on how many documents is way too many to hold in single Solr core and kick-off an automated process to:
    1. Either, CREATE another core (same or separate machine?) and add it to the ZooKeeper configuration with a weight that signifies that all new addition should happen to this new core's index only. Though I wonder how an update (delete+add) would work?
    2. Or, begin sharding the existing core. This could be done by CREATE-ing a copy of it (core_copy) and distributing the index into two shards (core_shard1,core_shard2) using a scheme/policy that does so in a best-effort manner such that the scoring wouldn't get thrown off by too much due to each individual shard's differing IDF. Then SWAP the two sharded-cores in as replacement for the overloaded core. What would happen to any changes made during this process?

Setup Solr master-slave replication with ZooKeeper

Reading Chapter 9: Scaling Solr from the book Solr 1.4 Enterprise Search Server before jumping into the world of Solr Cloud is essential for anyone who wants to understand what the embedded ZooKeeper can or cannot do. This is because you have to know how to configure al the nuts & bolts in Solr manually before you can gain a natural understanding of what the automation does and does not take care of for you.

If you go through the basic exercises for Solr Cloud, then you will come across Example B: Simple two shard cluster with shard replicas. It is important to note that the wording here can be a bit misleading based on what you are looking to accomplish. It is not replication that is being set up there. Instead, that example uses "replicas" as "copies", to demonstrate high search availability.

Here are the tested & tried steps for replication with a master-slave setup that will fit-in with a ZooKeeper managed Solr Cloud:
  1. If you've already done some work with Solr Cloud then you may want to start fresh by cleaning up any previous ZooKeeper configuration data in order to run this example exercise smoothly.
    cd /trunk/solr/example/solr
    rm -rf zoo_data
    
  2. Collection is a ZooKeeper oriented terminology to indicate a bunch of Solr cores that share the same schema and this has nothing to do with the name of a Solr Core itself. Lets keep this fact plain to see by editing the solr.xml file and providing an appropriate name for the core & collection:
    <cores adminPath="/admin/cores" defaultCoreName="master1">
     <core name="master1" instanceDir="." shard="shard1" collection="collection1"></core>
    </cores>
    
  3. Navigate to the configuration directory for the example in the trunk & begin editing solrconfig.xml using your preferred text-editor:
    cd /trunk/solr/example/solr/conf
    vi solrconfig.xml
    
  4. Uncomment and edit the replication requestHandler to be as follows:
    <requesthandler name="/replication" class="solr.ReplicationHandler" >
           <lst name="master">
             <str name="enable">${enable.master:false}</str>
             <str name="replicateAfter">commit</str>
             <str name="replicateAfter">startup</str>
             <str name="confFiles">schema.xml,stopwords.txt</str>
           </lst>
           <lst name="slave">
             <str name="enable">${enable.slave:false}</str>
             <str name="masterUrl">http://localhost:8983/solr/replication</str>
             <str name="pollInterval">00:00:60</str>
           </lst>
    </requestHandler>
    
  5. Navigate out of the examples directory and create another copy of it
    cd /trunk/solr/
    cp -r example example2
    
  6. Edit the solr.xml file for the example2 directory:
    1. change the name of the core to indicate that it is a slave
    2. leave the name of the shard as-is to indicate which shard it is a replica of
    3. leave the name of the collection as-is because this slave core should join the same collection as its master in ZooKeeper config
    cd /trunk/solr/example2/solr
    vi solr.xml
    
    <cores adminPath="/admin/cores" defaultCoreName="slave1">
     <core name="slave1" instanceDir="." shard="shard1" collection="collection1"></core>
    </cores>
    
  7. Start the master core, the use of java params allows us to call this out as a master at startup:
    cd /trunk/solr/example
    java -Dbootstrap_confdir=./solr/conf -Denable.master=true -DzkRun -jar start.jar
    
  8. Start the slave core, the use of java params allows us to call this out as a slave at startup:
    cd /trunk/solr/example2
    java -Djetty.port=7574 -DhostPort=7574 -Denable.slave=true -DzkHost=localhost:9983 -jar start.jar
    
  9. After starting the slave, towards the end of the logs for the slave, you should be able to spot info to affirm that replication is working:
    INFO: Updating cloud state from ZooKeeper...
    Sep 9, 2011 6:20:00 PM org.apache.solr.handler.SnapPuller fetchLatestIndex
    INFO: Slave in sync with master.
    

Sources:
  • http://lucene.472066.n3.nabble.com/Solr-Cloud-is-replication-really-a-feature-on-the-trunk-td3317695.html
  • http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html