Wednesday, September 14, 2011

Import data from Amazon RSS feeds into Solr

  1. For this example, lets use the RSS feed for new products that have been tagged as blu-ray, here's the URL:
  2. Before we import any data into Solr, lets take a moment to understand the format of the data from the RSS feed. Here's a sample item from the RSS feed:
    1. There are 5 basic fields per item: title, guid, link, pubdate and description.
    2. If we take a closer look at the HTML CHUNK inside description, we will find more information like:
      • the image URL
      • the price for a new item
        Buy new: $10.99
        
      • the price of used items starting from a lowerbound:
        
          22 used and new
          from $6.35
        
        
      • In fact, the HTML Chunk is like a complete webpage, it even has a separate description section inside of itself!
        ...
      • All this is rather messy and unpredictable (sometimes its there, sometimes its not, sometimes its a class, sometimes its an id, sometimes there are duplicates) but we must try to make the best of it.
  3. Having understood the complexities of our source of data, we are now ready to configure the Data Import Handler (DIH) for Solr.
    1. Navigate to the directory which has the out-of-the-box sample core for configuring RSS feeds. And edit the rss-data-config.xml file for importing data to look as follows:
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
       
       
        
         
         
         
        
       
      
      
  4. In order to grab and add the price from the description:
    1. add the price as a field to the schema.xml file
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/schema.xml
      
      
      
      
      
    2. add the RegexTransformer to the chain of transformers in the rss-data-config.xml file
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
         ...
      
      
    3. add the regex to find and extract the price
      cd /trunk/solr/example/example-DIH/solr/rss
      vi conf/rss-data-config.xml
      
      
         ...
         
         
      
      
      Please keep in mind that this is NOT the best regex that you can use. This one simply grabs the last set of digits with a dollar sign in front of them in the HTML, so it may grab the used price instead of the new price! Feel free to come up with a better regex for yourself.
    4. If there are any items where the regex is incapable of pulling the price due to malformed HTML or any other reason, you can choose to either skip those items and not add them to Solr. OR you can introduce your own dummy price for them. Here's how you may do so:
      • add the use of the ScriptTransformer function you'll define into the chain of transformers in the rss-data-config.xml file
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
           ...
        
        
      • add the following script to rss-data-config.xml file if you want to skip the items where a price wasn't available or couldn't be deduced
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
        
        
        ...
        
        
      • add the following script to rss-data-config.xml file if you simply want to inject a dummy price
        cd /trunk/solr/example/example-DIH/solr/rss
        vi conf/rss-data-config.xml
        
        
        
        
        ...
        
        
  5. The following can be added to pull out the image URL:
    
       ...
       
       
    
    
  6. The following can be added to pull out the date (SimpleDateFormat was used as a reference to devise a format string to parse the incoming dates from the feed such as Mon, 12 Sep 2011 21:14:23 GMT):
    
       ...
       
       
    
    
  7. Start the Solr server:
    cd /trunk/solr/example
    java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar
    
  8. Navigate to the following URL
    http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/dataimport
    
  9. Click the Full Import with Cleaning button to import data from Amazon RSS feed into Solr.

Junk / Errata
url="http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn"
regex=".*Buy new.*span.class..tgProductPrice...(\d*.\d*)"
stripHTML="true"

3 comments:

  1. Hello
    Nice article, I am new to Solr and this helps.
    However, I was trying this example from the SOLR example folder and was able to index. But after you index, how does one search for the docs. What would be the URL for it. I tried "http://localhost:8983/solr/browse?q=blu-ray" and it does not work.

    ReplyDelete
  2. You have to throw in the core's name into the URL like so: http://localhost:8983/solr/rss/browse?q=blu-ray

    ReplyDelete
  3. Nice site! I enjoy a couple of from the articles which have been written, and particularly the comments posted! I will definately be visiting again!
    Joomla developer

    ReplyDelete