- For this example, lets use the RSS feed for new products that have been tagged as blu-ray, here's the URL:
- http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn
- If you are using Firefox, you may not get a chance to view the feed in its raw xml format as the browser tends to actually interpret and present most of the RSS feed in a user-friendly fashion. This is not very desirable for developers or adminstrators. In order to view the raw content of the feed, you can simply view the same URL via Feed-Proxy: http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn
- Refer to the following links for info on how to get the RSS feeds you want from Amazon:
- Before we import any data into Solr, lets take a moment to understand the format of the data from the RSS feed. Here's a sample item from the RSS feed:
- There are 5 basic fields per item: title, guid, link, pubdate and description.
- If we take a closer look at the HTML CHUNK inside description, we will find more information like:
- the image URL
- the price for a new item
Buy new: $10.99
- the price of used items starting from a lowerbound:
22 used and new from $6.35
- In fact, the HTML Chunk is like a complete webpage, it even has a separate description section inside of itself!...
- All this is rather messy and unpredictable (sometimes its there, sometimes its not, sometimes its a class, sometimes its an id, sometimes there are duplicates) but we must try to make the best of it.
- the image URL
- Having understood the complexities of our source of data, we are now ready to configure the Data Import Handler (DIH) for Solr.
- Navigate to the directory which has the out-of-the-box sample core for configuring RSS feeds. And edit the rss-data-config.xml file for importing data to look as follows:
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
- Navigate to the directory which has the out-of-the-box sample core for configuring RSS feeds. And edit the rss-data-config.xml file for importing data to look as follows:
- In order to grab and add the price from the description:
- add the price as a field to the schema.xml file
cd /trunk/solr/example/example-DIH/solr/rss vi conf/schema.xml
- add the RegexTransformer to the chain of transformers in the rss-data-config.xml file
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
... - add the regex to find and extract the price
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
Please keep in mind that this is NOT the best regex that you can use. This one simply grabs the last set of digits with a dollar sign in front of them in the HTML, so it may grab the used price instead of the new price! Feel free to come up with a better regex for yourself.... - If there are any items where the regex is incapable of pulling the price due to malformed HTML or any other reason, you can choose to either skip those items and not add them to Solr. OR you can introduce your own dummy price for them. Here's how you may do so:
- add the use of the ScriptTransformer function you'll define into the chain of transformers in the rss-data-config.xml file
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
... - add the following script to rss-data-config.xml file if you want to skip the items where a price wasn't available or couldn't be deduced
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
... - add the following script to rss-data-config.xml file if you simply want to inject a dummy price
cd /trunk/solr/example/example-DIH/solr/rss vi conf/rss-data-config.xml
...
- add the use of the ScriptTransformer function you'll define into the chain of transformers in the rss-data-config.xml file
- add the price as a field to the schema.xml file
- The following can be added to pull out the image URL:
... - The following can be added to pull out the date (SimpleDateFormat was used as a reference to devise a format string to parse the incoming dates from the feed such as Mon, 12 Sep 2011 21:14:23 GMT):
... - Start the Solr server:
cd /trunk/solr/example java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar
- Navigate to the following URL
http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/dataimport
- Click the Full Import with Cleaning button to import data from Amazon RSS feed into Solr.
Junk / Errata
url="http://www.amazon.com/rss/tag/blu-ray/new/ref=tag_rsh_hl_ersn" regex=".*Buy new.*span.class..tgProductPrice...(\d*.\d*)" stripHTML="true"
Hello
ReplyDeleteNice article, I am new to Solr and this helps.
However, I was trying this example from the SOLR example folder and was able to index. But after you index, how does one search for the docs. What would be the URL for it. I tried "http://localhost:8983/solr/browse?q=blu-ray" and it does not work.
You have to throw in the core's name into the URL like so: http://localhost:8983/solr/rss/browse?q=blu-ray
ReplyDeleteNice site! I enjoy a couple of from the articles which have been written, and particularly the comments posted! I will definately be visiting again!
ReplyDeleteJoomla developer