Behemoth: Splitting up large XML data files for use with DIH in Solr

It is ridiculously beneficial to split up XML files if you will be using Solr's Data Import Handler (DIH) to process the data. I personally saw an improvement from a speed of 166 entries/minute to 30860 entries/minute after splitting up all the large XML data files into an individual file for every entity that is to become a lucene document in Solr.

It was only on a whim but the script that allowed me to experiment with this and yield the desired results was found here:

awk '/<item>/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' *.xml

So if your file looks something like:


  
    Item 1
    Description 1
    ...
  
  ...
  
    Item 20000
    Description 20000
    ...

Then all the items from 1 to 19,999 will be divided up by this script into idividual files named row1.xml, row2.xml ... row19999.xml and look like:


  Item N
  Description N
  ...

But the last (20,000-th) item will have a trailing tag:

  <item>
    <title>Item 20000</title>
    <description>Description 20000</description>
    ...
  </item>
</items>

If you have processed 10 files, each with 20000 entries using the splitter command mentioned above ... then basically every 20000, 20000*2 ... 20000*10 numbered file will need to have the trailing tag deleted from it. To that end, the following script can be edited by providing the # of original files in the while loop's comparison statement:

#!/bin/sh
if [ $# -eq 0 ]
then
  echo "Error - Number missing form command line argument"
  echo "Syntax : $0 number"
  echo " Use to print multiplication table for given number"
exit 1
fi
n=$1
i=1
while [ $i -le 10 ]
do
  echo "sed -ibak '/items>/d' row`expr $i \* $n`.xml"
  sed -ibak '/items>/d' row`expr $i \* $n`.xml
  i=`expr $i + 1`
done

And then running the script by passing it the # of the last entry (20000-th):

./sanitize.sh 20000
sed -ibak '/items>/d' row20000.xml
sed -ibak '/items>/d' row40000.xml
sed -ibak '/items>/d' row60000.xml
sed -ibak '/items>/d' row80000.xml
sed -ibak '/items>/d' row100000.xml
sed -ibak '/items>/d' row120000.xml
sed -ibak '/items>/d' row140000.xml
sed -ibak '/items>/d' row160000.xml
sed -ibak '/items>/d' row180000.xml
sed -ibak '/items>/d' row200000.xml

Behemoth

Wednesday, October 5, 2011

Splitting up large XML data files for use with DIH in Solr

0 comments:

Post a Comment

Total Pageviews

Blog Archive