It is ridiculously beneficial to split up XML files if you will be using Solr's Data Import Handler (DIH) to process the data. I personally saw an improvement from a speed of 166 entries/minute to 30860 entries/minute after splitting up all the large XML data files into an individual file for every entity that is to become a lucene document in Solr.
It was only on a whim but the script that allowed me to experiment with this and yield the desired results was found here:
So if your file looks something like:
Then all the items from 1 to 19,999 will be divided up by this script into idividual files named row1.xml, row2.xml ... row19999.xml and look like:
It was only on a whim but the script that allowed me to experiment with this and yield the desired results was found here:
awk '/<item>/{close("row"count".xml");count++}count{f="row"count".xml";print $0 > f}' *.xml
So if your file looks something like:
- ...
Item 1 Description 1 ...Item 20000 Description 20000 ...
Then all the items from 1 to 19,999 will be divided up by this script into idividual files named row1.xml, row2.xml ... row19999.xml and look like:
But the last (20,000-th) item will have a trailing tag:Item N Description N ...
<item>
<title>Item 20000</title>
<description>Description 20000</description>
...
</item>
</items>
If you have processed 10 files, each with 20000 entries using the splitter command mentioned above ... then basically every 20000, 20000*2 ... 20000*10 numbered file will need to have the trailing tag deleted from it. To that end, the following script can be edited by providing the # of original files in the while loop's comparison statement:
#!/bin/sh if [ $# -eq 0 ] then echo "Error - Number missing form command line argument" echo "Syntax : $0 number" echo " Use to print multiplication table for given number" exit 1 fi n=$1 i=1 while [ $i -le 10 ] do echo "sed -ibak '/items>/d' row`expr $i \* $n`.xml" sed -ibak '/items>/d' row`expr $i \* $n`.xml i=`expr $i + 1` doneAnd then running the script by passing it the # of the last entry (20000-th):
./sanitize.sh 20000 sed -ibak '/items>/d' row20000.xml sed -ibak '/items>/d' row40000.xml sed -ibak '/items>/d' row60000.xml sed -ibak '/items>/d' row80000.xml sed -ibak '/items>/d' row100000.xml sed -ibak '/items>/d' row120000.xml sed -ibak '/items>/d' row140000.xml sed -ibak '/items>/d' row160000.xml sed -ibak '/items>/d' row180000.xml sed -ibak '/items>/d' row200000.xml
0 comments:
Post a Comment