Data analysis and server optimization -- yesterday was a good day

Server performance has been on my mind lately, specifically Solr server performance. We use Solr as our search and browse infrastructure for MOspace, our DSpace -based institutional repository. I've heard that search performance on MOspace has been slow, so I wanted to get some data on that, and then find ways to improve the performance. I discovered that our Solr log files contain information on how long it takes the server to respond to a query. Which is great, I just needed a way to analyze that data. I've been meaning to play with Miller --- which is a handy little text parsing tool, with a little bit of data analysis added in. It's a lovely tool, but I couldn't figure out the syntax to do what I wanted. Normally, when I run into this issue with open source software, I'll reach out on a related mail list, and ask for help. Miller is new, it doesn't have a community yet, and no mail list. So, I sent an e-mail to John Kerl, the developer of Miller. Astoundingly, John responded right away, with a little snippet that turned into this:

for f in solr.log*; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime; echo; echo; done

Which outputs something like this:

Processing solr.log... QTime_min=0.000000,QTime_mode=1,QTime_mean=8.794904,QTime_stddev=83.023710,QTime_max=6380.000000

Processing solr.log.2015-08-26... QTime_min=0.000000,QTime_mode=3,QTime_mean=15.659905,QTime_stddev=87.355819,QTime_max=5350.000000

Processing solr.log.2015-08-27... QTime_min=0.000000,QTime_mode=3,QTime_mean=10.005723,QTime_stddev=59.013046,QTime_max=2557.000000

Processing solr.log.2015-08-28... QTime_min=0.000000,QTime_mode=3,QTime_mean=9.215227,QTime_stddev=51.547102,QTime_max=2692.000000

Processing solr.log.2015-08-29... QTime_min=0.000000,QTime_mode=3,QTime_mean=8.626997,QTime_stddev=52.084964,QTime_max=2688.000000

Processing solr.log.2015-08-30... QTime_min=0.000000,QTime_mode=3,QTime_mean=8.197308,QTime_stddev=50.787355,QTime_max=4718.000000

Processing solr.log.2015-08-31... QTime_min=0.000000,QTime_mode=3,QTime_mean=8.253708,QTime_stddev=106.475814,QTime_max=9781.000000

Which is kinda cool. But those max response times! I googled around a bit, and found this StackOverflow thread

Which in turn lead me to this wiki page and this wiki page

That last page mentions these settings for tuning the GC to maximize Solr performance:

JVM_OPTS=" -XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=250 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts "

This is info I can use. So, I tried those [OPTS] in our Tomcat server's setenv.sh script:

export CATALINA_OPTS="$CATALINA_OPTS -XX:+UseG1GC"

# suggestion from http://stackoverflow.com/questions/5839359/java-lang-outofmemoryerror-gc-overhead-limit-exceeded # and here https://plumbr.eu/outofmemoryerror/gc-overhead-limit-exceeded # though, really, we ought to figure out where the memory leak is a some point... export CATALINA_OPTS="$CATALINA_OPTS -XX:-UseGCOverheadLimit"

# This is interpreted as a hint to the garbage collector that pause times
# of milliseconds or less are desired. The garbage collector will # adjust the Java heap size and other garbage collection related parameters
# in an attempt to keep garbage collection pauses shorter than milliseconds. # http://java.sun.com/docs/hotspot/gc5.0/ergo5.html
export CATALINA_OPTS="$CATALINA_OPTS -XX:MaxGCPauseMillis=250"

# A hint to the virtual machine that it's desirable that not more than: # 1 / (1 + GCTimeRation) of the application execution time be spent in # the garbage collector. # http://themindstorms.wordpress.com/2009/01/21/advanced-jvm-tuning-for-low-pause/ export CATALINA_OPTS="$CATALINA_OPTS -XX:GCTimeRatio=9"

# more tweaks for Solr performance optimization, from # http://wiki.apache.org/solr/ShawnHeisey export CATALINA_OPTS="$CATALINA_OPTS -XX:+PerfDisableSharedMem" export CATALINA_OPTS="$CATALINA_OPTS -XX:+ParallelRefProcEnabled" export CATALINA_OPTS="$CATALINA_OPTS -XX:G1HeapRegionSize=8m" export CATALINA_OPTS="$CATALINA_OPTS -XX:InitiatingHeapOccupancyPercent=75" export CATALINA_OPTS="$CATALINA_OPTS -XX:+UseLargePages" export CATALINA_OPTS="$CATALINA_OPTS -XX:+AggressiveOpts"

And... so far... so good. I can see we are having far fewer big GC events by watching our system profiler. We are having many more small GC events.

Today, I've been playing a bit more with Miller, specifically grouping and sorting, so I can get a more fine-grained picture of query response, with the tuning noted above in place.

for f in solr.log; do echo "Processing $f..."; grep search $f | sed 's/.*} //' | mlr --repifs --ifs space stats1 -a min,mode,mean,stddev,max -f QTime -g hits then sort -nr hits; echo; echo; done

Which gets me lots of info like this (grouping by hits, reverse-sorting by hits):

hits=18383,QTime_min=5.000000,QTime_mode=5,QTime_mean=5.000000,QTime_stddev=,QTime_max=5.000000 hits=18283,QTime_min=3.000000,QTime_mode=8,QTime_mean=40.402174,QTime_stddev=42.019111,QTime_max=137.000000 hits=18282,QTime_min=2.000000,QTime_mode=6,QTime_mean=18.181592,QTime_stddev=41.169189,QTime_max=577.000000 hits=17403,QTime_min=6.000000,QTime_mode=184,QTime_mean=501.666667,QTime_stddev=816.713577,QTime_max=2068.000000 hits=17043,QTime_min=1.000000,QTime_mode=4,QTime_mean=3.711864,QTime_stddev=1.712538,QTime_max=9.000000 hits=17042,QTime_min=1.000000,QTime_mode=2,QTime_mean=2.278986,QTime_stddev=1.984960,QTime_max=29.000000 hits=17025,QTime_min=1.000000,QTime_mode=2,QTime_mean=3.378571,QTime_stddev=1.990871,QTime_max=11.000000 hits=17024,QTime_min=1.000000,QTime_mode=2,QTime_mean=2.060000,QTime_stddev=1.228379,QTime_max=14.000000 hits=15436,QTime_min=2.000000,QTime_mode=3,QTime_mean=20.684211,QTime_stddev=40.519906,QTime_max=235.000000 hits=14999,QTime_min=5.000000,QTime_mode=6,QTime_mean=8.500000,QTime_stddev=5.686241,QTime_max=17.000000 hits=14878,QTime_min=4.000000,QTime_mode=5,QTime_mean=18.888889,QTime_stddev=25.852681,QTime_max=78.000000 hits=14868,QTime_min=3.000000,QTime_mode=3,QTime_mean=3.750000,QTime_stddev=1.164965,QTime_max=6.000000 hits=11165,QTime_min=7.000000,QTime_mode=7,QTime_mean=490.000000,QTime_stddev=667.476142,QTime_max=1533.000000

I can then see that there might be certain queries that are taking an unusually long time (mode of 184, minimum of 6). I will see if I can find out what that query is and why it's taking so long.

Oh, I should note that John Kerl, the developer of Miller, is right now asking for people to make suggestions for his FAQ page. Miller is awesome, if you have a text parsing or data analysis task you think Miller might be just the tool for, post your challenge to that issue. Someone might help you answer your question.

10/28/2015 UPDATE: it was brought to my attention that the Lucene folks have posted a rather amusing warning against using the G1GC garbage collector and similarly, the Elastic Search folks have done the same (though it's likely just an echo of the Lucene warning). Now, I don't know what to make of this, but I do know that search performance has measurably increased since switching to the G1GC. For now, I think I will choose to quietly worry.