DevZona

Rails, Heroku, AWS and other awesome technologies

Delayed Solr Indexing With Sunspot and Resque

| Comments

Overview

So, you had set up your full-text search with WebSolr, sunspot and sunspot_rails. Although it may work perfectly, you can make a couple more tweaks that will improve both application speed and Solr index availability.

By default, Sunspot::Rails will send an update to Solr index after any create/update/delete request. This may take up to 200ms for every request. However, index update is a non-critical operation that may be performed in a background process with the minimal influence on app’s performance.

Delayed Indexing

So, if normally high-level request life cycle within your app would include: app receives create/update/delete request -> create/update/delete performed on DB -> record re-indexed and sent to Solr index, wait for response -> app returns response. After implementing indexing asynchronously, it will look like: app receives create/update/delete request -> create/update/delete performed on DB -> re-index job queued, no wait -> app returns response.

This can be easily achieved when using sunspot with delayed_job. All you need to do is to add handle_asynchronously :solr_index and handle_asynchronously :remove_from_index to your model after searchable block. However, if you are using resque for delayed processing, it is a little trickier.

Luckily, I found this gist by sunspot creator Nick Zadorozhny that makes it easy to implement delayed indexing with Sunspot and Resque.

In my case (models used from previous post):

app/models/job.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Job < ActiveRecord::Base
  belongs_to :employer

  searchable :auto_index => false, :auto_remove => false do
    text    :title, boost: 2.0
    text    :description, boost: 1.5
    ...
  end

  after_commit   :resque_solr_update, if: :persisted?
  before_destroy :resque_solr_remove

  def resque_solr_update
    Resque.enqueue(SolrUpdate, self.class.to_s, id)
  end

  def resque_solr_remove
    Resque.enqueue(SolrRemove, self.class.to_s, id)
  end
end
app/workers/solr_update.rb
1
2
3
4
5
6
7
class SolrUpdate
  @queue = :solr

  def self.perform(classname, id)
    classname.constantize.find(id).solr_index
  end
end
app/workers/solr_remove.rb
1
2
3
4
5
6
7
class SolrRemove
  @queue = :solr

  def self.perform(classname, id)
    Sunspot.remove_by_id(classname, id)
  end
end

A couple things to notice here. First, note :auto_index => false, :auto_remove => false after searchable declaration - this turns off automatic indexing. Also, note the usage of after_commit callback. It is used to ensure that the record is saved in the database at the time when worker is trying to process the job. Also notice that I added if: :persisted? to after_commit callback, since resque_solr_update was triggered on delete operations as well as create/update.

Testing this behavior is pretty simple. I use Rspec, resque_spec and test_after_commit in this example:

spec/models/job/job_delayed_indexing_spec.rb
1
2
3
4
5
6
7
8
9
10
it "creating a Job should que a job in SolrUpdate" do
  job = FactoryGirl.create(:job)
  SolrUpdate.should have_queued("Job", job.id).in(:solr)
end

it "destroying a Job should que a job in SolrRemove" do
  job = FactoryGirl.create(:job)
  job.destroy
  SolrRemove.should have_queued("Job", job.id).in(:solr)
end

And a couple specs for workers:

spec/workers/solr_update_spec.rb
1
2
3
4
5
6
7
8
9
10
11
12
before do
  @employer = FactoryGirl.create(:employer)
  @attrs    = FactoryGirl.attributes_for(:job)
end

it "performing a job changes the number of indexed docs by 1" do
  expect do
    j = @employer.jobs.create(@attrs)
    SolrUpdate.perform("Job", j.id)
    Sunspot.commit
  end.to change(Job, :indexed).by(1)
end
spec/workers/solr_remove_spec.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
before do
  @employer = FactoryGirl.create(:employer)
  @attrs    = FactoryGirl.attributes_for(:job)
end

it "performing a job changes the number of indexed docs by -1" do
  job = @employer.jobs.create(@attrs)
  SolrUpdate.perform("Job", job.id)
  Sunspot.commit
  id = job.id
  expect do
    job.destroy
    SolrRemove.perform("Job", id)
    Sunspot.commit
  end.to change(Job, :indexed).by(-1)
end

Using autoCommit for Solr updates

If you try to run code above, you will probably notice that changes are never committed to Solr index (that is why I had to use explicit Sunspot.commit in my specs). This is due to the fact that we used solr_index and remove_by_id methods in our workers instead of solr_index! and remove_by_id! that will commit changes to Solr immediately. We did it for a reason. Using solr_index! in production is highly NOT recommended since it can cause high number of commits to the index (especially when running large number of updates on the data), which may ultimately result in 503 errors and unavailability of your index for seacrh operations. Instead, in order to submit your changes to index, you should use autoCommit option in in solrconfig.xml which lets you submit changes to index every so often, or after a certain number of documents had been indexed:

app/solr/conf/solrconfig.xml
1
2
3
4
<autoCommit>
  <maxDocs>10000</maxDocs>
  <maxTime>6000</maxTime>
</autoCommit>

Now, changes to the index will be submitted every minute or every 10000 docs, whichever comes first.

To conclude, using asynchronous indexing with Resque and sunspot’s autoCommit can increase your app’s performance and your index availability.

Comments