Machine Learning made simple with Ruby

How is it possible to make automatic classification work properly without resorting to using external prediction services? Starting with Bayesian classification, you can use the ruby gem classifier-reborn to create a Latent Semantic Indexer. Hands on!

Estimate reading 8 minutes

At the moment, I’m working on a personal project that lets users insert announcements by adding links to a platform.

A periodically executed task analyses the links and downloads metadata such as titles, descriptions, photos, etc. using the opengraph tag (including the option of using a fallback, in the case of these not being specified).

At this point, the administrator has to be able to assign a category to the announcement and publish it.

Classifiers

Typically, platform admins of this kind have a lot of stuff to do and — being honest — manually assigning a category to each and every announcement isn’t exactly fun.

So, I asked myself if it would be possible to develop a system that, with adequate training, would be able to automatically suggest an appropriate category for newly inserted content.

Seeing that all of the announcements have to be categorized, the introduction of automatic classifiers won’t make the administrators job any harder, even if it doesn’t work.

I originally decided to use a bayesian classifier_ to achieve my goal, in the end choosing to use the ruby gem classifier-reborn. The gem makes use of a range of different classification algorithms and, reading through its documentation, I realized that another of these would be better suited to my case than a Bayesian classifier: a Latent Semantic Indexer, to be precise.

Latent Semantic Indexer

In practice, LSI is based on the principle that words used in a particular context tend to have a similar meaning.

In addition to offering a classification method, the gem also enables the software to work in reverse, making it possible to search for similar texts. This kind of functionality could prove to be useful in future.

Here’s an example of this functionality, taken from the README:

require 'classifier-reborn'
lsi = ClassifierReborn::LSI.new
strings = [ ["This text deals with dogs. Dogs.", :dog],
            ["This text involves dogs too. Dogs! ", :dog],
            ["This text revolves around cats. Cats.", :cat],
            ["This text also involves cats. Cats!", :cat],
            ["This text involves birds. Birds.",:bird ]]
strings.each {|x| lsi.add_item x.first, x.last}

lsi.search("dog", 3)
# returns => ["This text deals with dogs. Dogs.", "This text involves dogs too. Dogs! ",
#             "This text also involves cats. Cats!"]

lsi.find_related(strings[2], 2)
# returns => ["This text revolves around cats. Cats.", "This text also involves cats. Cats!"]

lsi.classify "This text is also about dogs!"
# returns => :dog

OK, choice made! Let’s starting writing some code!

Training

LSI classifiers have to be trained: that is, to be given a method of associating a given text with a particular category. In a system already containing manually classified publications, this training is simple to carry out.

The part of the code relative to the training has to be executed by an asynchronously executed task. Since this process is time-consuming, it is carried out only periodically.

We can define the behaviour of our Command Object using the specs:

require 'rails_helper'

describe ClassifyAdvertisements do
  let(:command) { described_class.new(advertisements, training) }
  let(:advertisements) { [build(:advertisement, title: 'foo')] }
  let(:training) do
    [
      { title: 'foo baz', category_id: 1 },
    ]
  end
  let(:classifier) { instance_double('ClassifierReborn::LSI') }

  before do
    allow(ClassifierReborn::LSI).
      to receive(:new).
          and_return(classifier)

    allow(classifier).
      to receive(:add_item)
  end

  describe '#initialize' do
    it 'takes the advertisements to classify and the training as constructor parameters' do
      command
    end

    it 'trains the classifier with the training advertisements title and category id' do
      command
      expect(classifier).to have_received(:add_item).with('foo baz', 1)
    end
  end

  describe '#classify!' do
    let(:storer) { instance_double('StoreClassifications') }

    before do
      allow(StoreClassifications).
        to receive(:new).
            with(advertisements[0], classifier).
            and_return(storer)

        allow(storer).
          to receive(:execute!)
    end

    before { command.classify! }

    it 'classifies the advertisements' do
      expect(storer).to have_received(:execute!)
    end
  end
end

The idea is to write a class that accepts as constructions parameters the announcements to be classified and a hash list with title keys and category_id. The values of these hashes have to be passed to the classifier as part of the training process.

The classification proper will be delegated to a second object, initiated for each announcement to be classified, using the method #classify!

At this point, we can only imagine what the interface of this object will look like: it will probably receive the announcements to be classified and the pre-trained classifier.

Let’s put it through its tests:

class ClassifyAdvertisements
  def initialize(advertisements, training)
    @advertisements = advertisements
    @training = training

    train_classifier!
  end

  def classify!
    advertisements.each do |advertisement|
      StoreClassifications.new(advertisement, classifier).execute!
    end
  end

  private

  attr_reader :advertisements, :training

  def train_classifier!
    training.each do |t|
      classifier.add_item(t[:title], t[:category_id])
    end
  end

  def classifier
    @classifier ||= ClassifierReborn::LSI.new
  end
end

Check that everything has gone well:

$ bundle exec rspec spec/commands/classify_advertisements_spec.rb
...

Finished in 0.11143 seconds (files took 3.88 seconds to load)
3 examples, 0 failures

Perfect! Now we have to implement the object that carries out the classification proper, saving the results in the database and updating the state of the announcement.

Classification

We already have an idea of the interface; now let’s define the behaviour through the specs:

require 'rails_helper'

describe StoreClassifications do
  let(:command) { described_class.new(advertisement, classifier) }
  let(:advertisement) { create(:advertisement, title: 'foo', category: nil) }
  let(:classifier) { instance_double('ClassifierReborn::LSI') }
  let!(:category) { create(:category) }

  it 'takes and advertisement and a classifier as constructor parameters' do
    command
  end

  describe '#execute!' do
    context 'on success' do
      before do
        allow(classifier).
          to receive(:classify).
              with('foo').
              and_return(category.id)
      end

      before { command.execute! }

      it 'updates the advertisement category' do
        expect(advertisement.reload.category_id).to eq(category.id)
      end

      it 'updates the status of the advertisement' do
        expect(advertisement.reload.classified?).to be(true)
      end
    end

    context 'if the classifier was not trained' do
      before do
        allow(classifier).
          to receive(:classify).
              with('foo').
              and_raise(Vector::ZeroVectorError)
      end

      before { command.execute! }

      it 'does not update the advertisement category' do
        expect(advertisement.reload.category_id).to be(nil)
      end

      it 'does not update the status of the advertisement' do
        expect(advertisement.reload.category_classified?).to be(false)
      end
    end
  end
end

And now the code to satisfy the tests:

class StoreClassifications
  def initialize(advertisement, classifier)
    @advertisement, @classifier = advertisement, classifier
  end

  def execute!
    ActiveRecord::Base.transaction do
      advertisement.update_attributes(category_id: classified_category_id)
      advertisement.classified!
    end
  rescue Vector::ZeroVectorError => e
    Rails.logger.error "error classifying advertisement #{advertisement.id}: #{e.message}"
  end

  private

  attr_reader :advertisement, :classifier

  def classified_category_id
    @classified_category_id ||= classifier.classify(advertisement.title)
  end
end

Let’s check that everything’s OK:

$ bundle exec rspec spec/commands/classify_advertisement_spec.rb
.....

Finished in 0.16245 seconds (files took 3.78 seconds to load)
5 examples, 0 failures

Conclusions

Let’s see if it works:

pry(main)> foo_category = Category.create(name: 'foo')
=> #<Category:0x007fd9737ada10>

pry(main)> bar_category = Category.create(name: 'bar')
=> #<Category:0x007fd97770cb48>

pry(main)> training = [Advertisement.create(title: 'this is something about foo', category: foo_category), Advertisement.create(title: 'this is something about bar', category: bar_category)]
=> [#<Advertisement:0x007fd9778e61d0>, #<Advertisement:0x007fd9793aedc8>]

pry(main)> ad = Advertisement.create(title: "an article to be categorized talking about foo", category: nil)
=> #<Advertisement:0x007fd97a128490>

pry(main)> ClassifyAdvertisements.new([ad], training.map { |t| t.slice(:id, :title) }).classify!

pry(main)> ad.category.name
=> "foo"

I’m satisfied with the results achieved by this implementation, even though there are a few remaining issues to be resolved… for example: the training is currently carried out using all of the published announcements, every time the task is executed. Saving the structure of the classification data once the training is finished and updating when changes are made to published announcements (synchronously or asynchronously) could be a better solution.

Did you find this interesting?Know us better

Made with Middleman and DatoCMS, our CMS for static websites