Previous topic

Simulated time series

Next topic

Direct information from a multiple sequence alignment

Anagram solver

Binder

In this example, we train a model to predict a missing letter in a 10-letter word. We then use the natural definition of energy in the model to solve anagrams by descending the energy function. We use a data set of words that were scraped from news sites, blogs, and Twitter.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import fem, os, time, pickle
data_dir = '../../../data/words'
print 'number of threads: %i' % (fem.fortran_module.fortran_module.num_threads(),)
cache = True
number of threads: 8

We load the word data into to the words then select a sample of the words in the variabel x to train the model.

In [2]:
# load word data
n = 10
words = np.loadtxt(os.path.join(data_dir, '%02i.txt' % (n,)), dtype='U%i' % (n,))

# select a sample of words and separate into array of letters
l = int(1e6)
words_sample = np.random.choice(words, size=l, replace=False)
x = np.array([np.array(list(word)) for word in words_sample]).T
print 'word length: %i, number of words: %i' % x.shape
word length: 10, number of words: 1000000

Next, we create a model model and fit it to the word sample using model.fit(x, ...).

We plot the model parameter \(W\) heat map and the running discrepancies calculuated during the fit below.

In [3]:
model_file = os.path.join(data_dir, '%02i.pkl' % (n,))
if cache and os.path.exists(model_file):
    with open(model_file, 'r') as f:
        model = pickle.load(f)
    print 'loaded cached model'
else:
    model = fem.discrete.model()
    start = time.time()
    model.fit(x, overfit=False, svd='exact')
    end = time.time()
    print 'fit time: %.02f minutes' % ((end-start)/60.,)
    with open(model_file, 'w') as f:
        pickle.dump(model, f)

fig, ax = plt.subplots(1, 2, figsize=(12,6))
for d in model.disc:
    ax[0].plot(1+np.arange(len(d)), d, 'k-')
ax[0].set_xlabel('iteration')
ax[0].set_ylabel('discrepancy')

scale = np.abs(model.w[1]).max()
ax[1].matshow(model.w[1], aspect='equal', cmap='seismic', vmin=-scale, vmax=scale)
plt.show()
loaded cached model
../../_images/notebooks_discrete_02_anagram_solver_5_1.png

The natural definition of the energy of the word \(w\) given the model with parameters \(W\) is \(E(\sigma)=-\frac12\sigma^TW\sigma\) where \(\sigma\) is the one-hot encoding of \(w\). We may use this energy to design a Markov chain Monte Carlo approach to solve anagrams, that is to rearrage the letters in a given permuted word with permuted letters to recover the original word. The function solve_anagram below uses the simulated annealing approach to attempt to solve anagrams under the assumption that the original word minimizes the energy function constrained to the set of permuations of the input anagram.

Given an anagram \(w\), we propose a modified anagram \(\tilde{w}\) by swapping a random pair of adjacent letters in \(w\).

We accept the proposal with probability \(\exp\left({E(w)-E(\tilde{w})\over T}\right)\) where \(T\) is a temperature that decays exponentially from \(T_{max}>1\) at step 0 to \(T_{min}<1\) at step max_steps. We record the anagram with the lowest energy encountered and return that as the final guess.

We plot the running energies for a number of example anagrams below. Additionally, we tabulate the solution, initial anagram, and the best guess as well as their energies below.

In [4]:
def solve_anagram(anagram, max_steps=int(2e5)):

    def propose(anagram, pairs):
        """Propose a new anagram by swapping adjacent letters"""
        pair = pairs[np.random.randint(n-1)]
        proposal = np.array(list(anagram))
        proposal[pair] = proposal[pair][::-1]
        proposal = ''.join(proposal)
        return proposal

    def accept(anagram_energy, proposal_energy, t=1.0):
        """Accept proposed anagram with simulated annealing probability"""
        p = np.exp((anagram_energy - proposal_energy) / t)
        return np.random.uniform() < p

    def temperature(step, max_steps, t_min=1e-2, t_max=1e2):
        """Return temperature given progress to max_steps"""
        return t_max * np.exp(step / float(max_steps) * np.log(t_min / t_max))

    # pairs of neighboring letter indices
    pairs = np.stack([range(n)[:-1], range(n)[1:]]).T

    # initial energy
    anagram_energy = model.energy(anagram)
    energy_history = [[0, anagram_energy]]

    # best guess
    answer = [anagram, anagram_energy]

    for step in range(max_steps):

        # propose new anagram and compute energy
        proposal = propose(anagram, pairs)
        proposal_energy = model.energy(proposal)

        if accept(anagram_energy, proposal_energy, temperature(step, max_steps)):

            # update current guess
            anagram, anagram_energy = proposal, proposal_energy
            energy_history.append([step, anagram_energy])

            # record buest guess
            if anagram_energy < answer[1]:
                answer = [anagram, anagram_energy]

    energy_history = np.array(energy_history).T
    return answer, energy_history

tab, e = [], []
# for a random selection of words
for word in np.random.choice(words, size=5, replace=False):
    # make an anagram by permuting the letters in the word
    anagram = ''.join(np.random.permutation(list(word)))
    # solve the anagram with a simulated annealing approach
    answer, energy_history = solve_anagram(anagram)
    tab.append([word, model.energy(word), anagram, model.energy(anagram), answer[0], answer[1]])
    e.append(energy_history)

fig = plt.figure(figsize=(12,6))
ax = plt.gca()
for i, ei in enumerate(e):
    ax.plot(*ei, lw=0.5, label=tab[i][0])
ax.set_xlabel('proposal number')
ax.set_ylabel('proposal energy')
plt.legend()
plt.show()

tab = pd.DataFrame(data=tab, columns=['word', 'energy', 'anagram', 'energy', 'answer', 'energy'])
tab
../../_images/notebooks_discrete_02_anagram_solver_7_0.png
Out[4]:
word energy anagram energy answer energy
0 antiquated -12.952297 tqeiutdaan 0.094032 antiquated -12.952297
1 submission -12.171518 iosumisbsn 0.337162 submission -12.171518
2 consulting -11.747188 cnnitgulso -0.964849 consulting -11.747188
3 allocation -9.194562 ooaanltlci 1.586637 coallation -11.280581
4 shopkeeper -5.826653 oerhspeepk 1.823975 pekeershop -8.900936

There are adversarial examples to show the anagram solver. For example, we demonstrate that the search repeatedly guesses “impression” given an anagram of “permission” because the model is in a lower energy state at “impression.”

In [5]:
# adversarial examples:
# permission <-> impression
# nonduality <-> undylation
# treatments <-> restatement

word = 'permission'

print 'word: %s, energy: %f' % (word, model.energy(word))

tab = []
for attempt in range(5):
    anagram = ''.join(np.random.permutation(list(word)))
    answer = solve_anagram(anagram)[0]
    tab.append([anagram, model.energy(anagram), answer[0], answer[1]])

tab = pd.DataFrame(data=tab, columns=['anagram', 'energy', 'answer', 'energy'])
tab
word: permission, energy: -10.574511
Out[5]:
anagram energy answer energy
0 emrpoinssi -1.859870 impression -13.795883
1 sirspmeoni -2.762323 impression -13.795883
2 meriposins -1.256509 impression -13.795883
3 oneiirsmsp -0.925506 impression -13.795883
4 nismespiro -0.899383 impression -13.795883