Sunday, April 28, 2013

Sunday project. My desperate dictionary "attack"

I was stuck with one word of this game:
you look at 4 pictures and guess the word that correlates them. I couldn't go further so, I decided to use brute force.

Objective: Get all the possible permutations of the given letters (12 ) of the given size (5) and match them against a dictionary.

darwin@evolution:~/spwords> wget
darwin@evolution:~/spwords> unrar Spanish.rar

darwin@evolution:~/spwords> file Spanish.dic
Spanish.dic: ISO-8859 text, with CRLF line terminators

The database has unicode encoding, so lets do some encoding conversion.

idarwin@evolution:~/spwords> iconv -f ISO-8859-1 -t UTF-8 Spanish.dic > Spanish.dic.unicode
darwin@evolution:~/spwords> dos2unix Spanish.dic.unicode

I'm using my favorite RDBMS (PostgreSQL) for the word matching.

postgres=# CREATE TABLE words(id serial primary key, word varchar, word_unnacented varchar);
NOTICE:  CREATE TABLE creará una secuencia implícita «words_id_seq» para la columna serial «»
NOTICE:  CREATE TABLE / PRIMARY KEY creará el índice implícito «words_pkey» para la tabla «words»
postgres=# \copy words(word) FROM 'Spanish.dic.unicode'
postgres=# select count(*) from words;

postgres=# CREATE EXTENSION unaccent;
postgres=# UPDATE words set word_unnacented = unaccent(word);
UPDATE 413527
postgres=# CREATE INDEX words_word_idx ON words(word_unnacented varchar_pattern_ops);

postgres=# create extension plpython2u;

 CREATE OR REPLACE FUNCTION match_word(letters varchar,len int) RETURNS TABLE (match varchar)
AS $$
import itertools
result = []
for i in itertools.permutations(letters,len):
        rs = plpy.execute("SELECT word_unnacented FROM words WHERE word_unnacented = '%s' " % ''.join(i).lower())
        result += [(r['word_unnacented']) for r in rs]
return result
$$ LANGUAGE plpython2u;

postgres=# select distinct match_word('srtkjncnmonu',5) order by 1;


It took 10 seconds to yield the results (Intel Pentium Dual Core) .

It turns out that the correct word, which I found through this "attack", was "curso", that had in my opinion nothing to do with the pictures. Now I can continue playing =-) .