Tuesday, February 14, 2012

PostgreSQL. Thesaurus dictionary - Normalizing political state names

I want to normalize the name of mexico states to some abbreviation in an automatic fashion as I'll be parsing a lot of documents containing statistical information about different variables in the whole country and they refer to states in heterogeneous ways.
For instance, I want the strings 'edo. de mexico', 'estado de mexico','mexico','edo de mexico' to yield 'mex' instead, which is the abbreviation I've chosen for that particular state (which in time will help me assign it's proper numeric identifier for its respective polygon in maps).

1.Creating the configuration file.
vi /usr/pgsql-9.1/share/tsearch_data/mexican_states.ths

? ? mexico : mex
edo ? mexico : mex
mexico : mex

:wq

2.Creating the text search dictionary
CREATE TEXT SEARCH DICTIONARY mexican_states_ths 
(
TEMPLATE = thesaurus,
DictFile = mexican_states,
Dictionary = pg_catalog.spanish_stem
);

3. Altering the mapping of the text search configuration for the respective language (spanish).
ALTER TEXT SEARCH CONFIGURATION spanish ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH mexican_states_ths, spanish_stem;

4. And testing the results.

postgres=# SELECT plainto_tsquery('spanish','edo de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','estado de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','edo. de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)


Certainly some previous treatment is necessary, such as lowercasing and unquoting, but that's easier than editing files by hand, which I find kind of silly.

Reference:
http://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html
While learning how to achieve this, I blew away my spanish text search configuration. Here's how to get it back
http://forums.enterprisedb.com/posts/list/1575.page
i.e.
Go to your $SHAREDIR i.e. (pg_config --sharedir)
Check the contents of file: snowball_create.sql. 

No comments:

Post a Comment