Friday, August 30, 2013

PostgreSQL. Handling timestamps with timezone

Regarding timestamps and their proper handling, the world of databases is packed with bittersweet stories. With PostgreSQL this issue is rather straightforward if you are aware of the following principles:

- PostgreSQL doesn't store timezone metadata, it converts every timestamp with timezone you throw at it to UTC, and stores it that way.
- If you use timestamp without timezone fields, they're stored using the local timezone valid for the client session, in this way, we know nothing about relative to what that timestamp was stored in the database.
- To present timestamps with timezone to users it converts "on the fly" timestamps to the client's configured timezone.

show timezone;

select now();
 2013-08-30 00:39:55.82538-05

CREATE TABLE test_timestamptz(
id serial primary key,
mytimestamptz timestamp with time zone,
mytimestamp timestamp without time zone

INSERT INTO test_timestamptz SELECT 1,now(),now();

select * from test_timestamptz;
 id |        mytimestamptz         |        mytimestamp        
  1 | 2013-08-30 00:48:19.55422-05 | 2013-08-30 00:48:19.55422

set timezone = 'UTC';

select * from test_timestamptz;
 id |        mytimestamptz         |        mytimestamp        
  1 | 2013-08-30 05:48:19.55422+00 | 2013-08-30 00:48:19.55422

There are some provisions for us not to shoot ourselves in the foot when dealing with timezones, but I'll write about that later.

Friday, August 16, 2013

PostgreSQL. pg_memcache

In high performance applications caching alleviates a great deal of database workload, let's propose a little example using pg_memcache.


yum install memcached libmemcached libmemcached-devel

Download and install pg_memcache

tar -xzf pgmemcache_2.0.6.tar.bz2
cd pgmemcache
sudo make install 

Tuesday, August 6, 2013

Pentaho Data Integration (Kettle) on Fedora 19 x86_64 Core dumped. Problematic frame: C []

You issue the script on Fedora 19 x86_64, it breaks with the following error:

# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0x00000031e926d9c1, pid=10487, tid=140492000786176
# JRE version: 7.0_25-b15
# Java VM: Java HotSpot(TM) 64-Bit Server VM (23.25-b01 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  []  soup_session_feature_detach+0x11
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
# An error report file with more information is saved as:
# /home/darwin/data-integration/hs_err_pid10487.log
# If you would like to submit a bug report, please visit:
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
# line 166: 10487 Aborted                 (core dumped) "$_PENTAHO_JAVA"

It turns out that it is an eclipse issue as depicted here:

After updating to GNOME 3.8 with libsoup 2.42.0, Eclipse reliably crashes after a few seconds of using:

the workaround is modifying your file as follows


Note that the thing you need to add to the JAVA options is:

Should problems still arise, try adding this option too:
or the equivalent path in your system to xulrunner

and that's it, I hope it helps.

Saturday, June 22, 2013

python. plotting the normal distribution in 3,2,1, ... go!

Just for reference in case I needed later or should someone find it useful, this is the shortest path I've found to graphing the normal distribution using python

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
items = []
with open('data','r') as f:
    for line in f:
plt.plot(items, norm.pdf(items,np.mean(items),np.std(items)))

Tuesday, June 18, 2013

python. geocoding whith geopy

What if you have a bunch of mail addresses and need to get approximate coordinate positions? That's what geocoding is all about, and with geopy is as easy as it gets.

from geopy import geocoders
g = geocoders.GoogleV3()
place, (lat,lng) = g.geocode("AV. DIVISION DEL NORTE NO. 911-A COLONIA DEL VALLE BENITO JUAREZ")
print "%s: %.5f, %.5f" % (place, lat, lng)

And then, as if it were some sort of magic:

Avenida División del Norte 911, General Anaya, Benito Juarez, Mexico City, Federal District, Mexico: 19.35731, -99.15289


geopy includes geocoder classes for MediaWiki (with the GIS extension), Semantic MediaWiki, the Google v2 geocoder, the Yahoo! geocoder,geocoder.usBing Maps API, and GeoNames. The geocoder classes are located in geopy.geocoders.

Wednesday, May 29, 2013

splitting words into syllables with python

There is one very easy way to achieve it:

pip install pyhyphen

from hyphen import Hyphenator
from hyphen.dictools import *

for lang in ['es_MX', 'es_SP']:
        if not is_installed(lang): install(lang)
h_mx = Hyphenator('es_MX')
print h_mx.syllables(u"modernismo")

For more info:

Just in case you wondered, I need this as I'm working on mastering a mnemonic system based on the phonetic sound of words.

Monday, May 27, 2013

A virtual IP? What for? #linux

Layman's definition.
 It's an additional IP you assign to an interface by creating a virtual interface upon a real interface.

What for?
 For high availability scenarios, in which a service connects to a virtual IP pointing to a service (i.e. a database) .
1. The service breaks and a standby takes over bringing up the same virtual IP address automatically.
2. The application server reconnects to the standby as if nothing happened.

What kind of IP?
 Any available IP on the same network as the primary interface.


- An application server is associated to a database on
(address pointing to the database's virtual interface)
- The database server breaks down
- Bring down its virtual interface (if possible)
- On the standby server bring up the virtual interface with a virtual IP address.
ifconfig p2p1:0 inet netmask

- Refresh the arp cache on a neighbouring server (that in which the application making use of the service is hosted).
arping -U -w 1  -I p2p1

(no response is expected)

- From the same neighbour:

64 bytes from icmp_seq=1 ttl=64 time=0.538 ms
64 bytes from icmp_seq=2 ttl=64 time=0.318 ms
64 bytes from icmp_seq=3 ttl=64 time=0.320 ms
64 bytes from icmp_seq=4 ttl=64 time=0.623 ms
64 bytes from icmp_seq=5 ttl=64 time=0.278 ms
64 bytes from icmp_seq=6 ttl=64 time=0.287 ms
64 bytes from icmp_seq=7 ttl=64 time=0.197 ms
64 bytes from icmp_seq=8 ttl=64 time=0.280 ms

- Provided a standby database takes over as a part of a failover procedure, it must take over the same virtual IP.

- Care must be taken to make sure that the old database server's virtual interface gets brought down so it doesn't cause havoc (lookup the term STONITH)
- If using Dynamic Host Configuration Protocol make sure the virtual IP doesn't get assigned to any host automatically by restricting automatic leases to a lower range.


Tuesday, May 7, 2013

python-fabric. Easing multiple servers administration

I'm deploying a test scenario involving 16 pristine virtual machines and as a lazy administrator I'd hate to do the same configuration 16 times, that's what python-fabric is useful for.

From the official website:

What is Fabric?

As the README says:
Fabric is a Python (2.5 or higher) library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
More specifically, Fabric is:
  • A tool that lets you execute arbitrary Python functions via the command line;
  • A library of subroutines (built on top of a lower-level library) to make executing shell commands over SSH easy and Pythonic.
Naturally, most users combine these two things, using Fabric to write and execute Python functions, or tasks, to automate interactions with remote servers. Let’s take a look.

This is my first script which copies my public key into the test machines:

from fabric.api import local,env

hosts = []
for i in range(1,17):
   hosts.append( "box%d" % i)

env.hosts= hosts
def test_ssh():
         local("""ssh -o PasswordAuthentication=no root@%s "ifconfig | grep  -E  'inet.*192.168.0' | sed 's/\s\s*/ /g' | cut -d' ' -f3" """
      except SystemExit as e:
         local("ssh-copy-id root@%s"

If you have to administer multiple machines on a daily basis, I advice you give python-fabric a try.

Sunday, April 28, 2013

Sunday project. My desperate dictionary "attack"

I was stuck with one word of this game:
you look at 4 pictures and guess the word that correlates them. I couldn't go further so, I decided to use brute force.

Objective: Get all the possible permutations of the given letters (12 ) of the given size (5) and match them against a dictionary.

darwin@evolution:~/spwords> wget
darwin@evolution:~/spwords> unrar Spanish.rar

darwin@evolution:~/spwords> file Spanish.dic
Spanish.dic: ISO-8859 text, with CRLF line terminators

The database has unicode encoding, so lets do some encoding conversion.

idarwin@evolution:~/spwords> iconv -f ISO-8859-1 -t UTF-8 Spanish.dic > Spanish.dic.unicode
darwin@evolution:~/spwords> dos2unix Spanish.dic.unicode

I'm using my favorite RDBMS (PostgreSQL) for the word matching.

postgres=# CREATE TABLE words(id serial primary key, word varchar, word_unnacented varchar);
NOTICE:  CREATE TABLE creará una secuencia implícita «words_id_seq» para la columna serial «»
NOTICE:  CREATE TABLE / PRIMARY KEY creará el índice implícito «words_pkey» para la tabla «words»
postgres=# \copy words(word) FROM 'Spanish.dic.unicode'
postgres=# select count(*) from words;

postgres=# CREATE EXTENSION unaccent;
postgres=# UPDATE words set word_unnacented = unaccent(word);
UPDATE 413527
postgres=# CREATE INDEX words_word_idx ON words(word_unnacented varchar_pattern_ops);

postgres=# create extension plpython2u;

 CREATE OR REPLACE FUNCTION match_word(letters varchar,len int) RETURNS TABLE (match varchar)
AS $$
import itertools
result = []
for i in itertools.permutations(letters,len):
        rs = plpy.execute("SELECT word_unnacented FROM words WHERE word_unnacented = '%s' " % ''.join(i).lower())
        result += [(r['word_unnacented']) for r in rs]
return result
$$ LANGUAGE plpython2u;

postgres=# select distinct match_word('srtkjncnmonu',5) order by 1;


It took 10 seconds to yield the results (Intel Pentium Dual Core) .

It turns out that the correct word, which I found through this "attack", was "curso", that had in my opinion nothing to do with the pictures. Now I can continue playing =-) .