Tuesday, February 28, 2012

Data Visualization . Colorbrewer. Vanity coloring your maps

I love this app that helps you choose just the right color combinations for your layers in maps.

http://colorbrewer2.org/

This is how I used it on my first map:


Cool, isn't it? Yay!
P.S. By the way,darker means more violent...



Monday, February 27, 2012

Pentaho BI. Jpivot - java.util.EmptyStackException


I found my first bug with Pentaho BI while following this tutorial, but the solution was very straightforward, to replace the jpivot jar file....(because the old one was using an incorrect method).

Java.util.EmptyStackException
at java.util.Stack.peek(Stack.java:85)
at mondrian.server.Locus.peek(Locus.java:63)
at mondrian.rolap.SqlMemberSource.getMemberChildren2(SqlMemberSource.java:902)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:830)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:804)
at mondrian.rolap.SmartMemberReader.readMemberChildren(SmartMemberReader.java:247)
at mondrian.rolap.SmartMemberReader.getMemberChildren(SmartMemberReader.java:210)
at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.readMemberChildren(RolapCubeHierarchy.java:586)
at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.getMemberChildren(RolapCubeHierarchy.java:682)
at mondrian.rolap.SmartMemberReader.getMemberChildren(SmartMemberReader.java:176)
at mondrian.rolap.RolapSchemaReader.internalGetMemberChildren(RolapSchemaReader.java:170)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:152)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:146)
at mondrian.olap.DelegatingSchemaReader.getMemberChildren(DelegatingSchemaReader.java:78)
at com.tonbeller.jpivot.mondrian.MondrianMemberTree.internalGetChildren(MondrianMemberTree.java:209)
at com.tonbeller.jpivot.mondrian.MondrianMemberTree.getChildren(MondrianMemberTree.java:193)
at com.tonbeller.jpivot.navigator.member.TreeModelAdapter.getChildren(TreeModelAdapter.java:95)

etc...

The solution was given by dulci :

dulci
    Hey,
    I have the solution! Do this!:

    1.- stop-pentaho
    2.- Rename the library biserver-ce/tomcat/webapp/pentaho/WEB-INF/lib/jpivot-1.8.0-100420.jar to jpivot-1.8.0-100420.jar.OLD
    3.- Paste the library that we downloaded from http://www.stratebi.es/todobi/Ene12/jpivot-1.8.0-100420.jar
    4 - start pentaho and that's it! 

And bingo!!! This is the beauty of open source.

Reference:
http://forums.pentaho.com/showthread.php?89323-JPivot-error-want-to-check-the-solution

Friday, February 24, 2012

PostgreSQL - Example Database for Training (pagila improved . 540 MB once installed)

I just migrated from MySQL to PostgreSQL the example database based on pagila (mysql's sakila) but improved, as depicted in this book (which I highly recommend) :

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html

And I'm releasing it just like the original authors did, under the LGPL license.

You have to create a user 'wcm' with the necessary privileges in order to load this db.

Compressed text dump 71 mb.
https://rapidshare.com/files/3990485799/wcm.sql.tar.gz

Power Architect entity-relationship diagram.

https://rapidshare.com/files/2635761979/er_diagram_architect.architect

wcm=# select version();                                     version                                                                                        
-------------------------------------------------------------------------------------                                           
 PostgreSQL 9.1.1 on i586-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.6.2, 32-bit                                                  
(1 row) 


This is what the tables look like:
wcm=# set search_path = wcm;
SET
wcm=# \dt+
                            Listado de relaciones
 Esquema |        Nombre        | Tipo  |  Dueño   |   Tamaño   | Descripción 
---------+----------------------+-------+----------+------------+-------------
 wcm     | actor                | tabla | rene     | 6304 kB    | 
 wcm     | country              | tabla | rene     | 16 kB      | 
 wcm     | customer             | tabla | rene     | 32 MB      | 
 wcm     | customer_order       | tabla | rene     | 65 MB      | 
 wcm     | customer_order_line  | tabla | rene     | 107 MB     | 
 wcm     | director             | tabla | rene     | 840 kB     | 
 wcm     | distributor          | tabla | rene     | 624 kB     | 
 wcm     | dvd                  | tabla | rene     | 0 bytes    | 
 wcm     | dvd_release          | tabla | rene     | 23 MB      | 
 wcm     | dvd_release_actor    | tabla | rene     | 22 MB      | 
 wcm     | dvd_release_director | tabla | rene     | 0 bytes    | 
 wcm     | employee             | tabla | rene     | 0 bytes    | 
 wcm     | employee_job         | tabla | rene     | 8192 bytes | 
 wcm     | inventory            | tabla | rene     | 0 bytes    | 
 wcm     | job_description      | tabla | rene     | 8192 bytes | 
 wcm     | language             | tabla | rene     | 0 bytes    | 
 wcm     | lookup_type          | tabla | rene     | 8192 bytes | 
 wcm     | lookup_value         | tabla | rene     | 8192 bytes | 
 wcm     | promoted_dvd_release | tabla | rene     | 0 bytes    | 
 wcm     | promotion            | tabla | rene     | 8192 bytes | 
 wcm     | purchase_order       | tabla | rene     | 0 bytes    | 
 wcm     | purchase_order_line  | tabla | rene     | 0 bytes    | 
 wcm     | region               | tabla | rene     | 8192 bytes | 
 wcm     | warehouse            | tabla | rene     | 8192 bytes | 
 wcm     | website              | tabla | rene     | 8192 bytes | 
(25 filas)


Empty tables : purchase_order,promoted_dvd_release and purchase_order_line can be filled to your wish, the others
are not important and are easy to fill as well. The structure is kind of self explanatory but post a comment should you have any doubts.

PostgreSQL - Compressing text dumps on the fly

This one was missing in my recipes ...


pg_dump -U myUser myDB | gzip > myDB.sql.gz

Then, to restore:

gzip -cd myDB.sql.gz | pg_restore ...

or

or zcat myDB.sql.gz |pg_restore


The "tar" utility is for bundling up a bunch of files and directories into a single file (the name is a contraction of "tape archive"). In that respect, a "tar" file is kind-of like a "zip" file, except that "zip" always implies compression while "tar" does not.

Note finally that "gzip" is not "zip." The "gzip" utility just compresses; it doesn't make archives.


Reference:

http://stackoverflow.com/questions/2379482/shell-scripting-use-a-pipe-as-an-input-for-tar

Thursday, February 23, 2012

MySQL . mysqldump. exporting data with sql inserts to PostgreSQL

Generating a working mysql data dump with inserts to load into PostgreSQL.

Key options for generating the dump.

mysqldump -u user  --extended-insert=FALSE --no-create-info --compact --compatible=postgresql database > file.sql

Converting incompatible characters
sed "s/\\\'/\'\'/g" file.sql > file1.sql

And importing into postgresql
psql database < file1.sql
That was quick, wasn't it? References: http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html http://chriswa.wordpress.com/2008/02/20/mysqldump-data-only/ http://en.wikibooks.org/w/index.php?title=Programming:Converting_MySQL_to_PostgreSQL

MySQL . Error 1396 (HY000) <- nasty bug

The error goes like this:

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.16-log |
+------------+
1 row in set (0.00 sec)


CREATE USER youruser IDENTIFIED BY 'youruser';
ERROR 1396 (HY000) at line 4: Operation CREATE USER failed for 'youruser'@'%'

The bug is depicted here:

http://bugs.mysql.com/bug.php?id=28331

and the workaround that worked for me:

mysql> delete from mysql.db where user='youruser';
Query OK, 0 rows affected (0.00 sec)

mysql> delete from mysql.user where user='youruser';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

Another nuance that drives me off from using mysql (without needing to).

Wednesday, February 22, 2012

Pentaho . Chartbeans intro

I'm studying the way to adapt my favorite charting library to pentaho bi. I made a simple map.


http://wiki.pentaho.com/display/ServerDoc2x/Introduction+to+Pentaho+ChartBeans#IntroductiontoPentahoChartBeans-ExamplefromaJavaDeveloperPerspective

The provided example is a bit outdated, I needed to reconfigure some things, but I'm on my way. Once I accomplish it I'll publish a walkthrough tutorial.



Sunday, February 19, 2012

Pentaho. Pentaho Charts

I'm evaluating Pentaho BI, and I'm getting to like it very much. I'm interested in leveraging my own charts through chartbeans (the pentaho project to provide a wrapper to foreign charting libraries) as the ones from jfreechart don't appeal to me, and flash is not supported by some devices ,but they look very impressive like this one:




Chartbeans project:
http://wiki.pentaho.com/display/COM/The+Pentaho+ChartBeans+Project


Saturday, February 18, 2012

PostgreSQL - fuzzystrmatch . distance between strings

I think this feature could be useful for data cleansing, or in general, for tasks related with string comparisons.

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

F.9.2. Levenshtein
This function calculates the Levenshtein distance between two strings:
   levenshtein(text source, text target) returns int
Both source and target can be any non-null string, with a maximum of 255 characters.

test=# Create extension fuzzystrmatch;
CREATE EXTENSION

test=# select levenshtein('john smith','john schmit');
 levenshtein 
-------------
           3
(1 row)


Calculating their degree of similarity between two words sounds easy... but I'll try that after a long nap. One thing that would be awesome, is to somehow implement an efficient auto-complete feature using postgresql...

Visualization - Google charts API

Yet another charting API.
Charts are rendered using HTML5/SVG technology to provide cross-browser compatibility and cross platform portability to iPhones, iPads and Android. No plugins are needed.
http://code.google.com/intl/es-MX/apis/chart/interactive/docs/index.html

















Drawback (sort of):
Can I use charts offline?
No; your computer must have live access to http://www.google.com/jsapi in order to use charts. This is because the visualization libraries that your page requires are loaded dynamically before you use them. The code for loading the appropriate library is part of the included jsapi script, and is called when you invoke the google.load() method. Our terms of service do not allow you to download the google.load or google.visualization code to use offline.


And...the motion chart is fantastic!!!

Tuesday, February 14, 2012

PostgreSQL. Thesaurus dictionary - Normalizing political state names

I want to normalize the name of mexico states to some abbreviation in an automatic fashion as I'll be parsing a lot of documents containing statistical information about different variables in the whole country and they refer to states in heterogeneous ways.
For instance, I want the strings 'edo. de mexico', 'estado de mexico','mexico','edo de mexico' to yield 'mex' instead, which is the abbreviation I've chosen for that particular state (which in time will help me assign it's proper numeric identifier for its respective polygon in maps).

1.Creating the configuration file.
vi /usr/pgsql-9.1/share/tsearch_data/mexican_states.ths

? ? mexico : mex
edo ? mexico : mex
mexico : mex

:wq

2.Creating the text search dictionary
CREATE TEXT SEARCH DICTIONARY mexican_states_ths 
(
TEMPLATE = thesaurus,
DictFile = mexican_states,
Dictionary = pg_catalog.spanish_stem
);

3. Altering the mapping of the text search configuration for the respective language (spanish).
ALTER TEXT SEARCH CONFIGURATION spanish ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH mexican_states_ths, spanish_stem;

4. And testing the results.

postgres=# SELECT plainto_tsquery('spanish','edo de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','estado de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','edo. de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)


Certainly some previous treatment is necessary, such as lowercasing and unquoting, but that's easier than editing files by hand, which I find kind of silly.

Reference:
http://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html
While learning how to achieve this, I blew away my spanish text search configuration. Here's how to get it back
http://forums.enterprisedb.com/posts/list/1575.page
i.e.
Go to your $SHAREDIR i.e. (pg_config --sharedir)
Check the contents of file: snowball_create.sql. 

Sunday, February 12, 2012

Virtualbox - could not find a valid medium format for the target disk - vmdk.gz

I just ran into the following issue while importing a virtual appliance (.ova) to virtualbox.
Failed to import appliance /home/rene/Descargas/MusicBrainz NGS 2012-01-12.ova.
Could not find a valid medium format for the target disk '/home/rene/VirtualBox VMs/vm/musicbrainz-server-2012-01-12-disk1.vmdk.gz'.


The thing is that this appliance was generated by vmware with gzip compression enabled for the vmdk disk. I don't know whether there's another workaround, but it turns out that the .ova file is in fact a tar file so :

tar -xvf MusicBrainz\ NGS\ 2012-01-12.ova
gzip -d musicbrainz-server-2012-01-12-disk1.vmdk.gz

And you're all set.

Source:
http://itproctology.blogspot.com/2009/05/ovf-and-ova-dilemma.html
http://communities.vmware.com/thread/283650

Saturday, February 11, 2012

Flotr2 - Found another free charting library

To me, there's no excuse not to present our data in an attractive fashion.
http://www.humblesoftware.com/flotr2/documentation

Introduction Flotr2 is a library for drawing HTML5 charts and graphs. It is a branch of flotr which removes the Prototype dependency and includes many improvements. Features: mobile support framework independent extensible plugin framework custom chart types FF, Chrome, IE6+, Android, iOS lines bars candles pies bubbles
Usage

To use Flotr2, include the flotr2.min.js script in your page and create a visible div
with positive width and height. A graph is drawn with the Flotr.draw(container, data, options) method.

Friday, February 10, 2012

PostgreSQL. Mail merging-Part 2.The plpython function

Let's get this thing done.

Creating an auxiliary view,to get the name of the columns tables and views.
CREATE VIEW columns AS SELECT
  c.relname tab,
  a.attname as col
  FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
     LEFT JOIN pg_catalog.pg_attribute a ON a.attrelid = c.relname::regclass   
  WHERE
       n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid)
  AND c.relkind IN ('v','r')
  AND a.attnum > 0


Creating and populating the table where the "recipients" will be listed.
test=# CREATE TABLE deceased (recipient_name text,name_of_deceased text, his_her text, positive_attribute text,my_signature text);
CREATE TABLE
test=# INSERT INTO deceased values ('Jim Morrison','his',' a great artist'),('Kurt Kobain','his',' a tormented soul'),('John Lenon','his',' a charismatic and charming person'),('Janis Joplin','her',' authentic, a punch of energy ');
INSERT 0 3
UPDATE deceased SET my_signature = 'Rene', recipient_name = 'Mr.X'; 

And after some research and thought, I came up with this:
CREATE OR REPLACE FUNCTION mail_merge(source text,document text,path text) RETURNS void AS 
$$
#source -> the table or view that contains the list
#document -> the template with placeholders matching the field names of source prefixed with colons (:)
#path -> the directory where the generated documents will be created
import re
placeholders = []
doc_text = ""
columns = ""
rv = plpy.execute("SELECT col from columns WHERE tab = %s order by 1" % (plpy.quote_literal(source)))
for x in rv:
  placeholders = placeholders + [x['col']] 
rv = plpy.execute("SELECT document from documents WHERE doc_name =  %s " % (plpy.quote_literal(document)))
for x in rv:
  doc_text = doc_text + x['document']
for i in range(0,len(placeholders)):
  columns = columns + placeholders[i]+" "
  if i < len(placeholders) -1:
    columns = columns +","
rv = plpy.execute("SELECT "+columns+" FROM "+source )
for x in rv: 
  tmp = doc_text
  for i in range(0,len(placeholders)):
    tmp = re.sub(":"+str(placeholders[i]),str(x[placeholders[i]]),tmp)
  #here you create the file containing tmp, in path 
$$ LANGUAGE plpython2u;

With this function you can, based on a table or a view create n documents (where n equals the number of rows in the table or view), based on a template, setup as described in my previous entry. The field names of the table or view will determine the placeholders names.

The more I work with python, the more I like it =-).

Thursday, February 9, 2012

PostgreSQL . Mail merging ....with PostgreSQL ? Part 1

Yep , it's a weird concept, but the idea is to help a friend to automate a rather boring task. With mail merging, I'm talking about having a document template and you need to send it to a list of people. I know what you're thinking...gosh that's what office suites are for, but hell, I want to do it with PostgreSQL! hahahaha

This is my test document in Libre Office Writer:

Dear :recipient_name

I know :name_of_deceased has passed away. I was dismayed to hear of :his_her passing and am writing to extend my deepest sympathies to you and your family.

:name_of_deceased was :positive_attribute
If there is anything I can do to help you through this difficult time, please let me know and I'd be more than willing to offer my assistance.

With warmest regards,
:my_signature


Note that I'm using the following placeholders (variables if you wish):
:name_of_deceased
:his_her
:positive_attribute
:my_signature

I've saved it with a html format so I can manipulate flat text (it would be more elegant to use something like Latex I reckon).

Preprocessing and loading the document template:
  1. Getting rid of the new line character:
  2. -bash-4.2$ tr '\n' ' ' < letter.html | tee > letter_new.html
    
  3. Let's create the table that will hold my documents:
  4. CREATE TABLE documents (id serial primary key,doc_name text, document text);
    
  5. Load the template into the table
  6. letters=#\copy documents (document) FROM 'letter_new.html' WITH  DELIMITER AS '^';
    letters=# UPDATE documents SET doc_name = 'condolences' WHERE id = 1;
    
    I used '^' as delimiter because it doesn't appear in the document, hence, it will be loaded completely into that field.

Now my template looks like this in the database:
letters=# select * from documents;
id       | 1
doc_name | condolences
document | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD>    <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">      <TITLE></TITLE>         <META NAME="GENERATOR" CONTENT="LibreOffice 3.4  (Win32)">      <META NAME="CREATED" CONTENT="20120208;22492965">       <META NAME="CHANGED" CONTENT="0;0">     <STYLE TYPE="text/css">         <!--            @page { margin: 2cm }           P { margin-bottom: 0.21cm }     -->     </STYLE> </HEAD> <BODY LANG="es-MX" DIR="LTR"> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">Dear :recipient_name</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">I know :name_of_deceased  has passed away. I was dismayed to hear of  :his_her passing and am writing to extend my deepest sympathies to you and your family.</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">:name_of_deceased was :positive_attribute</P> <P STYLE="margin-bottom: 0cm">If there is anything I can do to help you through this difficult time, please let me know and I'd be more than willing to offer my assistance.</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">With warmest regards,</P> <P STYLE="margin-bottom: 0cm">:my_signature</P> </BODY> </HTML>         

In the second part I'll construct a python function that outputs documents based on this template and on a table of deceased people =-). But that'll be after I try to get some sleep ...




Wednesday, February 8, 2012

GeoServer - My first map served

I just served my first map with GeoServer
http://geoserver.org/display/GEOS/Welcome
and it was fairly easy. Their documentation is good enough for not being necessary to explain how did I do it, I just wanted to spread the word about this nice tool.

  This is the contour of my hometown: México City.

I just found some references for tutorials though.
http://gis.stackexchange.com/questions/7987/tutorials-for-geoserver


Python. Why should you learn python?

Today I was confronted with the question: why should anybody learn python?
And...gee, I couldn't give a strong argument besides: "because I like it, it's clear, fun to work with, and I you can do everything with it"

So in this entry I'll "curate" some facts and opinions of people from the web advocating for the Python programming language.
  1. Because it's a de-facto standard in the industry. [1] [2][3]
  2. Excellent  relation between speed/memory consumption/language features [1]
  3. Very easy to learn [1][2]
  4. Elegant and clear syntax [1]:
  5. class BankAccount(object):
        def __init__(self, initial_balance=0):
            self.balance = initial_balance
        def deposit(self, amount):
            self.balance += amount
        def withdraw(self, amount):
            self.balance -= amount
        def overdrawn(self):
            return self.balance < 0
    my_account = BankAccount(15)
    my_account.withdraw(5)
    print my_account.balance
    
  6. A huge variety of modules to accomplish anything you can think of. [1][2]
  7. Straight forward web programming [1][2]
---------------------------------------------------------------------------------------------------------------
Other unsorted reasons from stack overflow:
As for what tasks can be done:
  • Python is extremely fun to develop in.
  • Everything can be done with Python.
  • If something can't be done, you can create an extension for it.
  • Everything can not only be done, but it can be done fast. For example a program that takes you weeks in C++ might take you a day in Python.
  • Great for prototyping, and even for usage in a commercial setting.
As for language features:
  • Scripting language
  • Multi-paradigms (object oriented, imperative, functional)
  • Dynamic type system
  • Automatic memory management
  • Links to other dynamic languages
  • Embedding into other languages
  • Nice py2exe program that converts python scripts to executables for windows. Other platforms include it by default.
  • Strongly typed without static typing
  • Large standard library
  • Includes lists, dictionaries, sets, and much more as basic data types
  • Some implementations of python compile directly to machine code
  • It is actively being worked on
  • It has an antigravity module as noted by Jeremy Cantrell
  • ...
As for resources available:
  • Python is free to use
  • There are a ton of online resources and free tutorials
  • There are a ton of books on Python
  • There is a large community of developers available for help
--
Why should you learn Python?
  • Because it is a modern, elegant, highest level OO language
  • Because it is highly expressive, i.e., you will earn higher productivity
  • Because it comes with "batteries included" i.e. libraries for whatever you want
  • Because it is well documented and has a well-established community
  • Because it is simple to learn (and easy to read!) if you already know another OO language
  • Because it is relatively portable (that is, portable until you use system-specific features)
  • Because it is free and already installed on any modern Linux distribution (and also on Mac OS X and *BSD, I guess)
As for what can be done in Python: everything except extremely low-level things (i.e. kernels, drivers, programming a microcontroller). Also, you should not use Python directly for number crunching (it is far too slow for that), but you can wrap C or C++ number crunching libraries with Python modules (and it has already been done, the most notable examples being Numpy and Scipy).
------------------------------------------------------------------------------------------------------------------
I'll update this post as I find more convincing rationales advocating for python.

Monday, February 6, 2012

PostgreSQL - MVCC revisited

I had previously read about this topic but I needed to grasp it a little bit more, so I made this mental map:



And also deviced a very simple exercise based on this document and this one :


To me, the most important piece of information is this:

Basic idea: tuple is visible if xmin is valid and xmax is not."Valid" means "either committed or the current transaction".

postgres=# CREATE TABLE test(id int, time timestamp);
CREATE TABLE
postgres=# BEGIN;
BEGIN
postgres=# select txid_current();
 txid_current 
--------------
        13756
(1 row)

postgres=# INSERT INTO test values(generate_series(1,100),current_timestamp);
INSERT 0 100
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |           time            | xmin  | xmax 
----+---------------------------+-------+------
  1 | 2012-02-06 21:45:00.11556 | 13756 |    0
  2 | 2012-02-06 21:45:00.11556 | 13756 |    0
  3 | 2012-02-06 21:45:00.11556 | 13756 |    0
  4 | 2012-02-06 21:45:00.11556 | 13756 |    0
  5 | 2012-02-06 21:45:00.11556 | 13756 |    0
(5 rows)

postgres=# COMMIT;
COMMIT


Now let's generate some dead tuples:

postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
 txid_current 
--------------
        13758
(1 row)

postgres=# UPDATE test SET time = current_timestamp WHERE id < 6;
UPDATE 100
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |            time            | xmin  | xmax 
----+----------------------------+-------+------
  1 | 2012-02-06 21:48:03.560034 | 13758 |    0
  2 | 2012-02-06 21:48:03.560034 | 13758 |    0
  3 | 2012-02-06 21:48:03.560034 | 13758 |    0
  4 | 2012-02-06 21:48:03.560034 | 13758 |    0
  5 | 2012-02-06 21:48:03.560034 | 13758 |    0
(5 rows)

postgres=# DELETE FROM test WHERE id < 6;
DELETE 5
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |            time            | xmin  | xmax 
----+----------------------------+-------+------
  6 | 2012-02-06 21:48:03.560034 | 13758 |    0
  7 | 2012-02-06 21:48:03.560034 | 13758 |    0
  8 | 2012-02-06 21:48:03.560034 | 13758 |    0
  9 | 2012-02-06 21:48:03.560034 | 13758 |    0
 10 | 2012-02-06 21:48:03.560034 | 13758 |    0
(5 rows)

postgres=# ROLLBACK;
ROLLBACK

postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |           time            | xmin  | xmax  
----+---------------------------+-------+-------
  1 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  2 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  3 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  4 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  5 | 2012-02-06 21:45:00.11556 | 13756 | 13758
(5 rows)
How do we get a glance at how many dead tuples in a table?
postgres=# select * from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]-----+------------------------------
relid             | 16969
schemaname        | public
relname           | test
seq_scan          | 7
seq_tup_read      | 320
idx_scan          | 
idx_tup_fetch     | 
n_tup_ins         | 100
n_tup_upd         | 105
n_tup_del         | 5
n_tup_hot_upd     | 105
n_live_tup        | 100
n_dead_tup        | 5
last_vacuum       | 
last_autovacuum   | 2012-02-06 21:51:33.926522-06
last_analyze      | 
last_autoanalyze  | 2012-02-06 21:46:33.783941-06
vacuum_count      | 0
autovacuum_count  | 1
analyze_count     | 0
autoanalyze_count | 1

n_dead_tup = number of dead tuples And now, lets VACUUM the table.
postgres=# VACUUM VERBOSE test;
INFO:  vacuuming "public.test"
INFO:  "test": found 5 removable, 100 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 100 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM


I still feel that I need to dive more into the subject. One thing that would be great is to peep into "deleted" row contents =-).

Sunday, February 5, 2012

Python - SciPy. Concetrate on ideas not on implementations

I just discovered Scipy:
SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!


I wanted to find the inverse of a matrix using plpython in the old way (row operations) ,so I could practice the python syntax, but now I'm just going to use SciPy.

This is the guide to the linear algebra module:
http://docs.scipy.org/doc/scipy/reference/tutorial/linalg.html

postgres=# CREATE OR REPLACE FUNCTION inverse_matrix (m text)
RETURNS text
AS $$
from scipy import linalg
from numpy import matrix
return linalg.inv(matrix(m));
$$ LANGUAGE plpython2u;

postgres=# select inverse_matrix($$'[1 3 5; 2 5 1; 2 3 8]'$$);
-[ RECORD 1 ]--+----------------------
inverse_matrix | [[-1.48  0.36  0.88]
               |  [ 0.56  0.08 -0.36]
               |  [ 0.16 -0.12  0.04]]


As simple as that! =-)

Friday, February 3, 2012

Data Visualization - Word clouds

I'm using this javascript library to create simple word clouds:
http://code.google.com/p/nebulos/wiki/UserGuide
I just love creative people and javascript!


This is another example using:
https://github.com/DukeLeNoir/jQCloud

Wednesday, February 1, 2012

Postgresql - Domains

I'm studying good database design principles and discovered a beautiful concept : domains.

You have a SQL database and you want a column to be restrained to one of a few values.

If the list will never change, and ordering is preset or doesn't matter: use an ENUM
If the list might change but changes require programming, OR the list will not change but needs to be alpha sorted: use a DOMAIN
If the list will change, and changes do not require other programming: use a lookup table.


http://nigel.mcnie.name/blog/the-five-minute-guide-to-postgresql-domains-enums-on-steroids


For instance, I defined the following domains to enforce data validation to some extent.

CREATE DOMAIN general_text TEXT
 CONSTRAINT max_length CHECK (LENGTH( VALUE ) < 255 )
 ;

CREATE DOMAIN email_string TEXT
 CONSTRAINT valid_email_form CHECK ( VALUE ~ '_*@_*._*')
 CONSTRAINT valid_email_length CHECK ( LENGTH(VALUE) > 5)
 ;
 
CREATE DOMAIN number_string TEXT
 CONSTRAINT valid_number CHECK (VALUE ~ '[0-9]+')
 ;
 
CREATE DOMAIN telephone_string TEXT
 CONSTRAINT valid_telephone_form CHECK (VALUE ~ '[0-9\-]+')
 CONSTRAINT valid_telephone_length CHECK (LENGTH(VALUE) >= 8 )
 ;

CREATE DOMAIN counter int
 CONSTRAINT positive_number CHECK ( VALUE >= 0)
 ;

Defining a table that uses them:

CREATE TABLE editoriales(
  id serial primary key,
  nombre general_text not null,
  direccion general_text,
  estado_id int,
  pais_id int,
  codigo_postal number_string , 
  telefono telephone_string
);


The thing that got me into their use, is that they can be treated as if they were tables ,you can modify them / add new constraints and ... they are reusable!!!