Thursday, May 10, 2012

Python.-Closures.Functions that build other functions

I'm studying Dive Into Python 3 and I'm pretty much in love with it.
http://getpython3.com/diveintopython3/

From Chapter 6. Closures and Generators.
Guess what? Functions are just other kind of objects in Python!

The technique of using the values of outside parameters within a dynamic function is called closures
import re

def build_match_and_apply_functions(pattern,search,replace):
    '''
    builds dynamically a tuple containing two functions (matches_rule,apply_rule) for each tuple (pattern,search,replace) received
    the constants patter,search and replaced get substituted by the actual parameters passed to the function accordingly
    '''
    def matches_rule(word):
        return re.search(pattern,word)
    def apply_rule(word):
        return re.sub(search,replace,word)
    return (matches_rule,apply_rule)

patterns = \
'''
for each of these patterns two functions will be built a match_rule  and an apply_rule functions
'''
(
    ('[sxz]$','$','es'),
    ('[^aeioudgkprt]h$','$','es'),
    ('(qu|[^aeiou])y$','y$','ies'),
    ('$','$','s')
)


'''
this is called a list comprehension in Python terms
in this case we have as a result, a list containing tuples of match_rule and apply_rule functions
'''
rules = [build_match_and_apply_functions(pattern,search,replace)
         for (pattern, search,replace) in patterns]

print(rules)

def plural(noun):
    for matches_rule, apply_rule in rules:
        if matches_rule(noun):
            return apply_rule(noun)
        
nouns = ["sufix","trash","lady","cat","math","boy","day",'pita','vacancy']
for noun in nouns:
    print(plural(noun))

I'm still in the process of assimilation, and this is pretty cool stuff.

Wednesday, April 25, 2012

Python - An insignificant script to grade tests

I'm just backing up this script to track my progress with python, it ain't a great thing, it just grades a bunch of test of some students.

f = open('/home/rene/Dropbox/respuestas.txt','r')
o = open('/home/rene/Dropbox/hoja_resultados.txt','w')
r = open('/home/rene/Dropbox/cals_uaem.txt','r')
respuestas = f.readline().split()
secciones = f.readline().split()    
temas = f.readline().split(',')
f.close()
reactivos = [""]*29
counter = 0
for i in r:
    reactivos[counter] = i.split()
    counter = counter + 1
r.close()
temas_estudiar = []
preguntas_por_seccion = {'BI':9,'ETL':6,'SQL':4,'BI-Documentos':3}
aciertos = {'BI':0,'ETL':0,'SQL':0,'BI-Documentos':0}
errores = {'BI':0,'ETL':0,'SQL':0,'BI-Documentos':0}
ptc_aciertos = {'BI':0.0,'ETL':0.0,'SQL':0.0,'BI-Documentos':0.0}
for i in range(0,29):
    for j in range(0,22):
        if (j!=5):
            if (respuestas[j] == reactivos[i][j+1]):
                aciertos[secciones[j]] = aciertos[secciones[j]] + 1
            else:
                errores[secciones[j]] = errores[secciones[j]] + 1
                temas_estudiar.append(temas[j])
        else:
            aciertos[secciones[j]] = aciertos[secciones[j]] + 1       
    num_aciertos = aciertos['BI']+aciertos['SQL']+aciertos['ETL']+aciertos['BI-Documentos']
    
    o.write("\nNum.examen: ")
    o.write(reactivos[i][0]+'\n') 
    o.write("score: "+str(num_aciertos)+" de 22\n")  
    o.write("Aciertos\n")
    o.write("BI ETL SQL BI-Documentos\n") 
    o.write(str(aciertos['BI'])+" "+str(aciertos['ETL'])+" "+str(aciertos['SQL'])+" "+str(aciertos['BI-Documentos'])+'\n')
    o.write("Errores\n")
    o.write("BI ETL SQL BI-Documentos\n") 
    o.write(str(errores['BI'])+" "+str(errores['ETL'])+" "+str(errores['SQL'])+" "+str(errores['BI-Documentos'])+'\n')
    o.write("Promedios\n")
    o.write("BI ETL SQL BI-Documentos\n") 
    o.write("%.1f"%(float(aciertos['BI'])/float(preguntas_por_seccion['BI']))+" "+"%.1f"%(float(aciertos['ETL'])/float(preguntas_por_seccion['ETL']))+" "+"%.2f"%(float(aciertos['SQL'])/float(preguntas_por_seccion['SQL']))+' '+"%.2f"%(float(aciertos['BI-Documentos'])/float(preguntas_por_seccion['BI-Documentos']))+'\n')
    o.write("Promedio final: "+"%.2f"%(float(num_aciertos)/22.0)+"\n")
    aciertos['BI'] = 0
    aciertos['ETL'] = 0
    aciertos['SQL'] = 0
    aciertos['BI-Documentos'] = 0
    errores['BI'] = 0
    errores['ETL'] = 0
    errores['SQL'] = 0
    errores['BI-Documentos'] = 0
    o.write("Temas a estudiar: \n")
    for i in temas_estudiar:
        o.write(i+",")
    o.write('\n')
    del temas_estudiar[:]
o.close()

Things I learned:
The write function doesn't put a new line character at the end.
http://docs.python.org/release/1.5.1p1/tut/range.html
The range function is not inclusive of its second parameter.
To get a better understanding of formatting operations.
http://docs.python.org/release/2.4.4/lib/typesseq-strings.html
To use the "del" statement to clear a list, i.e. del myList[:]
To use the split function.
To make lists of lists such as:
mlist3 = [[0]*3]*3
http://www.daniweb.com/software-development/python/threads/58916/making-a-list-of-lists
To use dictionaries

Saturday, April 21, 2012

Weave Maker. Rapid Web Application Development

I'm giving Weave Maker a try, I love coding, but coding algorithms ... the logic behind applications, I don't fancy myself as a GUI developer nor I think I would be good at it... anyway:

http://www.wavemaker.com/product/

Here's a screenshot of my hello weavey world web application that only took 2 minutes to build:


Did I mention that it has Apache license?


Saturday, April 14, 2012

Pentaho -Kettle. java options. xulrunner dependency

I stumbled upon an old error I had forgotten. Many applications refer to the xulrunner library in order to work. As its location varies from system to system, in some cases it's necessary to set it "by hand" . This is done through java options, for instance, within your java options you could include something like this:


-Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib/xulrunner-2/xulrunner

i.e. You need to edit your spoon.sh file to include that option, for example:

OPT="$OPT $PENTAHO_DI_JAVA_OPTIONS -Djava.library.path=$LIBPATH -DKETTLE_HOME=$KETTLE_HOME -DKETTLE_REPOSITORY=$KETTLE_REPOSITORY -DKETTLE_USER=$KETTLE_USER -DKETTLE_PASSWORD=$KETTLE_PASSWORD -DKETTLE_PLUGIN_PACKAGES=$KETTLE_PLUGIN_PACKAGES -DKETTLE_LOG_SIZE_LIMIT=$KETTLE_LOG_SIZE_LIMIT    -Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib/xulrunner-2/xulrunner"


I was having this issue with Pentaho PDI 4.2 (kettle):


/home/avr/Downloads/data-integration
INFO  14-04 00:36:53,711 - Spoon - Logging goes to file:///tmp/spoon_d782dc5d-85f3-11e1-b128-1b67dca12e64.log
WARN  14-04 00:36:55,703 - could not parse [vertical] as Align value
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x01ced76c, pid=5084, tid=3077724992
#
# JRE version: 7.0-b147
# Java VM: Java HotSpot(TM) Client VM (21.0-b17 mixed mode, sharing linux-x86 )
# Problematic frame:
# C  [libxul.so+0xd5076c]  NS_InvokeByIndex_P+0x5d90
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/avr/Downloads/data-integration/hs_err_pid5084.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#
./spoon.sh: line 144:  5084 Aborted                 (core dumped) "$_PENTAHO_JAVA" $OPT $STARTUP -lib $LIBPATH "${1+$@}"

Which vanished once the aforementioned option was set.

I dealt with this issue a year ago, but gosh I couldn't remember it up until 3 hours of trial and error. Fuck, lack of sleep shuts down my long term memory!

Thursday, March 29, 2012

Curso Business Intelligence - Material

Hola muchachos, aquí estaré colocando el material necesario para nuestro curso de modo que lo descarguen y lo lleven a la respectiva sesión. Guarden un bookmark a este post para una referencia rápida. El password en todos los casos es el nombre propio de la coordinadora de la carrera de Sistemas Inteligentes seguido del password de sus máquinas virtuales (todo en minúsculas).

#########################################################################################################
Sesión del 31 de Marzo 2012.

Diapositivas

https://rapidshare.com/files/1964967587/diap_0331.zip

Script SQL base de datos dw_imdb:
https://rapidshare.com/files/493712796/dw_imdb.sql.tar.gz

Diagrama entidad relación de la base de datos dw_imdb
https://rapidshare.com/files/3472566836/entidad_relacion_dw_imdb.pdf

No olviden llevar su diagrama entidad relación de la base de datos WCM.

#########################################################################################################

Sesion del 14 de Abril 2012

Perdón por la demora, hemos andado muy atareados.

Transformaciones
https://rapidshare.com/files/1028060750/transformaciones.tar.gz

#########################################################################################################

Friday, March 23, 2012

PostgreSQL - Getting the size of an index

I just stumbled upon this gentle piece of info:
http://postgresql.cc/postgres-index-size


How to Get a Postgres Table’s Indexes’ Size

Run the following SQL command, substituting the name of your table, to get the size of that Postgres table’s index data:

postgres=# select pg_size_pretty(pg_total_relation_size('test_table') - pg_relation_size('test_table'));
 pg_size_pretty
----------------
 24 kB
(1 row)

The function pg_total_relation_size() gets the size of both the row data and indexes for the named table. We subtract the size of the tables row data (pg_relation_size()) to get the size of the table’s indexes. The pg_size_pretty() function then puts the size in readable units (bytes, kilobytes, megabytes or gigabytes).


As I intent to explain tomorrow the importance of surrogate keys this tip will be very useful. =-)

Wednesday, March 14, 2012

Protovis playground - Disregard this please

I'm just playing around with the data visualization library called "protovis". There's a new kid in town called d3.js but I aim to develop my own visualizations with a product that uses protovis. It would be more proper to setup a web page independently but I'm too lazy for that, and I would like to show my intermediate results whilst playing with it, so, please disregard this post at this moment... =-).






-->

PostgreSQL - Generating fake data with Common Table Expressions

I'm filling an orders table with sample data for a sql course. I have two tables: employee (the one who placed the order) and distributor, the task to generate 1,000,000 orders between 2005 and 2012.

WITH RECURSIVE source (counter) AS ( SELECT 1 UNION ALL SELECT counter +1 FROM source where counter < 1000000)
SELECT 
(random() * (select max(employee_id) - min(employee_id) from wcm.employee))::int + (select min(employee_id) from wcm.employee) employee_id,
(random() * (select max(distributor_id) - min(distributor_id) from wcm.distributor))::int + (select min(distributor_id) from wcm.distributor) distributor_id,
( '2005-01-01'::date + ((now()::date - '2005-01-01'::date)::int * random())::int ) order_date
FROM source;


I love CTE's !!!! Don't you?

PostgreSQL - Foreign key checking. Permission denied for schema even as superuser

I was getting the following error:

wcm=> select version();
                                                      version                                                      
-------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.0.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-48), 64-bit
(1 Zeile)

wcm=# delete from employee;
ERROR:  permission denied to the schema wcm
ZEILE 1: SELECT 1 FROM ONLY "wcm"."inventory" x WHERE $1 OPERATOR(pg_...
                            ^
ANFRAGE:  SELECT 1 FROM ONLY "wcm"."inventory" x WHERE $1 OPERATOR(pg_catalog.=) "employee_id" FOR SHARE OF x

The explanation goes as follows:

permission-denied-for-schema-even-as-superuser

That's a foreign key checking query.FK checks are done as the owner of
the target table
, not as the user who did the original query. So your
problem is that the owner of the table lacks permissions
to access the other table (or more specifically, the schema it's in).

Reference:
http://postgresql.1045698.n5.nabble.com/permission-denied-for-schema-even-as-superuser-td4373036.html

Thanks Tom Lane wherever you are, may the force be with you!!

Monday, March 5, 2012

Latex - Beamer + minted + Pygments

The minted package allows you to highlight source code according to its language type by using pygments.
http://tug.ctan.org/tex-archive/macros/latex/contrib/minted/

rene@linux-2561:~> sudo zypper in python-Pygments
rene@linux-2561:~/bin> unzip minted.zip 
Archive:  minted.zip
   creating: minted/
  inflating: minted/minted.dtx       
  inflating: minted/minted.ins       
  inflating: minted/minted.pdf       
  inflating: minted/README           
  inflating: minted/Makefile 
rene@linux-2561:~/bin/minted> python --version
Python 2.7.2
rene@linux-2561:~/bin/minted> make
tex minted.ins
This is TeX, Version 3.1415926 (TeX Live 2011)
(./minted.ins (/usr/share/texmf/tex/latex/base/docstrip.tex
Utility: `docstrip' 2.5d <2005/07/29>
English documentation    <1999/03/31>

**********************************************************
* This program converts documented macro-files into fast *
* loadable files by stripping off (nearly) all comments! *
**********************************************************

********************************************************
* No Configuration file found, using default settings. *
********************************************************

)

Generating file(s) minted.sty 

Processing file minted.dtx (package) -> minted.sty
Lines  processed: 1296
Comments removed: 994
Comments  passed: 0
Codelines passed: 296

 )
No pages of output.
Transcript written on minted.log

rene@linux-2561:~/bin/minted> sudo cp minted.sty /usr/share/texmf/tex/latex/minted/

rene@linux-2561:~> sudo texhash
texhash: Updating /etc/texmf/ls-R... 
texhash: Updating /var/lib/texmf/main/ls-R... 
texhash: Updating /usr/share/texmf/../../../var/lib/texmf/dist/ls-R... 
texhash: Updating /var/cache/texmf/fonts/ls-R... 
texhash: Updating /var/lib/texmf/ls-R... 
texhash: Done.

This is what I've got so far:



Source
\documentclass[slidestop,compress,mathserif]{beamer}
\usepackage{minted}
\usetheme{Antibes}
\usecolortheme{lily}
\title{PostgreSQL}
\author{@darwinian}
\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}[fragile]
\begin{itemize}
\item Generating factorials
\end{itemize}
\begin{minted}{sql}
WITH RECURSIVE source (counter,product) AS
 ( SELECT 1,1 UNION ALL
     SELECT counter +1 , product * (counter +1)
         FROM source WHERE counter < 10)
SELECT counter, product FROM source;
\end{minted}
\end{frame}
\end{document}
Other references: http://en.wikibooks.org/wiki/LaTeX/Packages/Installing_Extra_Packages

Sunday, March 4, 2012

Latex.- The powerdot class for presentations

I started working with the powerdot document class. I find it extremely easy to use and powerful.
http://ctan.org/tex-archive/macros/latex/contrib/powerdot

This is my test presentation.


Wich I achieved with the following code

\documentclass{powerdot}
\begin{document}
\title{PostgreSQL replication concepts}
\author{Rene Romero Benavides}
\date{March 4,2012}
\begin{slide}{PostgreSQL}
\begin{itemize}
\item Here is the first line on my slide
\pause
\item And now here's the next
\end{itemize}
\end{slide}
\end{document}

References.
www.bakoma-tex.com/doc/latex/powerdot/powerdot.pdf
http://csgsa.cs.clemson.edu/seminar/tools06/docbook/html/ch02s04.html#id2396476
http://www.miwie.org/presentations/html/powerdot.html
doctorh.umwblogs.org/files/2010/12/courseC.pdf
http://www.dmi.me.uk/blog/2010/11/08/creating-a-presentation-with-latex-and-powerdot/


Mmmm... second guessing, I'll try beamer too.

Friday, March 2, 2012

PostgreSQL - Forcing index scans

I'm following a presentation by Bruce Momjiam titled "Explainng the Postgres Query Optimizer", and I found an exercise where you can force the query optimizer to use index scans (RAF not long ago told me about this too). I don't quite understand the value of doing this as if a value is present in a table several times it could be more costly to carry on an index scan versus a sequential scan, but ... you never now

SET enable_seqscan = false;
SET enable_bitmapscan  = false;

WITH letter (letter, count) AS ( SELECT letter, count(*) from sample GROUP by 1) SELECT letter AS "1", count,(SELECT * FROM lookup_letter(letter) AS "12" limit 1) AS lookup_letter FROM letter ORDER BY 2 DESC

So now it uses index scan each time regardless.

1 | count |                              lookup_letter                               
---+-------+--------------------------------------------------------------------------
 p |   199 | Index Scan using i_sample on sample  (cost=0.00..39.05 rows=199 width=2)
 s |     9 | Index Scan using i_sample on sample  (cost=0.00..21.91 rows=9 width=2)
 c |     8 | Index Scan using i_sample on sample  (cost=0.00..19.64 rows=8 width=2)
 r |     7 | Index Scan using i_sample on sample  (cost=0.00..19.63 rows=7 width=2)
 t |     5 | Index Scan using i_sample on sample  (cost=0.00..15.09 rows=5 width=2)
 d |     4 | Index Scan using i_sample on sample  (cost=0.00..15.07 rows=4 width=2)
 v |     4 | Index Scan using i_sample on sample  (cost=0.00..15.07 rows=4 width=2)
 f |     4 | Index Scan using i_sample on sample  (cost=0.00..15.07 rows=4 width=2)
 _ |     3 | Index Scan using i_sample on sample  (cost=0.00..12.80 rows=3 width=2)
 a |     3 | Index Scan using i_sample on sample  (cost=0.00..12.80 rows=3 width=2)
 u |     3 | Index Scan using i_sample on sample  (cost=0.00..12.80 rows=3 width=2)
 e |     2 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)
 i |     1 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)
 k |     1 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)

Where in cases of index scanning a value that repeats to a high degree it's related cost is higher too, comparing it to:

RESET ALL;
WITH letter (letter, count) AS ( SELECT letter, count(*) from sample GROUP by 1) SELECT letter AS "1", count,(SELECT * FROM lookup_letter(letter) AS "12" limit 1) AS lookup_letter FROM letter ORDER BY 2 DESC;

 1 | count |                             lookup_letter                             
---+-------+-----------------------------------------------------------------------
 p |   199 | Seq Scan on sample  (cost=0.00..13.16 rows=199 width=2)
 s |     9 | Seq Scan on sample  (cost=0.00..13.16 rows=9 width=2)
 c |     8 | Seq Scan on sample  (cost=0.00..13.16 rows=8 width=2)
 r |     7 | Seq Scan on sample  (cost=0.00..13.16 rows=7 width=2)
 t |     5 | Bitmap Heap Scan on sample  (cost=4.29..12.76 rows=5 width=2)
 v |     4 | Bitmap Heap Scan on sample  (cost=4.28..12.74 rows=4 width=2)
 f |     4 | Bitmap Heap Scan on sample  (cost=4.28..12.74 rows=4 width=2)
 d |     4 | Bitmap Heap Scan on sample  (cost=4.28..12.74 rows=4 width=2)
 a |     3 | Bitmap Heap Scan on sample  (cost=4.27..11.38 rows=3 width=2)
 u |     3 | Bitmap Heap Scan on sample  (cost=4.27..11.38 rows=3 width=2)
 _ |     3 | Bitmap Heap Scan on sample  (cost=4.27..11.38 rows=3 width=2)
 e |     2 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)
 i |     1 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)
 k |     1 | Index Scan using i_sample on sample  (cost=0.00..8.27 rows=1 width=2)
(14 filas)

Thursday, March 1, 2012

PostgreSQL - Common Table Expressions

I just stumbled upon this presentation by Bruce Momjian:
http://momjian.us/main/writings/pgsql/cte.pdf
I had heard of these beauties before but never really used them ... they're astonishing!

I'm just going to replicate the given examples :

test=# with source as ( select 1) SELECT * FROM source;
 ?column? 
----------
        1
(1 row)

test=# WITH source AS ( SELECT 1 as col1) SELECT * FROM source;
 col1 
------
    1
(1 row)

test=# WITH source (col1) AS ( SELECT 1 ) SELECT * FROM source;
 col1 
------
    1
(1 row)

test=# WITH source (col2) AS ( SELECT 1 AS col1) SELECT col2 AS col3 FROM source;
 col3 
------
    1
(1 row)

test=# WITH source AS ( SELECT 1,2 ) SELECT * FROM source;
 ?column? | ?column? 
----------+----------
        1 |        2
(1 row)

-- here it starts to get interesting

test=# WITH source AS ( SELECT 1,2), source2 AS ( SELECT 3,4) select * from source UNION ALL SELECT * FROM source2;
 ?column? | ?column? 
----------+----------
        1 |        2
        3 |        4
(2 rows)

test=# WITH source AS ( select lanname, rolname FROM pg_language JOIN pg_roles ON lanowner = pg_roles.oid ORDER BY lanname) SELECT * FROM source UNION ALL SELECT MIN(lanname), NULL FROM source;
  lanname   | rolname  
------------+----------
 c          | postgres
 internal   | postgres
 plpgsql    | postgres
 plpython2u | postgres
 plpythonu  | postgres
 sql        | postgres
 c          | 
(7 rows)

test=# WITH class (oid, relname) AS ( SELECT oid, relname FROM pg_class WHERE relkind = 'r' ) SELECT c.relname, pa.attname FROM pg_attribute pa JOIN  class c ON c.oid = pa.attrelid ORDER BY 1,2 LIMIT 5;
  relname   | attname 
------------+---------
 child_test | cmax
 child_test | cmin
 child_test | col1
 child_test | col2
 child_test | ctid
(5 rows)


--and this is where I give up as I'm extreeeeeemly tired


By the way: cmin,cmax, and ctid are system columns:

cmin

The command identifier (starting at zero) within the inserting transaction.
cmax

The command identifier within the deleting transaction, or zero.
ctid

The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. The OID, or even better a user-defined serial number, should be used to identify logical rows.

http://www.postgresql.org/docs/current/static/ddl-system-columns.html

Another set of statements that I find fascinating:
test=# set statement_timeout = '1s';
SET
test=# WITH RECURSIVE source AS ( select 1 UNION ALL SELECT 1 FROM source ) SELECT * FROM source;
ERROR:  canceling statement due to statement timeout

test=# RESET statement_timeout;
RESET


And this one is absolutely epic:
test=# WITH RECURSIVE source (counter) AS ( SELECT 1 UNION ALL SELECT counter + 1 FROM source WHERE counter < 10) SELECT * FROM source;
 counter 
---------
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
(10 rows)

test=# WITH RECURSIVE source (counter,product) AS ( SELECT 1,1 UNION ALL SELECT counter +1 , product * (counter +1) FROM source WHERE counter < 10) SELECT counter, product FROM source;
 counter | product 
---------+---------
       1 |       1
       2 |       2
       3 |       6
       4 |      24
       5 |     120
       6 |     720
       7 |    5040
       8 |   40320
       9 |  362880
      10 | 3628800
(10 rows)

test=# WITH RECURSIVE source (counter,product) AS ( SELECT 1,1 UNION ALL SELECT counter +1 , product * (counter +1) FROM source WHERE counter < 12) SELECT counter, product FROM source WHERE counter = 12;
 counter |  product  
---------+-----------
      12 | 479001600
(1 row)


Tuesday, February 28, 2012

Data Visualization . Colorbrewer. Vanity coloring your maps

I love this app that helps you choose just the right color combinations for your layers in maps.

http://colorbrewer2.org/

This is how I used it on my first map:


Cool, isn't it? Yay!
P.S. By the way,darker means more violent...



Monday, February 27, 2012

Pentaho BI. Jpivot - java.util.EmptyStackException


I found my first bug with Pentaho BI while following this tutorial, but the solution was very straightforward, to replace the jpivot jar file....(because the old one was using an incorrect method).

Java.util.EmptyStackException
at java.util.Stack.peek(Stack.java:85)
at mondrian.server.Locus.peek(Locus.java:63)
at mondrian.rolap.SqlMemberSource.getMemberChildren2(SqlMemberSource.java:902)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:830)
at mondrian.rolap.SqlMemberSource.getMemberChildren(SqlMemberSource.java:804)
at mondrian.rolap.SmartMemberReader.readMemberChildren(SmartMemberReader.java:247)
at mondrian.rolap.SmartMemberReader.getMemberChildren(SmartMemberReader.java:210)
at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.readMemberChildren(RolapCubeHierarchy.java:586)
at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.getMemberChildren(RolapCubeHierarchy.java:682)
at mondrian.rolap.SmartMemberReader.getMemberChildren(SmartMemberReader.java:176)
at mondrian.rolap.RolapSchemaReader.internalGetMemberChildren(RolapSchemaReader.java:170)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:152)
at mondrian.rolap.RolapSchemaReader.getMemberChildren(RolapSchemaReader.java:146)
at mondrian.olap.DelegatingSchemaReader.getMemberChildren(DelegatingSchemaReader.java:78)
at com.tonbeller.jpivot.mondrian.MondrianMemberTree.internalGetChildren(MondrianMemberTree.java:209)
at com.tonbeller.jpivot.mondrian.MondrianMemberTree.getChildren(MondrianMemberTree.java:193)
at com.tonbeller.jpivot.navigator.member.TreeModelAdapter.getChildren(TreeModelAdapter.java:95)

etc...

The solution was given by dulci :

dulci
    Hey,
    I have the solution! Do this!:

    1.- stop-pentaho
    2.- Rename the library biserver-ce/tomcat/webapp/pentaho/WEB-INF/lib/jpivot-1.8.0-100420.jar to jpivot-1.8.0-100420.jar.OLD
    3.- Paste the library that we downloaded from http://www.stratebi.es/todobi/Ene12/jpivot-1.8.0-100420.jar
    4 - start pentaho and that's it! 

And bingo!!! This is the beauty of open source.

Reference:
http://forums.pentaho.com/showthread.php?89323-JPivot-error-want-to-check-the-solution

Friday, February 24, 2012

PostgreSQL - Example Database for Training (pagila improved . 540 MB once installed)

I just migrated from MySQL to PostgreSQL the example database based on pagila (mysql's sakila) but improved, as depicted in this book (which I highly recommend) :

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470484322.html

And I'm releasing it just like the original authors did, under the LGPL license.

You have to create a user 'wcm' with the necessary privileges in order to load this db.

Compressed text dump 71 mb.
https://rapidshare.com/files/3990485799/wcm.sql.tar.gz

Power Architect entity-relationship diagram.

https://rapidshare.com/files/2635761979/er_diagram_architect.architect

wcm=# select version();                                     version                                                                                        
-------------------------------------------------------------------------------------                                           
 PostgreSQL 9.1.1 on i586-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.6.2, 32-bit                                                  
(1 row) 


This is what the tables look like:
wcm=# set search_path = wcm;
SET
wcm=# \dt+
                            Listado de relaciones
 Esquema |        Nombre        | Tipo  |  Dueño   |   Tamaño   | Descripción 
---------+----------------------+-------+----------+------------+-------------
 wcm     | actor                | tabla | rene     | 6304 kB    | 
 wcm     | country              | tabla | rene     | 16 kB      | 
 wcm     | customer             | tabla | rene     | 32 MB      | 
 wcm     | customer_order       | tabla | rene     | 65 MB      | 
 wcm     | customer_order_line  | tabla | rene     | 107 MB     | 
 wcm     | director             | tabla | rene     | 840 kB     | 
 wcm     | distributor          | tabla | rene     | 624 kB     | 
 wcm     | dvd                  | tabla | rene     | 0 bytes    | 
 wcm     | dvd_release          | tabla | rene     | 23 MB      | 
 wcm     | dvd_release_actor    | tabla | rene     | 22 MB      | 
 wcm     | dvd_release_director | tabla | rene     | 0 bytes    | 
 wcm     | employee             | tabla | rene     | 0 bytes    | 
 wcm     | employee_job         | tabla | rene     | 8192 bytes | 
 wcm     | inventory            | tabla | rene     | 0 bytes    | 
 wcm     | job_description      | tabla | rene     | 8192 bytes | 
 wcm     | language             | tabla | rene     | 0 bytes    | 
 wcm     | lookup_type          | tabla | rene     | 8192 bytes | 
 wcm     | lookup_value         | tabla | rene     | 8192 bytes | 
 wcm     | promoted_dvd_release | tabla | rene     | 0 bytes    | 
 wcm     | promotion            | tabla | rene     | 8192 bytes | 
 wcm     | purchase_order       | tabla | rene     | 0 bytes    | 
 wcm     | purchase_order_line  | tabla | rene     | 0 bytes    | 
 wcm     | region               | tabla | rene     | 8192 bytes | 
 wcm     | warehouse            | tabla | rene     | 8192 bytes | 
 wcm     | website              | tabla | rene     | 8192 bytes | 
(25 filas)


Empty tables : purchase_order,promoted_dvd_release and purchase_order_line can be filled to your wish, the others
are not important and are easy to fill as well. The structure is kind of self explanatory but post a comment should you have any doubts.

PostgreSQL - Compressing text dumps on the fly

This one was missing in my recipes ...


pg_dump -U myUser myDB | gzip > myDB.sql.gz

Then, to restore:

gzip -cd myDB.sql.gz | pg_restore ...

or

or zcat myDB.sql.gz |pg_restore


The "tar" utility is for bundling up a bunch of files and directories into a single file (the name is a contraction of "tape archive"). In that respect, a "tar" file is kind-of like a "zip" file, except that "zip" always implies compression while "tar" does not.

Note finally that "gzip" is not "zip." The "gzip" utility just compresses; it doesn't make archives.


Reference:

http://stackoverflow.com/questions/2379482/shell-scripting-use-a-pipe-as-an-input-for-tar

Thursday, February 23, 2012

MySQL . mysqldump. exporting data with sql inserts to PostgreSQL

Generating a working mysql data dump with inserts to load into PostgreSQL.

Key options for generating the dump.

mysqldump -u user  --extended-insert=FALSE --no-create-info --compact --compatible=postgresql database > file.sql

Converting incompatible characters
sed "s/\\\'/\'\'/g" file.sql > file1.sql

And importing into postgresql
psql database < file1.sql
That was quick, wasn't it? References: http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html http://chriswa.wordpress.com/2008/02/20/mysqldump-data-only/ http://en.wikibooks.org/w/index.php?title=Programming:Converting_MySQL_to_PostgreSQL

MySQL . Error 1396 (HY000) <- nasty bug

The error goes like this:

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.16-log |
+------------+
1 row in set (0.00 sec)


CREATE USER youruser IDENTIFIED BY 'youruser';
ERROR 1396 (HY000) at line 4: Operation CREATE USER failed for 'youruser'@'%'

The bug is depicted here:

http://bugs.mysql.com/bug.php?id=28331

and the workaround that worked for me:

mysql> delete from mysql.db where user='youruser';
Query OK, 0 rows affected (0.00 sec)

mysql> delete from mysql.user where user='youruser';
Query OK, 0 rows affected (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

Another nuance that drives me off from using mysql (without needing to).

Wednesday, February 22, 2012

Pentaho . Chartbeans intro

I'm studying the way to adapt my favorite charting library to pentaho bi. I made a simple map.


http://wiki.pentaho.com/display/ServerDoc2x/Introduction+to+Pentaho+ChartBeans#IntroductiontoPentahoChartBeans-ExamplefromaJavaDeveloperPerspective

The provided example is a bit outdated, I needed to reconfigure some things, but I'm on my way. Once I accomplish it I'll publish a walkthrough tutorial.



Sunday, February 19, 2012

Pentaho. Pentaho Charts

I'm evaluating Pentaho BI, and I'm getting to like it very much. I'm interested in leveraging my own charts through chartbeans (the pentaho project to provide a wrapper to foreign charting libraries) as the ones from jfreechart don't appeal to me, and flash is not supported by some devices ,but they look very impressive like this one:




Chartbeans project:
http://wiki.pentaho.com/display/COM/The+Pentaho+ChartBeans+Project


Saturday, February 18, 2012

PostgreSQL - fuzzystrmatch . distance between strings

I think this feature could be useful for data cleansing, or in general, for tasks related with string comparisons.

http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

F.9.2. Levenshtein
This function calculates the Levenshtein distance between two strings:
   levenshtein(text source, text target) returns int
Both source and target can be any non-null string, with a maximum of 255 characters.

test=# Create extension fuzzystrmatch;
CREATE EXTENSION

test=# select levenshtein('john smith','john schmit');
 levenshtein 
-------------
           3
(1 row)


Calculating their degree of similarity between two words sounds easy... but I'll try that after a long nap. One thing that would be awesome, is to somehow implement an efficient auto-complete feature using postgresql...

Visualization - Google charts API

Yet another charting API.
Charts are rendered using HTML5/SVG technology to provide cross-browser compatibility and cross platform portability to iPhones, iPads and Android. No plugins are needed.
http://code.google.com/intl/es-MX/apis/chart/interactive/docs/index.html

















Drawback (sort of):
Can I use charts offline?
No; your computer must have live access to http://www.google.com/jsapi in order to use charts. This is because the visualization libraries that your page requires are loaded dynamically before you use them. The code for loading the appropriate library is part of the included jsapi script, and is called when you invoke the google.load() method. Our terms of service do not allow you to download the google.load or google.visualization code to use offline.


And...the motion chart is fantastic!!!

Tuesday, February 14, 2012

PostgreSQL. Thesaurus dictionary - Normalizing political state names

I want to normalize the name of mexico states to some abbreviation in an automatic fashion as I'll be parsing a lot of documents containing statistical information about different variables in the whole country and they refer to states in heterogeneous ways.
For instance, I want the strings 'edo. de mexico', 'estado de mexico','mexico','edo de mexico' to yield 'mex' instead, which is the abbreviation I've chosen for that particular state (which in time will help me assign it's proper numeric identifier for its respective polygon in maps).

1.Creating the configuration file.
vi /usr/pgsql-9.1/share/tsearch_data/mexican_states.ths

? ? mexico : mex
edo ? mexico : mex
mexico : mex

:wq

2.Creating the text search dictionary
CREATE TEXT SEARCH DICTIONARY mexican_states_ths 
(
TEMPLATE = thesaurus,
DictFile = mexican_states,
Dictionary = pg_catalog.spanish_stem
);

3. Altering the mapping of the text search configuration for the respective language (spanish).
ALTER TEXT SEARCH CONFIGURATION spanish ALTER MAPPING FOR asciiword, asciihword, hword_asciipart WITH mexican_states_ths, spanish_stem;

4. And testing the results.

postgres=# SELECT plainto_tsquery('spanish','edo de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','estado de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)

postgres=# SELECT plainto_tsquery('spanish','edo. de mexico');
 plainto_tsquery 
-----------------
 'mex'
(1 row)


Certainly some previous treatment is necessary, such as lowercasing and unquoting, but that's easier than editing files by hand, which I find kind of silly.

Reference:
http://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html
While learning how to achieve this, I blew away my spanish text search configuration. Here's how to get it back
http://forums.enterprisedb.com/posts/list/1575.page
i.e.
Go to your $SHAREDIR i.e. (pg_config --sharedir)
Check the contents of file: snowball_create.sql. 

Sunday, February 12, 2012

Virtualbox - could not find a valid medium format for the target disk - vmdk.gz

I just ran into the following issue while importing a virtual appliance (.ova) to virtualbox.
Failed to import appliance /home/rene/Descargas/MusicBrainz NGS 2012-01-12.ova.
Could not find a valid medium format for the target disk '/home/rene/VirtualBox VMs/vm/musicbrainz-server-2012-01-12-disk1.vmdk.gz'.


The thing is that this appliance was generated by vmware with gzip compression enabled for the vmdk disk. I don't know whether there's another workaround, but it turns out that the .ova file is in fact a tar file so :

tar -xvf MusicBrainz\ NGS\ 2012-01-12.ova
gzip -d musicbrainz-server-2012-01-12-disk1.vmdk.gz

And you're all set.

Source:
http://itproctology.blogspot.com/2009/05/ovf-and-ova-dilemma.html
http://communities.vmware.com/thread/283650

Saturday, February 11, 2012

Flotr2 - Found another free charting library

To me, there's no excuse not to present our data in an attractive fashion.
http://www.humblesoftware.com/flotr2/documentation

Introduction Flotr2 is a library for drawing HTML5 charts and graphs. It is a branch of flotr which removes the Prototype dependency and includes many improvements. Features: mobile support framework independent extensible plugin framework custom chart types FF, Chrome, IE6+, Android, iOS lines bars candles pies bubbles
Usage

To use Flotr2, include the flotr2.min.js script in your page and create a visible div
with positive width and height. A graph is drawn with the Flotr.draw(container, data, options) method.

Friday, February 10, 2012

PostgreSQL. Mail merging-Part 2.The plpython function

Let's get this thing done.

Creating an auxiliary view,to get the name of the columns tables and views.
CREATE VIEW columns AS SELECT
  c.relname tab,
  a.attname as col
  FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
     LEFT JOIN pg_catalog.pg_attribute a ON a.attrelid = c.relname::regclass   
  WHERE
       n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid)
  AND c.relkind IN ('v','r')
  AND a.attnum > 0


Creating and populating the table where the "recipients" will be listed.
test=# CREATE TABLE deceased (recipient_name text,name_of_deceased text, his_her text, positive_attribute text,my_signature text);
CREATE TABLE
test=# INSERT INTO deceased values ('Jim Morrison','his',' a great artist'),('Kurt Kobain','his',' a tormented soul'),('John Lenon','his',' a charismatic and charming person'),('Janis Joplin','her',' authentic, a punch of energy ');
INSERT 0 3
UPDATE deceased SET my_signature = 'Rene', recipient_name = 'Mr.X'; 

And after some research and thought, I came up with this:
CREATE OR REPLACE FUNCTION mail_merge(source text,document text,path text) RETURNS void AS 
$$
#source -> the table or view that contains the list
#document -> the template with placeholders matching the field names of source prefixed with colons (:)
#path -> the directory where the generated documents will be created
import re
placeholders = []
doc_text = ""
columns = ""
rv = plpy.execute("SELECT col from columns WHERE tab = %s order by 1" % (plpy.quote_literal(source)))
for x in rv:
  placeholders = placeholders + [x['col']] 
rv = plpy.execute("SELECT document from documents WHERE doc_name =  %s " % (plpy.quote_literal(document)))
for x in rv:
  doc_text = doc_text + x['document']
for i in range(0,len(placeholders)):
  columns = columns + placeholders[i]+" "
  if i < len(placeholders) -1:
    columns = columns +","
rv = plpy.execute("SELECT "+columns+" FROM "+source )
for x in rv: 
  tmp = doc_text
  for i in range(0,len(placeholders)):
    tmp = re.sub(":"+str(placeholders[i]),str(x[placeholders[i]]),tmp)
  #here you create the file containing tmp, in path 
$$ LANGUAGE plpython2u;

With this function you can, based on a table or a view create n documents (where n equals the number of rows in the table or view), based on a template, setup as described in my previous entry. The field names of the table or view will determine the placeholders names.

The more I work with python, the more I like it =-).

Thursday, February 9, 2012

PostgreSQL . Mail merging ....with PostgreSQL ? Part 1

Yep , it's a weird concept, but the idea is to help a friend to automate a rather boring task. With mail merging, I'm talking about having a document template and you need to send it to a list of people. I know what you're thinking...gosh that's what office suites are for, but hell, I want to do it with PostgreSQL! hahahaha

This is my test document in Libre Office Writer:

Dear :recipient_name

I know :name_of_deceased has passed away. I was dismayed to hear of :his_her passing and am writing to extend my deepest sympathies to you and your family.

:name_of_deceased was :positive_attribute
If there is anything I can do to help you through this difficult time, please let me know and I'd be more than willing to offer my assistance.

With warmest regards,
:my_signature


Note that I'm using the following placeholders (variables if you wish):
:name_of_deceased
:his_her
:positive_attribute
:my_signature

I've saved it with a html format so I can manipulate flat text (it would be more elegant to use something like Latex I reckon).

Preprocessing and loading the document template:
  1. Getting rid of the new line character:
  2. -bash-4.2$ tr '\n' ' ' < letter.html | tee > letter_new.html
    
  3. Let's create the table that will hold my documents:
  4. CREATE TABLE documents (id serial primary key,doc_name text, document text);
    
  5. Load the template into the table
  6. letters=#\copy documents (document) FROM 'letter_new.html' WITH  DELIMITER AS '^';
    letters=# UPDATE documents SET doc_name = 'condolences' WHERE id = 1;
    
    I used '^' as delimiter because it doesn't appear in the document, hence, it will be loaded completely into that field.

Now my template looks like this in the database:
letters=# select * from documents;
id       | 1
doc_name | condolences
document | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD>    <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=windows-1252">      <TITLE></TITLE>         <META NAME="GENERATOR" CONTENT="LibreOffice 3.4  (Win32)">      <META NAME="CREATED" CONTENT="20120208;22492965">       <META NAME="CHANGED" CONTENT="0;0">     <STYLE TYPE="text/css">         <!--            @page { margin: 2cm }           P { margin-bottom: 0.21cm }     -->     </STYLE> </HEAD> <BODY LANG="es-MX" DIR="LTR"> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">Dear :recipient_name</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">I know :name_of_deceased  has passed away. I was dismayed to hear of  :his_her passing and am writing to extend my deepest sympathies to you and your family.</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">:name_of_deceased was :positive_attribute</P> <P STYLE="margin-bottom: 0cm">If there is anything I can do to help you through this difficult time, please let me know and I'd be more than willing to offer my assistance.</P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm"><BR> </P> <P STYLE="margin-bottom: 0cm">With warmest regards,</P> <P STYLE="margin-bottom: 0cm">:my_signature</P> </BODY> </HTML>         

In the second part I'll construct a python function that outputs documents based on this template and on a table of deceased people =-). But that'll be after I try to get some sleep ...




Wednesday, February 8, 2012

GeoServer - My first map served

I just served my first map with GeoServer
http://geoserver.org/display/GEOS/Welcome
and it was fairly easy. Their documentation is good enough for not being necessary to explain how did I do it, I just wanted to spread the word about this nice tool.

  This is the contour of my hometown: México City.

I just found some references for tutorials though.
http://gis.stackexchange.com/questions/7987/tutorials-for-geoserver


Python. Why should you learn python?

Today I was confronted with the question: why should anybody learn python?
And...gee, I couldn't give a strong argument besides: "because I like it, it's clear, fun to work with, and I you can do everything with it"

So in this entry I'll "curate" some facts and opinions of people from the web advocating for the Python programming language.
  1. Because it's a de-facto standard in the industry. [1] [2][3]
  2. Excellent  relation between speed/memory consumption/language features [1]
  3. Very easy to learn [1][2]
  4. Elegant and clear syntax [1]:
  5. class BankAccount(object):
        def __init__(self, initial_balance=0):
            self.balance = initial_balance
        def deposit(self, amount):
            self.balance += amount
        def withdraw(self, amount):
            self.balance -= amount
        def overdrawn(self):
            return self.balance < 0
    my_account = BankAccount(15)
    my_account.withdraw(5)
    print my_account.balance
    
  6. A huge variety of modules to accomplish anything you can think of. [1][2]
  7. Straight forward web programming [1][2]
---------------------------------------------------------------------------------------------------------------
Other unsorted reasons from stack overflow:
As for what tasks can be done:
  • Python is extremely fun to develop in.
  • Everything can be done with Python.
  • If something can't be done, you can create an extension for it.
  • Everything can not only be done, but it can be done fast. For example a program that takes you weeks in C++ might take you a day in Python.
  • Great for prototyping, and even for usage in a commercial setting.
As for language features:
  • Scripting language
  • Multi-paradigms (object oriented, imperative, functional)
  • Dynamic type system
  • Automatic memory management
  • Links to other dynamic languages
  • Embedding into other languages
  • Nice py2exe program that converts python scripts to executables for windows. Other platforms include it by default.
  • Strongly typed without static typing
  • Large standard library
  • Includes lists, dictionaries, sets, and much more as basic data types
  • Some implementations of python compile directly to machine code
  • It is actively being worked on
  • It has an antigravity module as noted by Jeremy Cantrell
  • ...
As for resources available:
  • Python is free to use
  • There are a ton of online resources and free tutorials
  • There are a ton of books on Python
  • There is a large community of developers available for help
--
Why should you learn Python?
  • Because it is a modern, elegant, highest level OO language
  • Because it is highly expressive, i.e., you will earn higher productivity
  • Because it comes with "batteries included" i.e. libraries for whatever you want
  • Because it is well documented and has a well-established community
  • Because it is simple to learn (and easy to read!) if you already know another OO language
  • Because it is relatively portable (that is, portable until you use system-specific features)
  • Because it is free and already installed on any modern Linux distribution (and also on Mac OS X and *BSD, I guess)
As for what can be done in Python: everything except extremely low-level things (i.e. kernels, drivers, programming a microcontroller). Also, you should not use Python directly for number crunching (it is far too slow for that), but you can wrap C or C++ number crunching libraries with Python modules (and it has already been done, the most notable examples being Numpy and Scipy).
------------------------------------------------------------------------------------------------------------------
I'll update this post as I find more convincing rationales advocating for python.

Monday, February 6, 2012

PostgreSQL - MVCC revisited

I had previously read about this topic but I needed to grasp it a little bit more, so I made this mental map:



And also deviced a very simple exercise based on this document and this one :


To me, the most important piece of information is this:

Basic idea: tuple is visible if xmin is valid and xmax is not."Valid" means "either committed or the current transaction".

postgres=# CREATE TABLE test(id int, time timestamp);
CREATE TABLE
postgres=# BEGIN;
BEGIN
postgres=# select txid_current();
 txid_current 
--------------
        13756
(1 row)

postgres=# INSERT INTO test values(generate_series(1,100),current_timestamp);
INSERT 0 100
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |           time            | xmin  | xmax 
----+---------------------------+-------+------
  1 | 2012-02-06 21:45:00.11556 | 13756 |    0
  2 | 2012-02-06 21:45:00.11556 | 13756 |    0
  3 | 2012-02-06 21:45:00.11556 | 13756 |    0
  4 | 2012-02-06 21:45:00.11556 | 13756 |    0
  5 | 2012-02-06 21:45:00.11556 | 13756 |    0
(5 rows)

postgres=# COMMIT;
COMMIT


Now let's generate some dead tuples:

postgres=# BEGIN;
BEGIN
postgres=# SELECT txid_current();
 txid_current 
--------------
        13758
(1 row)

postgres=# UPDATE test SET time = current_timestamp WHERE id < 6;
UPDATE 100
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |            time            | xmin  | xmax 
----+----------------------------+-------+------
  1 | 2012-02-06 21:48:03.560034 | 13758 |    0
  2 | 2012-02-06 21:48:03.560034 | 13758 |    0
  3 | 2012-02-06 21:48:03.560034 | 13758 |    0
  4 | 2012-02-06 21:48:03.560034 | 13758 |    0
  5 | 2012-02-06 21:48:03.560034 | 13758 |    0
(5 rows)

postgres=# DELETE FROM test WHERE id < 6;
DELETE 5
postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |            time            | xmin  | xmax 
----+----------------------------+-------+------
  6 | 2012-02-06 21:48:03.560034 | 13758 |    0
  7 | 2012-02-06 21:48:03.560034 | 13758 |    0
  8 | 2012-02-06 21:48:03.560034 | 13758 |    0
  9 | 2012-02-06 21:48:03.560034 | 13758 |    0
 10 | 2012-02-06 21:48:03.560034 | 13758 |    0
(5 rows)

postgres=# ROLLBACK;
ROLLBACK

postgres=# SELECT *,xmin,xmax FROM test LIMIT 5;
 id |           time            | xmin  | xmax  
----+---------------------------+-------+-------
  1 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  2 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  3 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  4 | 2012-02-06 21:45:00.11556 | 13756 | 13758
  5 | 2012-02-06 21:45:00.11556 | 13756 | 13758
(5 rows)
How do we get a glance at how many dead tuples in a table?
postgres=# select * from pg_stat_all_tables where relname = 'test';
-[ RECORD 1 ]-----+------------------------------
relid             | 16969
schemaname        | public
relname           | test
seq_scan          | 7
seq_tup_read      | 320
idx_scan          | 
idx_tup_fetch     | 
n_tup_ins         | 100
n_tup_upd         | 105
n_tup_del         | 5
n_tup_hot_upd     | 105
n_live_tup        | 100
n_dead_tup        | 5
last_vacuum       | 
last_autovacuum   | 2012-02-06 21:51:33.926522-06
last_analyze      | 
last_autoanalyze  | 2012-02-06 21:46:33.783941-06
vacuum_count      | 0
autovacuum_count  | 1
analyze_count     | 0
autoanalyze_count | 1

n_dead_tup = number of dead tuples And now, lets VACUUM the table.
postgres=# VACUUM VERBOSE test;
INFO:  vacuuming "public.test"
INFO:  "test": found 5 removable, 100 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 100 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM


I still feel that I need to dive more into the subject. One thing that would be great is to peep into "deleted" row contents =-).

Sunday, February 5, 2012

Python - SciPy. Concetrate on ideas not on implementations

I just discovered Scipy:
SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!


I wanted to find the inverse of a matrix using plpython in the old way (row operations) ,so I could practice the python syntax, but now I'm just going to use SciPy.

This is the guide to the linear algebra module:
http://docs.scipy.org/doc/scipy/reference/tutorial/linalg.html

postgres=# CREATE OR REPLACE FUNCTION inverse_matrix (m text)
RETURNS text
AS $$
from scipy import linalg
from numpy import matrix
return linalg.inv(matrix(m));
$$ LANGUAGE plpython2u;

postgres=# select inverse_matrix($$'[1 3 5; 2 5 1; 2 3 8]'$$);
-[ RECORD 1 ]--+----------------------
inverse_matrix | [[-1.48  0.36  0.88]
               |  [ 0.56  0.08 -0.36]
               |  [ 0.16 -0.12  0.04]]


As simple as that! =-)

Friday, February 3, 2012

Data Visualization - Word clouds

I'm using this javascript library to create simple word clouds:
http://code.google.com/p/nebulos/wiki/UserGuide
I just love creative people and javascript!


This is another example using:
https://github.com/DukeLeNoir/jQCloud

Wednesday, February 1, 2012

Postgresql - Domains

I'm studying good database design principles and discovered a beautiful concept : domains.

You have a SQL database and you want a column to be restrained to one of a few values.

If the list will never change, and ordering is preset or doesn't matter: use an ENUM
If the list might change but changes require programming, OR the list will not change but needs to be alpha sorted: use a DOMAIN
If the list will change, and changes do not require other programming: use a lookup table.


http://nigel.mcnie.name/blog/the-five-minute-guide-to-postgresql-domains-enums-on-steroids


For instance, I defined the following domains to enforce data validation to some extent.

CREATE DOMAIN general_text TEXT
 CONSTRAINT max_length CHECK (LENGTH( VALUE ) < 255 )
 ;

CREATE DOMAIN email_string TEXT
 CONSTRAINT valid_email_form CHECK ( VALUE ~ '_*@_*._*')
 CONSTRAINT valid_email_length CHECK ( LENGTH(VALUE) > 5)
 ;
 
CREATE DOMAIN number_string TEXT
 CONSTRAINT valid_number CHECK (VALUE ~ '[0-9]+')
 ;
 
CREATE DOMAIN telephone_string TEXT
 CONSTRAINT valid_telephone_form CHECK (VALUE ~ '[0-9\-]+')
 CONSTRAINT valid_telephone_length CHECK (LENGTH(VALUE) >= 8 )
 ;

CREATE DOMAIN counter int
 CONSTRAINT positive_number CHECK ( VALUE >= 0)
 ;

Defining a table that uses them:

CREATE TABLE editoriales(
  id serial primary key,
  nombre general_text not null,
  direccion general_text,
  estado_id int,
  pais_id int,
  codigo_postal number_string , 
  telefono telephone_string
);


The thing that got me into their use, is that they can be treated as if they were tables ,you can modify them / add new constraints and ... they are reusable!!!

Monday, January 30, 2012

Python - My first plpython function

I needed a function that is able to remove white spaces inside a string in order to rename table and columns to a more standard form. There are probably faster / easier ways to do this, but for now, this will suffice.

postgres=# CREATE OR REPLACE FUNCTION remove_blanks (t text)
RETURNS text
AS $$
import string
result = ""
for char in t:
  if not(char in string.whitespace):
    result = result + char
return result
$$ LANGUAGE plpython2u;
CREATE FUNCTION

postgres=# \df
                                List of functions
 Schema |         Name          | Result data type | Argument data types |  Type  
--------+-----------------------+------------------+---------------------+--------
public | remove_blanks         | text             | t text              | normal


postgres=# select remove_blanks('This string contains spaces');
      remove_blanks       
--------------------------
 Thisstringcontainsspaces
(1 row)

Sunday, January 29, 2012

Python - Generating primes

I always make a generating primes program when I'm learning a new language, as if it were my "hello world". I'm using python 3.2

import math

def cribe(size):
    nums = [0]*size
    for i in range(2,int(math.sqrt(size))):
        if(nums[i] == 1):
            continue
        for j in range(2,size):
            if i * j > size - 1:
                break
            if nums[i*j] == 0:
                nums[i*j] = 1
    for i in range(2,size):
        if(nums[i] ==0 ):
            print(i)        
limit = input('Up to what number? ')
cribe(int(limit))

My IDE is by the moment aptana-studio, haven't found anything better.

http://aptana.com/





Flot - Attractive Javascript plotting for jQuery - MIT License






I just found out about this open source chart plotting library. This is quite a static example, but a very wide array of functionalities are available, many kinds of interactions with the charts are possible. I intent to use it in my personal proyects of data visualization.

http://code.google.com/p/flot/

I tailored and customized this one, but here are some other examples:
http://people.iola.dk/olau/flot/examples/

Friday, January 27, 2012

PostgreSQL - Forbid access to a database temporarily.

I found out about this feature, thanks to someone who asked how to do this in the postgresql-es-ayuda list.

Disable connections to a PostgreSQL database temporarily

Today I had to disable access to some PostgreSQL databases temporarily. This can be easily done using pgAdmin. Under database properties, set "Connection Limit " to 0 from the default of -1. The corresponding SQL statement is
 ALTER DATABASE "dbname" WITH CONNECTION LIMIT = 0;

When you are ready to allow connections, change this back to -1.

Reference:
http://vimalkumar.in/2011/01/27/disable-connections-to-a-postgresql-database-temporarily/

PostgreSQL - Renaming tables and columns with mixed case names

I just imported a mysql database to postgresql, the thing is that tables and columns are written using both upper and lower case characters and we don't like that naming schema.

1. I have this query that lists almost all objects in a particular schema (I stored it as a view named "tables").

view "tables"
SELECT n.nspname as Schema,
  c.relname as Name,
  CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as Type,
  pg_catalog.obj_description(c.oid, 'pg_class') as Description
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','')
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 1,2;

Now, I generate the needed SQL to rename the tables:

SELECT 'ALTER TABLE ' || $$"$$ || name ||$$"$$||' RENAME TO '|| lower(name)||' ; ' FROM (SELECT name FROM tables WHERE type = 'table') AS t1;

So far so good. But then, how to you rename table columns?

Well, this is my base query (attributes) for that:

SELECT a.attname,
  pg_catalog.format_type(a.atttypid, a.atttypmod),
  (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid) for 128)
   FROM pg_catalog.pg_attrdef d
   WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),
  a.attnotnull, a.attnum,
  (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type t
   WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation
FROM pg_catalog.pg_attribute a
WHERE a.attrelid = '33954' AND a.attnum > 0 AND NOT a.attisdropped
ORDER BY a.attnum;

The part that needs to be parametrized is : WHERE a.attrelid = ?

I might also use this one:

SELECT
    a.attname as "Column",
    pg_catalog.format_type(a.atttypid, a.atttypmod) as "Datatype"
    
FROM
    pg_catalog.pg_attribute a
WHERE
    a.attnum > 0
    AND NOT a.attisdropped
    AND a.attrelid = (
        SELECT c.oid
        FROM pg_catalog.pg_class c
            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
        WHERE c.relname ~ '^(actividad)$'
            AND pg_catalog.pg_table_is_visible(c.oid)
    )
;

http://bytes.com/topic/postgresql/answers/692471-how-get-column-names-table


How to obtain the table's oid ?
http://www.postgresql.org/docs/8.0/static/datatype-oid.html

SELECT oid::text FROM pg_attribute WHERE attrelid = 'mytable'::regclass;

For instance, obtaining the oid of the table "activity" in my database:

SELECT attrelid FROM pg_attribute WHERE attrelid = 'actividad'::regclass limit 1;
-[ RECORD 1 ]---
attrelid | 33954

Which is the number that my query "attributes" is embedding in the query. I'll finish this post when I get home.
-----------------------------------------------------------------------------------------------------------------------
Putting together all that info, I came to this query:
CREATE VIEW columns AS SELECT n.nspname as Schema,
  c.relname as Name,
  a.attname as Col 
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
     LEFT JOIN pg_catalog.pg_attribute a ON a.attrelid = c.relname::regclass    
WHERE 
       n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
      AND n.nspname !~ '^pg_toast'
  AND pg_catalog.pg_table_is_visible(c.oid)
  AND c.relkind = 'r'
  AND a.attnum > 0
ORDER BY 1,2
;

Now only remains switching to lower case those table columns:

select 'ALTER TABLE ' ||schema||'.'||name||' RENAME COLUMN '||$$"$$||col||$$"$$||' TO '||$$"$$||lower(col)||$$"$$||' ;' FROM columns;

I put the column's name double quoted, as another issue is having those names with blank spaces between words, for which I'll probably need a function, gosh!

Django and python - First steps

I've always wanted to learn python, and I've always wanted to learn to use a web framework. I'm giving Django a try.

https://www.djangoproject.com/

I'm developing my first app (tutorial 1), but they suggest that you get comfortable with the api before going through the second one.


I'd never thought that developing a database driven web application would be so easy. I'm starting to like it very much (despite those "expected indent" messages =-) ).

Thursday, January 26, 2012

MySQL - mini-migration to PostgreSQL

Today I had to perform a "nano-migration" from mysql (from a mysql dump) to postgresql, but gee I needed a refresher.

These links were very useful.

To install mysql (on opensuse).

http://en.opensuse.org/SDB:MySQL_installation

http://www.debuntu.org/how-to-create-a-mysql-database-and-set-privileges-to-a-user

To import the sql dump to mysql.
http://www.cyberciti.biz/faq/import-mysql-dumpfile-sql-datafile-into-my-database/

Enterprisedb offers a migration wizard for this kind of boring and burdensome chores:

http://www.enterprisedb.com/docs/en/8.4/migrate/Table%20of%20Contents.htm#TopOfPage



Except that it didn't do an automatic conversion of the datetime mysql type, I'm really satisfied with the results.

There are also lots of scripts and other tools available.
http://wiki.postgresql.org/wiki/Converting_from_other_Databases_to_PostgreSQL#MySQL

Other interesting references:
http://www.raditha.com/postgres/

Tuesday, January 24, 2012

Linux - OpenSUSE startup script with systemd

I've been struggling with my opensuse box to execute some scripts as root when the system starts, and I just found the answer to why it wasn't working.

I have found out that in openSUSE 12.1, some startup scripts do not operate due to the usage of systemd. systemd really speeds up system startup by running the many startup processes in parallel and avoids the usage of shell scripts altogether. That is is good for speed, but not good if you are in need of running something in openSUSE as root when the system is restarted.(1)


1. We need to create the text file after-local.service (shown below) in the folder /lib/systemd/system as root (Alt-F2):

Code:

kdesu kwrite /lib/systemd/system/after-local.service

This is what you put into this file and save it:

Code:

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.

[Unit]
Description=/etc/init.d/after.local Compatibility
ConditionFileIsExecutable=/etc/init.d/after.local

[Service]
Type=oneshot
ExecStart=/etc/init.d/after.local
TimeoutSec=0
StandardOutput=tty
RemainAfterExit=yes
SysVStartPriority=99

[Install]
WantedBy=multi-user.target

2. Next, we need to add the after-local.service to systemd. Open up a terminal session and run the following command:

Code:

sudo systemctl enable /lib/systemd/system/after-local.service

3. Finally, you need to edit the file /etc/init.d/after.local as root and add in the stuff you want to run as root when you startup openSUSE 




Monday, January 23, 2012

Postgresql - Studying the MVCC model

I intent to understand PostgreSQL throughly. I found a great presentation here:
PostgreSQL Through Pictures.
http://momjian.us/main/writings/pgsql/internalpics.pdf
It has a section dealing with this particular issue.

I also like this resource which depicts a practical example.
http://devcenter.heroku.com/articles/postgresql-concurrency

I find interesting following the examples illustrated with the explanations, I never learn more than when doing, but for now, I'm a bit tired and it's quite late.

Friday, January 20, 2012

Postgresql - Installing Postgresql 9.1 on Amazon Linux

First of all, there's such thing as a Linux from Amazon

http://aws.amazon.com/amazon-linux-ami/

Amazon Linux AMI

The Amazon Linux AMI is a supported and maintained Linux image provided by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2). It is designed to provide a stable, secure, and high performance execution environment for applications running on Amazon EC2. It also includes packages that enable easy integration with AWS, including launch configuration tools and many popular AWS libraries and tools. Amazon Web Services provides ongoing security and maintenance updates to all instances running the Amazon Linux AMI. The Amazon Linux AMI is provided at no additional charge to Amazon EC2 users.


It comes bundled with postgresql 8.4 which is good enough, but gee, I like bleeding edge stuff and postgresql has improved much since =-)


I followed this tutorial made by some altruistic guy that also shares his discoveries (what would it be the internet without people like that?)

http://imperialwicket.com/aws-install-postgresql-90-on-amazon-linux

And everything went smoothly, so smoothly that I'm smiling =-).

I'm reproducing the important parts here as a backup.

Update the yum repositories

I want to install the latest stable postgresql from pgrpms.org. We could just download the rpm and manually install from the file, but that inevitably results in some dependency issues. I prefer to configure an alternate yum repository for a particular keyword. So we need to update the configuration for the Amazon repositories (be sure to update both "main" and "updates", and do not forget the asterisk).

sudo vim /etc/yum.repos.d/amzn-main.repo
[At the bottom of the "[amzn-main]" section, after "enabled=1", add "exclude=postgresql*"]
sudo vim /etc/yum.repos.d/amzn-updates.repo
[Add the same exclude to the bottom of the "[amzn-updates]" section]

Download the repository/key installation rpm from pgrpms.org

wget http://yum.pgrpms.org/reporpms/9.0/pgdg-redhat-9.0-2.noarch.rpm
sudo rpm -ivh pgdg-redhat-9.0-2.noarch.rpm

Since this rpm is generated for RHEL6, we need to make a minor change to the resulting /etc/yum.repos.d/pgdg-90-redhat.repo file. Update the URLs, replacing the $releaseserver value with '6'. The Amazon Linux $releaseserver value is a date, instead of the primary version number that this repository configuration is expecting.

sudo vim /etc/yum.repos.d/pgdg-90-redhat.repo

The updated base url values should look like this (update two of them):

## ORIGINAL
# baseurl=http://yum.pgrpms.org/9.0/redhat/rhel-$releasever-$basearch
## UPDATED
baseurl=http://yum.pgrpms.org/9.0/redhat/rhel-6-$basearch

All we have done is told yum that it should use the amzn repositories for everything except packages that meet the "postgresql*" criteria. After that, install a new repository configuration for the pgrpms.org repository.

webProgramming - http://api.themoviedb.org. Getting info about movies

I have a friend that among other things sells movies. I decided to give him a hand to fetch the relevant info about them so he can publish his own movie catalog.
With the api from http://api.themoviedb.org, it couldn't be easier. You need to request an access key first (it's free) for personal use or to develop an open source application.

I'm using this implementation for php
http://sourceforge.net/projects/simpletmdb/
and an example by Dan Bettles

Fetching the info is as simple as this: 






 * modified a little bit by René Romero
 */

$title = basename(__FILE__, '.php') . ' Demo';

$oResults = false;

if (isset($_GET['sendRequest'])) {
    require_once dirname(dirname(__DIR__)) . '/include/boot.php';
    $oResults = simpletmdb\Tmdb::create('here you put your access key')
        ->Movie
        ->getInfo($_GET['id']);
}
?>


    
        
        <?php echo $title ?>
        
    
    
        

id ?> | name ?> | tagline ?> | imdb_id ?> | certification ?> | language ?> | released ?> | homepage ?> | url ?> | runtime ?> | trailer ?> | images->xpath("image[(@type = 'poster') and (@size = 'cover')]")) { ?> | overview ?>| categories && ($oGenres = $oResults->categories->xpath("category[@type = 'genre']"))) { ?> studios && ($oStudios = $oResults->studios->xpath("studio"))){?> | countries && ($oCountries = $oResults->countries->xpath("country"))){?> | cast && ($oCast = $oResults->cast->xpath("person[@job = 'Director']"))){?> | cast && ($oCast = $oResults->cast->xpath("person[@job = 'Producer']"))){?> | cast && ($oCast = $oResults->cast->xpath("person[@job = 'Original Music Composer']"))){?> | cast && ($oCast = $oResults->cast->xpath("person[@job = 'Author']"))){?> | cast && ($oCast = $oResults->cast->xpath("person[@job = 'Actor']"))){?> |

Movie not found

TODO: Batch processing, and remove trailing commas.