Thursday, September 29, 2011

Postgresql- the pg_stat_activity table | incrementing the number of checkpoint_segments | the pg_log directory | Disabling the use of an index

postgresql- the pg_stat_activity table

  • Today I stumbled upon a table I don't know how could I live without.

This goodie gives you information about who's connected through what ip, which process id,  which query is being executed and whether such query is waiting for a lock to be released.  Extremely important info!

 Incrementing the number of checkpoint_segments for heavy write operations
When the transaction log no longer has available space, the installation of a so-called control checkpoint takes place. This serves as an order for the system to remove all the unprocessed material to a disk, so the log can become available for use once again. Additionally, the control checkpoint may be installed not upon occasion, but after a certain period of time, typically standing at 5 min. When there is massive logging to the database, the transaction log can become overloaded too rapidly. This, in turn, will lead to a substantial deceleration of the aforementioned material removal to a disk.

Establishing the installation frequency of control checkpoints:

The installation of control checkpoints must take place every couple of minutes. If the installation takes place too frequently (for instance, every minute), the productivity of the system would notably decrease. To establish the current frequency, at which the system installs control checkpoints, you can use the log's analysis of timestamps. First, however, make sure that you've allowed access to the log. Check follow options in the file postgresql.conf:

The pg_log directory

Another thing worth mentioning is having a look one in while to the pg_log directory (within the postgresql cluster directory).

Just for the sake of it I decided to log everything in our production database, but now I thing it's likely to become bloated and cause storage havoc. I need to set that option back to how it was.

Disabling the use of an index

set the idisvalid property to false in the pg_index table. The index will be updated but won't be use for queries. Why would you desire such behavior? To profile query performance maybe... In the case of bulk insertions, it might be better to drop them altogether.

Tuesday, September 27, 2011

I said such! No,you didn't, I'm sure!

Today I learned the importance of good communication and overall of documenting everything (yet again). Previously , I had worked mostly on my own, but now I need to learn how to work and cooperate with others . I deemed it not being that difficult but it has its quirks. From now on, I'll take a couple of minutes to report advances and agreements on a daily basis. Though, I find encouraging to share goals with someone else, I'll give my best to accomplish them.

Monday, September 26, 2011

postgresql- selective distinct (on a specific subset of fields) | renaming a view

postgresql- selective distinct (on a specific subset of fields)

I don't know whether there's another situation where this characteristic would be useful, but right now due to an ETL specification I need to make sure that a particular combination of fields doesn't repeat in the source flow of a fact table .  So I transformed the source query into a view and applied the DISTINCT ON clause.


If DISTINCT is specified, all duplicate rows are removed from the result set (one row is kept from each group of duplicates). ALL specifies the opposite: all rows are kept; that is the default.

DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example:
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports
ORDER BY location, time DESC;

retrieves the most recent weather report for each location. But if we had not used ORDER BY to force descending order of time values for each location, we'd have gotten a report from an unpredictable time for each location.

The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.

postgresql renaming a view

Renaming a view is as easy as it gets, for instance:
alter view actualizar_inmueble rename to v_actualizar_inmueble;


Postgresql - Getting the sql behind a view

If you need to take a glance of the sql query that created a view, you can obtain that information from the pg_views catalog. The catalog's structure is pretty much self explanatory.

as you can see, the query is contained in the definition field which is of type text.

Wednesday, September 7, 2011

NVL (oracle) & coalesce (postgresql) functions

I've used the NVL (that stands for null value) function before, it provides a pretty useful service : if one field in our query returns a null, it replaces that null value for another value you specify, and this comes particularly handy when filling a table that doesn't allow null values.

The equivalent in postgresql is the coalesce function:

As with everything else, overuse may injure, if used without an analysis of those null values (IMHO).


Postgresql-Optimizing huge inserts/copy's

It boils down to disabling fsync which forces writes to disk of finished transactions.

I found about this interesting tip on facebook...what are the odds to that? Thanks Jorge Lopez!

To disable it, just set the environment variable PGFSYNC to no.

Note: I read about that environment variable being related to previous versions of postgresql. I'm using 9.0 and  disabling (fsync = off) in the postgresql.conf file as it seems  that it's no longer there.

Side effects:

Should system failures arise, losing data is bound to happen, so you're advise to enable it once you've loaded your data.





Friday, September 2, 2011

Talend - xulrunner (BI - "the resurrection")

I ran into an ugly error while starting up Talend (a.k.a Talend Open Studio) on linux (OpenSuse 11.4)

[sourcecode language="bash"]
# A fatal error has been detected by the Java Runtime Environment:
#  SIGSEGV (0xb) at pc=0x6fa62bde, pid=14623, tid=3076241088
# JRE version: 6.0_22-b22
# Java VM: OpenJDK Client VM (20.0-b11 mixed mode linux-x86 )
# Derivative: IcedTea6 1.10.2
# Distribution: Dummy Product (i586), package suse-4.3.1-i386
# Problematic frame:
# C  []  NS_UTF16ToCString_P+0x1abd6
# An error report file with more information is saved as:
# /home/rene/TOS-All-r63143-V4.2.2/hs_err_pid14623.log
# If you would like to submit a bug report, please include
# instructions how to reproduce the bug and visit:
./ lĂ­nea 7: 14623 Abortado                ./TalendOpenStudio-linux-gtk-x86


The thing is that you need to specify the path to the xulrunner library (of course you ought to install it first if not present on your system) in the TalendOpenStudio-linux-gtk-x86_64.ini or TalendOpenStudio-linux-gtk-x86.ini depending on your architecture. Mine is as follows:

[sourcecode language="bash"]


diff: -Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib/xulrunner-2.0.1/xulrunner

I'm giving it more memory than the default parameters (still experimenting with them) as I hate it running that slow.