Monday, April 4, 2011

Ass-kicking Business Intelligence Course and Merge Sort


  • Today finally started our Information Management course, and I'm very pleased with it. It's being given by a chemist (what are the odds to that!)  ,he's an experienced DBA and open source software expert. Some topics we reviewed:



  1. Business intelligence (BI) focuses on interrelations and facts to act as a guide in decision making and ultimately to achieve goals.

  2. What really matters? Return of Investment. What's the relation between what I'm putting in, and what I'm getting out?  Typical information management systems don't provide that information. In general terms we're interested in KPI (key performance indicators), i.e ROI, etc.

  3. BI is not appropriate for all enterprises. We're talking about big amounts of data (i.e. data generated by multinational enterprises, the government, etc.) and in general enterprises with several information systems that have been operating for several years.

  4. BI is a set of abilities and techniques that aim to the understanding of the commercial and functional environment.

  5. Data Mart. It's the set of data pertaining to an specific area of the business (i.e. accountancy, warehouse, etc).

  6. Data warehousing. Depends on acknowledgment, analysis and discovery of information.Provides historical and predictive views and alerts (warnings that come up when some deviation from goals occurs).

  7. The Business Intelligence Life Cycle. Data Sources -> Exchange, Transformation and Load (ETL)->Data Marts / Data WareHouse ->Business Intelligence Tools -> Outputs (such as web reports). The most important phases in this cycle are the Exchange, Transformation and Load and the DataWareHousing part. For instance in ETL we have a centralized system that gathers all the data from heterogeneous sources in an automated way.

  8. The dimensional model. In contrast with the relational model, it doesn't enforce normalization. We have two kind of tables: fact tables, and dimension tables that provide services to fact tables. What do we understand by dimensions?: those variables we're interested in analyzing, i.e (I wan't a report by vendor, or by year, or by model, etc). It implies high granularity, i.e. a very high level of detail; redundancy is not avoided but encouraged because it speeds up processing (weird!). We use surrogated keys (usually integers we defined by the system). What's a fact? Numeric data. Long and descriptive fields are endorsed.

  9. Star Schema. Formed by a central fact-table and surrounded by dimensional tables in an entity-relation fashion. Is the basic form of a datawarehouse.

  10. We were given some tips regarding new BI projects. To have a data quality program ( to reduce garbage), sell convenience and decision making support, not less work for the IT crew. Not to over dimension the project, start by a department at a time.


And a few more things. After all, the course was worth the waiting.

  • Today, I also studied again the merge sort algorithm. There's not much to it's implementation, and there's a phrase that comes to my mind: Denis Diderot used to say: "Simplicity is the ultimate sophistication", and in this regard the merge sort algorithm is highly sophisticated. The only shortcoming it has, is the extra space needed as it does not do the sorting "in place" ,so for high volumes of data, it might not be the best choice. I read that the recently modified implementation of the heap sort algorithm brings up the best of quick sort( in place sorting) and heapsort (nlogn performance), I'll study it tomorrow. Here is the code I took from Analysis of Algorithms An Actyve Learning Approach by Jeffrey J. Mc. Connel:


[sourcecode language="cpp"]

#include <cstdlib>
#include <iostream>
#include <ctime>
const int MAX_SIZE = 1000;
int datA[MAX_SIZE];
using namespace std;
int createData(){
int lBound=0,hBound=1000;
int range = hBound - lBound + 1;
for(int i=0; i<MAX_SIZE; i++){
datA[i] = lBound + int(rand()/(RAND_MAX/range + 1.0));
}
}

void mergeLists(int e[],int start1,int end1, int start2, int end2){
int result[MAX_SIZE];
int finalStart = start1;
int finalEnd = end2;
int indexC = 1;
while(start1 <= end1 && start2<= end2){
if( e[start1] < e[start2]){
result[indexC] = e[start1];
start1++;
}
else{
result[indexC] = e[start2];
start2++;
}
indexC++;
}
//move the part of the list that is left over
if(start1<= end1){
for(int i= start1; i<= end1; i++){
result[indexC] = e[i];
indexC++;
}
}
else{
for(int i=start2; i<= end2; i++){
result[indexC] = e[i];
indexC++;
}
}
//now put the result back into the list
indexC=1;
for(int i=finalStart; i<=finalEnd; i++){
e[i] = result[indexC];
indexC++;
}
}
void mergeSort(int e[], int first, int last){
if(first < last){
int middle = (first + last) / 2;
mergeSort(e,first,middle);
mergeSort(e,middle + 1,last);
mergeLists(e,first,middle,middle+1,last);
}
}

int main(int argc, char** argv) {
srand((unsigned) time(0));
createData();
mergeSort(datA,0,MAX_SIZE-1);
for(int i=0; i<MAX_SIZE; i++)cout<<datA[i]<<" ";
cout<<endl;
return 0;
}

[/sourcecode]

What does it do? Divides the input down to small lists through recursion. When the end of recursion is reached, it sorts and merges the sublists until we have just one ordered list which is then copied into the original array.

No comments:

Post a Comment