The Squirrels and the Chestnuts

These days I am trying out some problems that Codechef has put up for the competitions. I will be posting the algorithms I used for anyone who’s interested as well as for comments and suggestions for improvement.
Problem (Taken from the June 2010 competition at Codechef)
There are n squirrel(s) waiting below the feet of m chestnut tree(s). The first chestnut of the i-th tree will fall right after Ti second(s), and one more every Pi second(s) after that. The “big mama” of squirrels wants them to bring their nest no less than k chestnuts to avoid the big storm coming, as fast as possible! So they are discussing to wait below which trees to take enough chestnuts in the shortest time. Time to move to the positions is zero, and the squirrels move nowhere after that.
Request
Calculate the shortest time (how many seconds more) the squirrels can take enough chestnuts.

Algorithm

FindLeastTime (integer numberOfSquirrels,integer numberOfChestnuts,integer numberOfTrees,array startTimes,array intervals)
{
/*Comment
startTimes-contains the time at which the first chestnut falls from each tree
intervals-contains the time interval between consecutive chestnuts from each tree
*/
time=0;
array multiplier;// represents the time interval number for each tree
array tally;//contains the chestnuts from each tree at a given time
while (ChestnutCount
{
for(each tree i)
{
if(startTimes[i]+multiplier[i]*intervals[i]<=time)
{
tally[i]=tally[i]+1;
multiplier[i]=multiplier+1;
}
}
tempArray=Sort(tally);//ascending order
for(j=1;j<=numberOfSqurriels;j=j+1)
{
count=count+tally[j th index from the last];
}
if(ChestnutCount
{
time=time+1;
}
}
return time;
}
Try implementing the algorithm in a language of your preference. I tried out C# and Java. C# was somewhat faster in execution.

A small problem encountered in Creating MySQL stored procedures

The other day I started working on a small personal project which involved a MySQL database. I’m pretty familiar with MS SQL and MySQL is almost the same in so far as the language goes so I got right down creating the database and the required stored procedures. This is where I hit a little snag. May be not so little considering the time it took me to figure it out 🙂

The code i wrote for the stored procedure is given below.

CREATE PROCEDURE test()

BEGIN

SELECT * FROM entries;

END

The  query is as simple as can be and worked perfectly when I ran it the phpmyadmin SQL query window. But whenever I try to create the stored procedure I kept getting the error code 1064 which basically indicates a syntax error. After a lot of reading on creating stored procedures in MySQL I was able to confirm my syntax was in fact correct. So what was the problem?

This is what I finally found out from a discussion over at StackOverFlow. MySQL considers ; to be the statement delimiter by default. The entire Create Procedure is also considered a statement to which delimiters need to be applied. So when we come to the semi colon in my code the compiler considers this as the delimiter for the entire CREATE PROCEDURE block. But END corresponding to the BEGIN is outside the semi colon which throws the error I was getting.

All I had to do was change the delimiter to // using the delimiter option text box phpadmin provides right under the query editing area. This way a // is added automatically added to the end of the entire block and the semi colon works as a delimiter for simple statements.

Those of you who are using an MySQL client  that is command line based will need to modify your code as shown

delimiter //

—-put your your code here-

delimiter ;

This will change the delimiter back to the default once your code has been executed.

Using an XML Configuration file and Expressions in an SSIS package

Introduction

In my previous posts I’ve given an introduction in to developing simple ETL packages using Microsoft Integration Services. In this post I’ll be showing how we can make your ETL’s more dynamic using expressions evaluated at run time as well as how to set properties of an ETL during run time using an XML configuration file.

The use of expressions allows a developer to create an ETL where settings such as Database connection strings,output filenames, file locations are acquired at run time, while the use of an XML configuration file allows easily changing common settings across multiple  ETL packages.

I’ll be demonstrating these functionalities using the simple ETL package developed in my previous posts  to which I will add a XML configuration file to set the input file location (this usually not the kind of value that is set by a configuration file since the source file location changes from ETL to ETL, a better value would be a connection string of a shared data source for multiple ETL’s ). Then I will provide the file location as an expression to the file connections used in the ETL so that the actual file location will be acquired by the connections at run time.

Step 1:

Create a String variable with global scope to hold the path of the input data file.


The variable created to hold the file path

Step 2:

Next right click on the control flow design area and select the package configuration option and the from the screen that appears tick the check box for enabling package configurations and click Add button to add a configuration.

The following screen will then pop up.

Adding an XML Configuration file

From the configuration type select XML configuration file (Note: there are a number of possible configuration types including parent package variables which will be introduced in a future post).

Then select the Option “Specify Configuration Settings Directly” option and for the configuration file name give a suitable name at a location of your choice by using “Browse”

Setting the configuration file name

The next step in the wizard allows you to set the properties of the package whose values would be set by the configuration file. In this case I’ve given Only the value 0f the InputFilePath variable as the property to be set by the XML configuration file as seen by the screen shot below.

adding properties to be set using the XML file

Now you can continue and in the next step of the wizard set the name for the configuration as seen by the project. The next time you access project configurations this configuration will appear in the list with the given name.

Step 3:

Lets take a look at the configuration file we’ve created. Browse to the file location on the file system and open it using a text editor.

The XML configuration file

Each property to be set by the configuration file has a its information within a <Configuration></Configuration> tag pair. Notice the configuration tags for the InputFilePath variable. Note that the name of the property has to be identified by the Path attribute within the configuration tags.

The value to the property should be set within the <ConfiguredValue></ConfiguredValue> tags within the relevant Configuration tags as seen by the screen shot below.

Setting the value within the XML Configuration file

The XML configuration file has now been modified to provide the location of the input file to the package at run time. Lets move back to the SSIS project.

Step 4:

Now that we have set the XML configuration file to set the value to the InputFilePath at run time we can use this variable to dynamically provide the input file path to the Excel File Connection Manager and The RowSet Schema Connection Manager.

Select the Excel File Connection Manager that we’ve created from the connection managers area. Then from the properties pane that appears on the right select the option “Expressions”.

The Expressions option in the connection manager

Note that this expression option is available in most of the components of SSIS to allow dynamically setting of their properties.

The screen that appears when you select “Expressions” will allow you to select different properties of the Excel Connection Manager and construct the corresponding expression. In this case I’ve selected the Excel File Path property and set the expression to the InputFilePath variable as shown

Selecting the Excel File Path property

Constructing the expression using Expression Builder

You can simply drag and drop variables from the the variable list on the list as well as type constants in the expression area. The expression builder also allows us to use an umber of String,Math,Logic etc operations. These too can be dragged and dropped.

Once the expression is set the Expression screen for the Excel File Connection Manager should look like this.

Excel File Path with the Expression set

Now select the Excel RowSet Schema connection Manager that we use to iterate through the sheets of the input Excel file and set its  “Server Name” property to the same expression as the one to which the Excel File Path was set.

Conclusion

Now the SSIS package is configured so that during run time the XML configuration file sets the value of the variable InputFilePath and this  path is assigned to the Excel File Source to extract data using Expressions. Normally the use of expressions and a configuration file is unnecessary for a simple package such as the one used with just one variable used in expressions for two connections. However when a value is used in multiple components in the same package or even multiple packages using the expressions and a configuration file can  not only standardize the packages but also significantly reduce the work needed to modify a value across multiple components or packages (only needs to be modified once at the configuration file) .

Run the completed package and observe how added configuration works.

Enjoy1!:)

Extracting data from multiple sheets in an Excel file in SSIS

Introduction

In my previous post i showed how we could develop a simple ETL package using Microsoft SQL Server Integration Services to transfer data in a single sheet of an  Excel file t0 a table in a database.

But what if you had multiple sheets of data in the your Excel file? This post is meant to show you how to tackle this problem

What I will touch on in this post

  • Use of package variables in SSIS
  • Use of the foreach control flow component
  • Use of a script task
  • Possible errors

Step 1:

The ETL package we completed in the previous post can be extended to demonstrate how multiple sheets from an Excel file.

As a first step I’ve extended the data in the excel file as shown. Note that the data in the first sheet has already been inserted in to the database by the ETL from the previous post. I will use this fact to demonstrate the function of the Slowly Changing Dimension component that has been used in the ETL.

Sheet1

Sheet 2

In order to tackle multiple sheets in  an Excel file we need to use the SSIS package variable feature to maintain states as well as for testing certain conditions.In this example  I’ve created to package variables with a global scope and off the seen data types. The need for these variables will become clear as we go on.

package variables

Step 2:

The key component we need to use to achieve our goal is the  “foreach” loop container. We can use this to iterate through the sheets in our target Excel file.

Drag and drop a  “foreach” loop container from the tool box and then drag and drop the “Load to users” data flow task in to the  loop container. Next right click on the container and select the edit option to configure the loop container.

In the configuration interface go to the collections tab and set the configure as below

Loop Container properties

Notice how for the connection property I have given the location of the Excel file. When you create a new connection you will be directed to a form where you will have to set the connection properties. Set these properties as shown below.

Connection Properties 1

Connection Properties 2

The value “Excel 12.0” for the Extended properties is specific to Excel 2007  (.xlsx) files. If you are dealing with an Excel 97-2003 file you need to use “Excel 8.0”.

In case of Excel 97-2003 files I still haven’t got the looping through sheets to work. For some reason I get an error at the Excel file source source component in the data flow task that basically says that the file is being held by another component (Which I’ve surmised is the loop container since its the only other component that connects to the file though as to why this happens only for Excel 97-2003 files I have no idea yet)

Once you ok the connection properties a connection manager will be created for use by the loop container to connect to the Excel file that will be visible at the connection manager area.

Back at the Configuration screen of the loop container move to the variable mapping tab and set properties as shown.

Getting the sheet names

What we do here is extract the sheet name as the loop container executes and iterates through each sheet of the Excel file. The sheet name is stored in the ActiveWorkSheet user variable we created at the very beginning.

Now you have completed the configuration of the loop container so you can ok  and close the configuration screen.

Step 3:

The problem with this approach is that for some reason in some iterations the variable gets set to some bogus values that will give errors if we use them as sheet names. Therefore we need to validate the value of the variable to be a actual sheet before we provide as a sheet name. For this purpose i have us a Script task on the control flow. The script task allows you to  write code in C# .

Drag and drop a script task in to the loop container above the data flow task . Then right click and select edit to configure the task. In the configuration screen set properties as follows.

setting read write variables for the script task

In the script task I will be read the value of the ActiveWorkSheet and based on a validity test (using the default sheet name Regular Expression) set the value of the IsValidSheet to true or false as relevant (which is why the IsValidSheet is taken a readwrite variable).

Now enter the code area by selecting edit script and add the following code within the main method of the script.

Code to validate the sheet name

Notice the way in which the package variables are accessed and assigned to in the script. Build  the script and exit to the control flow area.

Step 4:

Now connect the validate script to the data flow task  by dragging the outgoing arrow of the script task on to the data flow task. Then right click on the connection made and select edit.

In the configuration screen that appears setup properties as follows.

Applying a condition to the control flow

“Expression and Constraint ” for the evaluation operation will ensure that the control flow proceeds beyond the script task only if the task completes successfully (as specified by Value=Success) and the variable is IsValidSheet is true (as specified by the expression).

Press ok and you should have a control flow similar to the one below.

Completed control flow

The fx symbol indicates that a condition has been applied to the flow from the script task to the data flow task.

Step 5:

Now double click the the data flow task and move in to the data flow design area. There change the Excel source properties as follows.

Modifying the Excel Source to iterate through the sheets

Notice that the data access mode is now set to a variable type and the variable is given as the one to which we assign the sheet name from the loop container. In order to preview you may want initialize the ActiveWorkSheet variable to a known sheet name (Ex: Sheet1$) at the variable pane. This will automatically get overwritten during run time.

Step 6:

The package is now complete and you can execute and see that the new data is entered in to the data base. Notice that in the first iteration of the loop the dataflow doesn’t proceed beyond the slowly changing dimension because the exact same rows are  already present in the database. In contrast when iterating through the second sheet which contains the new rows you will see the data flow proceeding to insert data component as below.

Completed Data Flow task

Note: If the Excel file that you are using has empty sheets that don’t even have the column headers during run time the package will give a synchronization error. Currently I am looking for a solution for this issue and will post when I find one (in the mean time you can remove empty sheets or at least have the column header in them)

Building a simple ETL package using MS SQL Integration Services

Introduction

During the last couple of months I’ve been in industrial training at a company that is developing a Business Intelligence Application for a major bank in the country using  Microsoft SQL  Server and .NET. A significant part of the development includes the creation of ETL(Extract Transform Load) packages to transfer data in the excel files generated every month by the bank to the application database. This is carried out by developing ETL packages using the integration services component of the MS SQL Server suite which provides a powerful and intuitive interface for development.

So this post is meant to be a very basic introduction to ETL package development in MS SQL Server Integration Services.

What you need for this to go smoothly

  • Enough familiarity with MS SQL Server to create your own databases and tables and query them
  • Familiarity with a Microsoft Visual Studio version
  • Access to both MS SQL Server and Microsoft Visual Studio including the business intelligence suite

Step 1:

Create the Database for our project with one table having data fields as shown by the image.

The Destination Table in our database

In my case I’ve created  a database BlogDatabase and inserted the above table (Note : the id is the auto increment primary key of the table).

Step 2:

Start Visual Studio and create a new Integration Services project

Once you have created a project add a SSIS package by right clicking the SSIS packages folder in the solution explorer and selecting  New package option from the menu.

Once a new package is added you will notice that the design interface for the package consists of five tabs as follows

  • Control flow
  • Data Flow
  • Event Handlers
  • Package Explorer
  • Execution Results

Out of these  we only need to be concerned about the Control Flow and the Data flow  tabs in this post

The control flow design area provides us with the ability to design the logical flow of the ETL operation which may include testing conditions, looping etc.

The data flow design area is accessed through a data flow task that we insert in to the control flow design. A data flow task at the control flow design simply encompasses and represents a series of data manipulation operations that can be added and managed at the data flow design area.

Step 3:

One of the first things we should is to establish our BlogDatabase as a shared data source for the project. While connections to the database can be made from each new package by directly referring to  the connection string it is good practice to have the database as a shared data source for the solution when its been used by multiple packages.

To create a shared data source out of the BlogDatabase right click the data sources  folder in the solutions explorer and select New Data Source Option

From the Wizard that appears select the ‘New’ option and then specify the Server and Database name

Not e: make sure to test the connection using the “Test Connection” button that is available at the form where you specify the server and the database

Creating a shared data source

You can name the Shared data source as you see fit in the final stage of the wizard and once completed the data source will appear under the Data Sources folder in  the solutions explorer

Once this is done you need to add a connection to the data source to your package. Right click on the connection managers area at the bottom of the UI and select the option “New Connection from Data Source” in the form that appears you can select the newly created Shared data source and this will be displayed as added to the package in the connection managers area.

Step 4:

Now that you’ve set up the data source you are ready for the real work.

Take a look at the toolbox at the right end of the UI (may need to be expanded) and from the category control flow items select “Data Flow Task” and Drag and Drop it on to the control flow design area of  your project.

Data Flow task Added to the Control Flow

You can rename the Data flow task  as you like by simply selecting it and typing in. Its usually good practice to give tasks name that have some relevance to the task performed

Step 5:

Let us now prepare a sample Excel file that will be the source of the data we want to transfer to the data table we’ve created.

Sample excel file structure

Next double click on the data flow task component you’ve dragged on to the design area. This will transfer you to the Data flow design area within the scope of the selected data flow task

Designing the actual data transfer process within the Data flow task is very intuitive. We just need to specify where our data is(The Data Source), Where we want it to end up(The Destination) and what we want to do with it on the way(The Transformations)

The tool box provides us with a number of read made  components to do exactly this.

In our case the data source is an Excel 2007 File. So from the Data flow Sources Group drag and drop the Excel Source component on to the design surface. Next double click it  to configure.

Configuring the Excel Source

In the form that appears you have to set up a connection manager for the Excel Source which you can initiate by clicking the new button in the form.

This will lead to a Excel Connection Manager Creation form as follows.

Creating an excel connection manager

The image above is pretty self explanatory and as such I’ll not go in to detail about configuring the connection manager. However you should note the ticked option  “First row has column names” in the form. While in the excel file we have created the headers are in the first row this may not always be the case. The headers may come after some other data such date of creation (in the BI project i am working on). This created a complication in extracting the data from excel files, the solution to which will be explained in a future post.

Now if you’ve done everything correctly once you’ve clicked Ok and gone back to the Excel Source Editor and set the Data Access mode as Table or View and selected the relevant sheet name from the drop down you should be able to preview the data in the excel file using the preview button as below.

Preview of the Excel file from the Excel Source

Now you must select from which columns in the excel file you want to extract data. This can be done by selecting columns option in the Excel source editor and checking the the check boxes of the relevant columns displayed. This form also provides you with the option of renaming the columns as preferred for use in the data flow task.

Step 6:

Congratulations. Now you’ve configured your Excel source(Unless you forgot to click ok on the Excel source editor :)). You should note that the excel connection manager you created from within the Excel Source Editor is now visible in the Connection Managers area from which you can reconfigure it if you need to.

Our next task is a small data transformation. When we extract data from an Excel file the Excel Source looks at the first 200 or so rows an decides the Data type of the data being extracted. Usually text is taken as Unicode String data or [DT_WSTR] and numbers as double precision float or [DT_R8]. However in my data table I have defined the user and telephone data as VARCHAR which is considered non unicode. So in order to insert our data taken from the Excel file  we need to convert it in to non Unicode string  type or  [DT_STR] which is where the Data conversion component comes in.

Drag and drop a data conversion component from the Data flow transformations group in the toolbox. Connect the excel source to it by selecting and dragging the green arrow sticking out of the excel source component on to the data conversion component. What this does is establish the logical sequence of the process (extract data from the file using the excel source and then convert)

Now you can configure the data conversion component by double clicking it. The form that appears when you allows you to choose which of the data columns sent to the conversion component should be converted, the conversion data type as well as the names of the new data fields created by the conversion. It should be noted that the unconverted original data column continues to be available for use in future components as well.

Configuring the data conversion component

Now your data flow design area should look something like this

The data flow design area after adding the data conversion component

Step 7 :

Now your data is ready to be transfered to the database.

For this we need to use an appropriate destination component from the data flow destinations group in the toolbox. How ever in certain cases there is the option of using the data transformation component “Slowly changing Dimension” which also incorporates a data flow destination. An example case where slowly changing dimension comes in handy is as follows.

If we consider the data we are using for this post its obvious that the users are not going to be changing their  names regularly :). But their telephone numbers may change overtime and what we need to do when we reload the data after some time is replace the phone number in cases where it has changed  for existing users and add the user,telephone combinations separately so that we don’t end up with duplicate phone numbers (one most likely outdated) for the same user.

However note that using the name as a reference to update the phone number in the database is inappropriate because the name may not be unique. This example is used only for simplicity.

Ok. Now we can drag and drop a  “Slowly changing Dimension” component from the data transformation group in the tool box and connect the data conversion component to it.

Now follow the wizard that appears.

Configuring the Slowly Changing Dimension component

Here select the connection manager to be the connection manager that we created right at the beginning  using our shared data source and set the table to the one we intend to enter the data. Next match the data fields that have the data we want to transfer to the appropriate columns in the data base table (In this case converted fields are used to match the data types, if you use the original data columns you got from the excel file you would get an error at run time).

The next important thing to is setting  the business key/keys which is a unique field or set of fields which can be used to identify a column (basically a candidate key in database terms). In this case I’ve chosen the name though as explained before it may not be very suitable.

Proceed with the the wizard. At the next stage you set the option for updating the non key columns.  Set the non key column telephone to a changing attribute which basically means that if the telephone field for particular business key were to change it would be updated in place and no new row would be added.


Setting the non key field attribute type

In the next step of the wizard check the check box under Changing attribute heading as follows and in the next stage leave the check box for inferring attributes unchecked.

Once you complete the wizard the Slowly Changing Dimension will automatically create two components as shown in the image below to handle the two types of data base operations to be carried out (updating existing and inserting new rows)

Components automatically generated by Slowly changing dimension

Step 8:

Now your package is complete and ready for testing. In the solution explorer select the package file and right click. From the menu select  “Execute package” and the package will begin executing. As each component/task that is in either the control flow or the data flow task begin executing the fill color will turn to yellow and once completed the component/task will turn green.If an error occures in  a component it will turn red and based on the default properties for error handling the entire package may fail. You can find out the error by moving the mouse over the error generating component or by finding detail error details from the execution results/progress tab.

In this case of course you will not encounter an error and the package will execute smoothly and once execution is complete the data flow area of the DF task will look like this

The appearance of the DF task on completion of execution

Now lets go take a look at the Data table  “Users” in the BlogDatabase.

Data table with inserted data

Conclusion

With this post I  described how to create a very simple ETL package to transfer data from an Excel file to a database using Microsoft SQL server integration services with a detail level I hope is sufficient for a first timer.  In my future posts on the area  I intend to extend the same example and most likely address specific issues (Therefore a lot shorter and sweeter :)) .