Monday, 8 April 2013

Hive Connection with Talend Open Studio Example


Talend Open studio for big data provides following components to integrate with hive. You can find these components in two directories of palette which are as follows

1.    Big Data
2.    Databases



tHiveconnection: tHiveConnection is used to establish the connection between Talend and Hive.



For hive connection we need to know following parameters.

1.1.1        Distribution: distribution of the hive you are using whether it is HortonsWork, Apache, cloudera or MapR.
1.1.2        Hive version: Hive version you are using.
1.1.3        Host:  Host address of the hive you are using.
1.1.4        Hadoop Properties:  In Hadoop properties we need to give JobTracker URL and Namenode URL for the Hadoop server running.

NOTE: Before trying for connection make sure that your hadoop server   and Hive server is running properly,For checking that hadoop server is running or not use jps command.








 If you see the following process are running that means your hadoop serer is running.

tHiveInput: This component is used to extract the data from hive tables on the basis of give Hive query.




 The query option is in basic setting of tHiveInput as given below



tHiveRow: tHiveRow is used to execute Hive Sql query on the specified database and on each of Talend flow’s iteration.The specification will be similar as tHiveInput for the connection.



Example for Hive Integration with Talend :


Above scenario is used to read the data from a hive Table.

Step 1. Set the necessary properties for tHiveRow as follows.


Step 2. In tHiveRow we need to set the advance property also because here we are using resultset object to for extracting the data from hive.


Step 3. Set tParserecordSet properties for the extracting the desirable data from the resultset.



Step 4. Run the scenario. When you will run the scenario and observe the Hive Thrift server console then you can observe how hive executes the query by Map reduce jobs as shown below.


 Output on Talend console:



 Thats How Talend Open Studion is integrated with hive. For further queries drop me a mail. Enjoy playing with Hive :) 




Wednesday, 27 March 2013

Talend Open Studio Integration with Teradata Database


Talend Open Studio is an ETL tool used for various databases. In this post we will see how Talend is integrated with Teradata Database. It contains basically three components.

  1.    Establishing the connection with Teradata
  2.    Data insertion in Teradata
  3.    Data extraction from Teradata


   Talend Open Studio Provides following options to integrate with Teradata database. These options are available in palette window inside Database options.


     We can use these options according to our need in the ETL process.
Now we will see how Integration of Teradata will be done

 1.        Establishing Connection with Teradata
       Before performing any operations through talend first we establish the
       connection to the database and check whether it’s working or not.


       In tTeradataConnection we need to give Host, database, user name,     password for connecting the database to Talend.


2.        Data Insertion in Teradata using Talend
When the connection is established then after that we can insert data in Teradata using tTerdataOutput.
This function inserts the input data flow to Teradata database.





tTeradataOuput provides following options like Host name, Database name, Username, password, and the action you want to take on the data.

  1.     Action on Table : Action on table is defined for the action which will be taken on the table specified.


  1.      Action on Data : This is specified for the action you want to take on the data while inserting in the table.



   3. Data Extraction from Teradata
For extracting the data from Teradata we use tTeradataInput function.On
the basis of given query it gives the result from the specified table.


Example scenario for Teradata and PostgreSQL ETL

In this example we are doing the ETL process between Teradata and PostgreSQL database extracting data from Teradata and dumping it to PostgreSQL.

1.    Setting properties of tTerdataInput.


By Browsing table option we can check the connection to the database is working or not. If the connection is established then it will show list of all the tables in database. 

2.    Setting the properties for tPostgresqlOutput.


3.    Run the scenario by pressing F6.


As you can see rows are extracted from Teradata and inserted in PostgreSQL



4. Output console

tLogRow will print all the input rows coming from Teradata to the Output console.


Thats all from my side contact me for any other queries related to Talend Integration.








Wednesday, 13 March 2013

How to Increase the virtual box disk size of an existing linux virtual machine

Many people find difficulty in increasing size of existing virtual machine but trust me after reading this blog you will find it very easy. follow the steps as given below.

Initially Just make sure that your virtual machine is not running at the time you are making changes in that.

1. Right click on the virtual machine you want to increase the size and go to setting menu.

Settings Menu
2. Now go to Storage option where you can find the option to add new hard disk in the virtual machine.

Storage Option
3. Click on Add hard disk for creating the new storage disk.

Create disk option
here you can chose an preexisting disk also, I have created a new disk for extending the storage space of my virtual machine.

4. Choose the drive file type among the given options i have choosen *.vdi format for the hard drive.


Drive File Type
  4. Choose the Storage type on your physical hard drive, in most cases it is dynamically allocated.

Storage Type
5. Now choose the Size of your hard drive and you can also give a name to the hard drive.

Storage Size
6. Click on create and your hard disk will be created and you can see the one extra hard drive is created in your virtual machine.


As the Hard disk is created now the process is not completed you cant use that space directly in your machine you have to mount this drive in your desired directory.

Use following set of commands in the terminal of your linux machine to mount the created hard drive.

  • By Using fdisk command you can see all the existing partition of the disk in your system.
As you can see there is one new extended disk for which no valid partition exists so now first we need to make the partition for the new hard drive.



Now as the partition is created successfully now you need to make the file system by using mkfs command on the particular partition.


After creating the file system now you are ready to mount the disk on your desired directory .
I have mounted my drive on the data directory in the current folder you can choose according to your need.


That is all now you can see the mounted disk in your desired directory.

Thanks
Contact for any other queries. 

Thursday, 21 February 2013

Bulk Insert in Cassandra


Cassandra's bulk loading interfaces are most useful in two use cases: initial migration or restore from another datastore/cluster and regular ETL. Bulk loading assumes that it's important to you to avoid loading via the thrift interface, for example because you haven't integrated a client library yet or because throughput is critical.

In this blog we will see the two efficient ways of bulk loading in Cassandra.
1. By using Copy command to import CSV files into the database.
2. Copy command in Cassandra using java code.

Copy command for Cassandra bulk loading :


  •   Syntax for COPY FROM command

COPY <column family name> [ ( column [, ...] ) ] FROM ( '<filename>' | STDIN )
[ WITH <option>='value' [AND ...] ];


  •   Description of COPY FROM command 


Cassandra COPY FROM command option Using these options in a WITH clause, you can change the format of the CSV format. The following table describes these options:


COPY Option                        Default Value                                   Function 
DELIMITER                          Comma(,)                 Set the character that separates fields in the file. QUOTE                                 Quotation mark(“)       Set the character that encloses field values.
 ESCAPE                               Backslash(\)                Set the character that escapes literal uses of the QUOTE                                 False                           Set true to indicate that first row of the file is a header.


  1.  By default, when you use the COPY FROM command, Cassandra expects every row in the CSV input to contain the same number of columns. The number of columns in the CSV input is the same as the number of columns in the Cassandra table metadata. Cassandra assigns fields in the respective order. To apply your input data to a particular set of columns, specify the column names in parentheses after the table name.
  2.  COPY FROM command is only supported in Cassandra CQLSH environment, It will not work while using Cassandra CLI.

 IMPORTANT NOTE: COPY FROM is intended for importing small datasets (a few million rows or less) into Cassandra. For importing larger datasets, use Cassandra SSTableBulkLoader .


 Example :


Copy command in Cassandra using Java Code :


  •  Now we will see how we do the bulk load through java code.


  1.  Basically we will generate the copy command through java according to our need.
  2.  After generating the command we will write all those commands in a query file.
  3. As in Cassandra we can execute any query file use our command prompt or terminal for linux.

      Syntax : $CASSANDRA_HOME/bin/cqlsh -k <keyspace name > -f <filename>


   Now By using Java utility of running console commands we will execute the above command for
   doing the Copy Bulk insert in Cassandra.

   Java Code for performing bulk copy in Cassandra.


import java.io.BufferedReader;
import java.io.DataInputStream; 
import java.io.FileInputStream; 
import java.io.InputStreamReader;
import java.io.FileWriter; 
import java.io.BufferedWriter; 
public class Copycassandra { 
  public static void main(String[] args) { 
     try { 
            // directory for data CSV files 
File pathName = new File("/home/prashant/data/"); 
String[] fileNames = pathName.list();
FileInputStream fstream = new
FileInputStream("sampletables.txt"); 
// sampletables.txt file contains the table names in which we want 
// to do bulk copy
DataInputStream in = new DataInputStream(fstream); 
BufferedReader br = new BufferedReader(new InputStreamReader(in)); 
String strLine, sql; 
FileWriter foutstream = new FileWriter("/home/prashant/queries.txt"); 
BufferedWriter out = new BufferedWriter(foutstream); 
    while ((strLine = br.readLine()) != null) 
             { 
               for (int i = 0; i < fileNames.length; i++) 
                 { 
                    if (strLine.equals(fileNames[i])) 
                         { 
sql = "copy " + strLine + " from '/home/prashant/data/"+ fileNames[i] + " WITH delimiter='|';"; out.write(sql + "\n"); 
                           } 
                 } 
         } 
out.close(); 
String sysEnvStr = System.getenv("CASSANDRA_HOME"); 
Process p = Runtime.getRuntime().exec(sysEnvStr+ "/bin/cqlsh -k idbench -f /home/prashant/queries.txt"); 
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream())); 
BufferedReader stdError = new BufferedReader(new InputStreamReader(p.getErrorStream())); 
// read the output from the command 
String s = ""; 
while ((s = stdInput.readLine()) != null) 
  { 
     System.out.println(s); 
  } 
while ((s = stdError.readLine()) != null) 
  { 
     System.out.println("Std ERROR : " + s); 
  } 


catch (Exception e) 
{ e.printStackTrace();
 } 

}

Bulk-Loading Data to Cassandra with SSTable.


  • The 'sstableloader' introduced from Apache Cassandra 0.8.1 onwards, provides a powerful way to load huge volumes of data into a Cassandra cluster. If you are moving from a cloud cluster to a dedicated cluster or vice-versa or from a different database to Cassandra you will be interested in this tool. As shown below in whatever case if you can generate the 'sstable' from the data to be loaded into Cassandra, you can load it in bulk to the cluster using 'sstableloader'.
       
  •  Once you Have SSTable files generated from CSV files then you can use sstableloader utility provided by Cassandra.


  • Inside bin directory of Cassandra you can find this tool sstableloader. You can run it through command line pointing to the above generated sstables. Good guidance on that can be found in Datastax .
  • Also you can directly use the class 'org.apache.cassandra.tools.Bulkloader' in java code to load the sstables to a Cassandra cluster.
  • If you are testing all this in localhost, we run sstableloader from command line as follows,

                        - ./sstableloader -d localhost <path to generated sstables>