Microsoft announced the general availability of Windows Azure HDInsight last week, and released an updated version of the Microsoft Hive ODBC Driver. This post describes how to use this driver to pull data from your HDInsight cluster with an SSIS Data Flow Task.
Installing and Configuring the Hive ODBC Driver
The download page for the Hive ODBC driver provides a link to an article on how to configure the Hive ODBC driver to connect to HDInsight using Excel. The screenshots in the article are a bit out of date, but the procedure is essentially the same when using the driver from SSIS. Be sure to install both the 32bit and 64bit versions of the driver – you’ll need to install them separately. If you only install the 64bit driver, you’ll get errors in visual studio (which is a 32bit process) when trying to configure your connection manager. If you only install the 32bit driver, you might get an error when you try to run the package in the designer (as the designer runs the package in 64bit mode by default).
Once the driver is installed, there are two ways to configure it – using a Data Source Name (DSN), or using a connection string. I recommend taking the connection string approach, but you might find using the DSN approach more convenient for your environment.
Data Source Name (DSN)
Using a DSN requires you preregister the data source (at the file/user/system level) using the Windows ODBC Data Source Administrator. You’ll then be able to reference this DSN entry by name from an SSIS Connection Manager. This essentially creates an alias for your data source – you can change where the DSN is pointing, and your packages continue to work. The downside to this approach is that you’ll need to make sure this DSN exists on all machines that will be running this package.
There are two versions of the ODBC Data Source Administrator UI – one for 32bit (%windir%SysWOW64odbcad32.exe) and one for 64bit (%windir%System32odbcad32.exe). You’ll likely want to create both 32bit and 64bit DSNs – just make sure that the same name is used for both versions. At minimum, you’ll need to register a 32bit DSN to use when creating your SSIS package in the designer.
When creating the DSN, select the Microsoft Hive ODBC Driver.
On the configuration page, you’ll need to enter:
- Host: This is the full domain name of your HDInsight cluster (*.azurehdinsight.net)
- Port: 443 is the default
- Database: default
- Hive Server Type: Hive Server 2
- Authentication Mechanism: Select Windows Azure HDInsight Service – this will be the UN/PW you used when configuring your cluster
- User Name & Password: the user you configured for your cluster
Another setting of interest on the Advanced Options page is the Default string column length value. By default, this will be set to 32767, which is larger than the SSIS maximum length for a DT_STR/DT_WSTR column (4000). This means that SSIS will treat all Hive STRING columns as DT_TEXT/DT_NTEXT, which has negative performance implications. If you know that your data values will be less than 4000 characters in length, I recommend lowering this value to 4000 (or less).
Once the values are configured, click the Test button, and you’re good to go.
The DSN will then appear in your ODBC Administrator UI. As noted above, you’ll want to create both 32bit and 64bit DSNs.
You will now be able to see your newly created DNS when you create an ODBC Connection Manager in SSIS.
Using a Connection String
The advantage to using a connection string is that you don’t have to pre-create the DSN on the systems that will be running your SSIS package. Knowing what to enter can be a little tricky (see the tip below), but I prefer approach as it removes the external (DSN) dependency. Also note that the same connection string works for both 32bit and 64bit execution modes – you just need to ensure that both versions of the ODBC driver are installed.
The table below lists properties you’ll need to set.
Field | Description |
Driver | Name of the driver – {Microsoft Hive ODBC Driver} |
Host | DNS hostname of your cluster |
Port | Connection port – the HDInsight is 443 |
Schema | Default database schema |
RowsFetchedPerBlock | Number of rows fetched per block. The recommendation is to keep it at 10,000. |
HiveServerType | The HDInsight default is 2 |
AuthMech | Authentication mechanism – you’ll want to use a value of 6, which maps to using the username and password you specified when the cluster was created |
DefaultStringColumnLength | The default length for STRING columns. See the note above. |
Example:
DRIVER={Microsoft Hive ODBC Driver};Host=<cluster name>.azurehdinsight.net;Port=443;Schema=default;RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;DefaultStringColumnLength=4000
When configuring your Connection Manager in SSIS, select the “Use connection string” option and paste in your connection string. Note, the Username and Password values can be specified as part of the connection string (uid and pwd fields), or can be set separately in the connection manager’s UI.
Tip – To determine all of the fields that can be set for the provider (or at least the ones exposed by the UI), you can go through the DSN creation step, and then save it as a File DSN. This will create a text file in .ini format. Open it with any text editor, and you can see each of the name/value pairs that are being set. You just need to put them together into a single, semi-colon delimited string.
Configuring Your Data Flow
In an SSIS Data Flow, you can use the Hive ODBC Driver from:
- ODBC Source – introduced as a new component in SSIS 2012, although there are third party options available for previous versions
- ADO.NET Source – using the .NET ODBC Data Provider
Note that in the past the Hive ODBC driver had some compatibility issues with the SSIS ODBC Source, leading to a recommendation to use the ADO.NET Source. I didn’t notice any problems when using the latest Microsoft driver – you’ll see that the example screenshots below use the ODBC Source.
After configuring your connection manager, you can select a table or enter your query (similar to interacting with any other source in SSIS). I found that using the table drop down can take a while (15-30 seconds) – querying metadata from Hive isn’t an immediate operation.
Clicking the Preview button takes a while (as the Hive job is executed on your cluster), but it is a good way to make sure that your connection manager is correctly configured.
Clicking the Columns tab may have a delay (again – fetching metadata isn’t instantaneous), but will come back with the columns as you’d expect.
Once you click OK, you may notice another delay as the component is validated. Since Hive metadata operations can take a while to complete, I suggest setting the ValidateExternalMetadata value of your source component to False. This will speed up validation quite a bit – the metadata will only be refreshed when you open the Source UI. When using the ODBC Source, you should also consider setting the BatchSize property (default 1000) to match the RowsFetchedPerBlock setting of your connection manager (default is 10,000).
After you complete the rest of your Data Flow, you should be able to run your package. In my experience, the Validate and PreExecute phases of the data flow will finish (relatively) quick, but once the Execute phase starts, your source will sit with the icon spinning and no rows coming out as it waits for your Hive job to complete on your cluster. Once the job completes, you’ll see the data streaming back nicely.
I found that running a package which reads from the hivesampletable table that comes with HDInsight consistently takes about 2 minutes. After applying the tips mentioned throughout this post (and summarized below), the average execution time went down to 1 minute, 20 seconds. I also found the the ADO.NET Source was consistently (slightly) faster than the ODBC Source (5-10 seconds), but I expect this will vary between environments (and data sets).
Tips
- Install both the 32bit and 64bit versions of the Hive ODBC Driver
- Consider using a connection string instead of a DSN in your connection manager
- Change the Default String Column Length to 4000 (or less) if you are sure your String columns are less than 4000 characters in length
- Set ValidateExternalMetadata to False on your Source component
- When using the ODBC Source, increase the BatchSize (to 10000)
Troubleshooting
If Preview works for you in the ODBC Source (or ADO.NET Source), but your package fails when you run in the designer, make sure you have both the 32bit and 64bit versions of the Hive provider installed, and that you’ve created DSNs for both platforms. The error looks something like this:
[ODBC Source [2]] Error: The AcquireConnection method call to the connection manager <XXX> failed with error code 0xC0014009. There may be error messages posted before this with more information on why the AcquireConnection method call failed.
[SSIS.Pipeline] Error: ODBC Source failed validation and returned error code 0x80004005.
[Connection manager “<XXX>”] Error: An ODBC error -1 has occurred.
[Connection manager “<XXX>”] Error: There was an error trying to establish an Open Database Connectivity (ODBC) connection with the database server.
—-
If for some reason you are unable to install the 64bit provider, you can force your SSIS package to run in 32bit mode in Visual Studio by right clicking on the project, selecting Properties, going to the Configuration Properties | Debugging page, and setting the Run64BitRuntime value to False.
Hi, thanks for your article.
Can you tell me when the SSIS Tasks for HDInsight will be available on CodePlex ?
I watched the "SQL Server Integration Services Roadmap" session (http://www.youtube.com/watch?v=2Aj9_w3y9Xo) of the PASS Summit and i would like to test them 😉 Thanks
They are coming soon (any day now)… I will post details once they are available.
Thank you for sharing such a thorough article on how to connect to Hive, it is especially thoughtful to include the tips.
Hi, thanks. Its very nice and detailed article. it helped me with the connection.
I was able to connect and import a table. Now, when I try again, I am getting the following error:
Test connection failed because of an error in initializing provider. ERROR [HY000] [Microsoft][Hardy] (34) Error from Hive: No more data to read..
ERROR [HY000] [Microsoft][Hardy] (34) Error from Hive: No more data to read..
Any idea on why and how to resolve it?