Salesforce2hadoop allows you to import data from Salesforce and put it in HDFS (or your local filesystem), serialised as Avro. Despite its boring name, it's a powerful tool that helps you get all relevant data of your business in one place. It only needs access to the Salesforce API using a username/password combination, and the Enterprise WSDL of your Salesforce Organisation.
Features:
- Choose the type(s) of records you want to import
- Data types are preserved by looking at the Enterprise WSDL of your Salesforce Organisation
- Data is stored in Avro format, providing great compatibility with a number of tools
- Do a complete import of your data, or incrementally import only the records that have been changed since your last import
- Salesforce2hadoop keeps track for you of the last time each record type was imported.
- Stores data into any filesystem that Hadoop/KiteSDK supports. Can be HDFS but also a local filesystem.
- Built with the help of KiteSDK, Salesforce WSC and our own wsdl2avro library.
- Built for the JVM, so works on any system that has Java 7 or greater installed
More background information: http://dandydev.net/blog/getting-salesforce-data-into-hadoop
You can find compiled binaries here on Github. Just download, unpack and you're good to go.
If you want to build salesforce2hadoop yourself, you need to have Scala and SBT installed. You can then build the "fat" jar as follows:
$ git clone https://github.com/datadudes/salesforce2hadoop.git
$ sbt assembly
The resulting jar file can be found in the target/scala-2.11
directory.
sf2hadoop
is a command line application that is really simple to use. If you have Java 7 or higher installed, you
can just use java -jar sf2hadoop.jar
to run the application. To see what options are available, run:
$ java -jar sf2hadoop.jar --help
In order for salesforce2hadoop to understand the structure of your Salesforce data, it has to read the Enterprise WSDL of your Salesforce organisation. You can find out here how to generate and download it for your organisation.
sf2hadoop
has 2 commands for importing data from Salesforce: init and update.
Use init to do an initial data import from Salesforce. For each record type a dataset will be created in HDFS (or your local filesystem) and a full import of the desired record types will be done. This can take some time. You can do an inital import like this:
$ java -jar sf2hadoop.jar init -u <salesforce-username> -p <salesforce-password> -b /base/path -w /path/to/enterprise.wsdl -s /path/to/state-file recordtype1 recordtype2 ...
Salesforce credentials
As you can see, it needs your Salesforce username & password to login to the Salesforce API to fetch your data.
Data import directory
It also needs a basePath where it will store all the imported data. This must be in the form of a URI that Hadoop can understand. Currently salesforce2hadoop supports storing data in either HDFS or your local filesystem. URIs for these options have the following format:
- HDFS:
hdfs://hostname-of-namenode:port/path/to/dir
Port can be left out. Example URI when runningsf2hadoop
on the namenode of your Hadoop cluster:hdfs://localhost/imports/salesforce
, this will store all data on HDFS in the/imports/salesforce
directory. - Local filesystem:
file:///path/to/dir
This will store all data on your local filesystem in the/path/to/dir
directory.
Imported data will be stored under the provided base path. Data for each record type will be stored under its own directory, which is the record type in lowercase.
Record types
Provide sf2hadoop
with the types of records you want to import from Salesforce. Specify the types by their Salesforce API names. Examples: Account
, Opportunity
, etc. Custom record types usually end with __c
, for example:
Payment_Account__c
.
State
Salesforce2hadoop will keep track of what record types have been imported at what point in time. This allows you to switch to incremental imports after the initial import. To make this possible, it will save the name of each record type that is imported, together with an import date. When doing an incremental update of the data (see below), it will read back the import states for each record type, and only request data from Salesforce that has been created/updated since that moment. After the incremental import, it will update the import date for each record type that was imported.
The statefile can be stored either in HDFS or on your local filesystem. As with the basePath, you have to specify the path to the statefile as a URI, following the guidelines above. The statefile will be created automatically, including all non-existing parent directories.
It is advised to store the statefile on HDFS, so you can run sf2hadoop
from any machine, without having to worry if
the proper statefile is present.
Use the update command to do an incremental update of record types that have previously been imported:
$ java -jar sf2hadoop.jar update -u <salesforce-username> -p <salesforce-password> -b /base/path -w /path/to/enterprise.wsdl -s /path/to/state-file recordtype1 recordtype2 ...
The parameters are the same as with the init command.
Record types that have not been previously initialized (no initial import has been done), will be skipped, and the output will tell you so.
Due to the limitations of the Salesforce API, sf2hadoop
can only go back 30 days in the past when doing an incremental
update. Trying an incremental import over a longer timespan, might result in errors or incomplete data, so it is advised
to do an update at least monthly.
Here are some recipies to make the most out of salesforce2hadoop.
Once you've imported Salesforce data into HDFS using sf2hadoop
, you can then create a Hive table backed by the
imported data (in Avro format) by running the following command in the Hive shell:
CREATE EXTERNAL TABLE <tablename>
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/base/path/recordtype_in_lowercase'
TBLPROPERTIES ('avro.schema.url'='hdfs:///base/path/recordtype_in_lowercase/.metadata/schema.avsc');
This table will also be available in Impala (you might have to do a INVALIDATE METADATA
for the table to show up).
The reason to create it in Hive instead of in Impala directly, is that Hive can infer the table's schema from the Avro
schema for you.
Each time you do an incremental import of data for which you have created a Hive/Impala table, you have to tell Impala
that new data is available by running the following command in the Impala shell: REFRESH <tablename>
Some random TODOs:
- Allow creating Hive tables directly by using the facilities provided by KiteSDK
- Support other Hadoop-compliant filesystems, like S3
You're more than welcome to create issues for any bugs you find and ideas you have. Contributions in the form of pull requests are also very much appreciated!
Salesforce2hadoop was created with passion by: