Monday, November 24, 2014

HPCC: Getting Started


HPCC stands for High Performance Computing Cluster. This doesn't mean however that one needs a fleet of servers to get started with it.

I present here what one needs to start using HPCC and have fun extracting, transforming and loading data, big and small, simple and complex.

It boils down to two tasks: setting up the cluster and setting up an IDE.

Setting up a cluster


There are many different flavors to choose from here. I recommend the AWS approach as it's the fastest way to get some real sense of what HPCC is and its performance, even with very few nodes.
Regardless of the solution you choose, once you're getting comfortable, try to scale up and push its limits a bit to fully appreciate its power.


HPCC Instance Cloud for AWS


If you already have an AWS account, this is going to be easy. All you need is an Access Key Id and a Secret Key to log in.
I recommend NOT using your root access keys and instead creating a new user. HPCC only needs EC2 permissions to create clusters, so you can set those permissions for this new user, and use its Access Key Id and Secret Key to log in.

Once logged in, you can launch new clusters, view existing clusters, list IPs of nodes in the cluster, etc. You can also terminate clusters right from this interface.

Now it seems to be caching/persisting the configuration of clusters but it doesn't get updated automatically. This means if you modify your cluster by adding a new node for example, it won't show up here.

It's still in Beta but to launch cloud clusters quickly, there isn't anything better right now.
To get started with HPCC/AWS, go here: https://aws.hpccsystems.com/aws/login/


Manual Installation


You can setup HPCC yourself on CentOS, Debian, Red Hat or Ubuntu as they provide all the packages to do so. This is useful when you're in Administrator mode and trying to figure out exactly what's needed, what services need to run, how to configure clusters, etc.

You can find the binaries here:


VM Image


If you have VirtualBox or VMWare Player already installed, or if you want to try it out in a single-node setup, this is your best option.
Now the image is a little over 1GB so it may take some time to download it.
The performance is not great in that setup but it's fast enough to play with it and figure things out using either inline-datasets or small data samples.

You can find VM images here:
http://hpccsystems.com/download/hpcc-vm-image


Setting up an IDE


Truth be told, you could do everything with command line tools and use Notepad-like editors to write ECL code. If you want to be more productive though, I recommend choosing a good IDE.

As of now, there are two options available but I recommend the Native IDE to get the full experience and avoid trying to figure things out at an early stage.

Eclipse IDE


This is basically Eclipse with ECL support (including syntax highlighting) and an HPCC-specific perspective.
It provides less features (as of now) than the native IDE but works really well to get started.

You can download it here:
http://hpccsystems.com/products-and-services/products/plugins/eclipse-ide


Native (Windows) IDE


Also called ECL IDE, this interface provides access to most features provided by HPCC platform.
You can write ECL code, compile code, submit code, list and check Workunits, etc.
Few things not yet supported are advanced options when it comes to web services (e.g. packagemaps for Roxie).
But to write good ECL code and organize your scripts, check jobs and so on, this is the best option right now.

You can download it here:
http://hpccsystems.com/download/free-community-edition/ecl-ide


Checkpoint


Once you've made your choice, got HPCC up and running and your IDE configured, submit the following ECL code to test everything is working properly:

OUTPUT('Hello World!');


What now?


I strongly recommend going through the tutorials provided on HPCC Systems web site:


The Programmers Guide and the Six Degrees of Kevin Bacon are really good ones to get your feet wet. The former will especially teach you how to create sample data (in ECL of course), how to do some cross-tab reports, create index(es) and work with SuperFiles. You can stop at the first section of that tutorial and you'll already have enough under your belt to do a lot in ECL.
The latter is more a like a real project: a goal and the implementation step by step. This one deals a lot with working with third-party data, processing some awkward format, joining and dedup-ing records.

ECL Best Practices is a great read as well as it covers how to define a problem to tackle with HPCC/ECL, provides some naming conventions, what to avoid, datasets and TRANSFORMs.
You don't have to follow everything there (e.g. you can definitely come up with your own naming conventions) but it's good to know how it's used and get some tips as well before re-inventing the wheel yourself.

Using Roxie and Data Handling are a little more advanced and cover cases beyond "What can I do with ECL?". I recommend going through those last if ECL is your priority.

HPCC Systems Administrator's Guide,  Installing & Running the HPCC Platform and Using Configuration Manager are useful when you get to the administration side of things. The configuration manager, how to edit configuration and distribute the configuration file to your cluster get some getting used to. But HPCC Systems provides once again great documentation and great scripts to help with all that.





No comments:

Post a Comment