In the end, it will boil down to what fits, in terms of integration with already-in-place technology, in terms of existing know-how and skills, and ultimately what exactly needs to be accomplished.
I had the privilege to be asked to process healthcare information (mostly claims) to help build and grow a new business. One great resource of healthcare information is the U.S. Department of Health & Human Services (HHS). HHS doesn't fall short on data...at all, and one can buy pretty much anything from it. HHS even provides publicly available "synthetic" version of their data to spur innovations by letting entrepreneurs find opportunities in it.
Solutions
SAS is a company that provides (among other things) software and solutions to process data and even help with business intelligence and analytics. SAS seems to be the de facto technology in the healthcare industry, as HHS seems to often package SAS code along with any data they provide.
Now SAS is a proprietary software and we preferred looking at open source solutions first. We also did not want to invest time and resources in a solution that might lock us in and/or drive cost upward in the future. So we looked at alternatives.
The first I opted for and started using was Hadoop/Pig. I had been working with Hadoop/Hive for years already, and decided to give it a try with Pig being a widely used platform for processing data.
Now Pig is not really new but its not that feature-full either. First it's relational-oriented, like SQL. That's a big difference with softwares like SAS that are more record-oriented. For example, rolling up records and using the result with the next record, is nightmarish at best in Pig.
Second, Hadoop is not that fast of a platform: for developers, running scripts over and over to test outputs gets really time consuming.
After over a month of coding data processing in Pig, I came to realize this approach will not work in the long term. So I kept on looking for more alternatives.
A friend of mine and colleague actually stumbled upon a new data processing technology called HPCC. HPCC, which stands for High Performance Computing Cluster, is provided by LexisNexis Risk Solutions (LNRS). LexisNexis is famous for its legal services but it provides many other services, including risk analysis and quotes for insurance companies.
All the data processing, and even some of the delivery of the results, is supported by HPCC, and has been for years already. LNRS decided to open source the HPCC platform a little over 2 years ago, making it probably the number one open-source competitor to SAS software.
And so I went on re-coding some data processing code, switching from Pig to ECL, the programming language in HPCC platform. It became obvious really quickly that ECL was the right tool for the job: implementing complex data processing specifications became easier, queries are faster, lots of built-in tools and services and tremendous increase in development productivity.
The only drawback might be the size (as in small) of the community of developers and online resources. Developers at HPCC Systems respond pretty quickly on public forums so it's not as bad as it may sound.
HPCC
HPCC, just like Hadoop, runs on commodity hardware. HPCC runs on linux and I believe support Windows with some restrictions (or tradeoffs).
The similarity between HPCC and Hadoop stops here though. If I had to compare the two, I'd say Hadoop is Assembly (I dare say binary code) while HPCC is a 3rd (maybe 4th) Generation Programming Language (like Java, C#, etc.). That doesn't mean Hadoop has been coded in Assembly language (in Java actually) and that HPCC has been coded in Java or C# (it's C/C++).
It just means that Hadoop is very low-level and most technologies relying on Hadoop try their best to come up with higher-level languages to abstract the inner workings of Hadoop (like Pig Latin, HiveQL). The only problem is that it all just started, it's all very new and they all have still a long way to go.
HPCC on the other end is already there: not only HPCC provides a programming language to abstract (and simplify) data processing but the platform goes beyond processing as it provides security, monitoring, web services and more.
Data Processing
The Enterprise Control Language (ECL) was designed to help developer process data as efficiently as possible while leaving the platform the opportunity to optimize how to do it.
ECL is a high-level, declarative, non-procedural dataflow-oriented language. That sounds complicated but in short it means, you code WHAT you need and HPCC will figure out HOW to do it.
In Pig for example, it's the other way around (with some exceptions): a developer will code in Pig Latin HOW to do things: 1 line of Pig code is pretty much 1 MapReduce job.
Here's an example of ECL code:
A := ROLLUP( my_data, LEFT.id = RIGHT.id, TRANSFORM(my_data_layout, SELF.total_price := LEFT.total_price + RIGHT.total_price; SELF := RIGHT; ) );
It might look complicated but it is actually really easy.
ROLLUP is a built-in function that, when records match a certain condition, the TRANSFORM will be called, passing those matching records to it. The condition here is LEFT.id = RIGHT.id.
ROLLUP iterates through the records and process them by pair. So there's a LEFT record and a RIGHT record. So the condition here just means: do the transformation when the 2 records have the same id value.
ROLLUP iterates through the records and process them by pair. So there's a LEFT record and a RIGHT record. So the condition here just means: do the transformation when the 2 records have the same id value.
TRANSFORM is another built-in function. This one helps specify the logic of the transformation that needs to happen. SELF represent the result and LEFT and RIGHT are the 2 records passed as input to the function. The TRANSFORM here simply sums up the values of total_price from the LEFT and RIGHT record. When done, the result will be used as LEFT record for the next pair, if the condition is met.
This ECL code is equivalent to the following SQL code:
SELECT id, SUM(total_price) FROM my_data GROUP BY id;
Now, in the transform we could implement some more complicated logic, like setting values in the current record based on results of previous transformations, or keeping LEFT or RIGHT records only depending on the values of either records or some other logic. There is no equivalent in SQL or Pig as they do not provide processing at the record level.
It is still feasible in SQL or Pig, but the code will be a lot less intuitive and way less concise than this ECL version.
All the processing of data happens in a specific cluster inside the HPCC Platform. This cluster is called Thor, named after the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information.
All the processing of data happens in a specific cluster inside the HPCC Platform. This cluster is called Thor, named after the mythical Norse god of thunder with the large hammer symbolic of crushing large amounts of raw data into useful information.
Data Delivery
After the data has been processed, HPCC provides a cluster specialized in delivery called Roxie.
Just like for data processing, ECL can be used to query data on the Roxie cluster.
Where HPCC shines here is that, those queries can be "published", transforming them into actual web services that can be called by outside applications.
One only needs to take the WSDL and XSD provided by Roxie for the published query and create client code with it, be it in Ruby, Java or C# or anything that can handle SOAP or XML and REST.
Security, Monitoring, Redundancy, and more
As I mentioned earlier, HPCC isn't just about storage and large-scale processing of data. It comes with many features, including but not limited to:
- Authentication. With the community edition, one can just use the .htpasswd option. LDAP being the preferred option for production.
- SSL. It's possible to setup SSL certificate and enforce use of HTTPS for all services provided by and use throughout the platform.
- Storage Redundancy. Thor and Roxie are fully redundant. Data is replicated across the cluster.
- Online Configuration Manager. This little tool helps create configurations for the entire HPCC Cluster(s). It can be used to either tweak an existing configuration or create a new one. Once done, one needs to distribute it to the cluster and restart services for the new changes to be taken into account. If ones needs to scale (up or down), this is the tool to use.
The Enterprise Edition facilitates all the features mentioned above, but there are all available (albeit more manual) in the Community Edition.
Resources
There's already a very decent amount of documentation, articles, tutorials and even training videos to help developers getting started with HPCC and ECL.
The Community Forums are also a great place to ask questions and find existing answers as well. HPCC Developers and staff seems to monitor and answer questions posted in a timely fashion.
HPCC Systems bring a whole new world of technology to Data Processing and Querying, more than I could cover in one post at least.
Image: Bird with elephant
HPCC doesn't have a "cute" animal yet to represent itself. Hadoop has one and it's an elephant.
A friend of mine thought of an elephant-eating bird from the mythology called the Roc.
Now there's is no hostility between Hadoop and HPCC, I just found it somewhat relevant and funny.
There are not that many pictures of that bird. I found this one on this site although I'm not sure whose author this image is...
No comments:
Post a Comment