Speeding up a Pig+HBase MapReduce job by a factor of 15

hbase

The other day I ran a Pig script. Nothing fancy; I loaded some data into HBase and then ran a second Pig job to do some aggregations. I knew the data loading would take some time as it was multiple GB of data, but I expected the second aggregation job to run much faster. It ran for over 15 hours and was not done at that time. This was too long in my mind and I terminated it. I was using Amazon Elastic Map Reduce as my Hadoop environment, so … [Read more...]

Lessons learned from real world BigData implementations

In the last weeks I visited several Cloud and Big Data conferences. Especially the Big Data Innovation in Boston gained me a lot of insight. Some people only consider the technology side of BigData technologies like Hadoop or Cassandra. The real driver however is a different one. Business analysts discover Big Data technologies as the means to leverage tons of existing data and ask questions about customer behavior and all sorts … [Read more...]