EMR Case Study

HomeCase StudiesEMR Case Study

Data Processing Backend for BI

BI is an analytical product that provides clients with interactive dashboards that offer insights into consumer segments. Users either use the predefined segments or upload custom segments to perform advanced analytics, including:

  • Ranking segments by affinity index for large interests taxonomy (sport, games, lifestyle, brands, and so on..)
  • Media consumption analysis
  • Cross-segment analysis to select an optimal partnership for co-branding activities
  • Ranking places and events to select an optimal strategy for BTL marketing activity
  • And so on…

The Challenge

All this analysis based on consumer segments (1-10M), information about consumer interests from publicly available sources (billions of rows), and a lot of taxonomy dictionaries (like geo population stats). All these raw data sources aggregated into multi-dimensional OLAP cubes that are used to build interactive dashboards. Building these OLAP cubes is the most resource-intensive operations in all data workflow because a huge amount of data being aggregated, a large number of segments and complex aggregation logic (most cube cells are calculated as unique users count but some of the dimensions has specific rules for merging values across the dimension).

The customer needs a solution to perform complex aggregation of raw data into OLAP cubes. These cubes used in interactive dashboards to visualize customer segments marketing properties, like other brands affinity and top relevant events for marketing BTL. For each cube, the user provides segments (in terms of ids) and the solution need to prepare cubes in a short time (tens of minutes). The solution should incur costs only during data processing. Dashboards could be used for a long time after cube build.


Amazon EMR is the industry-leading cloud-native big data platform for processing vast amounts of data quickly and cost-effectively at scale. Using open-source tools such as Apache SparkApache HiveApache HBaseApache FlinkApache Hudi (Incubating), and Presto, coupled with the dynamic scalability of Amazon EC2 and scalable storage of Amazon S3, EMR gives analytical teams the engines and elasticity to run Petabyte-scale analysis for a fraction of the cost of traditional on-premises clusters.

The solution

Based on the posted requirements and the extensive AWS experience, Pi5 proposed to split the solution into the data processing layer and BI integration layer. The data processing layer needs high scalability and elasticity. The BI integration layer needs high availability and should be cost-effective for serving requests in 24/7 mode. To fit the layers with each other we used S3 as a data storage.

We build data processing layer with the following stack:

  • Apache Spark for aggregation implementation
  • S3 as a persistent data lake storage.
  • EMR as a data processing platform. EMR provides us with a managed Hadoop/Spark environment.  It also orchestrates all processing states and does monitoring.
  • EC2 Spot Instances allow reducing costs up to 80%.
  • AWS Glue as a managed metadata catalog.
  • We build application to manage EMR clusters and steps.

For BI integration layer we used:

  • Athena provides JDBC-compatible SQL engine
  • AWS Glue allows simple integration with data processing layer outputs

The general system architecture is shown below.


Lessons Learned

  • EMR allows to build scalable applications and provide high performance with hundreds-instances clusters.
  • Spot instances could dramatically reduce bills, EMR handles instances termination and for short-term computations (1-2 hours) spot instances is a very good fit.
  • EMR provides a bunch of quite useful applications (Ganglia, Zeppelin, Spark History Server) to troubleshoot all Spark issues in place

About pi5.cloud

pi5.cloud is a global technology consulting company at the forefront of cloud computing. Through collaboration with Amazon Web Services, we help customers embrace a broad spectrum of innovative solutions. From a migration strategy to operational excellence, cloud-native development, and immersive transformation, pi5.cloud is a full spectrum integrator.

Let us worry about your I.T. while you can focus on your business

Let someone else worry about your technology

We want to hear about your project. Get a free consultation and estimate.