Apache Hadoop Ecosystem

Your Questions. Answered.

Apache Hadoop is a Java-based open-source framework for storing and processing large amounts of data. The framework is based on the MapReduce algorithm from Google and it makes possible performing intensive computing with large amounts of data on computer clusters. Various Hadoop-related technologies have been developed with the aim of changing the way companies store, process and extract meaningful information from the data.

At ZettaScale we have deep knowledge about designing and implementing scalable systems using the Hadoop stack and we are always motivated to create reliable and fault-tolerant applications.

Our development expertise ranges from Industry 4.0 to autonomous car platforms and the Hadoop ecosystem has always played a central role, especially Apache Spark and Apache HBase. Having a vast amount of experience in working with on-prem and cloud solutions, we are flexible when it comes to choosing the right infrastructure, building the solution and maintaining it.

Apache Hadoop

Apache Hadoop is at its core a Java-based MapReduce framework. Hadoop scales from a single computer to thousands of computers in a cluster, each providing its local computing power and storage capacity. This allows Hadoop to efficiently store and process large datasets ranging from a few gigabytes to petabytes.

MapReduce
HDFS Filesystem
YARN
Hadoop Common

We can help you in the following ways:

Architectural decisions
Implement batch jobs
Monitor and debug jobs
Setup a Hadoop cluster or migrate existing jobs to the cloud or to another big data framework

Apache Spark

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs.

Low-level APIs: RDDs, Distributed Variables
Structured APIs: Datasets, DataFrames, SQL
Structured Streaming
Advanced Analytics
Libraries and Ecosystem

We can help you in the following ways:

Design and implement batch and streaming jobs using Spark
Deploy jobs using various cluster managers, including YARN and Mesos (cloud/on-prem)
Monitor and debug jobs
Performance tuning
Operational guidance

Apache Storm

Apache Storm is a framework used to develop distributed, real-time, data processing platforms. It provides a set of primitives that can be used to develop applications that can process a very large amount of data in real time in a highly scalable manner using Storm topologies.

Topologies (Spouts, Bolts, Tuples)
Streams
Nimbus, Zookeeper

We can help you in the following ways:

Setup Zookeeper cluster
Design and implement Storm topologies
Monitor and debug jobs
Deploy jobs (cloud/on-prem)

Apache HBase

Apache HBase is the Hadoop database. HBase is open source and its fundamental characteristics are that it is a non-relational, column-oriented, distributed, scalable, big data store, providing schema flexibility.

Data model (tables, column families, columns)
Storage architecture (HMaster, HRegionServer, Datanode, HDFS, WAL, HLog, HFile)
Java APIs

We can help you in the following ways:

Create and configure an HBase cluster
Design HBase tables
Migrate HBase to Cloud Bigtable on Google Cloud Platform

Apache Hive

Apache Hive is a data warehouse that facilitates reading, writing and managing large datasets using SQL. It provides standard SQL functionality for data analytics, including OLAP functions, subqueries and more.

Hive services (Hive Server, Web interface)
HiveQL, Metadata Store, HCatalog

We can help you in the following ways:

Deploy Hive Metastore on top of the data in HDFS
Write queries (SQL operations, DML operations)
Use Dataproc ephemeral clusters in Google Cloud Platform to run Hive jobs
Host Hive metastore in a MySQL database on Cloud SQL