In a recent blog post, Google announced a new Cloud Storage connector for Hadoop. This new capability allows organizations to substitute their traditional Hadoop Distributed File System (HDFS) with Google Cloud Storage. Columnar file formats such as Parquet and ORC may realize increased throughput and customers will also benefit from Cloud Storage directory isolation, lower latency, increased parallelization and intelligent defaults.
While HDFS is a popular storage solution for Hadoop customers, it does come with some complexity, including running long-running HDFS clusters. Google Cloud Storage is a unified object store that exposes data through a unified API. It is a managed solution that supports both high-performance and archival use cases.
The Cloud Storage connector is an open-source Java client library that implements Hadoop Compatible FileSystem (HCFS) and runs inside Hadoop JVMs which allows big-data processes, like Hadoop or Spark jobs, access to underlying data from Google Cloud Storage.
Google feels there are many opportunities when using Google Cloud Storage over HDFS, including:
- Significant cost reduction as compared to a long-running HDFS cluster with three replicas on persistent disks;
- Separation of storage from compute, allowing you to grow each layer independently;
- Persisting the storage even after Hadoop clusters are terminated;
- Sharing Cloud Storage buckets between ephemeral Hadoop clusters;
- No storage administration overhead, like managing upgrades and high availability for HDFS.
Even though the connector is open-source, it is supported by Google Cloud Platform and comes pre-configured in Cloud Dataproc, Google’s fully managed service for running Apache Hadoop and Apache Spark workloads. In addition, it can be installed and fully supported in other Hadoop distributions, including MapR, Cloudera and Hortonworks. This interoperability allows customers to transition their on-premises big data solutions to the cloud.
Using Cloud Storage in Hadoop implementations, offers customers performance improvements. One customer who has been able to take advantage of improved performance is Twitter. In Twitter’s implementation, they:
Started testing big data SQL queries against columnar files in Cloud Storage at massive scale, against a 20+ PB dataset. Since the Cloud Storage Connector is open source, Twitter prototyped the use of range requests to read only the columns required by the query engine, which increased read efficiency. We (Google) incorporated that work into a more generalized fadvise feature.
Another capability introduced as part of the Cloud Storage connector is cooperative locking, which isolates storage modification operations executed by the Hadoop file system shell. Igor Dvorzhak, a software engineer, explains the importance of this feature:
Although Cloud Storage is strongly consistent at the object level, it does not natively support directory semantics. For example, what should happen if two users issue conflicting commands (delete vs. rename) to the same directory? In HDFS, such directory operations are atomic and consistent.
To address cooperative locking, Google worked with Twitter to implement the feature in the Cloud Storage connector which prevents data inconsistencies during competing directory operations.
Existing Cloud Storage Connector customers can upgrade to the new version of Cloud Storage Connector using the connectors initialization action for existing Cloud Dataproc versions. As of Cloud Dataproc version 2.0, it will become the standard connector.