If you have been looking into cloud computing, then the chances are you have come across the term big data. This is used to describe the large volumes of data that modern-day businesses are processing. We are not so worried about the mass of data, but more about how we can use this data. Almost everything carries a digital footprint these days, and there are challenges in ingesting this data and extracting meaningful information. Google offers services that work together to allow us to gather data, process it, and analyze it. The faster we can analyze data, the faster business decisions can be made. Big data is becoming very important to many organizations.
In this article, we will cover the following topics:
- End-to-end big data solution
- Cloud IoT
- Cloud Data Fusion
In this post, we want to introduce the main Google big data services and concepts to a level expected by the exam. Of course, deeper dives are available.
Before we begin looking at the services, let's look at a very simple diagram. This represents the end-to-end big data solution. GCP provides us with all of the tools to architect an elastic and scalable service where we can import huge amounts of data, process events, and then execute business rules.
It is important to understand this flow and which services map to each stage. As we go through this blog, we will map each big data service to a stage in our end-to-end solution.
Of course, this is a very simplistic view of it and there are complexities when architecting our solutions, but we should expect, at least for the exam, to understand this at a high level. Let's look at the services that map to each stage of the process in more detail.
Pub/Sub is a messaging and event ingestion service that acts as the glue between loosely coupled systems. It allows us to send and receive messages between independent applications while decoupling the publishers of events and subscribers to those events. This means that the publishers do not need to know anything about their subscribers. Pub/Sub is fully managed, so it offers scale at ease, making it perfect for a modern stream analytics pipeline.
There are some core concepts that you should understand:
- A publisher is an application that will create and send messages to a topic.
- A topic is a resource that messages are sent to by publishers.
- A subscription represents the stream of messages from a single topic to be delivered to the subscribing application. Subscribers will either receive the message through pull or push, meaning Pub/Sub pushes the messages to the endpoint using a webhook, or the message is pulled by the application using HTTPS requests to the Google API.
- A message is data that a publisher will send to a topic. Simply put, it is data in transit through the system.
As we mentioned earlier, subscribers receive the messages either through a pull or push delivery method. Similar to publishers, pull subscribers can be any application that can make HTTPS requests to googleapis.com. In this case, the subscriber application will initiate requests to Pub/Sub to retrieve messages. On the other hand, push subscribers must be webhook endpoints that can accept POST requests over HTTPS.
In this case, Cloud Pub/Sub initiates requests to the subscriber application to deliver messages. Consideration should be given as to the best method for our requirements. As an example, if you are expecting a large volume of messages (for example, more than one message per second), then it is advisable to use the pull delivery method. Alternatively, if you have multiple topics that must be processed by the same webhook, then it is advisable to use the push delivery method:
To summarize this, let's look at how this works in practice in the preceding diagram. When all of the deliveries are complete, the message will be removed from the queue.
As Cloud Pub/Sub integrates well with other GCP products, it opens up a lot of use cases.
- Distributed event notifications: Let's take a service that needs to send a notification whenever there is a new registration. Pub/Sub makes it possible for a downstream service to subscribe and receive notifications of this new registration.
- Balancing workloads: Tasks can be distributed among multiple Compute Engine instances.
- Logging: Pub/Sub can write logs to multiple systems. For example, if we wish to work with real-time information, we could write to a monitoring system, and if we wish to analyze the data at a later time, we could write to a database.
Now, let's look at creating a Pub/Sub topic and subscription. Then, we will publish a message and pull it via Cloud Shell. Please refer to This Post, Google Cloud Management Options, for more information on Cloud Shell. At this point, we are assuming you can open this:
- From the Navigation menu, browse to BIG DATA | Pub/Sub, as shown in the following screenshot:
- Click on CREATE TOPIC:
- Provide a Topic ID. In this example, we will name it newTopic. Click CREATE TOPIC, as shown in the following screenshot:
- Select Create subscription, as shown in the following screenshot:
- Provide a Subscription ID. In this example, we will call it newSub. We will leave the rest of the settings as their default settings, but we should note what each of these is:
- Message retention duration: Pub/Sub will try to deliver the message within a maximum of 7 days. After that, the message will be deleted is no longer accessible.
- Subscription expiration: If there are no active connections or pull/push successes, then the subscription expires.
- Acknowledgment Deadline: The subscriber should acknowledge the message within a specific timeframe; otherwise, Pub/Sub will attempt to deliver this message.
- Browse back to Topics and click on the topics we recently created. Scroll to the bottom, select the MESSAGES tab, and select PUBLISH MESSAGE:
- Enter the message's content and select PUBLISH, as shown in the following screenshot:
- From Cloud Shell, we can now pull down the message:
This concludes how to publish and retrieve messages to and from Pub/Sub.
While Pub/Sub should be our default solution for most application integration use cases, GCP does offer Pub/Sub Lite for applications when a lower cost justifies some operational overhead. It offers fewer features, lower availability, and there is a requirement to manually reserve and manage resource capacity. However, using Pub/Sub Lite for a system with a single subscription could be up to 85% cheaper.
Some of the other main differences between the products are as follows:
- Pub/Sub routes messages globally, whereas Lite routes messages zonally.
- Pub/Sub topics and subscriptions are global resources, whereas with Lite, they are zonal resources.
- Pub/Sub scales automatically, whereas Lite requires us to manually provision capacity.
- Pubsub Publisher: This has permission to publish a topic.
- Pubsub Subscriber: This has permission to consume a subscription or attach a subscription to a topic.
- Pubsub Viewer: This has the right to list topics and subscriptions.
- Pubsub Editor: This has the right to create, delete, and update topics, as well as create, delete, and update subscriptions.
- Pubsub Admin: This has the right to retrieve and set IAM policies for topics and subscriptions.
- There is a limit of 10,000 topics and 10,000 subscriptions per project.
- There is a limit of 10,000 attached subscriptions per topic.
- There is a limit of 1,000 messages and a 10 MB total size per publishing request.
- There is a limit of 10 MB per message.
In this section, we looked at Pub/Sub and Pub/Sub Lite. In the next section, we'll look at Cloud Dataflow.
Cloud Dataflow is a service based on Apache Beam, which is an open source software for creating data processing pipelines. A pipeline is essentially a piece of code that determines how we wish to process our data. Once these pipelines have been constructed and input into the service, they become a Dataflow job. This is where we can process the data that's been ingested by Pub/Sub. It will perform steps to change our data from one format to another and can transform both real-time streams or historical batch data. Dataflow is completely serverless and fully managed. It will spin up the necessary resources to execute our Dataflow job and then delete these resources when the job is complete. As an example, a pipeline job might be made up of several steps. If a specific step needs to be executed on 15 machines in parallel, then Dataflow will automatically scale to these 15 machines and remove them when the job is complete. These resources are based on Compute Engine and are referred to as workers, and Cloud Storage is used as a temporary staging area and I/O. These resources are based on our location settings.
Dataflow will transform and enrich data in stream and batch modes. Its serverless architecture can be used to shard and process very large batch datasets or a high volume of live streams in parallel.
Here, we have a Pub/Sub subscription and we wish to write the data to a data warehouse service for analysis. In this example, we will use BigQuery to analyze our data. We will discuss this service in the next section. We need Dataflow to convert JSON-formatted messages from Pub/Sub into BigQuery elements. For this example, we will use a Google public GCP project, which has Pub/Sub Topics set up. We will also assume that we have a BigQuery table ready for the data to be written to:
- From the Navigation menu, browse to BIG DATA | Dataflow:
- Click on CREATE JOB FROM TEMPLATE:
- Now, we can populate the required parameters for our Dataflow job:
- Select a location to deploy our Dataflow workers to and store our metadata in. We will leave this as the default setting.
- Insert the Pub/Sub input topic. This is a public project that's made available by Google as part of the template.
- Insert the BigQuery output table where our data will be written.
- Insert a Cloud Storage bucket to store temporary files:
Once you have selected the dataflow template, we need to populate additional required parameters. We need to enter a valid Pub/Sub topic. In the following example, we are using a public GCP-managed topic. We also need to enter a BigQuery table to output the data to and then a location for temporary files to be written to. In our example, we are using a Cloud Storage bucket called cloudarchitectbookupdate:
- If we go to our BigQuery table and run a SQL query, we will see that we are pulling in data from the Pub/Sub topic (note that your table will be named differently):
Of course, this query would be refined rather than us selecting all.
One additional important thing to note is that although the preceding example has built a pipeline job from a template, this is only one of the options available. This option allows us to quickly start with no coding. However, GCP offers us the option to build pipelines using the following languages:
- Java and Maven
- Dataflow Admin: This has the right to create and manage Dataflow jobs.
- Dataflow Developer: This has the right to execute and manipulate Dataflow jobs.
- Dataflow Viewer: This has read-only rights to all Dataflow-related resources.
- Dataflow Worker: This has rights for a GCE service account to execute work units for a Dataflow pipeline.
Google Cloud Dataflow comes with predefined quotas. These default quotas can be changed via the Navigation menu and IAM & Admin | Quotas. From this menu, we can review the current quotas and request an increase to these limits. We recommend that you are aware of the limits for each service as this can have an impact on your scalability. For Cloud Dataflow, we should be aware of the following limits:
- Dataflow uses Compute Engine to execute pipeline code, and then we are subject to Compute Engine quotas.
- There is a limit of 1,000 workers per pipeline.
- There is a limit of 10 MB for a job creation request.
- There is a limit of 20,000 side input shards.
BigQuery is a fully managed, serverless analytics service. It can scale to petabytes of data and is ideal for data warehouse workloads. It is the analysis stage of our solution, and once Dataflow processes our data, BigQuery will provide value to our business by querying large volumes of data in a very short period. Queries are executed in the SQL language, so it will be easy to use for many. We should emphasize that BigQuery is enterprise-scale and can perform large SQL queries extremely fast – all without the need for us to provision any underlying infrastructure.
BigQuery is ideal for data warehouse workloads as it has the capacity for PB of storage.
Datasets are used to organize and control access to tables and views. We require at least one dataset before we can load data into BigQuery. Dataset names are unique to each project, and we are required to specify a location. Locations can be either regional or multi-regional, and we cannot change our location once the dataset has been created. Some consideration should be taken when selecting a location, depending on the export or ingestion requirements. For example, if we plan to query data from an external data source such as Cloud Storage, the data we are querying must be in the same location as our BigQuery dataset.
An additional consideration when creating a dataset is whether or not to apply a default table expiration. BigQuery charges for both data storage and per query. We should think about setting an expiration date if our tables will be for temporary use. The following example shows the same screenshot as the preceding one after Data location has been selected:
Once we have created a dataset, we can create a table and load data into it. Tables will contain individual records that are organized into rows. Each table is defined by a schema that will describe the column names and data types; for example, a string or integer. The schema is decided when we create our table. We can also set an expiration date for our tables when we create them. If we set an expiration value when we create our dataset, then this value will be overwritten by the expiration value that's set when the table is created. If no value is set on either, then the table will never expire.
Partitioned tables make it easier to manage and query data. These are tables that are split into smaller partitions, meaning it will improve the query's performance and reduce costs. This is because the number of bytes that are read by a query is also reduced.
BigQuery offers three types of table partitioning:
- Ingestion time: BigQuery will automatically load data into daily, data-based partitions.
- Time-Unit Column: These allow us to partition a table on the DATE, TIMESTAMP, or DATETIME column.
- Integer Range: These allow us to partition a table based on the ranges of values in an integer column.
It is possible to set expiration data on partitioned tables. Use this if you have, for example, a requirement to delete sensitive data after a while.
To send a query statement to an external database and receive the result as a temporary table, we can use Federated queries. These will use the BigQuery Connection API and establish a connection with an external database. This allows us to use the EXTERNAL_QUERY function in our standard SQL query to send a statement to the external database, using that database's SQL dialect. The results will be transformed into BigQuery standard SQL data types.
Just as a reminder, Google Sheets is an online spreadsheet application by Google. Connected Sheets lets us analyze data from a Google Sheet without needing to know SQL. We can access, analyze, and share billions of rows of BigQuery data from our Google Sheet spreadsheet. This can be useful if we want to collaborate with our partners or analysts within a recognizable spreadsheet interface or provide a single source of truth without further spreadsheet exports. Connected Sheets will execute the BigQuery queries on our behalf, via a request or a schedule, and save the results in our spreadsheet for analysis.
Storage and compute separation
BigQuery separates storage and compute. By decoupling these components, we can select the storage and processing solutions that are best for our organization as each of them can scale independently, therefore offering a real elastic data warehouse solution.
To increase performance and efficiency, BigQuery offers materialized views, which are precomputed views that will routinely cache the results of our queries. BigQuery uses these precomputed results and, whenever possible, reads only the delta changes from the base table to compute up-to-date results. Using these views can significantly enhance the performance of workloads that have common characteristics.
BigQuery BI Engine
BigQuery BI Engine is a Business Intelligence (BI) service built into BigQuery that allows us to analyze data that has been stored in BigQuery. It offers sub-second query response time with high concurrency. Additionally, we can build dashboards and reports that are backed by BigQuery, without impacting performance or security. It can integrate into Google tools such as Data Studio (which we will look at later in this post) and Connected Sheets, or third-party BI tools such as Microsoft's Power BI or Tableau.
BigQuery ML Engine
BigQuery ML is a Machine Learning (ML) service that allows us to create and execute ML models in BigQuery using SQL queries. Machine learning will be discussed in more detail in This Article, Putting Machine Learning to Work.
BigQuery GIS allows us to augment our analytics workflows with location intelligence. By using geography data types and standard SQL geography functions, we can analyze and visualize geospatial data in BigQuery. This can be a real benefit for organizations where decisions are based on location data.
At the time of writing, this feature is in preview. However, it is worth us highlighting the service for your information. BigQuery Omni is a multi-cloud analytics solution that allows us to analyze data across Azure, AWS, and GCP all within the BigQuery UI. BigQuery Omni is powered by Anthos, which means we do not have to manage any underlying infrastructure. We will discuss Anthos in This Post, Managing Cloud-Native Workloads with Anthos.
Let's look at the power of BigQuery. Google offers several example datasets for public use so that you can get a feel for BigQuery. In this example, we will use a dataset called Chicago Taxi Trips and execute a query to see which drop-off areas give the highest tip:
- Browse to BIG DATA | BigQuery in the GCP console, as shown in the following screenshot:
- Expand ADD DATA and select Explore public datasets, as shown in the following screenshot:
- In the marketplace, search for taxi and select the Chicago Taxi Trips dataset:
- View the dataset:
- You will now see the public datasets that are available for our use. Browse to the Chicago Taxi Dataset:
- Now, we can compose a new query on this dataset. This query will provide some valuable information. We will retrieve both the average and the highest tip given to drivers, grouped by drop-off communities. We can see that this query uses standard SQL language and that ANSI 2011 is supported:
Notice from the results that we have processed the query in only 1.2 seconds. This may not be a large amount of data that's been queried, but it shows that we can quickly load a dataset and execute our query. There were no prerequisite steps to deploying the infrastructure.
One thing regarding BigQuery that we should mention is that it has a command-line tool called bq. We will look at bq in more detail in This Post, Google Cloud Management Options.
Importing and exporting data
The preceding example was based on a public dataset. BigQuery allows us to upload data from the console or Cloud Shell. Let's look at using the console to upload a CSV file we have locally. In this example, we will use an example dataset named london_bikes. We will also use a CSV called london_bikes.csv:
- From our BigQuery dataset, we can click on the plus (+) icon to create a new table, as shown in the following screenshot:
- We are then given several methods to create our table. Select Upload and browse to the location of our CSV file:
- Then, we can populate the dataset we want to add the table to and build our schema. In this example, we are adding entries to our schema with different types. Here, we can see that start_station_name and num are both STRING:
Let's export our table to a Cloud Storage bucket:
- From our table view, we can click EXPORT:
- Then, we can populate our bucket name and export it:
GCP also offers something called BigQuery Data Transfer Service, which will automate the process of moving data into BigQuery on a scheduled basis. After the initial data transfer configuration, the service will automatically load data into BigQuery. Note that it cannot be used for transferring data out of BigQuery.
Although we are discussing BigQuery as big data, let's remind ourselves that it is also a storage service. There is no need to export older data to another storage platform. With BigQuery, we can enjoy the same inexpensive long-term storage as we would expect from services such as Cloud Storage. So long as we are not editing tables for 90 consecutive days, then our cost will drop by 50%, which would match the cost of the Cloud Storage Nearline class.
We now have some valuable information from our data. By putting our full services together, at a high level, our solution would now look like this:
Take a moment to review the preceding diagram and understand the services that are used at each stage. This is important for the exam.
Storage Read API
BigQuery also offers a Storage Read API, which provides fast access to BigQuery-managed table data. This is the third option for accessing table data in BigQuery, alongside the tabledata.list or jobs.getQueryResults REST API, or bulk data export using BigQuery extract jobs, which will export table data to Cloud Storage. When using the Storage Read API, structured data is sent over the wire in a binary serialization format, allowing for additional parallelism, thus improving on the two original options.
- BigQuery User: This has the rights to run jobs within the project. It can also create new datasets. Most individuals in an organization should be a user.
- BigQuery Job User: This has the rights to run jobs within the project.
- BigQuery Read Sessions User: This has the rights to create and read sessions within the project via the BigQuery storage API.
- BigQuery Data Viewer: This has the right to read the dataset metadata and list tables in the dataset. It can also read data and metadata from the dataset tables.
- BigQuery Metadata Viewer: This has the rights to list all datasets and read metadata for all datasets in the project. It can also list all the tables and views and read the metadata for all the tables and views in the project.
- BigQuery Data Editor: This has the right to read the dataset metadata and list tables in the dataset. It can create, update, get, and delete the dataset tables.
- BigQuery Data Owner: This has the rights to read, update, and delete the dataset. It can also create, update, get, and delete the dataset tables.
- BigQuery Admin: This has the right to manage all the resources within the project.
BigQuery comes with predefined quotas. These default quotas can be changed via the Navigation menu and IAM & Admin | Quotas. From this menu, we can review the current quotas and request an increase to these limits. We recommend that you are aware of the limits for each service, as this can have an impact on your scalability. For BigQuery, we should be aware of the following limits:
- There is a limit of 100 concurrent queries.
- There is a query execution time limit of 6 hours.
- 1,500 maximum number of table operations are allowed per day.
- There is a maximum of 100,000 streamed rows of data per second, per project. This is 500,000 in the US and EU multi-regions.
- A maximum of 300 concurrent API requests per user is allowed.
Storage costs for BigQuery are based on the amount of data we store. We can be charged for both active and long-term storage. Of course, long-term storage will be a lower monthly change. Additionally, we are charged for running queries. We are offered two pricing models for queries:
- On-demand: This charges based on the amount of data that's processed by each query.
- Flat-rate: This allows us to purchase dedicated resources for processing, so we are not charged for individual queries.
Pricing is also determined by the location of our BigQuery dataset. Now, let's look at Dataproc.
Dataproc is GCP's big-data-managed service for running Hadoop and Spark clusters. Hadoop and Spark are open source frameworks that handle data processing for big data applications in a distributed manner. Essentially, they provide massive storage for data, while also providing enormous processing power to handle concurrent processing tasks.
If we refer to the End-to-end big data solution section of this article, Dataproc is also part of the processing stage. It can be compared to Dataflow; however, Dataproc requires us to provision servers, whereas Dataflow is serverless.
Dataproc should be chosen over Dataflow if we have an existing Hadoop or Spark Cluster. Also, the skill sets of existing resources are needed. If we need to create new pipeline jobs or process streaming data, then we should select Dataflow.
As an alternative to hosting these services on-premises, Google offers Dataproc, which has many advantages – mainly cost-saving, as you are only charged for what you use, with no large initial outlay for the required processing power and storage. Traditional on-premises Hadoop clusters are generally expensive to run because you are paying for processing power that is not being utilized regularly, and we cannot remove the cluster because our data would also be removed. In comparison, Dataproc moves away from persistent clusters to ephemeral clusters. Dataproc integrates well with Cloud Storage. Therefore, if we have a requirement to run a job, we can spin up our cluster very quickly, process our data, and store it on Cloud Storage in the same region. Then, we can simply delete our cluster. Dataproc clusters are not made to run 24/7 as they are job-specific, and this is where we can gain cost savings. This approach allows us to use different cluster configurations for individual jobs, scale clusters to suit individual jobs or groups of jobs, and reduce any maintenance of the clusters as we are simply spinning up a freshly configured cluster each time we need to run a job.
The whole point of an ephemeral cluster is to use it only for the job's lifetime.
The underlying Dataproc infrastructure is built on Compute Engine, which means we can build on several machine types, depending on our budget, and take advantage of predefined and custom machine types.
Cost savings are also increased by using preemptible instances.
When Dataproc clusters are created, we have the option to set a maximum combination of 96 CPU cores and 624 GB RAM. We can also select between SSD and HDD storage. Please refer to This Post, Working with Google Compute Engine, for more details on machine types.
In a Dataproc cluster, there are different classes of machines:
- Master nodes: This machine will assign and synchronize tasks on worker nodes and process the results.
- Worker nodes: These machines will process data. These can be expensive due to high their CPU and memory specifications.
- Preemptible worker nodes: These are secondary worker nodes and are optional. They do the same job but lower the per-hour compute costs for non-critical data processing.
When we create a new cluster, we can select different cluster modes:
- Standard: This includes one master node and N worker nodes. In the event of a Compute Engine failure, in-flight jobs will fail and the filesystem will be inaccessible until the master nodes reboot.
- High availability: This includes three master nodes and N worker nodes. This is designed to allow uninterrupted operations, despite a Compute Engine failure or reboots.
- Single node: This combines both master and worker nodes. This is not suitable for large data processing and should be used for PoC or small-scale non-critical data processing.
- Apache Spark
- Apache Hadoop
- Apache Pig
- Apache Hive
- Hadoop Distributed File System (HDFS)
The process of migrating from on-premises to Google Cloud Storage is explained further at https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-data.
We can also specify initialization actions in executables or scripts that Dataproc will run on all the nodes in our cluster immediately after the cluster has been set up. These are often used to set up job dependencies to ensure jobs don't require any dependencies to be installed.
- Dataproc Editor: This has full control over Dataproc.
- Dataproc Viewer: This has rights to get and list Dataproc machine types, regions, zones, and projects.
- Dataproc Worker: This is for service accounts only and provides the minimum permissions necessary to operate with Dataproc.
- Dataproc Admin: This role has the same permissions as Editor but can Get and Set Dataproc IAM permissions.
We recommend that you review the Further reading section for more information on Compute quotas.
The Internet of Things, or IoT, is a collective term for physical objects that are connected to the internet. You are no doubt using many IoT devices today, such as smartwatches, Google Home, or wirelessly controlled light bulbs.
Cloud IoT Core is a fully managed service that allows us to securely connect, manage, and ingest data from devices spread around the globe. It is also completely serverless, meaning no upfront software installation. Cloud IoT Core integrates with other GCP services to offer a complete solution for collecting, processing, and analyzing data in real time. Let's look at the following diagram. It shows us where Cloud IoT Core sits in the overall end-to-end solution we have discussed in this article and the protocols that IoT devices can use to communicate with it – MQTT and HTTP:
- MQTT and HTTP protocol endpoints
- Automatic load balancing
- Global data access with Pub/Sub
- Register individual devices
- Configure individual devices
- Update and control devices
- Provide role-level access control
- Provide a console and APIs for device deployment and monitoring
- Cloudiot Viewer: This has read-only access to all Cloud IoT resources.
- Cloudiot Device Controller: This has access to update the configuration of devices. It does not have the right to create or delete devices.
- Cloudiot Provisioner: This has the right to create and delete devices from registries but not to modify the registries.
- Cloudiot Editor: This has read-write access to all Cloud IoT resources.
- Cloudiot Admin: This has full control over all Cloud IoT resources and permissions.
- Project, device, and telemetry limits: These limits refer to the number of devices per project, device metadata, telemetry event payloads, and MQTT connections per device.
- Rate: These limits refer to device-to-cloud and cloud-to-device throughput limits, MQTT incoming messages per second and connection, and device manager API limits.
- Time: These limits refer to MQTT connection time and timeout limits.
We recommend that you refer to the Further reading section for more information.
Cloud IoT Core is charged according to the data volume that's used per calendar month. Google recommends using the pricing calculator to estimate the price according to the volume of data that's exchanged.
Cloud Data Fusion is a fully managed and cloud-native data integration service for quickly building and managing data pipelines. Data Fusion uses Dataproc as the execution environment for these pipelines. The GUI caters to a variety of users, which means that business users, developers, and data scientists can easily build integration solutions that will cleanse, prepare, and transform data. Data fusion also offers a library of preconfigured plugins to extend its capabilities. It is also important to note that it is powered by an open source project called Cask Data Application Platform (CDAP).
To begin with, we must create a Cloud Data Fusion instance. Instances run as the Compute Engine service account and Data Fusion executes pipelines using a Dataproc cluster. Instances are a unique deployment of Data Fusion and are created from the GCP Console. At the time of creating, we can decide what type of instance we want to deploy, and the choice will be determined by requirements and cost. There are three options:
- Developer instances provide a full-featured edition with zonal availability and limitations on execution environments.
- Basic instances provide comprehensive integration capabilities but have limitations on simultaneous pipeline runs and are recommended for non-critical environments.
- Enterprise instances also offer comprehensive integration capabilities but there are no limitations on simultaneous pipeline runs and are recommended for critical environments.
Each instance contains a unique and independent Data Fusion deployment that will contain a set of services that handle pipeline life cycle management, orchestration, coordination, and metadata management. Note that before we can create an instance, we will be prompted to enable the API and create credentials:
- From the Data Fusion UI, click HUB. This will open a list of plugins that we can utilize to extend the capabilities of Data Fusion:
- Select the Cloud Data Fusion Quickstart plugin:
- Click Create, update the pipeline name if required, and click Finish:
- Click Customize Pipeline when prompted. This will take us to our pipeline studio. This is a graphical interface for developing our pipeline. It allows us to drag and drop various plugins onto our canvas. If we click Deploy from the top right, it will submit the pipeline to Data Fusion. In this example pipeline, we are reading a JSON file containing New York Times bestseller data from Cloud Storage. It will then run transformations on the file to parse and clean the data. Finally, it will load the top-rated books that have been added in the last week that cost less than $25 into BigQuery:
- We can then Run our pipeline and see the results at the end of the process:
With that, we have seen how easy it is to get started with Data Fusion.
Cloud Data Fusion will create ephemeral execution environments to run pipelines when we want to manually run our pipelines or if we want our pipelines to run through a schedule or a pipeline state trigger. We have already mentioned that Cloud Data Fusion supports Dataproc as an execution environment. We can choose to run pipelines as MapReduce, Spark, or Spark Streaming programs. Cloud Data Fusion will provision an ephemeral Dataproc cluster in our customer project at the start of a pipeline run. Then, it will execute the pipeline using MapReduce or Spark in the cluster before deleting the cluster once the pipeline's execution is complete.
If we are using technologies such as Terraform to manage existing Dataproc clusters, we can configure Cloud Data Fusion to use these and not provision new clusters.
It is recommended that we use autoscaling policies to increase the cluster's size, not decrease it. Decreasing its size with autoscaling will also remove nodes that hold intermediate data and may cause failure.
Pipelines offer us a way to visually design data and control flows to extract, transform, aggregate, and load data from various data sources – whether that is on-premises or cloud-based. Pipelines allow us to create data processing workflows that can help us solve data ingestion, integration, and migration problems. We can use Cloud Data Fusion to build real-time and batch pipelines.
Pipelines allow us to express our data processing workflows using the logical flow of data and Cloud Data Fusion will handle all the functionality to physically run in an execution environment. The Cloud Data Fusion planner transforms the logical flow into parallel computations by using Apache Spark and Apache Hadoop MapReduce on Dataproc.
- Secure Data Lakes on GCP. Data Fusion helps us build scalable, distributed data lakes on GCP.
- Agile data warehouses with BigQuery. Data Fusion can break down data silos and enable the development of data warehouse solutions in BigQuery.
- Unified Analytics. Data Fusion can establish an analytics environment across a variety of on-premises data marts.
- Datafusion Viewer: This has full access to the Data Fusion UI, along with permissions to view, create, manage and run pipelines.
- Datafusion Admin: This has all viewer permissions, plus permissions to create, update, and delete all Data Fusion instances
- Datafusion Runner: This is granted to the Dataproc service account so that Dataproc is authorized to communicate the pipeline runner information.
- There is a limit of 600 requests per minute from a single user in a single region.
- Data Fusion quotas apply to all pipelines that are executed.
- Data Fusion also stores logs, so Cloud Operations logging quotas apply.
We recommend that you refer to the Further reading section for more information.
Usage for Data Fusion is measured as the length of time, in minutes, between the time a Cloud Data Fusion instance is created to the time it is deleted. Cloud Data Fusion is billed by the minute, although pricing is defined by the hour. Usage will be measured in hours – for example, 30 minutes is 0.5 hours – to apply hourly pricing to minute-by-minute use.
Now, let's look at the Datastream API.
Datastream is a serverless change data capture (CDC) and replication service. It helps us bring in change streams from Oracle or MySQL into Google's data services to support analytics, DB replication, and event-driven architectures. It allows us to synchronize our data across heterogeneous databases and applications with minimal latency and downtime.
Please note that at the time of writing, this product is in the pre-GA phase. However, we felt that it is worth highlighting the service as something to be aware of.
There are other services offered by Google that we wish to highlight:
- Dataprep: This is a web application that allows us to define preparation rules for our data by interfacing with a sample of the data. Like many of the other services we have discussed, Dataprep is serverless, meaning no upfront deployments are required. By default, Dataprep jobs are executed on a Dataflow pipeline. Refer to https://cloud.google.com/dataprep/ for more information.
- Datalab: This is built on Jupyter (formerly IPython), which is an open source web application. Datalab is an interactive data analysis and machine learning environment. We can use this product to visualize and explore data using Python and SQL interactively. This would be treated as part of the data usage stage of our end-to-end solution and would use data that's passed from BigQuery. Datalab is free of charge but runs on Compute Engine instances, so charges will be applicable. For more information, refer to https://cloud.google.com/datalab/.
- Data Studio: This is a Business Intelligence (BI) solution that turns your data into informative and easy-to-read dashboards and reports. It is a fully managed visual analytics service. Refer to https://datastudio.google.com/overview/ for more information.
- Composer: Another workflow orchestration service is Cloud Composer. Built on the Apache Airflow open source project, it allows us to create workflows that span across both public cloud and on-premises data centers. Workflows are a series of tasks that are used to ingest, transform, analyze, and utilize data. In Airflow, these workflows are created using what is known as Directed Acyclic Graphs (DAGs). DAGs represent a group of tasks that we want to schedule and run. Cloud Composer operates using Python, so it is free from cloud vendor lock-in and easy to use. DAGs are created using Python scripts that define tasks and their dependencies in code.
Similar to Data Fusion, there is a requirement for an environment to execute these workflows. By using Cloud Composer, we can benefit from everything that Airflow offers without the need for any installation or management overhead, so Composer will provision the GCP resources required to run our workflows. This group of components is referred to as a Cloud Composer environment and is based on GKE. You can define the number of nodes in the GKE cluster when creating a Composer environment. For more information, refer to https://cloud.google.com/composer/docs/concepts/overview.
- Looker: Looker is an enterprise platform that helps us explore and share company data to make more informed business decisions. Using a modeling language called LookML, we can query data to create reports, dashboards, and other patterns of data.
- Data Catalog: Without the right tools, dealing with the growing number of data assets within many of today's organizations will be a real challenge. Stakeholders wish to search for insightful data, understand the data, and make the data as useful as possible. Data Catalog is a scalable metadata management service that can catalog the metadata on data assets from BigQuery, Pub/Sub, the Dataproc metastore, and Cloud Storage. For more information, refer to https://cloud.google.com/data-catalog/docs/concepts.
While the preceding services are not primary exam topics, we should still be aware of what each service offers.
In this post, we covered the main aspects of big data relating to the exam. We covered each service and showed that these can be used at different stages of our end-to-end solution. We took the time to see how we can configure Pub/Sub, Dataflow, and BigQuery from the GCP console and discussed Dataproc and Cloud IoT Core.
The key takeaway from this article is to understand which services map to the ingest, process, and analysis stages of data.
Then, we looked at the processing stage of our solution. Cloud Dataflow will deploy Google Compute Engine instances to deploy and execute our Apache Beam pipeline, which will process data from Pub/Sub and pass it onto further stages for analysis or storage. We have shown how we can easily create a pipeline in the GCP console, which pulls information from Pub/Sub for analysis in BigQuery.
After, we covered BigQuery and understood that it is a data warehouse. It is designed to make data analysts more productive, crunching petabytes of data in small amounts of time. It is completely serverless, so we do not have to worry about provisioning any infrastructure before we can use it, which saves a lot of upfront cost and time. We looked at how easy and quick it is to set up a dataset and start to query it.
We also covered Dataproc. We covered the architecture of the service and established that Dataproc is an alternative to hosting Hadoop clusters on-premises. We should now understand that Dataproc can be used to process data that has been injected from Cloud Pub/Sub.
We also covered Cloud IoT Core, which is used to connect IoT devices. We showed you how we can ingest the real-time data that's generated by these devices. Additionally, we discussed data pipelines using Data Fusion and serverless CDC and replication services in Datastream. We also introduced Dataprep, Datalab, Datastudio, Composer, DataCatalog, and Looker.
In the next Post, we will take a look at machine learning.
Read the following articles for more information on the topics that were covered in this post:
- For Cloud Pub/Sub, refer to the following:
- For Cloud Dataflow, refer to the following:
- For BigQuery, refer to the following:
- For Dataproc, refer to the following:
- For Cloud IoT Core, refer to the following:
- For Data Fusion, refer to the following:
- For Datastream, refer to https://cloud.google.com/datastream.