PCA-2 Quick, GCP Professional Architect – Cloud SQL, BigQuery, BigTable, scalable fast R/W

PCA-2 Quick, GCP Professional Architect – Cloud SQL, BigQuery, BigTable, scalable fast R/W

Articles Blog


It’s very important to the process of
understanding Google Cloud and passing the certification exam, that you go through
the question and attempt answering it yourself first. So pause the video, work
through the question, we’ll catch up in just a little while and I’ll show you how I
do it. This project scenario is based on
TerramEarth case study. Today, TerramEarth maintainence workers receive interactive
performance graphs for the last 24 hours (or 86,400 events) by plugging their
maintenance tablets into the vehicle. The support group wants support technicians
to view this data remotely to help troubleshoot problems. You want to
minimize the latency of graph loads. How should you provide this functionality?
So in this project scenario, we have got these vehicles coming in to some service
center. The support technician plugs in a tablet that they have into the vehicle. And
data that has been collecting in the vehicle for the last 24 hours, which
amounts to an event per second, is all pulled out and loaded to the central circuit.
The support technicians would really like to use this data now to compare it
with probably what happened in the previous days, to see if some
maintenance needs to be done on this vehicle, or are they expecting something
to happen that they want to preempt. The support group wants to give this ability
to the support technicians to view this data quickly. What is the problem right
now? The data is probably centralized somewhere, and because of this the graph
loads or the data loading is taking a inordinately long amount of time, which
they would like to minimize now. Let’s look at what are the options we have to
solve this. The first option suggests that we execute
queries against the data stored in a Cloud SQL. This is possible. The fact
that Cloud SQL supports MySQL, PostgreSQL and MS SQL,
we can run queries against them and get the data. Now looking at the TerramEarth case
study, there are a few things that we should recall. TerramEarth collects a
total of about 9TB/day from these connected vehicles. So this is
clearly big data scale. We’re collecting almost about say 365 X 10 (round off)
terabytes a day, about 3,650 terabytes
of data per year, or about 3 petabytes a year from all these
connected vehicles. So there’s a large amount of data that they are trying to
process. The next thing is that these service centers are in around 100
countries. So clearly they are not located in some region, say like
Southeast Asia or the US West Coast, these are spread all across the world.
The question now for Cloud SQL is that will it scale to this kind of data
requirement which is in many petabytes, and will it serve acglobal audience?
Cloud SQL, say for example, taking MySQL as an example, does not scale beyond a
few terabytes. It can go up to tens of terabytes, but it is unable to support
say petabytes of data. Therefore, Cloud SQL is not looking like a good
option For the kind of, or the scale of data that we are going to need.
Cloud SQL is also regional. You typically choose a region and then have maybe
backups in another region, but your single source of troop when you’re going to
be writing data, is going to be in one location, in one region. Therefore, the
requirement that we support over 100 countries is not going to work for us
using Cloud SQL. So both on scaling and global audience reach, Cloud SQL is not
going to be effective. Therefore we can eliminate
option A. How about option B? Option B suggests that we execute queries on data
in Bigtable. Now, queries in Bigtable are not going
to be SQL, but they are queries nevertheless. So I think this part is okay,
that we can query the Bigtable databse. The next suggestion is that in this we need
to have the data indexed by vehicle_id.timestamp. Let’s see how that works.
In Bigtable, when we have data, we define a key that allows us to easily access
that data. So typically this could be a combination of different columns, say for
example vehicle_id and timestamp would be one of it. When a technician
wants to look at the data for a vehicle, it then becomes very easy to just use a
single vehicle ID for just the data that he wants to look at for this vehicle. The
fact that we also appended timestamp, allows us to look at this data in a
particular period. So we can say, we want the data for say vehicle_id 123
from last 24 hours, up to the current moment. Or it can be from the last 7 days
up to the current moment. And that can be taken as a timestamp. So when we have
data in this key format, it allows us to quickly narrow down this large amount of
data. So far this option is looking good. The next thing we need to consider
is that when this vehicle is brought in, somebody is plugging in their tablet into it,
pulling out all the data that’s collected over the last 24
hours, and uploading it immediately to a
backend circuit. So there’s going to be a large amount of writes that are going to
happen. Does Bigtable support a large number of writes and very efficiently?
The answer is that, it does. Bigtable is able to scale very well
for both read and write to many petabytes of data. We have not only
the ability to write this data very quickly, we also have a very fast way of
reading it back. Therefore, this is starting to look very good for us
because we’re able to get large amount of data in quickly, we are able to
minimize the latency of the graph, and Bigtable also allows you to scale
globally by increasing the number of nodes in different regions. We can add
new clusters in other places or in other regions and locations, which means that
the data that is accessed locally is going to have very low latency. So considering
all of these factors, we can be fairly sure that this is looking to be a very
good answer. But of course, we can’t immediately decide. We have to see
whether the other options have any better recommended practices that would
be preferable. As of now, let us just keep this as a possibility and move on to the next
option. Option C suggests that we query BigQuery tables. Now, querying BigQuery tables is
going to be very fast. It is able to work with analytics data at very large
scales, and BigQuery is also optimized for reads. So querying Bigtable to minimize the latency graph would be a great option.
Let us now look whether the remaining requirements are also suited by BigQuery.
In the case of BigQuery here, they’re suggesting that we make
these daily partitioned tables. This is fairly good, because we just have to now
look at the data that has come in, say over a certain few days. So if you wanted to
just see the data for today, it is very easy to see now because that data has
been partitioned to just today’s data. We can probably have data from the
last three partitions which would be the last three days, or last seven days as we
require. So this is also looking to be a fairly good option right now, but we
have a limited view of all the data via these
partitions. One thing we do not have though, is that this data is not
partitioned by vehicle at this point. So there’s going to be data coming from
multiple vehicles, and they’re all going to be inside the BigQuery database. But we
will have to run some kind of scan on the data to pick up data for just one
particular vehicle. So this is going to increase the time a little bit, and it’s
going to fail us a little bit on the minimizing the latency. The next
consideration will be writing all this data very quickly. So when the vehicle
comes in, we need to load this data via the tablet, which means that
we want a large amount of writes to happen quickly and that should be
efficient. However, BigQuery is not optimized for
writes. Therefore, writes via BigQuery are not going to be too efficient and
this is starting to weaken them. We also have a problem that BigQuery is
primarily based in a region, or you can also choose it to be a multi-region, but
it is not truly global in the sense that some of the reads from certain countries
are probably going to have to travel to a different continent to be served. So
given that it does not scale extremely well to a global audience, we also have
the problem of serving this with low latency all across the world. Given a few of
these weak points, we would rather eliminate option C because we already
had a better option in option B.How about the last option D. We start with
the same idea that we have queries running against BigQuery, which like we
said is a possibility, BigQuery’s going to be fast for reads, it’s a great
analytical and an OLAP database. So that part is okay. But the rest of it like we
saw, that infact it does not scale very well for our writes
and that it is not truly global, means that it’s a bad option already. Without
even considering the excessive points in Cloud Storage and BigQuery Federation,
which is not going to help in any way, it is only going to hinder the speed. We can
say that option D will not be suitable for us either. Given all of that, option B
to use Cloud Bigtable and to query data in it via the vehicle_id.timestamp
key is going to be the best of the options we have in this question. Now
it’s time for you to subscribe to all the great content we’ve got lined up for you, to
learn Google Cloud, and to help you the certifications!

1 thought on “PCA-2 Quick, GCP Professional Architect – Cloud SQL, BigQuery, BigTable, scalable fast R/W”

Leave a Reply

Your email address will not be published. Required fields are marked *