by Pushkar Reddipalli
BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. It allows you to execute terabyte-scale queries, get results in seconds, and gives you the benefits of being a fully managed solution. BigQuery and other Google Cloud Platform (GCP) services offer "pay per use" pricing. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.
In this blog post, I will guide you through best practices to implement Google BigQuery to control costs and optimize performance regardless of the level of development experience you have. To start working with BigQuery, you must first upload your data, which means that you will also use it as storage, then verify this data in some way to create your models or create reports. In the next section, we will analyze best practices for these main actions of loading, querying and storing.
Depending on the nature of your data and the amount of historical data you need to keep; you can optimize your storage to save costs. Some of the best practices are:
If you only need new data or recently updated data, no matter the historical data, you can set an expiration time for tables or partitions. For example, you can set expiration of a partition in 90 days, this means that any partition older than 90 days will be automatically deleted. Note that this cannot be applied to existing tables / partitions, only to tables / partitions created after setting the policy, to apply it older tables/partitions you might have to recreated them with the policy.
BigQuery is natively optimized for cost reduction in data storage that is no longer actively consulted. By default, any table / partition in BQ that is not touched for at least 90 days will move to what is called long-term storage, which costs $ 0.01 per GB per month, which is 50% cheaper than normal rates. For even older data that you still need to keep, in case you need to consult them, or just for general governance purposes, you have the option to export that BigQuery data to Google Cloud Storage for an even lesser cost.
Although Star or Snowflake schemes are easy to implement in a "lift and shift" scenario, it is worth the time and effort to adjust the data model to make use of the BigQuery column storage format, called Capacitor.
BigQuery supports nested records within tables. For example, in the relational database Invoice_Headers and Invoice Lines are stored in separate tables. In BigQuery, you can create a column in Invoice Headers that contains an array of Invoice Lines records. BigQuery will collocate the inner table records with its corresponding outer table records while loading the data, while retrieving the data BigQuery will only perform a Join operation between the collocated records. If you retrieve only outer table columns your query will cost and perform as if the outer table is just a separate table and when you try to access any fields in the Inner table then only the BigQuery will read and join the Inner table records.
BigQuery allows you to partition by a date column or partition by load time. This improves performance and reduces the cost of a query, since it only reads the partitions requested by the query.
BigQuery can arrange data based on Clustered columns. For example, you can arrange Employee data by Department and if there is query on a department BigQuery reads until it reaches the filtered department and skips reading rest of the data, this eliminates unnecessary data scans. Also, when we group by clustered columns in a query Bigquery reduces the overhead of sorting the data by this column.
Note: As of now, BigQuery only allows you to Cluster columns of a Partitioned table, incase you don't have any Date column to apply partition on you can partition by a dummy date column to make use of clustering.
Well, that's it folks. Please watch this section for more exciting posts on BigQuery - Analytics Data Warehouse.
Pushkar Reddipalli is a Software Engineer in KPI Partners and has around 10 years of experience in Data warehousing and Business Intelligence. He has implemented many BI solutions on Oracle BI Applications and Google Cloud Platform. He has extensively worked on ETL tools like Informatica and ODI, Reporting tools like OBIEE and Looker. He has proficiency in Python and Spark.