by Pushkar Reddipalli
BigQuery Best Practices to Optimize Cost and Performance
BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. It allows you to execute terabyte-scale queries, get results in seconds, and gives you the benefits of being a fully managed solution. BigQuery and other Google Cloud Platform (GCP) services offer "pay per use" pricing. If you know how to write SQL Queries, you already know how to query it. In fact, there are plenty of interesting public data sets shared in BigQuery, ready to be queried by you.
In this blog post, I will guide you through best practices to implement Google BigQuery to control costs and optimize performance regardless of the level of development experience you have. To start working with BigQuery, you must first upload your data, which means that you will also use it as storage, then verify this data in some way to create your models or create reports. In the next section, we will analyze best practices for these main actions of loading, querying and storing.
Manage storage:
Depending on the nature of your data and the amount of historical data you need to keep; you can optimize your storage to save costs. Some of the best practices are:
If you only need new data or recently updated data, no matter the historical data, you can set an expiration time for tables or partitions. For example, you can set expiration of a partition in 90 days, this means that any partition older than 90 days will be automatically deleted. Note that this cannot be applied to existing tables / partitions, only to tables / partitions created after setting the policy, to apply it older tables/partitions you might have to recreated them with the policy.
BigQuery is natively optimized for cost reduction in data storage that is no longer actively consulted. By default, any table / partition in BQ that is not touched for at least 90 days will move to what is called long-term storage, which costs $ 0.01 per GB per month, which is 50% cheaper than normal rates. For even older data that you still need to keep, in case you need to consult them, or just for general governance purposes, you have the option to export that BigQuery data to Google Cloud Storage for an even lesser cost.
BigQuery works best on denormalized data:
Although Star or Snowflake schemes are easy to implement in a "lift and shift" scenario, it is worth the time and effort to adjust the data model to make use of the BigQuery column storage format, called Capacitor.
Use Nested and Repeated fields:
BigQuery supports nested records within tables. For example, in the relational database Invoice_Headers and Invoice Lines are stored in separate tables. In BigQuery, you can create a column in Invoice Headers that contains an array of Invoice Lines records. BigQuery will collocate the inner table records with its corresponding outer table records while loading the data, while retrieving the data BigQuery will only perform a Join operation between the collocated records. If you retrieve only outer table columns your query will cost and perform as if the outer table is just a separate table and when you try to access any fields in the Inner table then only the BigQuery will read and join the Inner table records.
Use Partitioned tables:
BigQuery allows you to partition by a date column or partition by load time. This improves performance and reduces the cost of a query, since it only reads the partitions requested by the query.
Clustered tables:
BigQuery can arrange data based on Clustered columns. For example, you can arrange Employee data by Department and if there is query on a department BigQuery reads until it reaches the filtered department and skips reading rest of the data, this eliminates unnecessary data scans. Also, when we group by clustered columns in a query Bigquery reduces the overhead of sorting the data by this column.
Note: As of now, BigQuery only allows you to Cluster columns of a Partitioned table, incase you don't have any Date column to apply partition on you can partition by a dummy date column to make use of clustering.
Here we will discuss some of the best practices to follow while writing a BQ query.
- BigQuery queries costs depend on the amount of data scanned, not the data retrieved.
- Control projection: the projection is the number of columns that you read when you execute your query. You should always try to select only the columns you need and avoid using "Select *" (also known as select all) at all costs. Even if you use the LIMIT clause, it will not reduce the cost of the query.
- If you need to explore the data before writing your query, use the table preview option, which is free.
- Always prefer partitioned tables (time or data partitions) over fragmented tables (with date)
- Filter the amount of processed data before joining with other queries / tables.
- When querying wildcard tables, use the most granular prefix possible. That is, if you have table names like "TableName_YYYMMDD", and you only need 2019 data, use "TableName_2019 *" and do not use "TableName_ *" and then filter by year.
- When partitioning tables are consulted, it is recommended to first filter the required partitions in the most internal selection query.
- Avoid joining the same subquery several times in the same query. Instead, try to save this subquery as an intermediate table and use it. The cost of storage is much less than the cost of repetitive queries.
- The Order By clause is very expensive, so, unless it is necessary, try to limit your use of the Order By clause as much as possible. Or simply use it in the outermost query to sort the results and not in subqueries.
- Avoid using user-defined functions of JavaScript. If necessary, use native UDF (SQL) instead.
- When joining two tables, start with the larger table for better performance.
- You can use data from external sources like Google Cloud Storage, Google Sheet etc. these are called federated tables. Even though using these tables will reduce query performance, they can be used to querying fast changing data.
- Before executing your query, verify the query validator, which will tell you approximately how much data will be scanned, so you can get an idea of ??how much it will cost with the GCP price calculator.
Well, that's it folks. Please watch this section for more exciting posts on BigQuery - Analytics Data Warehouse.
Pushkar Reddipalli is a Software Engineer in KPI Partners and has around 10 years of experience in Data warehousing and Business Intelligence. He has implemented many BI solutions on Oracle BI Applications and Google Cloud Platform. He has extensively worked on ETL tools like Informatica and ODI, Reporting tools like OBIEE and Looker. He has proficiency in Python and Spark.
Comments
Comments not added yet!