BigQuery is a fully-managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence. BigQuery's serverless architecture with zero infrastructure management lets you query terabytes in seconds and petabytes in minutes and data is stored in columnar format.
Traditional relational databases, like Db2, Oracle and MySQL, store data in record-oriented storage. This makes them great for transactional updates and OLTP use cases because they only need to open up a single row to read or write data. BigQuery uses columnar storage where each column is stored in a separate file block. This makes BigQuery an ideal solution for OLAP (Online Analytical Processing) use cases.
Things to know before starting
BigQuery Concepts
BigQuery Objects
NOTE: There is a page dedicated to Limits and Quotas. Here are few which i need to remember.
bq cp
command to copy tables with in the same project or to a different project with 0 cost.
BigQuery is orchestrated via Borg, Google’s precursor to Kubernetes.
Everything is a job is bigquery, when you query, create, load, export/copy everything creates and submits a batch job in the background and gets you the result. Application may show this as interactive but its actually not. Only a few things that are not managed by a job is listing resources or getting metadata.
BigQuery determines the location to run the job based on the datasets referenced in the request, since the job itself must be run in the same region where the data is stored.
A slot is a unit of computational capacity. It’s basically a worker, made up of CPU, RAM and Network. Since BigQuery likes to divide-and-conquer work - running parts of each task in parallel - more slots usually means that the query will run faster. Number of slots are determined by on-demand pricing and flat-rate pricing
Partitioning - Table is divided into multiple files/datasets/segments called partitions. Queries that contain filters on the partitioning column can dramatically reduce the overall data scanned. BigQuery supports 3 types of partitions
Partitioning is better with column having low cardinality and its recommended that partitions are greater than 1GB.
Clustering - Data can be automatically sorted based on the contents of one or more columns (up to 4, that you specify). Usually high cardinality and non-temporal columns are preferred for clustering. BigQuery performs free, automatic re-clustering in the background.
BigQuery sorts the data in a clustered table based on the values in the clustering columns and organizes them into blocks.
When you run a query against a clustered table, and the query includes a filter on the clustered columns, BigQuery uses the filter expression and the block metadata to prune the blocks scanned by the query. This allows BigQuery to scan only the relevant blocks.
When a block is pruned, it is not scanned. Only the scanned blocks are used to calculate the bytes of data processed by the query. The number of bytes processed by a query against a clustered table equals the sum of the bytes read in each column referenced by the query in the scanned blocks.
This matters, you are going to be charged for everything that is going to be returned from a SELECT clause and for storage as well.
Data types | length |
---|---|
INT64/INTEGER | 8 bytes (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) |
FLOAT64/FLOAT | 8 bytes |
NUMERIC | 16 bytes (Precision: 38, Scale: 9) |
BIGNUMERIC | 32 bytes |
BOOL/BOOLEAN | 1 byte |
STRING | 2 bytes + the UTF-8 encoded string size |
BYTES | 2 bytes + the number of bytes in the value |
DATE | 8 bytes (0001-01-01) |
DATETIME | 8 bytes (0001-01-01 00:00:00) |
TIME | 8 bytes (23:59:59.99999) |
TIMESTAMP | 8 bytes (9999-12-31 23:59:59.999999 UTC, eg:- 2014-09-27 12:30:00.45-8:00) |
INTERVAL | 16 bytes |
STRUCT/RECORD | 0 bytes + the size of the contained fields |
GEOGRAPHY | 16 bytes + 24 bytes * the number of vertices in the geography type. To verify the number of vertices, use the ST_NumPoints function. |
Null values for any data type are calculated as 0 bytes.
Primarily there are two types of tables
Created Temporary tables
They can be only be created/refered in a script. These temporary tables exist at the session level, they are created in a special dataset. They are automatically deleted after 24 hours. In the session, to refer TEMP table, you don't need to specify 3-part qualifier like project or dataset qualifier in the table name but when the session completes, you cannot refer it by the same name as a random name is given to it.
1Begin2
3CREATE TEMP TABLE test_temp as 4select *5from `btd_in3.calender`6order by ts desc 7limit 5;8
9End ;
Some important things to know about temporary tables,
Query results are not cached,
If you have executed a query in BQ console, you can find what temporary it had used by this technique.
At the bottom of the screen, there would be QUERY HISTORY, find the query you had executed. Click it.
Click the Destination table link
It will show you the table, you can look at the SCHEMA, DETAILS, PREVIEW and you can query it if you want.
If you need temporary tables that needs to be refered later by name or it needs to persist more than 24Hours. Create a permanent table with expiration.
1CREATE TABLE `test_public_data.expiry_table`2OPTIONS(3 expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)4) AS5SELECT corpus, COUNT(*) c6FROM `bigquery-public-data.samples.shakespeare`7GROUP BY corpus
External tables -- Data can be queried from external datasources like Cloud Storage, BigTable, Google Drive and CloudSQL.
This example uses schema auto-detection feature. Option available to mention schema details as well.
1CREATE EXTERNAL TABLE `<<project-name>>.test_public_data.test_external_gcs` OPTIONS (2 format = 'CSV',3 uris = ['gs://<<bucket-name>>/bse/bhavcopy_csv/*.CSV']4);
Below are the table details,
You can query like just another BQ table
1SELECT *2FROM `<<project-name>>.test_public_data.test_external_gcs`3limit 10;
Note: Data source and BQ Dataset have to be in same location.