HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL, including multitable inserts and create table as select.
HiveQL lacked support for transactions and materialized views and only limited subquery support.
Support for insert, update, and delete with full ACID functionality was made available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
DROP TABLE IF EXISTS docs; CREATE TABLE docs (line STRING); |
Checks if table docs exist and drop it if it does. Creates a new table called docs with a single column of type STRING called line.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs; |
Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is to be re-written; Otherwise, the data would be appended.
CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) temp GROUP BY word ORDER BY word; |
The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and count.
This query draws its input from the inner query (SELECT explode(split(line, '\s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table aliased as temp.
The GROUP BY WORD groups the results based on their keys.
This results in the count column holding the number of occurrences for each word of the word column.
The ORDER BY WORDS sorts the words alphabetically.
Tables :
Here are the types of tables in Apache Hive:
Managed Tables :
In a managed table, both the table data and the table schema are managed by Hive.
The data will be located in a folder named after the table within the Hive data warehouse, which is essentially just a file location in HDFS.
By managed or controlled we mean that if you drop (delete) a managed table, then Hive will delete both the Schema (the description of the table) and the data files associated with the table.
Default location is /user/hive/warehouse.
The syntax for Managed Tables :
CREATE TABLE IF NOT EXISTS stocks (exchange STRING, symbol STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_adj_close FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ; |
External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy the data file(s) there.
This location is included as part of the table definition statement.
When an external table is deleted, Hive will only delete the schema associated with the table.
The data files are not affected.
Syntax for External Tables :
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (exchange STRING, symbol STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_adj_close FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/data/stocks'; |
Querying Data :
A query is a request for data or information from a database table or a combination of tables.
This data may be generated as results returned by Structured Query Language (SQL) or as pictorials, graphs or complex results, e.g., trend analyses from data-mining tools.
One of several different query languages may be used to perform a range of simple to complex database queries.
SQL, the most well-known and widely-used query language, is familiar to most database administrators (DBAs)
User-Defined Functions :
In Hive, the users can define their own functions to meet certain client requirements.
These are known as UDFs in Hive.
User-Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks.
The developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return outputs according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single output value.
We can use two different interfaces for writing Apache Hive User-Defined Functions :
1. Simple API
2. Complex API
Sorting And Aggregating :
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a catch.
ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the number of reducers to one, making it very inefficient for large datasets.
When a globally sorted result is not required and in many cases it isn’t, then you can use Hive’s nonstandard extension, SORT BY instead.
SORT BY produces a sorted file per reducer.
If you want to control which reducer a particular row goes to, typically so you can perform some subsequent aggregation.
This is what Hive’s DISTRIBUTE BY clause does.
Example :
· To sort the weather dataset by year and temperature, in such a way to ensure that all the rows for a given year end up in the same reducer partition :
Hive> FROM records2 > SELECT year, temperature > DISTRIBUTE BY year > SORT BY year ASC, temperature DESC; |
· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11
0 Comments