安装配置

1.下载 2.解压 3.重命名 4.添加环境变量

vi /etc/proflie
export HIVE_HOME=/export/servers/hive-2.3.8
export PATH=$PATH:$HIVE_HOME/bin
source /etc/profile

修改配置文件
cp hive-env.sh.template hive-env.sh
在这里插入图片描述

Hive Metastore配置

将自带的derby数据库替换为mysql数据库可使多用户连接

参考文章 https://my.oschina.net/u/4292373/blog/3497563

新增hive-site.xml文件

记坑
hive-default.xml.template 的开头就写明了 WARNING!!!对该文件的任何更改都将被Hive忽略
其实hive-site.xml是用户定义的配置文件，hive在启动的时候会读取两个文件一个是hive-default.xml.template 还有一个就是hive-site.xml
在复制的hive-site.xml里保存你写的配置项，然后将其他的删掉，hive-site.xml只能写你自己的配置项，其他删掉

原文链接：https://blog.csdn.net/qq_43506520/article/details/83346463

1
2
3

cp hive-default.xml.template hive-site.xml 
vi hive-site.xml  在hive-site.xml
#文件只保存如下配置

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!-- jdbc连接的URL 01-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
    </property>
    <!-- jdbc连接的Driver 02-->
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <!-- jdbc连接的username 03-->
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <!-- jdbc连接的password 04-->
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>root123456</value>
    </property>
<!--元数据访问地址-->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://hadoop01:9083</value>
    </property>
<!-- Hive默认在HDFS的工作目录 -->
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
    <property>
        <name>spark.sql.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>/tmp</value>
    </property>
    <property>
        <name>metastore.catalog.default</name>
        <value>hive</value>
    </property>

<!--配置远程访问hive beeline -->
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
    </property>
    <property>
        <name>hive.server2.thrift.client.user</name>
        <value>root</value>
    </property>
    <property>
        <name>hive.server2.thrift.client.password</name>
        <value>root123</value>
    </property>
<!--校验在metastore中存储的信息的版本和hive的jar包中的版本一致性-->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>
<!--高可用集群设置设置-->
    <property>
        <name>hive.server2.support.dynamic.service.discovery</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.server2.zookeeper.namespace</name>
        <value>hiveserver2_zk</value>
    </property>
    <property>
        <name>hive.zookeeper.quorum</name>
        <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
    </property>
    <property>
        <name>hive.zookeeper.client.port</name>
        <value>2181</value>
    </property>
<!--元数据存储授权-->
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>ignorethis</value>
    </property>
    <property>
        <name>hive.exec.show.job.failure.debug.info</name>
        <value>false</value>
    </property>
<!--调试日志-->
    <property>
        <name>hive.exec.show.job.failure.debug.info</name>
        <value>false</value>
    </property>
 <!--允许单条插入-->
    <property>
        <name>hive.support.concurrency</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.enforce.bucketing</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
    </property>
<!--行级别更新-->
    <property>
        <name>hive.txn.manager</name>
        <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
    </property>
    <property>
        <name>hive.compactor.initiator.on</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.compactor.worker.threads</name>
        <value>1</value>
    </property>
<!--dynamic partition-->
    <property>
        <name>hive.exec.dynamic.partition</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
    </property>
    <property>
        <name>hive.exec.max.created.files</name>
        <value>100000</value>
    </property>
</configuration>

JDBC驱动

把连接Mysql的JDBC驱动包复制到Hive的lib目录下

下载地址：https://dev.mysql.com/downloads/connector/j

驱动包名为：mysql-connector-java-5.1.48-bin.jar

初始化数据库

1	schematool -dbType mysql -initSchema

若失败错误类型和参考如下

https://blog.csdn.net/lsr40/article/details/78026125

https://blog.csdn.net/brotherdong90/article/details/49661731/

开启metastore

1	nohup hive --service metastore & #开启元数据服务默认9083

开启hiveserver2

1	nohup hiveserver2 &

<!--配置远程访问hive的用户密码-->
<property>
    <name>hive.server2.thrift.client.user</name>
    <value>root</value>
</property>
<property>
    <name>hive.server2.thrift.client.password</name>
    <value>root123</value>
</property>

启动出现临时文件夹位置未定义问题

解决方案参考如下文章：
https://www.cnblogs.com/qxyy/articles/5247933.html

查看系统存储的hive运行日志

在./conf/hive-log4j2.properties文件中记录系统日志位置，默认/tmp/user/hive.log

数据定义

数据库

#创建
create database  if not exists emp；
#查看
show databases；
#描述
describe formatted emp；
#使用
use emp；
#修改
alter database set dbproperty;

数据表

创建普通内部表

1	create table employee(eid int,ename string,egender tinyint,esalary float);

创建外部表

1	create external table emp(eid int,ename string,egender tinyint,esalary float);

内部表外部表转换

1	alter table table_name set tablepropertiles(‘external’=’true’\|false);

分隔符

列分隔符 行分隔符 Array分隔符 Map分隔符

create table sales_info_new(
sku_id string comment '商品id',
sku_name string comment '商品名称',
state_map map<string,string> comment '商品状态信息',
id_array array<string> comment '商品相关id列表'
)
partitioned by(
dt string comment '年-月-日'
)
row format delimited
  fields terminated by '|'
  collection items terminated by ','
  map keys terminated by ':';

导入数据样例

123|华为Mate10|id:1111,token:2222,user_name:zhangsan1|1235,345
456|华为Mate30|id:1113,token:2224,user_name:zhangsan3|89,635
789|小米5|id:1114,token:2225,user_name:zhangsan4|452,63
1235|小米6|id:1115,token:2226,user_name:zhangsan5|785,36
4562|OPPO Findx|id:1116,token:2227,user_name:zhangsan6|7875,3563

文本：不用写双引号，花括号，程序会自动添加双引号和花括号！加了会出错

分区分桶

分区

为什么分：

使用分区技术，避免hive全表扫描，提升查询效率

如何分：

整个表的数据在存储时划分到多个子目录，从而在查询时可以指定查询条件（子目录以分区变量的值来命名）eg:year=‘2018’

分区需注意什么：

PARTIONED BY(colName dataType)
hive的分区字段使用的是表外字段。而mysql使用的是表内字段。

1、hive的分区名区分大小写不能使用中文

2、hive的分区本质是在表目录下面创建目录，但是该分区字段是一个伪列，不真实存在于数据中

3、一张表可以有一个或者多个分区，分区下面也可以有一个或者多个分区

导入分区

1	load data local inpath '/usr/local/xxx' into table part1 partition(country='China'); #要指定分区

#修改分区的存储路径：(hdfs的路径必须是全路径)

1	alter table part1 partition(country='Vietnam') set location ‘hdfs://hadoop01:9000/user/hive/warehouse/brz.db/part1/country=Vietnam’

二级分区

create table if not exists part2(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited 
fields terminated by ',';
# 导入多分区
load data local inpath '/usr/local/xxx' into table part1 partition(year='2018',month='09');

增加分区

1	alter table part1 add partition(country='india') partition(country='korea') partition(country='America')

动态分区

#动态分区的属性：
set hive.exec.dynamic.partition=true;//(true/false)
set hive.exec.dynamic.partition.mode=strict;//(strict/nonstrict) #至少有一个静态的值
set hive.exec.dynamic.partitions=1000;//(分区最大数)
set hive.exec.max.dynamic.partitions.pernode=100
#创建动态分区表
create table if not exists dt_part1(
uid int,
uname string,
uage int
)
PARTITIONED BY (year string,month string)
row format delimited 
fields terminated by ','
;
#加载数据：（使用 insert into方式加载数据）
insert into dy_part1 partition(year,month) select * from part_tmp ;

分桶

在分区下分桶，分桶使用表内字段

语法格式
CREATE TABLE test
(<col_name> <data_type> [, <col_name> <data_type> ...])]
[PARTITIONED BY ...]
CLUSTERED BY (<col_name>)
[SORTED BY (<col_name> [ASC|DESC] [, <col_name> [ASC|DESC]...])]
INTO <num_buckets> BUCKETS

CLUSTERED BY () 以哪一列进行分桶选择一列来分桶

SORTED BY ( [ASC|DESC] 对分桶内的数据进行排序

INTO BUCKETS 分成几个桶

##列信息更改
修改名称

1	Alter table emp change eid id string;

在这里插入图片描述

增加列
在这里插入图片描述

数据操作

#装载数据
Load data to table inpath 
#插入数据
Insert into table emp partition(year=2021,month=10) select id,name from ept;
#导出数据
#到hdfs
Export table ept to ‘/hom/emp’;
#Insert 导出
Insert overwrite local directory ‘path’ select * from emp;
#到本地
Hfds dfs -get localpath
#Hive shell 命令导出
Hive -e ‘select * from emp;’ > localpath
#导入数据
Import table emp  from path;
#HQL查询
Case when 
Select name,salary, case
Wehn salary <5000  then ‘low’
When salary >=5000 and salary <7000 then ‘middle’
Whne salary >=7000 then salary < 10000 then ‘high’
Else ‘vary high’
End as bracket from emp;

##Like和rlike
使用Like运算符可以进行模糊查询，通配符”%”代表0个或多个字符，”_”代表1个字符。

RLIKE子句是Hive中这个功能的一个扩展，其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。

GROUP BY

GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个队列结果进行分组，然后对每个组执行聚合操作

HAVING

在 text 中增加 HAVING 子句原因是，WHERE 关键字无法与合计函数一起使用。

Having 与where不同

（1）where是对表中数据的筛选，having是对分组统计结果的筛选

（2）Where后不能写分组函数，而having后可以使用分组函数。

（3）Having只用于group by分组统计语句。

SELECT Customer,SUM(OrderPrice) FROM Orders
GROUP BY Customer
HAVING SUM(OrderPrice)<2000

Join

内连接

内连接（INNER JOIN）中，只有进⾏连接的两个表中都存在与连接条件相匹配的数据时，记录才会被筛选出来

1	SELECT a.empno,a.ename,b.dname FROM emp a JOIN dept b ON a.deptno=b.deptno;

左连接

左外连接（LEFT OUTER JOIN）中，JOIN操作符左边表中符合WHERE⼦句的所有记录将会出现在查询结果中。右边表中如果没有符合ON后⾯连接条件的记录时，从右边表指定选择的列的值将会是NULL。

1	SELECT a.empno,a.ename,b.dname FROM emp a LEFT OUTER JOIN dept b ON a.deptno==b.deptno;

全连接

在这里插入图片描述

多表连接

连接 n个表，⾄少需要n-1个连接条件。例如：连接三个表，⾄少需要两个连接条件。

1	SELECT a.ename,b.dname,c.zip FROM emp a JOIN dept b ON a.deptno=b.deptno JOIN location c ON b.loc=c.loc;

注意：为什么不是表b和表c先进⾏连接操作呢？这是因为Hive总是按照从左到右的顺序执⾏的。

排序

###ORDER BY
ORDER BY⽤于对全局查询结果进⾏排序，也就是说会有⼀个所有的数据都通过⼀个 reducer进⾏处理的过程。

SORT BY

Hive增加了⼀个可供选择的⽅式，即SORT BY，其只会在每个reducer中对数据进⾏排序，即执⾏⼀个局部排序过程。这会保证每个reducer的输出数据都是有序的（但并⾮全局有序）。
ORDER BY 和SORT BY的区别是当reducer的个数⼤于1时，两种操作的输出结果是不同的，SORT BY是reducer内的局部排序。
###DISTRIBUTE BY和SORT BY
如果我们想对同⼀部⻔中的员⼯进⾏排序处理，那么我们可以使⽤DISTRIBUTE BY来保证具有相同部⻔编号的员⼯被分到同⼀个reducer中去，然后使⽤SORT BY来按照我们的期望对数据进⾏排序。

1	SELECT * FROM emp DISTRIBUTE BY deptno SORT BY empno DESC;

###CLUSTER BY
当distribute by和sorts by字段相同时，可以使⽤cluster by⽅式。⽤cluster b除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能指定排序规则为ASC或者DESC。

1 2	select * from emp cluster by deptno; select * from emp distribute by deptno sort by deptno;

类型转换

Hive会在适当的时候对数值型数据类型进⾏隐式类型转换，有些时候需要显示类型转换时可以使⽤关键字cast。

显示类型转换函数的语法是:

1
2
3

cast(value AS TYPE)
ALTER TABLE employees CHANGE COLUMN salary salary STRING; 
SELECT name,salary FROM employees WHERE cast(salary AS FLOAT) < 100000.0;

空字段赋值

NVL：给值为NULL的数据赋值，它的格式是NVL( string1, replace_with)。它的功能是如果 string1为NULL，则NVL函数返回replace_with的值，否则返回string1的值，如果两个参数都为 NULL ，则返回NULL。

1	hive> select nvl(comm,-1) from emp;

Hive 合并小文件

当hive中数据都是由小文件组成时，需要将这些小文件合并为一个大的文件

步骤

创建临时表

1	create table test like table1;

在当前会话设置

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.max.dynamic.partitions=100000;
set hive.merge.smallfiles.avgsize=128000000; #128M
set hive.merge.size.per.task=128000000;

将原表数据合并到临时表