如何使用Scala读取列族的所有数据

我是Scala和hbase的初学者。我的目标是从hbase读取特定列族中的所有数据,以便为机器学习的未来用途做一些数据准备。

要做到这一点,就像我提到的“我是一个初学者”一样,我的第一步是从hbase读取数据。我只是成功地从给定的行,列族和列中读取数据。这是我的代码,也许他很好地帮助了所有人。

我现在的第二步是读取给定列族的所有数据。 我只是尝试了一些示例,但没有人在工作

//Code to retrieve data from a given row,CF and column,this code works
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.{HBaseConfiguration,TableName}
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
val conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum","hadoop-master")
//conf.set("hbase.zookeeper.property.clientPort","16010")
val connection = ConnectionFactory.createConnection(conf)
val table =connection.getTable(TableName.valueOf("mimic3"))
val row  = Bytes.toBytes("10150") // 1 is the row id
val cf = Bytes.toBytes("sepsiscategories")
val c = Bytes.toBytes("intime")
val query = new Get(row)
val res = table.get(query)
res.getvalue(cf,c)
Bytes.toString(res.getvalue(cf,c))
//this query works well,now I will try to retreive all records             from HBase and I will put them on an RDD

import org.apache.hadoop.conf.Configuration
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.fs.{FileSystem,FSDataInputStream,Path}
import java.net.URI
import java.io.File
import java.util.Properties
import java.sql.DriverManager
import org.apache.spark.sql.{Row,SaveMode}
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.conf.Configuration._
import spark.implicits._
import spark.sql

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val mimic_table_conf = HBaseConfiguration.create();
mimic_table_conf.set(TableInputFormat.INPUT_TABLE,"mimic3")
mimic_table_conf.set("hbase.zookeeper.quorum","hadoop-master")
mimic_table_conf.set("hbase.zookeeper.property.clientPort","16010")
val mimic_PatternsFromHbase = spark.sparkContext.newAPIHadoopRDD(mimic_table_conf,classOf[TableInputFormat],classOf[ImmutableBytesWritable],classOf[Result])

val sepsiscategories = mimic_PatternsFromHbase.mapPartitions(f=>       f.map(row1 =>    (Bytes.toString(row1._2.getRow),Bytes.toString(row1._2.getvalue(Bytes.to    Bytes("sepsiscategories"),Bytes.toBytes("admissiontype")))))).toDF("id","admissiontype")
sepsiscategories.createOrReplaceTempView("sep_categories")
spark.sql("select * from sep_categories").show

我只是希望看到结果,但是我只有这个错误:

  

19/11/09 12:18:29警告zookeeper.ClientCnxn:服务器的会话0x0>> hadoop-master / 172.18.0.2:16010,意外错误,关闭套接字>连接并尝试重新连接   java.io.IOException:数据包len1213486160超出范围!

任何人都可以提出另一个主张,因为我没有成功解决此问题,找不到zkCli.sh文件来更新-Djute.maxbuffer参数,如此处所述:https://stackoverflow.com/a/19990613/5674606

PS:我正在使用的集群基于docker image,这是链接https://kiwenlau.com/2016/06/26/hadoop-cluster-docker-update-english/

guoshigege3 回答:如何使用Scala读取列族的所有数据

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3132897.html

大家都在问