Quickstart - Load data with HugeGraph-Loader - 《HugeGraph 开源图数据库系统 v0.9 使用手册》

目前支持的数据源包括：

本地磁盘文件或目录，支持压缩文件
HDFS 文件或目录，支持压缩文件
部分关系型数据库，如 MySQL

后面会说明数据源的具体要求。

有两种方式可以获取 HugeGraph-Loader：

下载已编译的压缩包
克隆源码编译安装

2.1 下载已编译的压缩包

下载最新版本的 HugeGraph-Loader release 包：

2.2 克隆源码编译安装

克隆最新版本的 HugeGraph-Loader 源码包：

编译生成 tar 包:

cd hugegraph-loader
mvn package -DskipTests

使用 HugeGraph-Loader 的基本流程分为以下几步：

编写图模型（schema）
准备数据文件
编写输入源映射（source）
执行导入过程

3.1 编写图模型（schema）

这一步是建模的过程，用户需要对自己已有的数据和想要创建的图模型有一个清晰的构想，然后编写 schema 建立图模型。

比如想创建一个拥有两类顶点及两类边的图，顶点是”人”和”软件”，边是”人认识人”和”人创造软件”，并且这些顶点和边都带有一些属性，比如顶点”人”有：”姓名”、”年龄”等属性，“软件”有：”名字”、”售卖价格”等属性；边”认识”有：”日期”属性等。

示例图模型

在设计好了图模型之后，我们可以用groovy编写出schema的定义，并保存至文件中，这里命名为schema.groovy。

// 创建一些属性
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("date").asText().ifNotExist().create();
schema.propertyKey("price").asDouble().ifNotExist().create();
// 创建 person 顶点类型，其拥有三个属性：name, age, city，主键是 name
schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create();
// 创建 software 顶点类型，其拥有两个属性：name, price，主键是 name
schema.vertexLabel("software").properties("name", "price").primaryKeys("name").ifNotExist().create();
// 创建 knows 边类型，这类边是从 person 指向 person 的
schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").ifNotExist().create();
// 创建 created 边类型，这类边是从 person 指向 software 的
schema.edgeLabel("created").sourceLabel("person").targetLabel("software").ifNotExist().create();

3.2 准备数据文件

目前 HugeGraph-Loader 支持的数据源包括：

本地磁盘文件或目录
HDFS 文件或目录
部分关系型数据库

本地磁盘文件或目录

用户可以指定本地磁盘文件作为数据源，如果数据分散在多个文件中，也支持以某个目录作为数据源，但暂时不支持以多个目录作为数据源，也不支持过滤目录下的文件。

支持的文件格式包括：

TEXT
CSV
JSON。

TEXT 是自定义分隔符的文本文件，第一行通常是标题，记录了每一列的名称，也允许没有标题行（在映射文件中指定）。其余的每行代表一条记录，会被转化为一个顶点/边；行的每一列对应一个字段，会被转化为顶点/边的 id、label 或属性；

id|name|lang|price|ISBN
1|lop|java|328|ISBN978-7-107-18618-5
2|ripple|java|199|ISBN978-7-100-13678-5

CSV 是分隔符为逗号,的 TEXT 文件，当列值本身包含逗号时，该列值需要用双引号包起来，如：

marko,29,Beijing
"li,nary",26,"Wu,han"

JSON 文件要求每一行都是一个 JSON 串，且每行的格式需保持一致。

HDFS 文件或目录

用户也可以指定 HDFS 文件或目录作为数据源，上面关于本地磁盘文件或目录的要求全部适用于这里。除此之外，鉴于 HDFS 上通常存储的都是压缩文件，loader 也提供了对压缩文件的支持，并且本地磁盘文件或目录同样支持压缩文件。

目前支持的压缩文件类型包括：GZIP、BZ2、XZ、LZMA、PACK200、SNAPPY_RAW、SNAPPY_FRAMED、Z、DEFLATE、LZ4_BLOCK 和 LZ4_FRAMED。

部分关系型数据库

loader 还支持以部分关系型数据库作为数据源，我们只测试过 MySQL，其他如：Oracle、PostgreSQL 等理论上也支持。

但目前对表结构要求较为严格，如果导入过程中需要做关联查询，这样的表结构是不允许的。关联查询的意思是：在读到表的某行后，发现某列的值不能直接使用（比如外键），需要再去做一次查询才能确定该列的真实值。

举个例子：假设有三张表，person、software 和 created

// person 表结构
id | name | age | city

// created 表结构
id | p_id | s_id | date

如果在建模（schema）时指定 person 或 software 的 id 策略是 PRIMARY_KEY 的，选择以 name 作为 primary keys（注意：这是 hugegraph 中 vertexlabel 的概念），在导入边数据时，由于需要拼接出源顶点和目标顶点的 id，必须拿着 p_id/s_id 去 person/software 表中查到对应的 name，这种需要做额外查询的表结构的情况，loader 暂时是不支持的。

如果建模（schema）时指定 person 和 software 的 id 策略是 CUSTOMIZE 的，这样导入边 created 时可以直接使用 p_id 和 s_id 作为源顶点和目标顶点的 id，所以是支持的。

3.2.1 准备顶点数据

顶点数据文件由一行一行的数据组成，一般每一行作为一个顶点，每一列会作为顶点属性。下面以 CSV 格式作为示例进行说明。

person 顶点数据

Tom,48,Beijing
Jerry,36,Shanghai

software 顶点数据

name,price
Photoshop,999
Office,388

3.2.2 准备边数据

边数据文件由一行一行的数据组成，一般每一行作为一条边，其中有部分列会作为源顶点和目标顶点的 id，其他列作为边属性。下面以 JSON 格式作为示例进行说明。

knows 边数据

{"source_name": "Tom", "target_name": "Jerry", "date": "2008-12-12"}

created 边数据

{"source_name": "Tom", "target_name": "Photoshop"}
{"source_name": "Jerry", "target_name": "Office"}

3.3 编写输入源的映射文件

输入源的映射文件是JSON格式的，由多个VertexSource和EdgeSource块组成，VertexSource和EdgeSource分别对应某类顶点/边的输入源映射。每个VertexSource和EdgeSource块内部会包含一个InputSource块，这个InputSource块就对应上面介绍的本地磁盘文件或目录、HDFS 文件或目录和关系型数据库，负责描述数据源的基本信息。

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "vertex_person.csv",
        "format": "CSV",
        "header": ["name", "age", "city"],
        "charset": "UTF-8"
      }
    },
    {
      "label": "software",
      "input": {
        "type": "file",
        "path": "vertex_software.csv",
        "format": "CSV"
      }
  ],
  "edges": [
    {
      "label": "knows",
      "source": ["source_name"],
      "target": ["target_name"],
      "input": {
        "type": "file",
        "path": "edge_knows.json",
        "format": "JSON"
      },
      "mapping": {
        "source_name": "name",
        "target_name": "name"
      }
    },
    {
      "label": "created",
      "source": ["source_name"],
      "target": ["target_name"],
      "input": {
        "type": "file",
        "path": "edge_created.json",
        "format": "JSON"
      },
      "mapping": {
        "source_name": "name",
        "target_name": "name"
      }
    }
  ]
}

3.3.1 VertexSource 和 EdgeSource

VertexSource的节点包括：

EdgeSource的节点包括：

映射文件节点名	说明	是否必填
label	待导入的边数据所属的`label`	是
input	边数据源的信息	是
source	指定某几列作为源顶点的 id 列	当源顶点的 Id 策略为 `CUSTOMIZE`时，必须指定某一列作为顶点的 id 列；当源顶点的 Id 策略为 `PRIMARY_KEY`时，必须指定一列或多列用于拼接生成顶点的 id，也就是说，不管是哪种 Id 策略，此项必填
target	指定某几列作为目标顶点的 id 列	与 source 类似，不再赘述
mapping	将列的列名映射为顶点的属性名	否
ignored	忽略某些列，使其不参与插入	否
null_values	可以指定一些字符串代表空值，比如”NULL”，如果该列的属性又是一个可空属性，那在构造边时不会填充该属性	否

3.3.2 InputSource

FileSource的节点包括：

HDFSSource的节点包括：

上述 FileSource 的节点及含义 HDFSSource 基本都适用，下面仅列出 HDFSSource 不一样的和特有的节点。

映射文件节点名	说明	是否必填
type	输入源类型，必须填 hdfs 或 HDFS	是
path	HDFS 文件或目录的路径，必须是 HDFS 的绝对路径	是
fs_default_fs	HDFS 集群的 fs.defaultFS 值，默认使用 fs.defaultFS 的默认值	否

JDBCSource的节点包括

3.4 执行导入

准备好图模型、数据文件以及输入源映射关系文件后，接下来就可以将数据文件导入到图数据库中。

导入过程由用户提交的命令控制，用户可以通过不同的参数控制执行的具体流程。

3.4.1 参数说明

参数	默认值	是否必传	描述信息
-f \| —file		Y	配置脚本的路径
-g \| —graph		Y	图数据库空间
-s \| —schema		Y	schema文件路径
-h \| —host	localhost		HugeGraphServer 的地址
-p \| —port	8080		HugeGraphServer 的端口号
—token	null		当 HugeGraphServer 开启了权限认证时，当前图的 token
—num-threads	cpus * 2 - 1		导入过程中线程池大小
—batch-size	500		导入数据时每个批次包含的数据条数
—max-parse-errors	1		最多允许多少行数据解析错误，达到该值则程序退出
—max-insert-errors	500		最多允许多少行数据插入错误，达到该值则程序退出
—timeout	100		插入结果返回的超时时间（秒）
—shutdown-timeout	10		多线程停止的等待时间（秒）
—retry-times	10		发生特定异常时的重试次数
—retry-interval	10		重试之前的间隔时间（秒）
—check-vertex	false		插入边时是否检查边所连接的顶点是否存在
—help	false		打印帮助信息

3.4.2 logs 目录文件说明

程序执行过程中各日志及错误数据会写入 logs 相关文件中。

hugegraph-loader.log 程序运行过程中的 log 和 error 信息 (追加写)
parse_error.data 解析错误的数据（每次启动覆盖写）
insert_error.data 插入错误的数据（每次启动覆盖写）

3.4.3 执行命令

运行 bin/hugeloader 并传入参数

下面给出的是 hugegraph-loader 包中 example 目录下的例子。

4.1 准备数据

顶点文件：vertex_person.csv

marko,29,Beijing
vadas,27,Hongkong
josh,32,Beijing
peter,35,Shanghai
"li,nary",26,"Wu,han"

顶点文件：vertex_software.text

name|lang|price
lop|java|328
ripple|java|199

边文件：edge_knows.json

{"source_name": "marko", "target_name": "vadas", "date": "20160110", "weight": 0.5}
{"source_name": "marko", "target_name": "josh", "date": "20130220", "weight": 1.0}

边文件：edge_created.json

{"aname": "marko", "bname": "lop", "date": "20171210", "weight": 0.4}
{"aname": "josh", "bname": "lop", "date": "20091111", "weight": 0.4}
{"aname": "josh", "bname": "ripple", "date": "20171210", "weight": 1.0}
{"aname": "peter", "bname": "lop", "date": "20170324", "weight": 0.2}

4.2 编写schema

schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("weight").asDouble().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("price").asDouble().ifNotExist().create();
schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create();
schema.vertexLabel("software").properties("name", "lang", "price").primaryKeys("name").ifNotExist().create();
schema.indexLabel("personByName").onV("person").by("name").secondary().ifNotExist().create();
schema.indexLabel("personByAge").onV("person").by("age").range().ifNotExist().create();
schema.indexLabel("personByCity").onV("person").by("city").secondary().ifNotExist().create();
schema.indexLabel("personByAgeAndCity").onV("person").by("age", "city").secondary().ifNotExist().create();
schema.indexLabel("softwareByPrice").onV("software").by("price").range().ifNotExist().create();
schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").properties("date", "weight").ifNotExist().create();
schema.edgeLabel("created").sourceLabel("person").targetLabel("software").properties("date", "weight").ifNotExist().create();
schema.indexLabel("createdByDate").onE("created").by("date").secondary().ifNotExist().create();
schema.indexLabel("createdByWeight").onE("created").by("weight").range().ifNotExist().create();
schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist().create();

4.3 编写输入源的映射文件

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "example/vertex_person.csv",
        "format": "CSV",
        "header": ["name", "age", "city"],
        "charset": "UTF-8"
      },
      "mapping": {
        "name": "name",
        "age": "age",
        "city": "city"
      }
    },
    {
      "label": "software",
      "input": {
        "type": "file",
        "path": "example/vertex_software.text",
        "format": "TEXT",
        "delimiter": "|",
        "charset": "GBK"
      }
    }
  ],
  "edges": [
    {
      "label": "knows",
      "source": ["source_name"],
      "target": ["target_name"],
      "input": {
        "type": "file",
        "path": "example/edge_knows.json",
        "format": "JSON"
      },
      "mapping": {
        "source_name": "name",
        "target_name": "name"
      }
    },
    {
      "label": "created",
      "source": ["aname"],
      "target": ["bname"],
      "input": {
        "type": "file",
        "path": "example/edge_created.json",
        "format": "JSON"
      },
      "mapping": {
        "aname": "name",
        "bname": "name"
      }
    }

Load data with HugeGraph-Loader