hdfs
Input parameters
- — The relative URI to the file in HDFS. Path to file support following globs in readonly mode:
*
,?
,{abc,def}
and{N..M}
whereN
,M
— numbers, `'abc', 'def'
— strings. format
— The format of the file.structure
— Structure of the table. Format .
Returned value
A table with the specified structure for reading or writing data in the specified file.
Example
SELECT *
FROM hdfs('hdfs://hdfs1:9000/test', 'TSV', 'column1 UInt32, column2 UInt32, column3 UInt32')
LIMIT 2
Globs in path
Multiple path components can have globs. For being processed file should exists and matches to the whole path pattern (not only suffix or prefix).
*
— Substitutes any number of any characters except/
including empty string.{some_string,another_string,yet_another_one}
— Substitutes any of strings'some_string', 'another_string', 'yet_another_one'
.- — Substitutes any number in range from N to M including both borders.
Constructions with {}
are similar to the ).
Example
- Suppose that we have several files with following URIs on HDFS:
- ‘hdfs://hdfs1:9000/some_dir/some_file_1’
- ‘hdfs://hdfs1:9000/some_dir/some_file_2’
- ‘hdfs://hdfs1:9000/some_dir/some_file_3’
- ‘hdfs://hdfs1:9000/another_dir/some_file_1’
- ‘hdfs://hdfs1:9000/another_dir/some_file_2’
- Query the amount of rows in these files:
SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV', 'name String, value UInt32')
- Query the amount of rows in all files of these two directories:
If your listing of files contains number ranges with leading zeros, use the construction with braces for each digit separately or use ?
.
Example
Query the data from files named file000
, file001
, … , file999
:
SELECT count(*)
FROM hdfs('hdfs://hdfs1:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV', 'name String, value UInt32')
- — Path to the file.
_file
— Name of the file.
See Also