API reference¶
Client¶
WebHDFS API clients.
-
class
hdfs.client.
Client
(url, root=None, proxy=None, timeout=None, session=None)¶ Bases:
object
Base HDFS web client.
Parameters: - url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. You may also specify multiple URLs separated by semicolons for High Availability support.
- proxy – User to proxy as.
- root – Root path, this will be prefixed to all HDFS paths passed to the client. If the root is relative, the path will be assumed relative to the user’s home directory.
- timeout – Connection timeouts, forwarded to the request handler. How
long to wait for the server to send data before giving up, as a float, or a
(connect_timeout, read_timeout)
tuple. If the timeout is reached, an appropriate exception will be raised. See the requests documentation for details. - session –
requests.Session
instance, used to emit all requests.
In general, this client should only be used directly when its subclasses (e.g.
InsecureClient
,TokenClient
, and others provided by extensions) do not provide enough flexibility.-
acl_status
(hdfs_path, strict=True)¶ Get AclStatus for a file or folder on HDFS.
Parameters: - hdfs_path – Remote path.
- strict – If
False
, returnNone
rather than raise an exception if the path doesn’t exist.
-
allow_snapshot
(hdfs_path)¶ Allow snapshots for a remote folder.
Parameters: hdfs_path – Remote path to a direcotry. If hdfs_path
doesn’t exist or does points to a normal file, anHdfsError
will be raised. No-op if snapshotting is already allowed.
-
checksum
(hdfs_path)¶ Get a remote file’s checksum.
Parameters: hdfs_path – Remote path. Must point to a file.
-
content
(hdfs_path, strict=True)¶ Get ContentSummary for a file or folder on HDFS.
Parameters: - hdfs_path – Remote path.
- strict – If
False
, returnNone
rather than raise an exception if the path doesn’t exist.
-
create_snapshot
(hdfs_path, snapshotname=None)¶ Create snapshot for a remote folder where it was allowed.
Parameters: hdfs_path – Remote path to a direcotry. If hdfs_path
doesn’t exist, doesn’t allow to create snapshot or points to a normal file, anHdfsError
will be raised.- :param snapshotname snapshot name; if absent, name is generated
- by the server.
Returns a path to created snapshot.
-
delete
(hdfs_path, recursive=False, skip_trash=True)¶ Remove a file or directory from HDFS.
Parameters: - hdfs_path – HDFS path.
- recursive – Recursively delete files and directories. By default,
this method will raise an
HdfsError
if trying to delete a non-empty directory. - skip_trash – When false, the deleted path will be moved to an appropriate trash folder rather than deleted. This requires Hadoop 2.9+ and trash to be enabled on the cluster.
This function returns
True
if the deletion was successful andFalse
if no file or directory previously existed athdfs_path
.
-
delete_snapshot
(hdfs_path, snapshotname)¶ Remove snapshot for a remote folder where it was allowed.
Parameters: hdfs_path – Remote path to a direcotry. If hdfs_path
doesn’t exist or points to a normal file, anHdfsError
will be raised.- :param snapshotname snapshot name; if it does not exist, an
HdfsError
will be raised.
-
disallow_snapshot
(hdfs_path)¶ Disallow snapshots for a remote folder.
Parameters: hdfs_path – Remote path to a direcotry. If hdfs_path
doesn’t exist, points to a normal file or there are some snapshots, anHdfsError
will be raised.No-op if snapshotting is disallowed/never allowed.
-
download
(hdfs_path, local_path, overwrite=False, n_threads=1, temp_dir=None, **kwargs)¶ Download a file or folder from HDFS and save it locally.
Parameters: - hdfs_path – Path on HDFS of the file or folder to download. If a folder, all the files under it will be downloaded.
- local_path – Local path. If it already exists and is a directory, the files will be downloaded inside of it.
- overwrite – Overwrite any existing file or directory.
- n_threads – Number of threads to use for parallelization. A value of
0
(or negative) uses as many threads as there are files. - temp_dir – Directory under which the files will first be downloaded
when
overwrite=True
and the final destination path already exists. Once the download successfully completes, it will be swapped in. - **kwargs – Keyword arguments forwarded to
read()
. If nochunk_size
argument is passed, a default value of 64 kB will be used. If aprogress
argument is passed and threading is used, care must be taken to ensure correct behavior.
On success, this method returns the local download path.
-
classmethod
from_options
(options, class_name='Client')¶ Load client from options.
Parameters: - options – Options dictionary.
- class_name – Client class name. Defaults to the base
Client
class.
This method provides a single entry point to instantiate any registered
Client
subclass. To register a subclass, simply load its containing module. If using the CLI, you can use theautoload.modules
andautoload.paths
options.
-
list
(hdfs_path, status=False)¶ Return names of files contained in a remote folder.
Parameters: - hdfs_path – Remote path to a directory. If
hdfs_path
doesn’t exist or points to a normal file, anHdfsError
will be raised. - status – Also return each file’s corresponding FileStatus.
- hdfs_path – Remote path to a directory. If
-
makedirs
(hdfs_path, permission=None)¶ Create a remote directory, recursively if necessary.
Parameters: - hdfs_path – Remote path. Intermediate directories will be created appropriately.
- permission – Octal permission to set on the newly created directory. These permissions will only be set on directories that do not already exist.
This function currently has no return value as WebHDFS doesn’t return a meaningful flag.
-
parts
(hdfs_path, parts=None, status=False)¶ Returns a dictionary of part-files corresponding to a path.
Parameters: - hdfs_path – Remote path. This directory should contain at most one part file per partition (otherwise one will be picked arbitrarily).
- parts – List of part-files numbers or total number of part-files to
select. If a number, that many partitions will be chosen at random. By
default all part-files are returned. If
parts
is a list and one of the parts is not found or too many samples are demanded, anHdfsError
is raised. - status – Also return each file’s corresponding FileStatus.
-
read
(**kwds)¶ Read a file from HDFS.
Parameters: - hdfs_path – HDFS path.
- offset – Starting byte position.
- length – Number of bytes to be processed.
None
will read the entire file. - buffer_size – Size of the buffer in bytes used for transferring the data. Defaults the the value set in the HDFS configuration.
- encoding – Encoding used to decode the request. By default the raw data is returned. This is mostly helpful in python 3, for example to deserialize JSON data (as the decoder expects unicode).
- chunk_size – If set to a positive number, the context manager will
return a generator yielding every
chunk_size
bytes instead of a file-like object (unlessdelimiter
is also set, see below). - delimiter – If set, the context manager will return a generator
yielding each time the delimiter is encountered. This parameter requires
the
encoding
to be specified. - progress – Callback function to track progress, called every
chunk_size
bytes (not available if the chunk size isn’t specified). It will be passed two arguments, the path to the file being uploaded and the number of bytes transferred so far. On completion, it will be called once with-1
as second argument.
This method must be called using a
with
block:with client.read('foo') as reader: content = reader.read()
This ensures that connections are always properly closed.
Note
The raw file-like object returned by this method (when called without an encoding, chunk size, or delimiter) can have a very different performance profile than local files. In particular, line-oriented methods are often slower. The recommended workaround is to specify an encoding when possible or read the entire file before splitting it.
-
remove_acl
(hdfs_path)¶ RemoveACL_ for a file or folder on HDFS.
Parameters: hdfs_path – Path to an existing remote file or directory. An HdfsError
will be raised if the path doesn’t exist.
-
remove_acl_entries
(hdfs_path, acl_spec)¶ RemoveACL_ for a file or folder on HDFS.
Parameters: - hdfs_path – Path to an existing remote file or directory. An
HdfsError
will be raised if the path doesn’t exist. - acl_spec – String representation of an ACL spec. Must be a valid
string with entries for user, group and other. For example:
"user::rwx,user:foo:rw-,group::r--,other::---"
.
- hdfs_path – Path to an existing remote file or directory. An
-
remove_default_acl
(hdfs_path)¶ RemoveDefaultACL_ for a file or folder on HDFS.
Parameters: hdfs_path – Path to an existing remote file or directory. An HdfsError
will be raised if the path doesn’t exist.
-
rename
(hdfs_src_path, hdfs_dst_path)¶ Move a file or folder.
Parameters: - hdfs_src_path – Source path.
- hdfs_dst_path – Destination path. If the path already exists and is
a directory, the source will be moved into it. If the path exists and is
a file, or if a parent destination directory is missing, this method will
raise an
HdfsError
.
-
rename_snapshot
(hdfs_path, oldsnapshotname, snapshotname)¶ Rename snapshot for a remote folder.
Parameters: hdfs_path – Remote path to a direcotry. If hdfs_path
doesn’t exist or points to a normal file, anHdfsError
will be raised.- :param oldsnapshotname snapshot name; if it does not exist,
- an
HdfsError
will be raised. - :param snapshotname new snapshot name; if it does already exist,
- an
HdfsError
will be raised.
-
resolve
(hdfs_path)¶ Return absolute, normalized path, with special markers expanded.
Parameters: hdfs_path – Remote path. Currently supported markers:
'#LATEST'
: this marker gets expanded to the most recently updated file or folder. They can be combined using the'{N}'
suffix. For example,'foo/#LATEST{2}'
is equivalent to'foo/#LATEST/#LATEST'
.
-
set_acl
(hdfs_path, acl_spec, clear=True)¶ SetAcl_ or ModifyAcl_ for a file or folder on HDFS.
Parameters: - hdfs_path – Path to an existing remote file or directory. An
HdfsError
will be raised if the path doesn’t exist. - acl_spec – String representation of an ACL spec. Must be a valid
string with entries for user, group and other. For example:
"user::rwx,user:foo:rw-,group::r--,other::---"
. - clear – Clear existing ACL entries. If set to false, all existing ACL
entries that are not specified in this call are retained without changes,
behaving like ModifyAcl_. For example:
"user:foo:rwx"
.
- hdfs_path – Path to an existing remote file or directory. An
-
set_owner
(hdfs_path, owner=None, group=None)¶ Change the owner of file.
Parameters: - hdfs_path – HDFS path.
- owner – Optional, new owner for file.
- group – Optional, new group for file.
At least one of
owner
andgroup
must be specified.
-
set_permission
(hdfs_path, permission)¶ Change the permissions of file.
Parameters: - hdfs_path – HDFS path.
- permission – New octal permissions string of file.
-
set_replication
(hdfs_path, replication)¶ Set file replication.
Parameters: - hdfs_path – Path to an existing remote file. An
HdfsError
will be raised if the path doesn’t exist or points to a directory. - replication – Replication factor.
- hdfs_path – Path to an existing remote file. An
-
set_times
(hdfs_path, access_time=None, modification_time=None)¶ Change remote timestamps.
Parameters: - hdfs_path – HDFS path.
- access_time – Timestamp of last file access.
- modification_time – Timestamps of last file access.
-
status
(hdfs_path, strict=True)¶ Get FileStatus for a file or folder on HDFS.
Parameters: - hdfs_path – Remote path.
- strict – If
False
, returnNone
rather than raise an exception if the path doesn’t exist.
-
upload
(hdfs_path, local_path, n_threads=1, temp_dir=None, chunk_size=65536, progress=None, cleanup=True, **kwargs)¶ Upload a file or directory to HDFS.
Parameters: - hdfs_path – Target HDFS path. If it already exists and is a directory, files will be uploaded inside.
- local_path – Local path to file or folder. If a folder, all the files inside of it will be uploaded (note that this implies that folders empty of files will not be created remotely).
- n_threads – Number of threads to use for parallelization. A value of
0
(or negative) uses as many threads as there are files. - temp_dir – Directory under which the files will first be uploaded
when
overwrite=True
and the final remote path already exists. Once the upload successfully completes, it will be swapped in. - chunk_size – Interval in bytes by which the files will be uploaded.
- progress – Callback function to track progress, called every
chunk_size
bytes. It will be passed two arguments, the path to the file being uploaded and the number of bytes transferred so far. On completion, it will be called once with-1
as second argument. - cleanup – Delete any uploaded files if an error occurs during the upload.
- **kwargs – Keyword arguments forwarded to
write()
. In particular, setoverwrite
to overwrite any existing file or directory.
On success, this method returns the remote upload path.
-
walk
(hdfs_path, depth=0, status=False, ignore_missing=False, allow_dir_changes=False)¶ Depth-first walk of remote filesystem.
Parameters: - hdfs_path – Starting path. If the path doesn’t exist, an
HdfsError
will be raised. If it points to a file, the returned generator will be empty. - depth – Maximum depth to explore.
0
for no limit. - status – Also return each file or folder’s corresponding FileStatus.
- ignore_missing – Ignore missing nested folders rather than raise an exception. This can be useful when the tree is modified during a walk.
- allow_dir_changes – Allow changes to the directories’ list to affect
the walk. For example clearing it by setting
dirs[:] = []
would prevent the walk from entering any nested directories. This option can only be set whenstatus
is false.
This method returns a generator yielding tuples
(path, dirs, files)
wherepath
is the absolute path to the current directory,dirs
is the list of directory names it contains, andfiles
is the list of file names it contains.- hdfs_path – Starting path. If the path doesn’t exist, an
-
write
(hdfs_path, data=None, overwrite=False, permission=None, blocksize=None, replication=None, buffersize=None, append=False, encoding=None)¶ Create a file on HDFS.
Parameters: - hdfs_path – Path where to create file. The necessary directories will be created appropriately.
- data – Contents of file to write. Can be a string, a generator or a
file object. The last two options will allow streaming upload (i.e.
without having to load the entire contents into memory). If
None
, this method will return a file-like object and should be called using awith
block (see below for examples). - overwrite – Overwrite any existing file or directory.
- permission – Octal permission to set on the newly created file. Leading zeros may be omitted.
- blocksize – Block size of the file.
- replication – Number of replications of the file.
- buffersize – Size of upload buffer.
- append – Append to a file rather than create a new one.
- encoding – Encoding used to serialize data written.
Sample usages:
from json import dump, dumps records = [ {'name': 'foo', 'weight': 1}, {'name': 'bar', 'weight': 2}, ] # As a context manager: with client.write('data/records.jsonl', encoding='utf-8') as writer: dump(records, writer) # Or, passing in a generator directly: client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')
-
class
hdfs.client.
InsecureClient
(url, user=None, **kwargs)¶ Bases:
hdfs.client.Client
HDFS web client to use when security is off.
Parameters: - url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode
- user – User default. Defaults to the current user’s (as determined by
whoami
). - **kwargs – Keyword arguments passed to the base class’ constructor.
Note that if a session argument is passed in, it will be modified in-place to support authentication.
-
class
hdfs.client.
TokenClient
(url, token, **kwargs)¶ Bases:
hdfs.client.Client
HDFS web client using Hadoop token delegation security.
Parameters: - url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode
- token – Hadoop delegation token.
- **kwargs – Keyword arguments passed to the base class’ constructor.
Note that if a session argument is passed in, it will be modified in-place to support authentication.
Extensions¶
The following extensions are currently available:
Kerberos¶
Support for clusters using Kerberos authentication.
This extension adds a new hdfs.client.Client
subclass,
KerberosClient
, which handles authentication appropriately with
Kerberized clusters:
from hdfs.ext.kerberos import KerberosClient
client = KerberosClient('http://host:port')
To expose this class to the command line interface (so that it can be used by
aliases), we add the following line inside the global
section of
~/.hdfscli.cfg
(or wherever our configuration file is located):
autoload.modules = hdfs.ext.kerberos
Here is what our earlier configuration would look like if we updated it to support a Kerberized production grid:
[global]
default.alias = dev
autoload.modules = hdfs.ext.kerberos
[dev.alias]
url = http://dev.namenode:port
[prod.alias]
url = http://prod.namenode:port
client = KerberosClient
-
class
hdfs.ext.kerberos.
KerberosClient
(url, mutual_auth='OPTIONAL', max_concurrency=1, root=None, proxy=None, timeout=None, session=None, **kwargs)¶ Bases:
hdfs.client.Client
HDFS web client using Kerberos authentication.
Parameters: - url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode.
- mutual_auth – Whether to enforce mutual authentication or not (possible
values:
'REQUIRED'
,'OPTIONAL'
,'DISABLED'
). - max_concurrency – Maximum number of allowed concurrent authentication requests. This is required since requests exceeding the threshold allowed by the server will be unable to authenticate.
- proxy – User to proxy as.
- root – Root path, this will be prefixed to all HDFS paths passed to the client. If the root is relative, the path will be assumed relative to the user’s home directory.
- timeout – Connection timeouts, forwarded to the request handler. How
long to wait for the server to send data before giving up, as a float, or a
(connect_timeout, read_timeout)
tuple. If the timeout is reached, an appropriate exception will be raised. See the requests documentation for details. - session –
requests.Session
instance, used to emit all requests. - **kwargs – Additional arguments passed to the underlying
HTTPKerberosAuth
class.
To avoid replay errors, a timeout of 1 ms is enforced between requests. If a session argument is passed in, it will be modified in-place to support authentication.
Avro¶
Read and write Avro files directly from HDFS.
This extension enables streaming decoding and encoding of files from and to
HDFS. It requires the fastavro
library.
AvroWriter
writes Avro files on HDFS from python objects.AvroReader
reads Avro files from HDFS into an iterable of records.
Sample usage:
It also features an entry point (named hdfscli-avro
by default) which
provides access to the above functionality from the shell. For usage examples
and more information:
$ hdfscli-avro --help
-
class
hdfs.ext.avro.
AvroReader
(client, hdfs_path, parts=None, reader_schema=None)¶ Bases:
object
HDFS Avro file reader.
Parameters: - client –
hdfs.client.Client
instance. - hdfs_path – Remote path.
- parts – Part-files to read, when reading a distributed file. The
default is to read all part-files in order. See
hdfs.client.Client.parts()
for details. - reader_schema – Schema to read the data as. If specified, it must be compatible with the writer’s schema (the default).
The contents of the file will be decoded in a streaming manner, as the data is transferred. This makes it possible to use on files of arbitrary size. As a convenience, the content summary object of the remote file is available on the reader’s
content
attribute.Usage:
with AvroReader(client, 'foo.avro') as reader: schema = reader.writer_schema # The remote file's Avro schema. content = reader.content # Content metadata (e.g. size). for record in reader: pass # and its records
-
schema
¶ Get the underlying file’s schema.
The schema will only be available after entering the reader’s corresponding
with
block.
-
writer_schema
¶ Get the underlying file’s schema.
The schema will only be available after entering the reader’s corresponding
with
block.
- client –
-
class
hdfs.ext.avro.
AvroWriter
(client, hdfs_path, schema=None, codec=None, sync_interval=None, sync_marker=None, metadata=None, **kwargs)¶ Bases:
object
Write an Avro file on HDFS from python dictionaries.
Parameters: - client –
hdfs.client.Client
instance. - hdfs_path – Remote path.
- schema – Avro schema. If not specified, the writer will try to infer it from the first record sent. There are however limitations regarding what can be inferred.
- codec – Compression codec. The default is
'null'
(no compression). - sync_interval – Number of bytes after which a block will be written.
- sync_marker – 16 byte tag used for synchronization. If not specified, one will be generated at random.
- metadata – Additional metadata to include in the container file’s
header. Keys starting with
'avro.'
are reserved. - **kwargs – Keyword arguments forwarded to
hdfs.client.Client.write()
.
Usage:
with AvroWriter(client, 'data.avro') as writer: for record in records: writer.write(record)
-
schema
¶ Avro schema.
-
write
(record)¶ Store a record.
Parameters: record – Record object to store. Only available inside the
with
block.
- client –
Dataframe¶
Read and write Pandas dataframes directly from HDFS.
This extension requires both the avro
extension and pandas
to be installed.
Currently only Avro serialization is supported.
-
hdfs.ext.dataframe.
read_dataframe
(client, hdfs_path)¶ Read dataframe from HDFS Avro file.
Parameters: - client –
hdfs.client.Client
instance. - hdfs_path – Remote path to an Avro file (potentially distributed).
- client –
-
hdfs.ext.dataframe.
write_dataframe
(client, hdfs_path, df, **kwargs)¶ Save dataframe to HDFS as Avro.
Parameters: - client –
hdfs.client.Client
instance. - hdfs_path – Remote path where the dataframe will be stored.
- df – Dataframe to store.
- **kwargs – Keyword arguments passed through to
hdfs.ext.avro.AvroWriter
.
- client –
Configuration¶
Command line interface configuration module.
This module provides programmatic access to HdfsCLI’s configuration settings.
In particular it exposes the ability to instantiate clients from aliases (see
Config.get_client()
).
-
class
hdfs.config.
Config
(path=None, stream_log_level=None)¶ Bases:
ConfigParser.RawConfigParser
Configuration class.
Parameters: - path – path to configuration file. If no file exists at that location,
the configuration parser will be empty. If not specified, the value of the
HDFSCLI_CONFIG
environment variable is used if it exists, otherwise it defaults to~/.hdfscli.cfg
. - stream_log_level – Stream handler log level, attached to the root
logger. A false-ish value will disable this handler. This is particularly
useful with the
catch()
function which reports exceptions as log messages.
On instantiation, the configuration object will attempt to load modules defined in the
autoload
global options (see Custom client support for more information).-
get_client
(alias=None)¶ Load HDFS client.
Parameters: alias – The client to look up. If not specified, the default alias be used ( default.alias
option in theglobal
section) if available and an error will be raised otherwise.Further calls to this method for the same alias will return the same client instance (in particular, any option changes to this alias will not be taken into account).
-
get_log_handler
(command)¶ Configure and return log handler.
Parameters: command – The command to load the configuration for. All options will be looked up in the [COMMAND.command]
section. This is currently only used for configuring the file handler for logging. If logging is disabled for the command, aNullHandler
will be returned, else aTimedRotatingFileHandler
.
- path – path to configuration file. If no file exists at that location,
the configuration parser will be empty. If not specified, the value of the
-
class
hdfs.config.
NullHandler
(level=0)¶ Bases:
logging.Handler
Pass-through logging handler.
This is required for python <2.7.
Initializes the instance - basically setting the formatter to None and the filter list to empty.
-
emit
(record)¶ Do nothing.
-
-
hdfs.config.
catch
(*error_classes)¶ Returns a decorator that catches errors and prints messages to stderr.
Parameters: *error_classes – Error classes. Also exits with status 1 if any errors are caught.
Utilities¶
Common utilities.
-
class
hdfs.util.
AsyncWriter
(consumer)¶ Bases:
object
Asynchronous publisher-consumer.
Parameters: consumer – Function which takes a single generator as argument. This class can be used to transform functions which expect a generator into file-like writer objects. This can make it possible to combine different APIs together more easily. For example, to send streaming requests:
import requests as rq with AsyncWriter(lambda data: rq.post(URL, data=data)) as writer: writer.write('Hello, world!')
-
flush
()¶ Pass-through implementation.
-
seekable
()¶ Implement file-like method expected by certain libraries.
fastavro
relies on it in python 3.
-
tell
()¶ No-op implementation.
-
write
(chunk)¶ Stream data to the underlying consumer.
Parameters: chunk – Bytes to write. These will be buffered in memory until the consumer reads them.
-
-
exception
hdfs.util.
HdfsError
(message, *args, **kwargs)¶ Bases:
exceptions.Exception
Base error class.
Parameters: - message – Error message.
- args – optional Message formatting arguments.
-
hdfs.util.
temppath
(*args, **kwds)¶ Create a temporary path.
Parameters: dpath – Explicit directory name where to create the temporary path. A system dependent default will be used otherwise (cf. tempfile.mkstemp
).Usage:
with temppath() as path: pass # do stuff
Any file or directory corresponding to the path will be automatically deleted afterwards.