Advanced usage¶
Path expansion¶
All Client
methods provide a path expansion functionality
via the resolve()
method. It enables the use of
special markers to identify paths. For example, it currently supports the
#LATEST
marker which expands to the last modified file inside a given folder.
# Load the most recent data in the `tracking` folder.
with client.read('tracking/#LATEST') as reader:
data = reader.read()
See the method’s documentation for more information.
Custom client support¶
In order for the CLI to be able to instantiate arbitrary client classes, it has
to be able to discover these first. This is done by specifying where they are
defined in the global
section of HdfsCLI’s configuration file. For example,
here is how we can make the KerberosClient
class
available:
[global]
autoload.modules = hdfs.ext.kerberos
More precisely, there are two options for telling the CLI where to load the clients from:
autoload.modules
, a comma-separated list of modules (which must be on python’s path).autoload.paths
, a comma-separated list of paths to python files.
Implementing custom clients can be particularly useful for passing default
options (e.g. a custom session
argument to each client). We describe below a
working example implementing a secure client with optional custom certificate
support.
We first implement our new client and save it somewhere, for example
/etc/hdfscli.py
.
from hdfs import Client
from requests import Session
class SecureClient(Client):
"""A new client subclass for handling HTTPS connections.
:param url: URL to namenode.
:param cert: Local certificate. See `requests` documentation for details
on how to use this.
:param verify: Whether to check the host's certificate.
:param \*\*kwargs: Keyword arguments passed to the default `Client`
constructor.
"""
def __init__(self, url, cert=None, verify=True, **kwargs):
session = Session()
if ',' in cert:
session.cert = [path.strip() for path in cert.split(',')]
else:
session.cert = cert
if isinstance(verify, basestring): # Python 2.
verify = verify.lower() in ('true', 'yes', 'ok')
session.verify = verify
super(SecureClient, self).__init__(url, session=session, **kwargs)
We then edit our configuration to tell the CLI how to load this module and
define a prod
alias using our new client:
[global]
autoload.paths = /etc/hdfscli.py
[prod.alias]
client = SecureClient
url = https://host:port
cert = /etc/server.crt, /etc/key
Note that options used to instantiate clients from the CLI (using
hdfs.client.Client.from_options()
under the hood) are always passed in as
strings. This is why we had to implement some parsing logic in the
SecureClient
constructor above.
Tracking transfer progress¶
The read()
, upload()
,
download()
client methods accept a progress
callback argument which can be used to track transfers. The passed function
will be called every chunk_size
bytes with two arguments:
- The source path of the file currently being transferred.
- The number of bytes currently transferred for this file or
-1
to signal that this file’s transfer has just finished.
Below is an implementation of a toy tracker which simply outputs to standard error the total number of transferred bytes each time a file transfer completes (we must still take care to ensure correct behavior even during multi-threaded transfers).
from sys import stderr
from threading import Lock
class Progress(object):
"""Basic progress tracker callback."""
def __init__(self):
self._data = {}
self._lock = Lock()
def __call__(self, hdfs_path, nbytes):
with self._lock:
if nbytes >= 0:
self._data[hdfs_path] = nbytes
else:
stderr.write('%s\n' % (sum(self._data.values()), ))
Finally, note that the write()
method doesn’t expose
a progress
argument since this functionality can be replicated by passing a
custom data
generator (or within the context manager).
Logging configuration¶
It is possible to configure and disable where the CLI logs are written for each
entry point. To do this, we can set the following options in its corresponding
section (the entry point’s name suffixed with .command
). For example:
[hdfscli-avro.command]
log.level = INFO
log.path = /tmp/hdfscli/avro.log
The following options are available:
log.level
, handler log level (defaults toDEBUG
).log.path
, path to log file. The log is rotated every day (keeping a single copy). The default is a file namedCOMMAND.log
in your current temporary directory. It is possible to view the currently active log file at any time by using the--log
option at the command line.log.disable
, disable logging to a file entirely (defaults toFalse
).
Renaming entry points¶
By default the command line entry point will be named hdfscli
. You can choose
another name by specifying the HDFSCLI_ENTRY_POINT
environment variable at
installation time:
$ HDFSCLI_ENTRY_POINT=hdfs pip install hdfs
Extension prefixes will be adjusted similarly (e.g. in the previous example,
hdfscli-avro
would become hdfs-avro
).