API reference

Client

WebHDFS API clients.

class hdfs.client.Client(url, root=None, proxy=None, timeout=None, session=None)

Bases: object

Base HDFS web client.

Parameters:
  • url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode.
  • proxy – User to proxy as.
  • root – Root path, this will be prefixed to all HDFS paths passed to the client. If the root is relative, the path will be assumed relative to the user’s home directory.
  • timeout – Connection timeouts, forwarded to the request handler. How long to wait for the server to send data before giving up, as a float, or a (connect_timeout, read_timeout) tuple. If the timeout is reached, an appropriate exception will be raised. See the requests documentation for details.
  • sessionrequests.Session instance, used to emit all requests.

In general, this client should only be used directly when its subclasses (e.g. InsecureClient, TokenClient, and others provided by extensions) do not provide enough flexibility.

acl_status(hdfs_path, strict=True)

Get AclStatus for a file or folder on HDFS.

Parameters:
  • hdfs_path – Remote path.
  • strict – If False, return None rather than raise an exception if the path doesn’t exist.
checksum(hdfs_path)

Get a remote file’s checksum.

Parameters:hdfs_path – Remote path. Must point to a file.
content(hdfs_path, strict=True)

Get ContentSummary for a file or folder on HDFS.

Parameters:
  • hdfs_path – Remote path.
  • strict – If False, return None rather than raise an exception if the path doesn’t exist.
delete(hdfs_path, recursive=False)

Remove a file or directory from HDFS.

Parameters:
  • hdfs_path – HDFS path.
  • recursive – Recursively delete files and directories. By default, this method will raise an HdfsError if trying to delete a non-empty directory.

This function returns True if the deletion was successful and False if no file or directory previously existed at hdfs_path.

download(hdfs_path, local_path, overwrite=False, n_threads=1, temp_dir=None, **kwargs)

Download a file or folder from HDFS and save it locally.

Parameters:
  • hdfs_path – Path on HDFS of the file or folder to download. If a folder, all the files under it will be downloaded.
  • local_path – Local path. If it already exists and is a directory, the files will be downloaded inside of it.
  • overwrite – Overwrite any existing file or directory.
  • n_threads – Number of threads to use for parallelization. A value of 0 (or negative) uses as many threads as there are files.
  • temp_dir – Directory under which the files will first be downloaded when overwrite=True and the final destination path already exists. Once the download successfully completes, it will be swapped in.
  • **kwargs – Keyword arguments forwarded to read(). If no chunk_size argument is passed, a default value of 64 kB will be used. If a progress argument is passed and threading is used, care must be taken to ensure correct behavior.

On success, this method returns the local download path.

classmethod from_options(options, class_name='Client')

Load client from options.

Parameters:
  • options – Options dictionary.
  • class_name – Client class name. Defaults to the base Client class.

This method provides a single entry point to instantiate any registered Client subclass. To register a subclass, simply load its containing module. If using the CLI, you can use the autoload.modules and autoload.paths options.

list(hdfs_path, status=False)

Return names of files contained in a remote folder.

Parameters:
  • hdfs_path – Remote path to a directory. If hdfs_path doesn’t exist or points to a normal file, an HdfsError will be raised.
  • status – Also return each file’s corresponding FileStatus.
makedirs(hdfs_path, permission=None)

Create a remote directory, recursively if necessary.

Parameters:
  • hdfs_path – Remote path. Intermediate directories will be created appropriately.
  • permission – Octal permission to set on the newly created directory. These permissions will only be set on directories that do not already exist.

This function currently has no return value as WebHDFS doesn’t return a meaningful flag.

parts(hdfs_path, parts=None, status=False)

Returns a dictionary of part-files corresponding to a path.

Parameters:
  • hdfs_path – Remote path. This directory should contain at most one part file per partition (otherwise one will be picked arbitrarily).
  • parts – List of part-files numbers or total number of part-files to select. If a number, that many partitions will be chosen at random. By default all part-files are returned. If parts is a list and one of the parts is not found or too many samples are demanded, an HdfsError is raised.
  • status – Also return each file’s corresponding FileStatus.
read(*args, **kwds)

Read a file from HDFS.

Parameters:
  • hdfs_path – HDFS path.
  • offset – Starting byte position.
  • length – Number of bytes to be processed. None will read the entire file.
  • buffer_size – Size of the buffer in bytes used for transferring the data. Defaults the the value set in the HDFS configuration.
  • encoding – Encoding used to decode the request. By default the raw data is returned. This is mostly helpful in python 3, for example to deserialize JSON data (as the decoder expects unicode).
  • chunk_size – If set to a positive number, the context manager will return a generator yielding every chunk_size bytes instead of a file-like object (unless delimiter is also set, see below).
  • delimiter – If set, the context manager will return a generator yielding each time the delimiter is encountered. This parameter requires the encoding to be specified.
  • progress – Callback function to track progress, called every chunk_size bytes (not available if the chunk size isn’t specified). It will be passed two arguments, the path to the file being uploaded and the number of bytes transferred so far. On completion, it will be called once with -1 as second argument.

This method must be called using a with block:

with client.read('foo') as reader:
  content = reader.read()

This ensures that connections are always properly closed.

Note

The raw file-like object returned by this method (when called without an encoding, chunk size, or delimiter) can have a very different performance profile than local files. In particular, line-oriented methods are often slower. The recommended workaround is to specify an encoding when possible or read the entire file before splitting it.

rename(hdfs_src_path, hdfs_dst_path)

Move a file or folder.

Parameters:
  • hdfs_src_path – Source path.
  • hdfs_dst_path – Destination path. If the path already exists and is a directory, the source will be moved into it. If the path exists and is a file, or if a parent destination directory is missing, this method will raise an HdfsError.
resolve(hdfs_path)

Return absolute, normalized path, with special markers expanded.

Parameters:hdfs_path – Remote path.

Currently supported markers:

  • '#LATEST': this marker gets expanded to the most recently updated file or folder. They can be combined using the '{N}' suffix. For example, 'foo/#LATEST{2}' is equivalent to 'foo/#LATEST/#LATEST'.
set_acl(hdfs_path, acl_spec, clear=True)

SetAcl_ or ModifyAcl_ for a file or folder on HDFS.

Parameters:
  • hdfs_path – Path to an existing remote file or directory. An HdfsError will be raised if the path doesn’t exist.
  • acl_spec – String representation of an ACL spec. Must be a valid string with entries for user, group and other. For example: "user::rwx,user:foo:rw-,group::r--,other::---".
  • clear – Clear existing ACL entries. If set to false, all existing ACL entries that are not specified in this call are retained without changes, behaving like ModifyAcl_. For example: "user:foo:rwx".
.. _SetAcl: SETACL_
set_owner(hdfs_path, owner=None, group=None)

Change the owner of file.

Parameters:
  • hdfs_path – HDFS path.
  • owner – Optional, new owner for file.
  • group – Optional, new group for file.

At least one of owner and group must be specified.

set_permission(hdfs_path, permission)

Change the permissions of file.

Parameters:
  • hdfs_path – HDFS path.
  • permission – New octal permissions string of file.
set_replication(hdfs_path, replication)

Set file replication.

Parameters:
  • hdfs_path – Path to an existing remote file. An HdfsError will be raised if the path doesn’t exist or points to a directory.
  • replication – Replication factor.
set_times(hdfs_path, access_time=None, modification_time=None)

Change remote timestamps.

Parameters:
  • hdfs_path – HDFS path.
  • access_time – Timestamp of last file access.
  • modification_time – Timestamps of last file access.
status(hdfs_path, strict=True)

Get FileStatus for a file or folder on HDFS.

Parameters:
  • hdfs_path – Remote path.
  • strict – If False, return None rather than raise an exception if the path doesn’t exist.
upload(hdfs_path, local_path, overwrite=False, n_threads=1, temp_dir=None, chunk_size=65536, progress=None, cleanup=True, **kwargs)

Upload a file or directory to HDFS.

Parameters:
  • hdfs_path – Target HDFS path. If it already exists and is a directory, files will be uploaded inside.
  • local_path – Local path to file or folder. If a folder, all the files inside of it will be uploaded (note that this implies that folders empty of files will not be created remotely).
  • overwrite – Overwrite any existing file or directory.
  • n_threads – Number of threads to use for parallelization. A value of 0 (or negative) uses as many threads as there are files.
  • temp_dir – Directory under which the files will first be uploaded when overwrite=True and the final remote path already exists. Once the upload successfully completes, it will be swapped in.
  • chunk_size – Interval in bytes by which the files will be uploaded.
  • progress – Callback function to track progress, called every chunk_size bytes. It will be passed two arguments, the path to the file being uploaded and the number of bytes transferred so far. On completion, it will be called once with -1 as second argument.
  • cleanup – Delete any uploaded files if an error occurs during the upload.
  • **kwargs – Keyword arguments forwarded to write().

On success, this method returns the remote upload path.

walk(hdfs_path, depth=0, status=False)

Depth-first walk of remote filesystem.

Parameters:
  • hdfs_path – Starting path. If the path doesn’t exist, an HdfsError will be raised. If it points to a file, the returned generator will be empty.
  • depth – Maximum depth to explore. 0 for no limit.
  • status – Also return each file or folder’s corresponding FileStatus.

This method returns a generator yielding tuples (path, dirs, files) where path is the absolute path to the current directory, dirs is the list of directory names it contains, and files is the list of file names it contains.

write(hdfs_path, data=None, overwrite=False, permission=None, blocksize=None, replication=None, buffersize=None, append=False, encoding=None)

Create a file on HDFS.

Parameters:
  • hdfs_path – Path where to create file. The necessary directories will be created appropriately.
  • data – Contents of file to write. Can be a string, a generator or a file object. The last two options will allow streaming upload (i.e. without having to load the entire contents into memory). If None, this method will return a file-like object and should be called using a with block (see below for examples).
  • overwrite – Overwrite any existing file or directory.
  • permission – Octal permission to set on the newly created file. Leading zeros may be omitted.
  • blocksize – Block size of the file.
  • replication – Number of replications of the file.
  • buffersize – Size of upload buffer.
  • append – Append to a file rather than create a new one.
  • encoding – Encoding used to serialize data written.

Sample usages:

from json import dump, dumps

records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')
class hdfs.client.InsecureClient(url, user=None, **kwargs)

Bases: hdfs.client.Client

HDFS web client to use when security is off.

Parameters:
  • url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode
  • user – User default. Defaults to the current user’s (as determined by whoami).
  • **kwargs – Keyword arguments passed to the base class’ constructor.

Note that if a session argument is passed in, it will be modified in-place to support authentication.

class hdfs.client.TokenClient(url, token, **kwargs)

Bases: hdfs.client.Client

HDFS web client using Hadoop token delegation security.

Parameters:
  • url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode
  • token – Hadoop delegation token.
  • **kwargs – Keyword arguments passed to the base class’ constructor.

Note that if a session argument is passed in, it will be modified in-place to support authentication.

Extensions

The following extensions are currently available:

Kerberos

Support for clusters using Kerberos authentication.

This extension adds a new hdfs.client.Client subclass, KerberosClient, which handles authentication appropriately with Kerberized clusters:

from hdfs.ext.kerberos import KerberosClient
client = KerberosClient('http://host:port')

To expose this class to the command line interface (so that it can be used by aliases), we add the following line inside the global section of ~/.hdfscli.cfg (or wherever our configuration file is located):

autoload.modules = hdfs.ext.kerberos

Here is what our earlier configuration would look like if we updated it to support a Kerberized production grid:

[global]
default.alias = dev
autoload.modules = hdfs.ext.kerberos

[dev.alias]
url = http://dev.namenode:port

[prod.alias]
url = http://prod.namenode:port
client = KerberosClient
class hdfs.ext.kerberos.KerberosClient(url, mutual_auth='OPTIONAL', max_concurrency=1, root=None, proxy=None, timeout=None, session=None, **kwargs)

Bases: hdfs.client.Client

HDFS web client using Kerberos authentication.

Parameters:
  • url – Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode.
  • mutual_auth – Whether to enforce mutual authentication or not (possible values: 'REQUIRED', 'OPTIONAL', 'DISABLED').
  • max_concurrency – Maximum number of allowed concurrent authentication requests. This is required since requests exceeding the threshold allowed by the server will be unable to authenticate.
  • proxy – User to proxy as.
  • root – Root path, this will be prefixed to all HDFS paths passed to the client. If the root is relative, the path will be assumed relative to the user’s home directory.
  • timeout – Connection timeouts, forwarded to the request handler. How long to wait for the server to send data before giving up, as a float, or a (connect_timeout, read_timeout) tuple. If the timeout is reached, an appropriate exception will be raised. See the requests documentation for details.
  • sessionrequests.Session instance, used to emit all requests.
  • **kwargs – Additional arguments passed to the underlying HTTPKerberosAuth class.

To avoid replay errors, a timeout of 1 ms is enforced between requests. If a session argument is passed in, it will be modified in-place to support authentication.

Avro

Read and write Avro files directly from HDFS.

This extension enables streaming decoding and encoding of files from and to HDFS. It requires the fastavro library.

  • AvroWriter writes Avro files on HDFS from python objects.
  • AvroReader reads Avro files from HDFS into an iterable of records.

Sample usage:

#!/usr/bin/env python
# encoding: utf-8

"""Avro extension example."""

from hdfs import Config
from hdfs.ext.avro import AvroReader, AvroWriter


# Get the default alias' client.
client = Config().get_client()

# Some sample data.
records = [
  {'name': 'Ann', 'age': 23},
  {'name': 'Bob', 'age': 22},
]

# Write an Avro File to HDFS (since our records' schema is very simple, we let
# the writer infer it automatically, otherwise we would pass it as argument).
with AvroWriter(client, 'names.avro', overwrite=True) as writer:
  for record in records:
    writer.write(record)

# Read it back.
with AvroReader(client, 'names.avro') as reader:
  schema = reader.schema # The inferred schema.
  content = reader.content # The remote file's HDFS content object.
  assert list(reader) == records # The records match!

It also features an entry point (named hdfscli-avro by default) which provides access to the above functionality from the shell. For usage examples and more information:

$ hdfscli-avro --help
class hdfs.ext.avro.AvroReader(client, hdfs_path, parts=None)

Bases: object

HDFS Avro file reader.

Parameters:
  • clienthdfs.client.Client instance.
  • hdfs_path – Remote path.
  • parts – Part-files to read, when reading a distributed file. The default is to read all part-files in order. See hdfs.client.Client.parts() for details.

The contents of the file will be decoded in a streaming manner, as the data is transferred. This makes it possible to use on files of arbitrary size. As a convenience, the content summary object of the remote file is available on the reader’s content attribute.

Usage:

with AvroReader(client, 'foo.avro') as reader:
  schema = reader.schema # The remote file's Avro schema.
  content = reader.content # Content metadata (e.g. size).
  for record in reader:
    pass # and its records
schema

Get the underlying file’s schema.

The schema will only be available after entering the reader’s corresponding with block.

class hdfs.ext.avro.AvroWriter(client, hdfs_path, schema=None, codec=None, sync_interval=None, sync_marker=None, metadata=None, **kwargs)

Bases: object

Write an Avro file on HDFS from python dictionaries.

Parameters:
  • clienthdfs.client.Client instance.
  • hdfs_path – Remote path.
  • schema – Avro schema. If not specified, the writer will try to infer it from the first record sent. There are however limitations regarding what can be inferred.
  • codec – Compression codec. The default is 'null' (no compression).
  • sync_interval – Number of bytes after which a block will be written.
  • sync_marker – 16 byte tag used for synchronization. If not specified, one will be generated at random.
  • metadata – Additional metadata to include in the container file’s header. Keys starting with 'avro.' are reserved.
  • **kwargs – Keyword arguments forwarded to hdfs.client.Client.write().

Usage:

with AvroWriter(client, 'data.avro') as writer:
  for record in records:
    writer.write(record)
schema

Avro schema.

write(record)

Store a record.

Parameters:record – Record object to store.

Only available inside the with block.

Dataframe

Read and write Pandas dataframes directly from HDFS.

#!/usr/bin/env python
# encoding: utf-8

"""Dataframe extension example."""

from hdfs import Config
from hdfs.ext.dataframe import read_dataframe, write_dataframe
import pandas as pd


# Get the default alias' client.
client = Config().get_client()

# A sample dataframe.
df = pd.DataFrame.from_records([
  {'A': 1, 'B': 2},
  {'A': 11, 'B': 23}
])

# Write dataframe to HDFS using Avro serialization.
write_dataframe(client, 'data.avro', df, overwrite=True)

# Read the Avro file back from HDFS.
_df = read_dataframe(client, 'data.avro')

# The frames match!
pd.util.testing.assert_frame_equal(df, _df)

This extension requires both the avro extension and pandas to be installed. Currently only Avro serialization is supported.

hdfs.ext.dataframe.read_dataframe(client, hdfs_path)

Read dataframe from HDFS Avro file.

Parameters:
  • clienthdfs.client.Client instance.
  • hdfs_path – Remote path to an Avro file (potentially distributed).
hdfs.ext.dataframe.write_dataframe(client, hdfs_path, df, **kwargs)

Save dataframe to HDFS as Avro.

Parameters:

Configuration

Command line interface configuration module.

This module provides programmatic access to HdfsCLI’s configuration settings. In particular it exposes the ability to instantiate clients from aliases (see Config.get_client()).

class hdfs.config.Config(path=None, stream_log_level=None)

Bases: ConfigParser.RawConfigParser

Configuration class.

Parameters:
  • path – path to configuration file. If no file exists at that location, the configuration parser will be empty. If not specified, the value of the HDFSCLI_CONFIG environment variable is used if it exists, otherwise it defaults to ~/.hdfscli.cfg.
  • stream_log_level – Stream handler log level, attached to the root logger. A false-ish value will disable this handler. This is particularly useful with the catch() function which reports exceptions as log messages.

On instantiation, the configuration object will attempt to load modules defined in the autoload global options (see Custom client support for more information).

get_client(alias=None)

Load HDFS client.

Parameters:alias – The client to look up. If not specified, the default alias be used (default.alias option in the global section) if available and an error will be raised otherwise.

Further calls to this method for the same alias will return the same client instance (in particular, any option changes to this alias will not be taken into account).

get_log_handler(command)

Configure and return log handler.

Parameters:command – The command to load the configuration for. All options will be looked up in the [COMMAND.command] section. This is currently only used for configuring the file handler for logging. If logging is disabled for the command, a NullHandler will be returned, else a TimedRotatingFileHandler.
class hdfs.config.NullHandler(level=0)

Bases: logging.Handler

Pass-through logging handler.

This is required for python <2.7.

Initializes the instance - basically setting the formatter to None and the filter list to empty.

emit(record)

Do nothing.

hdfs.config.catch(*error_classes)

Returns a decorator that catches errors and prints messages to stderr.

Parameters:*error_classes – Error classes.

Also exits with status 1 if any errors are caught.

Utilities

Common utilities.

class hdfs.util.AsyncWriter(consumer)

Bases: object

Asynchronous publisher-consumer.

Parameters:consumer – Function which takes a single generator as argument.

This class can be used to transform functions which expect a generator into file-like writer objects. This can make it possible to combine different APIs together more easily. For example, to send streaming requests:

import requests as rq

with AsyncWriter(lambda data: rq.post(URL, data=data)) as writer:
  writer.write('Hello, world!')
flush()

Pass-through implementation.

write(chunk)

Stream data to the underlying consumer.

Parameters:chunk – Bytes to write. These will be buffered in memory until the consumer reads them.
exception hdfs.util.HdfsError(message, *args)

Bases: exceptions.Exception

Base error class.

Parameters:
  • message – Error message.
  • args – optional Message formatting arguments.
hdfs.util.temppath(*args, **kwds)

Create a temporary path.

Parameters:dpath – Explicit directory name where to create the temporary path. A system dependent default will be used otherwise (cf. tempfile.mkstemp).

Usage:

with temppath() as path:
  pass # do stuff

Any file or directory corresponding to the path will be automatically deleted afterwards.