Quickstart

This page first goes through the steps required to configure HdfsCLI’s command line interface then gives an overview of the python API. If you are only interested in using HdfsCLI as a library, then feel free to jump ahead to the Python bindings section.

Configuration

HdfsCLI uses aliases to figure out how to connect to different HDFS clusters. These are defined in HdfsCLI’s configuration file, located by default at ~/.hdfscli.cfg (or elsewhere by setting the HDFSCLI_CONFIG environment variable correspondingly). See below for a sample configuration defining two aliases, dev and prod:

[global]
default.alias = dev

[dev.alias]
url = http://dev.namenode:port
user = ann

[prod.alias]
url = http://prod.namenode:port
root = /jobs/

Each alias is defined as its own ALIAS.alias section which must at least contain a url option with the URL to the namenode (including protocol and port). All other options can be omitted. If specified, client determines which hdfs.client.Client class to use and the remaining options are passed as keyword arguments to the appropriate constructor. The currently available client classes are:

See the Kerberos extension to enable the KerberosClient and Custom client support to learn how to use other client classes.

Finally, note the default.alias entry in the global configuration section which will be used as default alias if none is specified.

Command line interface

HdfsCLI comes by default with a single entry point hdfscli which provides a convenient interface to perform common actions. All its commands accept an --alias argument (described above), which defines against which cluster to operate.

Downloading and uploading files

HdfsCLI supports downloading and uploading files and folders transparently from HDFS (we can also specify the degree of parallelism by using the --threads option).

$ # Write a single file to HDFS.
$ hdfscli upload --alias=dev weights.json models/
$ # Read all files inside a folder from HDFS and store them locally.
$ hdfscli download export/results/ "results-$(date +%F)"

If reading (resp. writing) a single file, its contents can also be streamed to standard out (resp. from standard in) by using - as path argument:

$ # Read a file from HDFS and append its contents to a local log file.
$ hdfscli download logs/1987-03-23.txt - >>logs

By default HdfsCLI will throw an error if trying to write to an existing path (either locally or on HDFS). We can force the path to be overwritten with the --force option.

Interactive shell

The interactive command (used also when no command is specified) will create an HDFS client and expose it inside a python shell (using IPython if available). This makes is convenient to perform file system operations on HDFS and interact with its data. See Python bindings below for an overview of the methods available.

$ hdfscli --alias=dev

Welcome to the interactive HDFS python shell.
The HDFS client is available as `CLIENT`.

In [1]: CLIENT.list('data/')
Out[1]: ['1.json', '2.json']

In [2]: CLIENT.status('data/2.json')
Out[2]: {
  'accessTime': 1439743128690,
  'blockSize': 134217728,
  'childrenNum': 0,
  'fileId': 16389,
  'group': 'supergroup',
  'length': 2,
  'modificationTime': 1439743129392,
  'owner': 'drwho',
  'pathSuffix': '',
  'permission': '755',
  'replication': 1,
  'storagePolicy': 0,
  'type': 'FILE'
}

In [3]: CLIENT.delete('data/2.json')
Out[3]: True

Using the full power of python lets us easily perform more complex operations such as renaming folder which match some pattern, deleting files which haven’t been accessed for some duration, finding all paths owned by a certain user, etc.

More

Cf. hdfscli --help for the full list of commands and options.

Python bindings

Instantiating a client

The simplest way of getting a hdfs.client.Client instance is by using the Interactive shell described above, where the client will be automatically available. To instantiate a client programmatically, there are two options:

The first is to import the client class and call its constructor directly. This is the most straightforward and flexible, but doesn’t let us reuse our configured aliases:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

The second leverages the hdfs.config.Config class to load an existing configuration file (defaulting to the same one as the CLI) and create clients from existing aliases:

from hdfs import Config
client = Config().get_client('dev')

Reading and writing files

The read() method provides a file-like interface for reading files from HDFS. It must be used in a with block (making sure that connections are always properly closed):

# Loading a file in memory.
with client.read('features') as reader:
  features = reader.read()

# Directly deserializing a JSON object.
with client.read('model.json', encoding='utf-8') as reader:
  from json import load
  model = load(reader)

If a chunk_size argument is passed, the method will return a generator instead, making it sometimes simpler to stream the file’s contents.

# Stream a file.
with client.read('features', chunk_size=8096) as reader:
  for chunk in reader:
    pass

Similarly, if a delimiter argument is passed, the method will return a generator of the delimited chunks.

with client.read('samples.csv', encoding='utf-8', delimiter='\n') as reader:
  for line in reader:
    pass

Writing files to HDFS is done using the write() method which returns a file-like writable object:

# Writing part of a file.
with open('samples') as reader, client.write('samples') as writer:
  for line in reader:
    if line.startswith('-'):
      writer.write(line)

# Writing a serialized JSON object.
with client.write('model.json', encoding='utf-8') as writer:
  from json import dump
  dump(model, writer)

For convenience, it is also possible to pass an iterable data argument directly to the method.

# This is equivalent to the JSON example above.
from json import dumps
client.write('model.json', dumps(model))

Exploring the file system

All Client subclasses expose a variety of methods to interact with HDFS. Most are modeled directly after the WebHDFS operations, a few of these are shown in the snippet below:

# Retrieving a file or folder content summary.
content = client.content('dat')

# Listing all files inside a directory.
fnames = client.list('dat')

# Retrieving a file or folder status.
status = client.status('dat/features')

# Renaming ("moving") a file.
client.rename('dat/features', 'features')

# Deleting a file or folder.
client.delete('dat', recursive=True)

Other methods build on these to provide more advanced features:

# Download a file or folder locally.
client.download('dat', 'dat', n_threads=5)

# Get all files under a given folder (arbitrary depth).
import posixpath as psp
fpaths = [
  psp.join(dpath, fname)
  for dpath, _, fnames in client.walk('predictions')
  for fname in fnames
]

See the API reference for the comprehensive list of methods available.

Checking path existence

Most of the methods described above will raise an HdfsError if called on a missing path. The recommended way of checking whether a path exists is using the content() or status() methods with a strict=False argument (in which case they will return None on a missing path).

More

See the Advanced usage section to learn more.