| PyTables User's Guide: Hierarchical datasets in Python - Release 1.3.2 | ||
|---|---|---|
| Prev | Chapter 1. Introduction | Next | 
The hierarchical model of the underlying HDF5 library allows PyTables to manage tables and arrays in a tree-like structure. In order to achieve this, an object tree entity is dynamically created imitating the HDF5 structure on disk. The HDF5 objects are read by walking through this object tree. You can get a good picture of what kind of data is kept in the object by examining the metadata nodes.
The different nodes in the object tree are instances of PyTables classes. There are several types of classes, but the most important ones are the Node, Group and Leaf classes. All nodes in a PyTables tree are instances of the Node class. Group and Leaf classes are descendants of Node. Group instances (referred to as groups from now on) are a grouping structure containing instances of zero or more groups or leaves, together with supplementary metadata. Leaf instances (referred to as leaves) are containers for actual data and can not contain further groups or leaves. The Table, Array, CArray, EArray, VLArray and UnImplemented classes are descendants of Leaf, and inherit all its properties.
Working with groups and leaves is similar in many ways to working with directories and files on a Unix filesystem, i.e. a node (file or directory) is always a child of one and only one group (directory), its parent group[1]. Inside of that group, the node is accessed by its name. As is the case with Unix directories and files, objects in the object tree are often referenced by giving their full (absolute) path names. In PyTables this full path can be specified either as string (such as '/subgroup2/table3', using / as a parent/child separator) or as a complete object path written in a format known as the natural name schema (such as file.root.subgroup2.table3).
Support for natural naming is a key aspect of PyTables. It means that the names of instance variables of the node objects are the same as the names of the element's children[2]. This is very Pythonic and intuitive in many cases. Check the tutorial section 3.1.6 for usage examples.
You should also be aware that not all the data present in a file is loaded into the object tree. Only the metadata (i.e. special data that describes the structure of the actual data) is loaded. The actual data is not read until you request it (by calling a method on a particular node). Using the object tree (the metadata) you can retrieve information about the objects on disk such as table names, titles, name columns, data types in columns, numbers of rows, or, in the case of arrays, the shapes, typecodes, etc. of the array. You can also search through the tree for specific kinds of data then read it and process it. In a certain sense, you can think of PyTables as a tool that applies the same introspection capabilities of Python objects to large amounts of data in persistent storage.
It is worth to note that, from version 1.2 on, PyTables sports a node cache system that loads nodes on demand, and unloads nodes that have not been used for some time (i.e. following a Least Recent Used schema). This feature allows opening HDF5 files with large hierarchies very quickly and with a low memory consumption, while retaining all the powerful browsing capabilities of the previous implementation of the object tree. See [] for more facts about the advantages introduced by this new node cache system.
To better understand the dynamic nature of this object tree entity, let's start with a sample PyTables script (you can find it in examples/objecttree.py) to create a HDF5 file:
from tables import *
class Particle(IsDescription):
    identity = StringCol(length=22, dflt=" ", pos = 0)  # character String
    idnumber = Int16Col(1, pos = 1)  # short integer
    speed    = Float32Col(1, pos = 2)  # single-precision
# Open a file in "w"rite mode
fileh = openFile("objecttree.h5", mode = "w")
# Get the HDF5 root group
root = fileh.root
# Create the groups:
group1 = fileh.createGroup(root, "group1")
group2 = fileh.createGroup(root, "group2")
# Now, create an array in the root group
array1 = fileh.createArray(root, "array1",
                           ["this is", "a string array"], "String array")
# Create 2 new tables in group1 and group2
table1 = fileh.createTable(group1, "table1", Particle)
table2 = fileh.createTable("/group2", "table2", Particle)
# Create one more Array in group1
array2 = fileh.createArray("/group1", "array2", [1,2,3,4])
# Now, fill the tables:
for table in (table1, table2):
    # Get the record object associated with the table:
    row = table.row
    # Fill the table with 10 records
    for i in xrange(10):
        # First, assign the values to the Particle record
        row['identity']  = 'This is particle: %2d' % (i)
        row['idnumber'] = i
        row['speed']  = i * 2.
        # This injects the Record values
        row.append()
    # Flush the table buffers
    table.flush()
# Finally, close the file (this also will flush all the remaining buffers!)
fileh.close()
	This small program creates a simple HDF5 file called objecttree.h5 with the structure that appears in figure 1.1[3]. When the file is created, the metadata in the object tree is updated in memory while the actual data is saved to disk. When you close the file the object tree is no longer available. However, when you reopen this file the object tree will be reconstructed in memory from the metadata on disk, allowing you to work with it in exactly the same way as when you originally created it.
In figure 1.2 you can see an example of the object tree created when the above objecttree.h5 file is read (in fact, such an object is always created when reading any supported generic HDF5 file). It's worthwhile to take your time to understand it[4]. It will help you to avoid programming mistakes.
| [1] | PyTables does not support hard links – for the moment. | 
| [2] | I got this simple but powerful idea from the excellent Objectify module by David Mertz (see []) | 
| [3] | We have used ViTables (see []) in order to create this snapshot. | 
| [4] | Bear in mind, however, that this diagram is not a standard UML class diagram; it is rather meant to show the connections between the PyTables objects and some of its most important attributes and methods. |