POI Filesystem format

Introduction

The POI file format is essentially an archive wrapper around files. It is intended to mimic a filesystem. For the remainder of this document it is referred to as a filesystem in order to avoid confusion with the "files" it contains.

POI filesystems are compatible with those document formats used by a well-known software company's popular office productivity suite and programs outputting compatible data. Because the POI filesystem does not provide compression, encryption or any other worthwhile feature, its not a good choice unless you require interoperability with these programs.

The POI filesystem does not encode the documents themselves. For example, if you had a word processor file with the extension ".doc", you would actually have a POI filesystem with a document file archived inside of the filesystem.

Document Conventions

This document utilizes the numeric types as described by the Java Language Specification, which can be found at java.sun.com. In short:

The Java Language Specification spells out a number of other types that are not referred to by this document.

Where this document makes references to "endian conversion" it is referring to the byte order of stored numbers. Numbers in "little-endian order" are stored with the LEAST significant byte first. In order to properly read a short, for example, you'd read two bytes and then shift the second byte 8 bits to the left before performing an or operation to it against the first byte while stripping the "sign" from the first byte. The following code illustrates this method:

public int getShort (byte[ ] rec) {

return ( (rec[1] << 8) | (rec[0] & 0xff) );

}

Filesystem Introduction

POI filesystems are essentially normal files stored on a Java-compatible platform's native filesystem. They are identified by names ending in a four character identifier noting what type of data they contain. For example, a file ending in ".xls" would likely contain spreadsheet data, and a file ending in ".doc" would probably contain a word processing document. POI filesystems are called "filesystem", because they contain multiple embedded files in a manner similar to traditional filesystems. Along functional lines, it would be more accurate to call these POI archives.

POI filesystems do not provide encryption, compression, or any other feature of a modern archive and are therefore a poor choice for implementing new file formats. It is suggested that POI filesystems are most useful for interoperability with legacy applications that use a compatible file format.

Filesystem Walkthrough

This is a walkthrough of a POI filesystem and how it is put together. It is not intended to give a concise description but to give a "big picture" of the general structure and how it's interpreted.

A POI filesystem begins with a header. This header identifies locations in the file by function and provides a sanity check identifying a native filesystem file as indeed a POI filesystem.

The first 64 bits of the header compose a magic number identifier. This identifier tells the client software that this is indeed a POI filesystem and that it should be treated as such. This is a "sanity check" to make sure this is a POI filesystem and not some other format. The header also contains an array of block numbers. These block numbers refer to blocks in the file. When these blocks are read together they form the Block Allocation Table. The header also contains a pointer to the first element in the property table also known as the root element, and a pointer to the small Block Allocation Table (SBAT).

The block allocation table or BAT, along with the property table specify which blocks in the filesystem belong to which files. It is somewhat hard to conceptualize the Block Allocation Table at first. The block allocation table is essentially an array of integers that point at each other. These elements form chains.

To read the block allocation table you must first read the start block of the file from the property table. This is both your index for the next element in the BAT array as well as the index of the first block in your file. For instance: if the start block from your file's property is 0 then you read block 0 (the first block after the header) from your filesystem as the first block of your file. You also read element 0 from the BAT array. Supposing this element has a value equal to 2, you'd read block 2 from your filesystem as the next block of your file and element 2 from your BAT array. This will be covered further later in this document.

The Property Table is essentially the directory structure for the filesystem. It consists of the name of the file or directory, its start block in both the filesystem and BAT, and its actual size. The first property in the property table is the root element. Its real purpose is to hold the start block for the small blocks.

Filesystem Structure

All values in the POI filesystem are stored in "little-endian" order, meaning you must reverse the order of the bytes before assigning them to variables. Assume the values you see below are originally stored backwards.

The POI filesystem is divided into 512 byte blocks. Each block has an implicit block-type. The order and description of these is described below.

Header Block

The POI filesystem begins with a header block. The first 64 bits of the header form a long file type id or magic number identifier of 0xE11AB1A1E011CFD0L. This is basically a sanity check. If this isn't the first thing in the header (and consequently the filesystem) then this is not a POI filesystem and should be read with some other library.

It's important to know the most important parts of the header. These are discussed in the rest of this section.

BATs

At offset 0x2c is an int specifying the number of elements in the BAT array. The array at 0x4c an array of ints. This array contains the indices of every block in the Block Allocation Table.

XBATs

Very large POI archives may have more blocks than can be addressed by the BAT blocks enumerated in the header block. How large? Well, the BAT array in the header can contain up to 109 BAT block indices; each BAT block references up to 128 blocks, and each block is 512 bytes, so we're talking about 109 * 128 * 512 = 6.8MB. That's a pretty respectable document! But, you could have much more data than that, and in today's world of cheap gigabyte drives, why not? So, the BAT may be extended in that event. The integer value at offset 0x44 of the header is the index of the first extended BAT (XBAT) block. At offset 0x48 of the header, there is an int value that specifies how many XBAT blocks there are. The XBAT blocks begin at the specified index into the array of blocks making up the POI filesystem, and continue in sequence for the specified count of XBAT blocks.

Each XBAT block contains the indices of up to 128 BAT blocks, so the document size can be expanded by another 8MB for each XBAT block. The BAT blocks indexed by an XBAT block are appended to the end of the list of BAT blocks enumerated in the header block. Thus the BAT blocks enumerated in the header block are BAT blocks 0 through 108, the BAT blocks enumerated in the first XBAT block are BAT blocks 109 through 236, the BAT blocks enumerated in the second XBAT block are BAT blocks 237 through 364, and so on.

Through the use of XBAT blocks, the limit on the overall document size is that imposed by the 4-byte block indices; if the indices are unsigned ints, the maximum file size is 2 terabytes, 1 terabyte if the indices are treated as signed ints. Either way, I have yet to see a disk drive large enough to accommodate such a file on the shelves at the local office supply stores.

SBATs

If a file contained in a POI archive is smaller than 4096 bytes, it is stored in small blocks. Small blocks are 64 bytes in length and are contained within big blocks, up to 8 to a big block. As the main BAT is used to navigate the array of big blocks, so the small block allocation table is used to navigate the array of small blocks. The SBAT's start block index is found at offset 0x3C of the header block, and remaining blocks constituting the SBAT are found by walking the main BAT as if it were an ordinary file in the POI filesystem (this process is described below).

Property Table Start Index

An integer at address 0x30 specifies the start index of the property table. This integer is specified as a "block index". The Property Table is stored, as is almost everything in a POI file system, in big blocks and walked via the BAT. The Property Table is described below.

Property Table

The property table is essentially nothing more than the directory system. Properties are 128 byte records contained within the 512 byte blocks. The first property is always the Root Entry. The following applies to individual properties within a property table:

At offset 0x00 in the property is the "name". This is stored as an uncompressed 16 bit unicode string. In short every other byte corresponds to an "ASCII" character. The size of this string is stored at offset 0x40 (string size) as a short.

At offset 0x42 is the property type (byte). The type is 1 for directory, 2 for file or 5 for the Root Entry.

At offset 0x43 is the node color (byte). The color is either 1, (black), or 0, (red). Properties are apparently meant to be arranged in a red-black binary tree, subject to the following rules:

  1. The root of the tree is always black
  2. Two consecutive nodes cannot both be red
  3. A property is less than another property if its name length is less than the other property's name length
  4. If two properties have the same name length, the sort order is determined by the sort order of the properties' names.

At offset 0x44 is the index (int) of the previous property.

At offset 0x48 is the index (int) of the next property.

At offset 0x4C is the index (int) of the first directory entry.

At offset 0x74 is an integer giving the start block for the file described by this property. This index corresponds to an index in the array of indices that is the Block Allocation Table (or the Small Block Allocation Table) as well as the index of the first block in the file.

At offset 0x78 is an integer giving the total actual size of the file pointed at by this property. If the file size is less than 4096, the file is stored in small blocks and the SBAT is used to walk the small blocks making up the file. If the file size is 4096 or larger, the file is stored in big blocks and the main BAT is used to walk the big blocks making up the file. The exception to this rule is the Root Entry, which, regardless of its size, is ALWAYS stored in big blocks and the main BAT is used to walk the big blocks making up this special file.

Root Entry

The Root Entry in the Property Table contains the information necessary to read and write small files, which are files less than 4096 bytes long. The start block field of the Root Entry is the start index of the Small Block Array, which is read like any other file in the POI filesysstem. Since the SBAT cannot be used without the Small Block Array, the Root Entry MUST be read or written using the Block Allocation Table. The blocks making up the Small Block Array are divided into 64-byte small blocks, up to the size indicated in the Root Entry (which should always be a multiple of 64)

Walking the Nodes of the Property Table

The individual properties form a directory tree, with the Root Entry as the directory tree's root, as shown in the accompanying drawing. Note the numbers in parentheses in each node; they represent the node's index in the array of properties. The NEXT_PROP, PREVIOUS_PROP, and CHILD_PROP fields hold these indices, and are used to navigate the tree.

Each directory entry (i.e., a property whose type is directory or root entry) uses its CHILD_PROP field to point to one of its subordinate (child) properties. It doesn't seem to matter which of its children it points to. Thus in the previous drawing, the Root Entry's CHILD_PROP field may contain 1, 4, or the index of one of its other children. Similarly, the directory node (index 1) may have, in its CHILD_PROP field, 2, 3, or the index of one of its other children.

The children of a given directory property point to each other in a similar fashion by using their NEXT_PROP and PREVIOUS_PROP fields. The ordering of the children is governed by rules described here

Unused NEXT_PROP, PREVIOUS_PROP, and CHILD_PROP fields contain the marker value of -1. All file properties have a value of -1 for their CHILD_PROP fields for example.

Block Allocation Table

The BAT blocks are pointed at by the bat array contained in the header and supplemented, if necessary, by the XBAT blocks. These blocks form a large table of integers. These integers are block numbers. The Block Allocation Table holds chains of integers. These chains are terminated with -2. The elements in these chains refer to blocks in the files. The starting block of a file is NOT specified in the BAT. It is specified by the property for a given file. The elements in this BAT are both the block number (within the file minus the header) AND the number of the next BAT element in the chain. This can be thought of as a linked list of blocks. The BAT array contains the links from one block to the next, including the end of chain marker.

Here's an example: Let's assume that the BAT begins as follows:

BAT[ 0 ] = 2

BAT[ 1 ] = 5

BAT[ 2 ] = 3

BAT[ 3 ] = 4

BAT[ 4 ] = 6

BAT[ 5 ] = -2

BAT[ 6 ] = 7

BAT[ 7 ] = -2

...

Now, if we have a file whose Property Table entry says it begins with index 0, we walk the BAT array and see that the file consists of blocks 0 (because the start block is 0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[ 3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It ends at block 7 because BAT[ 7 ] is -2, which is the end of chain marker.

Similarly, a file beginning at index 1 consists of blocks 1 and 5.

Other special numbers in a BAT array are:

Filesystem Structures

The following outlines the basic filesystem structures.

Header (block 1) -- 512 (0x200) bytes

Field Description Offset Length Default value or const
FILETYPE Magic number identifying this as a POI filesystem. 0x0000 Long 0xE11AB1A1E011CFD0
UK1 Unknown constant 0x0008 Integer 0
UK2 Unknown Constant 0x000C Integer 0
UK3 Unknown Constant 0x0014 Integer 0
UK4 Unknown Constant (revision?) 0x0018 Short 0x003B
UK5 Unknown Constant (version?) 0x001A Short 0x0003
UK6 Unknown Constant 0x001C Short -2
LOG_2_BIG_BLOCK_SIZE Log, base 2, of the big block size 0x001E Short 9 (2 ^ 9 = 512 bytes)
LOG_2_SMALL_BLOCK_SIZE Log, base 2, of the small block size 0x0020 Integer 6 (2 ^ 6 = 64 bytes)
UK7 Unknown Constant 0x0024 Integer 0
UK8 Unknown Constant 0x0028 Integer 0
BAT_COUNT Number of elements in the BAT array 0x002C Integer required
PROPERTIES_START Block index of the first block of the property table 0x0030 Integer required
UK9 Unknown Constant 0x0034 Integer 0
UK10 Unknown Constant 0x0038 Integer 0x00001000
SBAT_START Block index of first big block containing the small block allocation table (SBAT) 0x003C Integer -2
UK11 Unknown Constant 0x0040 Integer 1
XBAT_START Block index of the first block in the Extended Block Allocation Table (XBAT) 0x0044 Integer -2
XBAT_COUNT Number of elements in the Extended Block Allocation Table (to be added to the BAT) 0x0048 Integer 0
BAT_ARRAY Array of block indicies constituting the Block Allocation Table (BAT) 0x004C, 0x0050, 0x0054 ... 0x01FC Integer[ ] -1 for unused elements, at least first element must be filled.
N/A Header block data not otherwise described in this table N/A N/A -1

Block Allocation Table Block -- 512 (0x200) bytes

Field Description Offset Length Default value or const
BAT_ELEMENT Any given element in the BAT block 0x0000, 0x0004, 0x0008, ... 0x01FC Integer -1 = unused
-2 = end of chain
-3 = special (e.g., BAT block)
All other values point to the next element in the chain and the next index of a block composing the file.

Property Block -- 512 (0x200) byte block

Field Description Offset Length Default value or const
Properties[ ] This block contains the properties. 0x0000, 0x0080, 0x0100, 0x0180 128 bytes All unused space is set to -1.

Property -- 128 (0x80) byte block

Field Description Offset Length Default value or const
NAME A unicode null-terminated uncompressed 16bit string (lose the high bytes) containing the name of the property. 0x00, 0x02, 0x04, ... 0x3E Short[ ] 0x0000 for unused elements, field required, 32 (0x40) element max
NAME_SIZE Number of characters in the NAME field 0x40 Short Required
PROPERTY_TYPE Property type (directory, file, or root) 0x42 Byte 1 (directory), 2 (file), or 5 (root entry)
NODE_COLOR Node color 0x43 Byte 0 (red) or 1 (black)
PREVIOUS_PROP Previous property index 0x44 Integer -1
NEXT_PROP Next property index 0x48 Integer -1
CHILD_PROP First child property index 0x4c Integer -1
SECONDS_1 Seconds component of the created timestamp? 0x64 Integer 0
DAYS_1 Days since epoch component of the created timestamp? 0x68 Integer 0
SECONDS_2 Seconds component of the modified timestamp? 0x6C Integer 0
DAYS_2 Days since epoch component of the modified timestamp? 0x70 Integer 0
START_BLOCK Starting block of the file, used as the first block in the file and the pointer to the next block from the BAT 0x74 Integer Required
SIZE Actual size of the file this property points to. (used to truncate the blocks to the real size). 0x78 Integer 0