updated poifs docs
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@352106 13f79535-47bb-0310-9956-ffa450edef68
BIN
src/documentation/images/BlockClassDiagram.gif
Normal file
After Width: | Height: | Size: 7.7 KiB |
BIN
src/documentation/images/POIFSAddDocument.gif
Normal file
After Width: | Height: | Size: 6.8 KiB |
BIN
src/documentation/images/POIFSClassDiagram.gif
Normal file
After Width: | Height: | Size: 13 KiB |
BIN
src/documentation/images/POIFSInitialization.gif
Normal file
After Width: | Height: | Size: 2.4 KiB |
BIN
src/documentation/images/POIFSLifeCycle.gif
Normal file
After Width: | Height: | Size: 1.8 KiB |
BIN
src/documentation/images/POIFSPropertyTablePreWrite.gif
Normal file
After Width: | Height: | Size: 4.7 KiB |
BIN
src/documentation/images/POIFSRootPropertyPreWrite.gif
Normal file
After Width: | Height: | Size: 1.8 KiB |
BIN
src/documentation/images/POIFSWriteArchive.gif
Normal file
After Width: | Height: | Size: 9.4 KiB |
BIN
src/documentation/images/POIFSWriteFilesystem.gif
Normal file
After Width: | Height: | Size: 9.4 KiB |
BIN
src/documentation/images/PropertySet.jpg
Normal file
After Width: | Height: | Size: 17 KiB |
BIN
src/documentation/images/PropertyTableClassDiagram.gif
Normal file
After Width: | Height: | Size: 11 KiB |
BIN
src/documentation/images/made-with-cocoon.png
Normal file
After Width: | Height: | Size: 2.0 KiB |
BIN
src/documentation/images/utilClasses.gif
Normal file
After Width: | Height: | Size: 20 KiB |
@ -8,6 +8,7 @@
|
||||
<menu label="Navigation">
|
||||
<menu-item label="Main" href="../index.html"/>
|
||||
<menu-item label="How To" href="how-to.html"/>
|
||||
<menu-item label="File System Documentation" href="fileformat.html"/>
|
||||
<menu-item label="Use Cases" href="usecases.html"/>
|
||||
</menu>
|
||||
|
||||
|
666
src/documentation/xdocs/poifs/fileformat.xml
Normal file
@ -0,0 +1,666 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../dtd/document-v10.dtd">
|
||||
<document>
|
||||
<header>
|
||||
<authors>
|
||||
<person email="mjohnson@apache.org" name="Marc Johnson" id="MJ"/>
|
||||
</authors>
|
||||
</header>
|
||||
<body>
|
||||
<s1 title="POIFS File System Internals">
|
||||
<s2 title="Introduction">
|
||||
<p>POIFS file systems are essentially normal files stored on a
|
||||
Java-compatible platform's native file system. They are
|
||||
typically identified by names ending in a four character
|
||||
extension noting what type of data they contain. For
|
||||
example, a file ending in ".xls" would likely
|
||||
contain spreadsheet data, and a file ending in
|
||||
".doc" would probably contain a word processing
|
||||
document. POIFS file systems are called "file
|
||||
system", because they contain multiple embedded files
|
||||
in a manner similar to traditional file systems. Along
|
||||
functional lines, it would be more accurate to call these
|
||||
POIFS archives. For the remainder of this document it is
|
||||
referred to as a file system in order to avoid confusion
|
||||
with the "files" it contains.</p>
|
||||
<p>POIFS file systems are compatible with those document
|
||||
formats used by a well-known software company's popular
|
||||
office productivity suite and programs outputting
|
||||
compatible data. Because the POIFS file system does not
|
||||
provide compression, encryption or any other worthwhile
|
||||
feature, its not a good choice unless you require
|
||||
interoperability with these programs.</p>
|
||||
<p>The POIFS file system does not encode the documents
|
||||
themselves. For example, if you had a word processor file
|
||||
with the extension ".doc", you would actually
|
||||
have a POIFS file system with a document file archived
|
||||
inside of that file system.</p>
|
||||
</s2>
|
||||
<s2 title="Document Conventions">
|
||||
<p>This document utilizes the numeric types as described by
|
||||
the Java Language Specification, which can be found at
|
||||
<link href="http://java.sun.com">http://java.sun.com</link>. In
|
||||
short:</p>
|
||||
<ul>
|
||||
<li>A <b>byte</b> is an 8 bit signed integer ranging from
|
||||
-128 to 127.</li>
|
||||
<li>A <b>short</b> is a 16 bit signed integer ranging from
|
||||
-32768 to 32767</li>
|
||||
<li>An <b>int</b> is a 32 bit signed integer ranging from
|
||||
-2147483648 to 2147483647</li>
|
||||
<li>A <b>long</b> is a 64 bit signed integer ranging from
|
||||
-9.22E18 to 9.22E18.</li>
|
||||
</ul>
|
||||
<p>The Java Language Specification spells out a number of
|
||||
other types that are not referred to by this document.</p>
|
||||
<p>Where this document makes references to "endian
|
||||
conversion" it is referring to the byte order of
|
||||
stored numbers. Numbers in "little-endian order"
|
||||
are stored with the <b>least</b> significant byte first. In
|
||||
order to properly read a short, for example, you'd read two
|
||||
bytes and then shift the second byte 8 bits to the left
|
||||
before performing an <code>or</code> operation to it
|
||||
against the first byte. The following code illustrates this
|
||||
method:</p>
|
||||
<source>
|
||||
public int getShort (byte[] rec)
|
||||
{
|
||||
return ((rec[1] << 8) | (rec[0] & 0x00ff));
|
||||
}</source>
|
||||
</s2>
|
||||
<s2 title="File System Walkthrough">
|
||||
<p>This is a walkthrough of a POIFS file system and how it is
|
||||
put together. It is not intended to give a concise
|
||||
description but to give a "big picture" of the
|
||||
general structure and how it's interpreted.</p>
|
||||
<p>A POIFS file system begins with a header. This header
|
||||
identifies locations in the file by function and provides a
|
||||
sanity check identifying a file as a POIFS file system.</p>
|
||||
<p>The first 64 bits of the header compose a <b>magic number
|
||||
identifier.</b> This identifier tells the client software
|
||||
that this is indeed a POIFS file system and that it should
|
||||
be treated as such. This is a "sanity check" to
|
||||
make sure this is a POIFS file system and not some other
|
||||
format. The header also contains an <b>array of block
|
||||
numbers</b>. These block numbers refer to blocks in the
|
||||
file. When these blocks are read together they form the
|
||||
<b>Block Allocation Table</b>. The header also contains a
|
||||
pointer to the first element in the <b>property table</b>,
|
||||
also known as the <b>root element</b>, and a pointer to the
|
||||
<b>small Block Allocation Table (SBAT)</b>.</p>
|
||||
<p>The <b>block allocation table</b> or <b>BAT</b>, along with
|
||||
the <b>property table</b>, specify which blocks in the file
|
||||
system belong to which files. After the header block, the
|
||||
file system is divided into identically sized blocks of
|
||||
data, numbered from 0 to however many blocks there are in
|
||||
the file system. For each file in the file system, its
|
||||
entry in the property table includes the index of the first
|
||||
block in the array of blocks. Each block's index into the
|
||||
array of blocks is also its index into the BAT, and the
|
||||
integer value stored at that index in the BAT gives the
|
||||
index of the next block in the array (and thus the index of
|
||||
the next BAT value). A special value is stored in the BAT
|
||||
to indicate "end of file".</p>
|
||||
<p>The <b>property table</b> is essentially the directory
|
||||
storage for the file system. It consists of the name of the
|
||||
file or directory, its <b>start block</b> in both the file
|
||||
system and <b>BAT</b>, and its actual size. The first
|
||||
property in the property table is the <b>root
|
||||
element</b>. It has two purposes: to be a directory entry
|
||||
(the root of the directory tree, to be specific), and to
|
||||
hold the start block for the <b>small block data</b>.</p>
|
||||
<p>Small block data is a special file that contains the data
|
||||
for small files (less than 4K bytes). It subdivides its
|
||||
blocks into smaller blocks and there is a special small
|
||||
block allocation table that, like the main BAT for larger
|
||||
files, is used to map a small file to its small blocks.</p>
|
||||
</s2>
|
||||
<s3 title="Header Block">
|
||||
<p>The POIFS file system begins with a <b>header
|
||||
block</b>. The first 64 bits of the header form a long
|
||||
<b>file type id</b> or <b>magic number identifier</b> of
|
||||
<code>0xE11AB1A1E011CFD0L</code>. This is basically a
|
||||
sanity check. If this isn't the first thing in the header
|
||||
(and consequently the file system) then this is not a
|
||||
POIFS file system and should be read with some other
|
||||
library.</p>
|
||||
<p>It's important to know the most important parts of the
|
||||
header. These are discussed in the rest of this
|
||||
section.</p>
|
||||
<s4 title="BATs">
|
||||
<p>At offset <b>0x2C</b> is an int specifying the number
|
||||
of elements in the <b>BAT array</b>. The array at
|
||||
<b>0x4C</b> an array of ints. This array contains the
|
||||
indices of every block in the Block Allocation
|
||||
Table.</p>
|
||||
</s4>
|
||||
<s4 title="XBATs">
|
||||
<p>Very large POIFS archives may have more blocks than can
|
||||
be addressed by the BAT blocks enumerated in the header
|
||||
block. How large? Well, the BAT array in the header can
|
||||
contain up to 109 BAT block indices; each BAT block
|
||||
references up to 128 blocks, and each block is 512
|
||||
bytes, so we're talking about 109 * 128 * 512 =
|
||||
6.8MB. That's a pretty respectable document! But, you
|
||||
could have much more data than that, and in today's
|
||||
world of cheap gigabyte drives, why not? So, the BAT
|
||||
may be extended in that event. The integer value at
|
||||
offset <b>0x44</b> of the header is the index of the
|
||||
first <b>extended BAT (XBAT) block</b>. At offset
|
||||
<b>0x48</b> of the header, there is an int value that
|
||||
specifies how many XBAT blocks there are. The XBAT
|
||||
blocks begin at the specified index into the array of
|
||||
blocks making up the POIFS file system, and continue in
|
||||
sequence for the specified count of XBAT blocks.</p>
|
||||
<p>Each XBAT block contains the indices of up to 128 BAT
|
||||
blocks, so the document size can be expanded by another
|
||||
8MB for each XBAT block. The BAT blocks indexed by an
|
||||
XBAT block are appended to the end of the list of BAT
|
||||
blocks enumerated in the header block. Thus the BAT
|
||||
blocks enumerated in the header block are BAT blocks 0
|
||||
through 108, the BAT blocks enumerated in the first
|
||||
XBAT block are BAT blocks 109 through 236, the BAT
|
||||
blocks enumerated in the second XBAT block are BAT
|
||||
blocks 237 through 364, and so on.</p>
|
||||
<p>Through the use of XBAT blocks, the limit on the
|
||||
overall document size is that imposed by the 4-byte
|
||||
block indices; if the indices are unsigned ints, the
|
||||
maximum file size is 2 terabytes, 1 terabyte if the
|
||||
indices are treated as signed ints. Either way, I have
|
||||
yet to see a disk drive large enough to accommodate
|
||||
such a file on the shelves at the local office supply
|
||||
stores.</p>
|
||||
</s4>
|
||||
<s4 title="SBATs">
|
||||
<p>If a file contained in a POIFS archive is smaller than
|
||||
4096 bytes, it is stored in small blocks. Small blocks
|
||||
are 64 bytes in length and are contained within big
|
||||
blocks, up to 8 to a big block. As the main BAT is used
|
||||
to navigate the array of big blocks, so the <b>small
|
||||
block allocation table</b> is used to navigate the
|
||||
array of small blocks. The SBAT's start block index is
|
||||
found at offset <b>0x3C</b> of the header block, and
|
||||
remaining blocks constituting the SBAT are found by
|
||||
walking the main BAT as if it were an ordinary file in
|
||||
the POIFS file system (this process is described
|
||||
below).</p>
|
||||
</s4>
|
||||
<s4 title="Property Table Start Index">
|
||||
<p>An integer at address <b>0x30</b> specifies the start
|
||||
index of the property table. This integer is specified
|
||||
as a <b>"block index"</b>. The Property Table
|
||||
is stored, as is almost everything in a POIFS file
|
||||
system, in big blocks and walked via the BAT. The
|
||||
Property Table is described below.</p>
|
||||
</s4>
|
||||
</s3>
|
||||
<s3 title="Property Table">
|
||||
<p>The property table is essentially nothing more than the
|
||||
directory system. Properties are 128 byte records
|
||||
contained within the 512 byte blocks. The first property
|
||||
is always the Root Entry. The following applies to
|
||||
individual properties within a property table:</p>
|
||||
<ul>
|
||||
<li>At offset <b>0x00</b> in the property is the
|
||||
"<b>name</b>". This is stored as an
|
||||
uncompressed 16 bit unicode string. In short every
|
||||
other byte corresponds to an "ASCII"
|
||||
character. The size of this string is stored at offset
|
||||
<b>0x40</b> (<b>string size</b>) as a short.</li>
|
||||
<li>At offset <b>0x42</b> is the <b>property type</b>
|
||||
(byte). The type is 1 for directory, 2 for file or 5
|
||||
for the Root Entry.</li>
|
||||
<li>At offset <b>0x43</b> is the <b>node color</b>
|
||||
(byte). The color is either 1, (black), or 0,
|
||||
(red). Properties are apparently meant to be arranged
|
||||
in a red-black binary tree, subject to the following
|
||||
rules:
|
||||
<ol>
|
||||
<li>The root of the tree is always black</li>
|
||||
<li>Two consecutive nodes cannot both be red</li>
|
||||
<li>A property is less than another property if its
|
||||
name length is less than the other property's name
|
||||
length</li>
|
||||
<li>If two properties have the same name length, the
|
||||
sort order is determined by the sort order of the
|
||||
properties' names.</li>
|
||||
</ol></li>
|
||||
<li>At offset <b>0x44</b> is the index (int) of the
|
||||
<b>previous property</b>.</li>
|
||||
<li>At offset <b>0x48</b> is the index (int) of the
|
||||
<b>next property</b>.</li>
|
||||
<li>At offset <b>0x4C</b> is the index (int) of the
|
||||
<b>first directory entry</b>. This is used by
|
||||
directory entries.</li>
|
||||
<li>At offset <b>0x74</b> is an integer giving the
|
||||
<b>start block</b> for the file described by this
|
||||
property. This index corresponds to an index in the
|
||||
array of indices that is the Block Allocation Table
|
||||
(or the Small Block Allocation Table) as well as the
|
||||
index of the first block in the file. This is used by
|
||||
files and the root entry.</li>
|
||||
<li>At offset <b>0x78</b> is an integer giving the total
|
||||
<b>actual size</b> of the file pointed at by this
|
||||
property. If the file size is less than 4096, the file
|
||||
is stored in small blocks and the SBAT is used to walk
|
||||
the small blocks making up the file. If the file size
|
||||
is 4096 or larger, the file is stored in big blocks
|
||||
and the main BAT is used to walk the big blocks making
|
||||
up the file. The exception to this rule is the <b>Root
|
||||
Entry</b>, which, regardless of its size, is
|
||||
<b>always</b> stored in big blocks and the main BAT is
|
||||
used to walk the big blocks making up this special
|
||||
file.</li>
|
||||
</ul>
|
||||
</s3>
|
||||
<s3 title="Root Entry">
|
||||
<p>The <b>Root Entry</b> in the <b>Property Table</b>
|
||||
contains the information necessary to read and write
|
||||
small files, which are files less than 4096 bytes
|
||||
long. The start block field of the Root Entry is the
|
||||
start index of the <b>Small Block Array</b>, which is
|
||||
read like any other file in the POIFS file system. Since
|
||||
the SBAT cannot be used without the Small Block Array,
|
||||
the Root Entry MUST be read or written using the <b>Block
|
||||
Allocation Table</b>. The blocks making up the Small
|
||||
Block Array are divided into 64-byte small blocks, up to
|
||||
the size indicated in the Root Entry (which should always
|
||||
be a multiple of 64).</p>
|
||||
</s3>
|
||||
<s3 title="Walking the Nodes of the Property Table">
|
||||
<p>The individual properties form a directory tree, with the
|
||||
<b>Root Entry</b> as the directory tree's root, as shown
|
||||
in the accompanying drawing. Note the numbers in
|
||||
parentheses in each node; they represent the node's index
|
||||
in the array of properties. The <b>NEXT_PROP</b>,
|
||||
<b>PREVIOUS_PROP</b>, and <b>CHILD_PROP</b> fields hold
|
||||
these indices, and are used to navigate the tree.</p>
|
||||
<img src="images/PropertySet.jpg" />
|
||||
<p>Each directory entry (i.e., a property whose type is
|
||||
<b>directory</b> or <b>root entry</b>) uses its
|
||||
<b>CHILD_PROP</b> field to point to one of its
|
||||
subordinate (child) properties. It doesn't seem to matter
|
||||
which of its children it points to. Thus in the previous
|
||||
drawing, the Root Entry's CHILD_PROP field may contain 1,
|
||||
4, or the index of one of its other children. Similarly,
|
||||
the directory node (index 1) may have, in its CHILD_PROP
|
||||
field, 2, 3, or the index of one of its other
|
||||
children.</p>
|
||||
<p>The children of a given directory property point to each
|
||||
other in a similar fashion by using their
|
||||
<b>NEXT_PROP</b> and <b>PREVIOUS_PROP</b> fields.</p>
|
||||
<p>Unused <b>NEXT_PROP</b>, <b>PREVIOUS_PROP</b>, and
|
||||
<b>CHILD_PROP</b> fields contain the marker value of
|
||||
-1. All file properties have a value of -1 for their
|
||||
CHILD_PROP fields for example.</p>
|
||||
</s3>
|
||||
<s3 title="Block Allocation Table">
|
||||
<p>The <b>BAT blocks</b> are pointed at by the bat array
|
||||
contained in the header and supplemented, if necessary,
|
||||
by the <b>XBAT blocks</b>. These blocks form a large
|
||||
table of integers. These integers are block numbers. The
|
||||
<b>Block Allocation Table</b> holds chains of integers.
|
||||
These chains are terminated with -2. The elements in
|
||||
these chains refer to blocks in the files. The starting
|
||||
block of a file is NOT specified in the BAT. It is
|
||||
specified by the <b>property</b> for a given file. The
|
||||
elements in this BAT are both the block number (within
|
||||
the file minus the header) <b>and</b> the number of the
|
||||
next BAT element in the chain. This can be thought of as
|
||||
a linked list of blocks. The BAT array contains the links
|
||||
from one block to the next, including the end of chain
|
||||
marker.</p>
|
||||
<p>Here's an example: Let's assume that the BAT begins as
|
||||
follows:</p>
|
||||
<p><code>BAT[ 0 ] = 2</code></p>
|
||||
<p><code>BAT[ 1 ] = 5</code></p>
|
||||
<p><code>BAT[ 2 ] = 3</code></p>
|
||||
<p><code>BAT[ 3 ] = 4</code></p>
|
||||
<p><code>BAT[ 4 ] = 6</code></p>
|
||||
<p><code>BAT[ 5 ] = -2</code></p>
|
||||
<p><code>BAT[ 6 ] = 7</code></p>
|
||||
<p><code>BAT[ 7 ] = -2</code></p>
|
||||
<p><code>...</code></p>
|
||||
<p>Now, if we have a file whose Property Table entry says it
|
||||
begins with index 0, we walk the BAT array and see that
|
||||
the file consists of blocks 0 (because the start block is
|
||||
0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[
|
||||
3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It
|
||||
ends at block 7 because BAT[ 7 ] is -2, which is the end
|
||||
of chain marker.</p>
|
||||
<p>Similarly, a file beginning at index 1 consists of
|
||||
blocks 1 and 5.</p>
|
||||
<p>Other special numbers in a BAT array are:</p>
|
||||
<ul>
|
||||
<li>-1, which indicates an unused block</li>
|
||||
<li>-3, which indicates a "special" block, such
|
||||
as a block used to make up the Small Block Array, the
|
||||
Property Table, the main BAT, or the SBAT</li>
|
||||
</ul>
|
||||
</s3>
|
||||
<s2 title="File System Structures">
|
||||
<p>The following outlines the basic file system structures.</p>
|
||||
<s3 title="Header (block 1) -- 512 (0x200) bytes">
|
||||
<table>
|
||||
<tr>
|
||||
<td><b>Field</b></td>
|
||||
<td><b>Description</b></td>
|
||||
<td><b>Offset</b></td>
|
||||
<td><b>Length</b></td>
|
||||
<td><b>Default value or const</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>FILETYPE</td>
|
||||
<td>Magic number identifying this as a POIFS file
|
||||
system.</td>
|
||||
<td>0x0000</td>
|
||||
<td>Long</td>
|
||||
<td>0xE11AB1A1E011CFD0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK1</td>
|
||||
<td>Unknown constant</td>
|
||||
<td>0x0008</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK2</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x000C</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK3</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0014</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK4</td>
|
||||
<td>Unknown Constant (revision?)</td>
|
||||
<td>0x0018</td>
|
||||
<td>Short</td>
|
||||
<td>0x003B</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK5</td>
|
||||
<td>Unknown Constant (version?)</td>
|
||||
<td>0x001A</td>
|
||||
<td>Short</td>
|
||||
<td>0x0003</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK6</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x001C</td>
|
||||
<td>Short</td>
|
||||
<td>-2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>LOG_2_BIG_BLOCK_SIZE</td>
|
||||
<td>Log, base 2, of the big block size</td>
|
||||
<td>0x001E</td>
|
||||
<td>Short</td>
|
||||
<td>9 (2 ^ 9 = 512 bytes)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>LOG_2_SMALL_BLOCK_SIZE</td>
|
||||
<td>Log, base 2, of the small block size</td>
|
||||
<td>0x0020</td>
|
||||
<td>Integer</td>
|
||||
<td>6 (2 ^ 6 = 64 bytes)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK7</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0024</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK8</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0028</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>BAT_COUNT</td>
|
||||
<td>Number of elements in the BAT array</td>
|
||||
<td>0x002C</td>
|
||||
<td>Integer</td>
|
||||
<td>required</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>PROPERTIES_START</td>
|
||||
<td>Block index of the first block of the property
|
||||
table</td>
|
||||
<td>0x0030</td>
|
||||
<td>Integer</td>
|
||||
<td>required</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK9</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0034</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK10</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0038</td>
|
||||
<td>Integer</td>
|
||||
<td>0x00001000</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SBAT_START</td>
|
||||
<td>Block index of first big block containing the small
|
||||
block allocation table (SBAT)</td>
|
||||
<td>0x003C</td>
|
||||
<td>Integer</td>
|
||||
<td>-2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>UK11</td>
|
||||
<td>Unknown Constant</td>
|
||||
<td>0x0040</td>
|
||||
<td>Integer</td>
|
||||
<td>1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>XBAT_START</td>
|
||||
<td>Block index of the first block in the Extended Block
|
||||
Allocation Table (XBAT)</td>
|
||||
<td>0x0044</td>
|
||||
<td>Integer</td>
|
||||
<td>-2</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>XBAT_COUNT</td>
|
||||
<td>Number of elements in the Extended Block Allocation
|
||||
Table (to be added to the BAT)</td>
|
||||
<td>0x0048</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>BAT_ARRAY</td>
|
||||
<td>Array of block indices constituting the Block
|
||||
Allocation Table (BAT)</td>
|
||||
<td>0x004C, 0x0050, 0x0054 ... 0x01FC</td>
|
||||
<td>Integer[]</td>
|
||||
<td>-1 for unused elements, at least first element must
|
||||
be filled.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>N/A</td>
|
||||
<td>Header block data not otherwise described in this
|
||||
table</td>
|
||||
<td>N/A</td>
|
||||
<td>N/A</td>
|
||||
<td>-1</td>
|
||||
</tr>
|
||||
</table>
|
||||
</s3>
|
||||
<s3 title="Block Allocation Table Block -- 512 (0x200) bytes">
|
||||
<table>
|
||||
<tr>
|
||||
<td><B>Field</B></td>
|
||||
<td><B>Description</B></td>
|
||||
<td><B>Offset</B></td>
|
||||
<td><B>Length</B></td>
|
||||
<td><B>Default value or const</B></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>BAT_ELEMENT</td>
|
||||
<td>Any given element in the BAT block</td>
|
||||
<td>0x0000, 0x0004, 0x0008, ... 0x01FC</td>
|
||||
<td>Integer</td>
|
||||
<td>
|
||||
<ul>
|
||||
<li>-1 = unused</li>
|
||||
<li>-2 = end of chain</li>
|
||||
<li>-3 = special (e.g., BAT block)</li>
|
||||
</ul>
|
||||
<p>All other values point to the next element in the
|
||||
chain and the next index of a block composing the
|
||||
file.</p>
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
</s3>
|
||||
<s3 title="Property Block -- 512 (0x200) byte block">
|
||||
<table>
|
||||
<tr>
|
||||
<td><B>Field</B></td>
|
||||
<td><B>Description</B></td>
|
||||
<td><B>Offset</B></td>
|
||||
<td><B>Length</B></td>
|
||||
<td><B>Default value or const</B></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Properties[]</td>
|
||||
<td>This block contains the properties.</td>
|
||||
<td>0x0000, 0x0080, 0x0100, 0x0180</td>
|
||||
<td>128 bytes</td>
|
||||
<td>All unused space is set to -1.</td>
|
||||
</tr>
|
||||
</table>
|
||||
</s3>
|
||||
<s3 title="Property -- 128 (0x80) byte block">
|
||||
<table>
|
||||
<tr>
|
||||
<td><B>Field</B></td>
|
||||
<td><B>Description</B></td>
|
||||
<td><B>Offset</B></td>
|
||||
<td><B>Length</B></td>
|
||||
<td><B>Default value or const</B></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NAME</td>
|
||||
<td>A unicode null-terminated uncompressed 16bit string
|
||||
(lose the high bytes) containing the name of the
|
||||
property.</td>
|
||||
<td>0x00, 0x02, 0x04, ... 0x3E</td>
|
||||
<td>Short[]</td>
|
||||
<td>0x0000 for unused elements, field required, 32
|
||||
(0x40) element max</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NAME_SIZE</td>
|
||||
<td>Number of characters in the NAME field</td>
|
||||
<td>0x40</td>
|
||||
<td>Short</td>
|
||||
<td>Required</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>PROPERTY_TYPE</td>
|
||||
<td>Property type (directory, file, or root)</td>
|
||||
<td>0x42</td>
|
||||
<td>Byte</td>
|
||||
<td>1 (directory), 2 (file), or 5 (root entry)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NODE_COLOR</td>
|
||||
<td>Node color</td>
|
||||
<td>0x43</td>
|
||||
<td>Byte</td>
|
||||
<td>0 (red) or 1 (black)</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>PREVIOUS_PROP</td>
|
||||
<td>Previous property index</td>
|
||||
<td>0x44</td>
|
||||
<td>Integer</td>
|
||||
<td>-1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>NEXT_PROP</td>
|
||||
<td>Next property index</td>
|
||||
<td>0x48</td>
|
||||
<td>Integer</td>
|
||||
<td>-1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>CHILD_PROP</td>
|
||||
<td>First child property index</td>
|
||||
<td>0x4c</td>
|
||||
<td>Integer</td>
|
||||
<td>-1</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SECONDS_1</td>
|
||||
<td>Seconds component of the created timestamp?</td>
|
||||
<td>0x64</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DAYS_1</td>
|
||||
<td>Days component of the created timestamp?</td>
|
||||
<td>0x68</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SECONDS_2</td>
|
||||
<td>Seconds component of the modified timestamp?</td>
|
||||
<td>0x6C</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>DAYS_2</td>
|
||||
<td>Days component of the modified timestamp?</td>
|
||||
<td>0x70</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>START_BLOCK</td>
|
||||
<td>Starting block of the file, used as the first block
|
||||
in the file and the pointer to the next block from
|
||||
the BAT</td>
|
||||
<td>0x74</td>
|
||||
<td>Integer</td>
|
||||
<td>Required</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>SIZE</td>
|
||||
<td>Actual size of the file this property points
|
||||
to. (used to truncate the blocks to the real
|
||||
size).</td>
|
||||
<td>0x78</td>
|
||||
<td>Integer</td>
|
||||
<td>0</td>
|
||||
</tr>
|
||||
</table>
|
||||
</s3>
|
||||
</s2>
|
||||
</s1>
|
||||
</body>
|
||||
</document>
|
@ -1,837 +0,0 @@
|
||||
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
|
||||
<HTML>
|
||||
<HEAD>
|
||||
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
|
||||
<TITLE></TITLE>
|
||||
<META NAME="GENERATOR" CONTENT="StarOffice/5.2 (Linux)">
|
||||
<META NAME="AUTHOR" CONTENT=" ">
|
||||
<META NAME="CREATED" CONTENT="20010728;10223600">
|
||||
<META NAME="CHANGEDBY" CONTENT="Marc Johnson">
|
||||
<META NAME="CHANGED" CONTENT="20010810;13415800">
|
||||
<STYLE>
|
||||
<!--
|
||||
@page { margin-left: 1.25in; margin-right: 1.25in; margin-top: 1in; margin-bottom: 1in }
|
||||
H1 { margin-bottom: 0.08in; font-size: 16pt }
|
||||
TD P { margin-bottom: 0.08in }
|
||||
H2 { margin-bottom: 0.08in; font-size: 14pt; font-style: italic }
|
||||
H3 { margin-bottom: 0.08in }
|
||||
H4 { margin-bottom: 0.08in; font-size: 11pt; font-style: italic }
|
||||
P { margin-bottom: 0.08in }
|
||||
-->
|
||||
</STYLE>
|
||||
</HEAD>
|
||||
<BODY>
|
||||
<H1>POI Filesystem format</H1>
|
||||
<H2>Introduction</H2>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
The POI file format is essentially an archive wrapper
|
||||
around files. It is intended to mimic a filesystem. For
|
||||
the remainder of this document it is referred to as a
|
||||
filesystem in order to avoid confusion with the
|
||||
"files" it contains.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium; text-decoration: none">
|
||||
POI filesystems are compatible with those document formats
|
||||
used by a well-known software company's popular office
|
||||
productivity suite and programs outputting compatible
|
||||
data. Because the POI filesystem does not provide
|
||||
compression, encryption or any other worthwhile feature,
|
||||
its not a good choice unless you require interoperability
|
||||
with these programs.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
The POI filesystem does not encode the documents
|
||||
themselves. For example, if you had a word processor file
|
||||
with the extension ".doc", you would actually
|
||||
have a POI filesystem with a document file archived inside
|
||||
of the filesystem.
|
||||
</P>
|
||||
<H2>Document Conventions</H2>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
This document utilizes the numeric types as described by
|
||||
the Java Language Specification, which can be found at
|
||||
java.sun.com. In short:
|
||||
</P>
|
||||
<UL>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
a byte is an 8 bit signed integer ranging from
|
||||
(-128) to 127.
|
||||
</P>
|
||||
</LI>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
a short is a 16 bit signed integer ranging from
|
||||
(-32768) to 32767
|
||||
</P>
|
||||
</LI>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
an int is a 32 bit signed integer ranging from
|
||||
(-2.14e+9) to 2.14e+9
|
||||
</P>
|
||||
</LI>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
a long is a 64 bit signed integer ranging from
|
||||
(-9.22e+18) to 9.22e+18
|
||||
</P>
|
||||
</LI>
|
||||
</UL>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The Java Language Specification spells out a number of
|
||||
other types that are not referred to by this document.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Where this document makes references to "endian
|
||||
conversion" it is referring to the byte order of
|
||||
stored numbers. Numbers in "little-endian order"
|
||||
are stored with the LEAST significant byte first. In order
|
||||
to properly read a short, for example, you'd read two
|
||||
bytes and then shift the second byte 8 bits to the left
|
||||
before performing an <CODE>or</CODE> operation to it
|
||||
against the first byte while stripping the
|
||||
"sign" from the first byte. The following code
|
||||
illustrates this method:
|
||||
</P>
|
||||
<P STYLE="text-decoration: none">
|
||||
<FONT FACE="Courier, monospace"><FONT
|
||||
SIZE=2><B>public int getShort (byte[ ] rec)
|
||||
{</B></FONT></FONT>
|
||||
</P>
|
||||
<P>
|
||||
<FONT FACE="Courier, monospace"><FONT SIZE=2><B>return (
|
||||
(rec[1] << 8) | (rec[0] & 0xff)
|
||||
);</B></FONT></FONT>
|
||||
</P>
|
||||
<P>
|
||||
<FONT FACE="Courier, monospace"><FONT
|
||||
SIZE=2><B>}</B></FONT></FONT>
|
||||
</P>
|
||||
<H2>Filesystem Introduction</H2>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
POI filesystems are essentially normal files stored on a
|
||||
Java-compatible platform's native filesystem. They are
|
||||
identified by names ending in a four character identifier
|
||||
noting what type of data they contain. For example, a file
|
||||
ending in ".xls" would likely contain
|
||||
spreadsheet data, and a file ending in ".doc"
|
||||
would probably contain a word processing document. POI
|
||||
filesystems are called "filesystem", because
|
||||
they contain multiple embedded files in a manner similar
|
||||
to traditional filesystems. Along functional lines, it
|
||||
would be more accurate to call these POI archives.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
POI filesystems do not provide encryption, compression, or
|
||||
any other feature of a modern archive and are therefore a
|
||||
poor choice for implementing new file formats. It is
|
||||
suggested that POI filesystems are most useful for
|
||||
interoperability with legacy applications that use a
|
||||
compatible file format.
|
||||
</P>
|
||||
<H2>Filesystem Walkthrough</H2>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
This is a walkthrough of a POI filesystem and how it is
|
||||
put together. It is not intended to give a concise
|
||||
description but to give a "big picture" of the
|
||||
general structure and how it's interpreted.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
A POI filesystem begins with a <A
|
||||
HREF="HeaderBlock"><B><I>header</I></B></A>. This header
|
||||
identifies locations in the file by function and provides
|
||||
a sanity check identifying a native filesystem file as
|
||||
indeed a POI filesystem.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The first 64 bits of the header compose a <B><I>magic
|
||||
number identifier.</I></B> This identifier tells the
|
||||
client software that this is indeed a POI filesystem and
|
||||
that it should be treated as such. This is a "sanity
|
||||
check" to make sure this is a POI filesystem and not
|
||||
some other format. The header also contains an <B><I>array
|
||||
of block numbers</I></B>. These block numbers refer to
|
||||
blocks in the file. When these blocks are read together
|
||||
they form the <A HREF="#BAT"><B><I>Block Allocation
|
||||
Table</I></B></A>. The header also contains a pointer to
|
||||
the first element in the <A
|
||||
HREF="#PropertyTable"><B><I>property table</I></B></A>
|
||||
also known as the <A HREF="RootEntry"><B><I>root
|
||||
element</I></B></A>, and a pointer to the <B>small Block
|
||||
Allocation Table (SBAT)</B>.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The <A HREF="#BAT"><B><I>block allocation
|
||||
table</I></B></A> or <B><I>BAT</I></B>, along with the <A
|
||||
HREF="#PropertyTable"><B><I>property table</I></B></A>
|
||||
specify which blocks in the filesystem belong to which
|
||||
files. It is somewhat hard to conceptualize the Block
|
||||
Allocation Table at first. The block allocation table is
|
||||
essentially an array of integers that point at each
|
||||
other. These elements form chains.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
To read the <A HREF="#BAT"><B><I>block allocation
|
||||
table</I></B></A> you must first read the <B><I>start
|
||||
block </I></B>of the file from the <A
|
||||
HREF="#PropertyTable"><B><I>property
|
||||
table</I></B></A>. This is both your index for the next
|
||||
element in the <B><I>BAT </I></B>array as well as the
|
||||
index of the first block in your file. For instance: if
|
||||
the <B><I>start block</I></B> from your file's property is
|
||||
0 then you read block 0 (the first block after the header)
|
||||
from your filesystem as the first block of your file. You
|
||||
also read element 0 from the <B><I>BAT array</I></B>.
|
||||
Supposing this element has a value equal to 2, you'd read
|
||||
block 2 from your filesystem as the next block of your
|
||||
file and element 2 from your <B><I>BAT array</I></B>.
|
||||
This will be covered further later in this document.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The <A HREF="#PropertyTable"><B><I>Property
|
||||
Table</I></B></A> is essentially the directory structure
|
||||
for the filesystem. It consists of the name of the file or
|
||||
directory, its <B><I>start block</I></B> in both the
|
||||
filesystem and <B><I>BAT</I></B>, and its actual size.
|
||||
The first property in the <A
|
||||
HREF="#PropertyTable">property table</A> is the <A
|
||||
HREF="RootEntry"><B><I>root element</I></B></A>. Its real
|
||||
purpose is to hold the start block for the <B><I>small
|
||||
blocks.</I></B>
|
||||
</P>
|
||||
<H3>Filesystem Structure</H3>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
All values in the POI filesystem are stored in
|
||||
"little-endian" order, meaning you must reverse
|
||||
the order of the bytes before assigning them to
|
||||
variables. Assume the values you see below are originally
|
||||
stored backwards.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
The POI filesystem is divided into 512 byte blocks. Each
|
||||
block has an implicit block-type. The order and
|
||||
description of these is described below.
|
||||
</P>
|
||||
<A NAME="HeaderBlock"><H3>Header Block</H3></A>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
The POI filesystem begins with a <B><I>header
|
||||
block</I></B>. The first 64 bits of the header form a long
|
||||
<B><I>file type id</I></B> or <B><I>magic number
|
||||
identifier</I></B> of
|
||||
<CODE>0xE11AB1A1E011CFD0L</CODE>. This is basically a
|
||||
sanity check. If this isn't the first thing in the header
|
||||
(and consequently the filesystem) then this is not a POI
|
||||
filesystem and should be read with some other library.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in; font-weight: medium">
|
||||
It's important to know the most important parts of the
|
||||
header. These are discussed in the rest of this
|
||||
section.
|
||||
</P>
|
||||
<H4>BATs</H4>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x2c</B> is an int specifying the number of
|
||||
elements in the <B><I>BAT array</I></B>. The array at
|
||||
<B>0x4c</B> an array of ints. This array contains the
|
||||
indices of every block in the <A HREF="#BAT">Block
|
||||
Allocation Table</A>.
|
||||
</P>
|
||||
<H4><I><B>XBATs</B></I></H4>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Very large POI archives may have more blocks than can be
|
||||
addressed by the BAT blocks enumerated in the header
|
||||
block. How large? Well, the BAT array in the header can
|
||||
contain up to 109 BAT block indices; each BAT block
|
||||
references up to 128 blocks, and each block is 512 bytes,
|
||||
so we're talking about 109 * 128 * 512 = 6.8MB. That's a
|
||||
pretty respectable document! But, you could have much more
|
||||
data than that, and in today's world of cheap gigabyte
|
||||
drives, why not? So, the BAT may be extended in that
|
||||
event. The integer value at offset <B>0x44</B> of the
|
||||
header is the index of the first <B><I>extended BAT (XBAT)
|
||||
block</I></B>. At offset <B>0x48</B> of the header, there
|
||||
is an int value that specifies how many XBAT blocks there
|
||||
are. The XBAT blocks begin at the specified index into the
|
||||
array of blocks making up the POI filesystem, and continue
|
||||
in sequence for the specified count of XBAT blocks.
|
||||
</p>
|
||||
<p>
|
||||
Each XBAT block contains the indices of up to 128 BAT
|
||||
blocks, so the document size can be expanded by another
|
||||
8MB for each XBAT block. The BAT blocks indexed by an XBAT
|
||||
block are appended to the end of the list of BAT blocks
|
||||
enumerated in the header block. Thus the BAT blocks
|
||||
enumerated in the header block are BAT blocks 0 through
|
||||
108, the BAT blocks enumerated in the first XBAT block are
|
||||
BAT blocks 109 through 236, the BAT blocks enumerated in
|
||||
the second XBAT block are BAT blocks 237 through 364, and
|
||||
so on.
|
||||
</P>
|
||||
<p>
|
||||
Through the use of XBAT blocks, the limit on the overall
|
||||
document size is that imposed by the 4-byte block indices;
|
||||
if the indices are unsigned ints, the maximum file size is
|
||||
2 terabytes, 1 terabyte if the indices are treated as
|
||||
signed ints. Either way, I have yet to see a disk drive
|
||||
large enough to accommodate such a file on the shelves at
|
||||
the local office supply stores.
|
||||
</p>
|
||||
<H4>SBATs</H4>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
If a file contained in a POI archive is smaller than 4096
|
||||
bytes, it is stored in small blocks. Small blocks are 64
|
||||
bytes in length and are contained within big blocks, up to
|
||||
8 to a big block. As the main BAT is used to navigate the
|
||||
array of big blocks, so the <B><I>small block allocation
|
||||
table</I></B> is used to navigate the array of small
|
||||
blocks. The SBAT's start block index is found at offset
|
||||
<B>0x3C</B> of the header block, and remaining blocks
|
||||
constituting the SBAT are found by walking the main BAT as
|
||||
if it were an ordinary file in the POI filesystem (this
|
||||
process is described below).
|
||||
</P>
|
||||
<H4>Property Table Start Index</H4>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
An integer at address <B>0x30</B> specifies the start
|
||||
index of the <A HREF="#PropertyTable">property
|
||||
table</A>. This integer is specified as a
|
||||
<B><I>"block index". </I></B>The <A
|
||||
HREF="#PropertyTable">Property Table</A> is stored, as is
|
||||
almost everything in a POI file system, in big blocks and
|
||||
walked via the BAT. The <A HREF="#PropertyTable">Property
|
||||
Table</A> is described below.
|
||||
</P>
|
||||
<A NAME="PropertyTable"><H3>Property Table</H3></A>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The property table is essentially nothing more than the
|
||||
directory system. Properties are 128 byte records
|
||||
contained within the 512 byte blocks. The first property
|
||||
is always the <A HREF="RootEntry">Root Entry</A>. The
|
||||
following applies to individual properties within a
|
||||
property table:
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x00</B> in the property is the
|
||||
"<B><I>name</I></B>". This is stored as an
|
||||
uncompressed 16 bit unicode string. In short every other
|
||||
byte corresponds to an "ASCII" character. The
|
||||
size of this string is stored at offset <B>0x40</B>
|
||||
(<B><I>string size</I></B>) as a short.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x42</B> is the <B><I>property type</I></B>
|
||||
(byte). The type is 1 for directory, 2 for file or 5 for
|
||||
the Root Entry.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x43</B> is the <B><I>node color</I></B>
|
||||
(byte). The color is either 1, (black), or 0,
|
||||
(red). Properties are apparently meant to be arranged in a
|
||||
red-black binary tree, subject to the following rules:
|
||||
<A name="node_rules"></A>
|
||||
<OL>
|
||||
<LI>The root of the tree is always black
|
||||
<LI>Two consecutive nodes cannot both be red
|
||||
<LI>A property is less than another property if its
|
||||
name length is less than the other property's name
|
||||
length
|
||||
<LI>If two properties have the same name length, the
|
||||
sort order is determined by the sort order of the
|
||||
properties' names.
|
||||
</OL>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x44</B> is the index (int) of the
|
||||
<B><I>previous property</I></B>.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x48</B> is the index (int) of the <B><I>next
|
||||
property</I></B>.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x4C</B> is the index (int) of the
|
||||
<B><I>first directory entry</I></B>.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x74</B> is an integer giving the <B><I>start
|
||||
block</I></B> for the file described by this
|
||||
property. This index corresponds to an index in the array
|
||||
of indices that is the Block Allocation Table (or the
|
||||
Small Block Allocation Table) as well as the index of the
|
||||
first block in the file.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
At offset <B>0x78</B> is an integer giving the total
|
||||
<B><I>actual size</I></B> of the file pointed at by this
|
||||
property. If the file size is less than 4096, the file is
|
||||
stored in small blocks and the SBAT is used to walk the
|
||||
small blocks making up the file. If the file size is 4096
|
||||
or larger, the file is stored in big blocks and the main
|
||||
BAT is used to walk the big blocks making up the file. The
|
||||
exception to this rule is the <B><I>Root Entry</I></B>,
|
||||
which, regardless of its size, is ALWAYS stored in big
|
||||
blocks and the main BAT is used to walk the big blocks
|
||||
making up this special file.
|
||||
</P>
|
||||
<A NAME="RootEntry"><H3>Root Entry</H3></A>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The <B><I>Root Entry</I></B> in the <A
|
||||
HREF="#PropertyTable"><B><I>Property Table</I></B></A>
|
||||
contains the information necessary to read and write small
|
||||
files, which are files less than 4096 bytes long. The
|
||||
start block field of the Root Entry is the start index of
|
||||
the <B><I>Small Block Array</I></B>, which is read like
|
||||
any other file in the POI filesysstem. Since the SBAT
|
||||
cannot be used without the Small Block Array, the Root
|
||||
Entry MUST be read or written using the <A
|
||||
HREF="#BAT"><B><I>Block Allocation Table</I></B></A>. The
|
||||
blocks making up the Small Block Array are divided into
|
||||
64-byte small blocks, up to the size indicated in the Root
|
||||
Entry (which should always be a multiple of 64)
|
||||
</P>
|
||||
<H3>Walking the Nodes of the <A HREF="#PropertyTable">Property
|
||||
Table</A></H3>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The individual properties form a directory tree, with the
|
||||
<B><I>Root Entry</I></B> as the directory tree's root, as
|
||||
shown in the accompanying drawing. Note the numbers in
|
||||
parentheses in each node; they represent the node's index
|
||||
in the array of properties. The <B>NEXT_PROP</B>,
|
||||
<B>PREVIOUS_PROP</B>, and <B>CHILD_PROP</B> fields hold
|
||||
these indices, and are used to navigate the tree.
|
||||
</P>
|
||||
<P>
|
||||
<IMG SRC="PropertySet.jpg">
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Each <A NAME="directoryEntry">directory entry</A> (i.e., a
|
||||
property whose type is <B><I>directory</I></B> or
|
||||
<B><I>root entry</I></B>) uses its <B>CHILD_PROP</B> field
|
||||
to point to one of its subordinate (child) properties. It
|
||||
doesn't seem to matter which of its children it points
|
||||
to. Thus in the previous drawing, the Root Entry's
|
||||
CHILD_PROP field may contain 1, 4, or the index of one of
|
||||
its other children. Similarly, the directory node (index
|
||||
1) may have, in its CHILD_PROP field, 2, 3, or the index
|
||||
of one of its other children.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The children of a given <A
|
||||
HREF="#directoryEntry">directory property</A> point to
|
||||
each other in a similar fashion by using their
|
||||
<B>NEXT_PROP</B> and <B>PREVIOUS_PROP</B> fields. The
|
||||
ordering of the children is governed by rules described <a
|
||||
href="#node_rules">here</a>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Unused <B>NEXT_PROP</B>, <B>PREVIOUS_PROP</B>, and
|
||||
<B>CHILD_PROP</B> fields contain the marker value of
|
||||
-1. All file properties have a value of -1 for their
|
||||
CHILD_PROP fields for example.
|
||||
</P>
|
||||
<A NAME="BAT"><H3>Block Allocation Table</H3></A>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
The <B><I>BAT blocks</I></B> are pointed at by the bat
|
||||
array contained in the <A HREF="HeaderBlock">header</A>
|
||||
and supplemented, if necessary, by the <B><I>XBAT
|
||||
blocks</I></B>. These blocks form a large table of
|
||||
integers. These integers are block numbers. The
|
||||
<B><I>Block Allocation Table</I></B> holds chains of
|
||||
integers. These chains are terminated with -2. The
|
||||
elements in these chains refer to blocks in the files. The
|
||||
starting block of a file is NOT specified in the BAT. It
|
||||
is specified by the <B><I>property</I></B> for a given
|
||||
file. The elements in this BAT are both the block number
|
||||
(within the file minus the header) AND the number of the
|
||||
next BAT element in the chain. This can be thought of as a
|
||||
linked list of blocks. The BAT array contains the links
|
||||
from one block to the next, including the end of chain
|
||||
marker.
|
||||
</P>
|
||||
<P>
|
||||
Here's an example: Let's assume that the BAT begins as
|
||||
follows:
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 0 ] = 2</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 1 ] = 5</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 2 ] = 3</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 3 ] = 4</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 4 ] = 6</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 5 ] =
|
||||
-2</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 6 ] = 7</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<FONT FACE="Courier, monospace"><B>BAT[ 7 ] =
|
||||
-2</B></FONT>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
<B>...</B>
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Now, if we have a file whose <A
|
||||
HREF="#PropertyTable">Property Table</A> entry says it
|
||||
begins with index 0, we walk the BAT array and see that
|
||||
the file consists of blocks 0 (because the start block is
|
||||
0), 2 (because BAT[ 0 ] is 2), 3 (BAT[ 2 ] is 3), 4 (BAT[
|
||||
3 ] is 4), 6 (BAT[ 4 ] is 6), and 7 (BAT[ 6 ] is 7). It
|
||||
ends at block 7 because BAT[ 7 ] is -2, which is the end
|
||||
of chain marker.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Similarly, a file beginning at index 1 consists of
|
||||
blocks 1 and 5.
|
||||
</P>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
Other special numbers in a BAT array are:
|
||||
</P>
|
||||
<UL>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
-1, which indicates an unused block
|
||||
</P>
|
||||
</LI>
|
||||
<LI>
|
||||
<P STYLE="margin-bottom: 0in">
|
||||
-3, which indicates a "special" block,
|
||||
such as a block used to make up the Small Block
|
||||
Array, the <A HREF="#PropertyTable">Property
|
||||
Table</A>, the main BAT, or the SBAT
|
||||
</P>
|
||||
</LI>
|
||||
</UL>
|
||||
<H2>Filesystem Structures</H2>
|
||||
<P>
|
||||
The following outlines the basic filesystem structures.
|
||||
</P>
|
||||
<H3>Header (block 1) -- 512 (0x200) bytes</H3>
|
||||
<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0>
|
||||
<TR VALIGN=TOP>
|
||||
<TD><B>Field</B></TD>
|
||||
<TD><B>Description</B></TD>
|
||||
<TD><B>Offset</B></TD>
|
||||
<TD><B>Length</B></TD>
|
||||
<TD><B>Default value or const</B></TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>FILETYPE</TD>
|
||||
<TD>Magic number identifying this as a POI
|
||||
filesystem.</TD>
|
||||
<TD>0x0000</TD>
|
||||
<TD>Long</TD>
|
||||
<TD>0xE11AB1A1E011CFD0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK1</TD>
|
||||
<TD>Unknown constant</TD>
|
||||
<TD>0x0008</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK2</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x000C</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK3</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0014</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK4</TD>
|
||||
<TD>Unknown Constant (revision?)</TD>
|
||||
<TD>0x0018</TD>
|
||||
<TD>Short</TD>
|
||||
<TD>0x003B</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK5</TD>
|
||||
<TD>Unknown Constant (version?)</TD>
|
||||
<TD>0x001A</TD>
|
||||
<TD>Short</TD>
|
||||
<TD>0x0003</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK6</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x001C</TD>
|
||||
<TD>Short</TD>
|
||||
<TD>-2</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>LOG_2_BIG_BLOCK_SIZE</TD>
|
||||
<TD>Log, base 2, of the big block size</TD>
|
||||
<TD>0x001E</TD>
|
||||
<TD>Short</TD>
|
||||
<TD>9 (2 ^ 9 = 512 bytes)</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>LOG_2_SMALL_BLOCK_SIZE</TD>
|
||||
<TD>Log, base 2, of the small block size</TD>
|
||||
<TD>0x0020</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>6 (2 ^ 6 = 64 bytes)</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK7</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0024</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK8</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0028</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>BAT_COUNT</TD>
|
||||
<TD>Number of elements in the BAT array</TD>
|
||||
<TD>0x002C</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>required</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>PROPERTIES_START</TD>
|
||||
<TD>Block index of the first block of the <A
|
||||
HREF="#PropertyTable">property table</A></TD>
|
||||
<TD>0x0030</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>required</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK9</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0034</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK10</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0038</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0x00001000</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>SBAT_START</TD>
|
||||
<TD>Block index of first big block containing the
|
||||
small block allocation table (SBAT)</TD>
|
||||
<TD>0x003C</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-2</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>UK11</TD>
|
||||
<TD>Unknown Constant</TD>
|
||||
<TD>0x0040</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>1</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>XBAT_START</TD>
|
||||
<TD>Block index of the first block in the Extended
|
||||
Block Allocation Table (XBAT)</TD>
|
||||
<TD>0x0044</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-2</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>XBAT_COUNT</TD>
|
||||
<TD>Number of elements in the Extended Block
|
||||
Allocation Table (to be added to the BAT)</TD>
|
||||
<TD>0x0048</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>BAT_ARRAY</TD>
|
||||
<TD>Array of block indicies constituting the <A
|
||||
HREF="#BAT">Block Allocation Table (BAT)</A></TD>
|
||||
<TD>0x004C, 0x0050, 0x0054 ... 0x01FC</TD>
|
||||
<TD>Integer[ ]</TD>
|
||||
<TD>-1 for unused elements, at least first element
|
||||
must be filled.</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>N/A</TD>
|
||||
<TD>Header block data not otherwise described in this
|
||||
table</TD>
|
||||
<TD>N/A</TD>
|
||||
<TD>N/A</TD>
|
||||
<TD>-1</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
<A HREF="#BAT"><H3><B>Block Allocation Table Block -- 512
|
||||
(0x200) bytes</B></H3></A>
|
||||
<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0>
|
||||
<TR VALIGN=TOP>
|
||||
<TD><B>Field</B></TD>
|
||||
<TD><B>Description</B></TD>
|
||||
<TD><B>Offset</B></TD>
|
||||
<TD><B>Length</B></TD>
|
||||
<TD><B>Default value or const</B></TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>BAT_ELEMENT</TD>
|
||||
<TD>Any given element in the BAT block</TD>
|
||||
<TD>0x0000, 0x0004, 0x0008, ... 0x01FC</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-1 = unused<BR>
|
||||
-2 = end of chain<BR>
|
||||
-3 = special (e.g., BAT block)<BR>
|
||||
All other values point to the next element in the
|
||||
chain and the next index of a block composing the
|
||||
file.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
<H3>Property Block -- 512 (0x200) byte block</H3>
|
||||
<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0>
|
||||
<TR VALIGN=TOP>
|
||||
<TD><B>Field</B></TD>
|
||||
<TD><B>Description</B></TD>
|
||||
<TD><B>Offset</B></TD>
|
||||
<TD><B>Length</B></TD>
|
||||
<TD><B>Default value or const</B></TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>Properties[ ]</TD>
|
||||
<TD>This block contains the properties.</TD>
|
||||
<TD>0x0000, 0x0080, 0x0100, 0x0180</TD>
|
||||
<TD>128 bytes</TD>
|
||||
<TD>All unused space is set to -1.</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
<H3>Property -- 128 (0x80) byte block</H3>
|
||||
<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0>
|
||||
<TR VALIGN=TOP>
|
||||
<TD><B>Field</B></TD>
|
||||
<TD><B>Description</B></TD>
|
||||
<TD><B>Offset</B></TD>
|
||||
<TD><B>Length</B></TD>
|
||||
<TD><B>Default value or const</B></TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>NAME</TD>
|
||||
<TD>A unicode null-terminated uncompressed 16bit
|
||||
string (lose the high bytes) containing the name
|
||||
of the property.</TD>
|
||||
<TD>0x00, 0x02, 0x04, ... 0x3E</TD>
|
||||
<TD>Short[ ]</TD>
|
||||
<TD>0x0000 for unused elements, field required, 32
|
||||
(0x40) element max</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>NAME_SIZE</TD>
|
||||
<TD>Number of characters in the NAME field</TD>
|
||||
<TD>0x40</TD>
|
||||
<TD>Short</TD>
|
||||
<TD>Required</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>PROPERTY_TYPE</TD>
|
||||
<TD>Property type (directory, file, or root)</TD>
|
||||
<TD>0x42</TD>
|
||||
<TD>Byte</TD>
|
||||
<TD>1 (directory), 2 (file), or 5 (root entry)</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>NODE_COLOR</TD>
|
||||
<TD>Node color</TD>
|
||||
<TD>0x43</TD>
|
||||
<TD>Byte</TD>
|
||||
<TD>0 (red) or 1 (black)</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>PREVIOUS_PROP</TD>
|
||||
<TD>Previous property index</TD>
|
||||
<TD>0x44</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-1</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>NEXT_PROP</TD>
|
||||
<TD>Next property index</TD>
|
||||
<TD>0x48</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-1</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>CHILD_PROP</TD>
|
||||
<TD>First child property index</TD>
|
||||
<TD>0x4c</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>-1</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>SECONDS_1</TD>
|
||||
<TD>Seconds component of the created timestamp?</TD>
|
||||
<TD>0x64</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>DAYS_1</TD>
|
||||
<TD>Days since epoch component of the created
|
||||
timestamp?</TD>
|
||||
<TD>0x68</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>SECONDS_2</TD>
|
||||
<TD>Seconds component of the modified timestamp?</TD>
|
||||
<TD>0x6C</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>DAYS_2</TD>
|
||||
<TD>Days since epoch component of the modified
|
||||
timestamp?</TD>
|
||||
<TD>0x70</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>START_BLOCK</TD>
|
||||
<TD>Starting block of the file, used as the first
|
||||
block in the file and the pointer to the next
|
||||
block from the BAT</TD>
|
||||
<TD>0x74</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>Required</TD>
|
||||
</TR>
|
||||
<TR VALIGN=TOP>
|
||||
<TD>SIZE</TD>
|
||||
<TD>Actual size of the file this property points
|
||||
to. (used to truncate the blocks to the real
|
||||
size).</TD>
|
||||
<TD>0x78</TD>
|
||||
<TD>Integer</TD>
|
||||
<TD>0</TD>
|
||||
</TR>
|
||||
</TABLE>
|
||||
</BODY>
|
||||
</HTML>
|