POIFS HOW TO

How to use POIFS directly

Andrew C. Oliver - December 14, 2001

10.31.2001- initial revision for build POI 0.12.3
12.15.2001 - minor revisions - thread safety, entry modification, name restrictions, and so on.
12.30.2001 - revised for POI 1.0-final - minor revisions

Capabilities

This release of POIFS contains the full functionality to read, write and modify (by recreation) files in the format most commonly referred to as OLE 2 Compound Document Format (proabably tm - Microsoft).

Target Audience

This release candidate is intended for general use. It is considered to be production-ready. It has not yet been extensively tested (especially in a high load multi-threaded server situation), though it's been unit tested quite a bit. This release is considered to be "golden" as it has been used by HSSF and other users without problems for some time, and has not changed recently.

General Use

User API

High level description and overview

Files written with the POIFS library are referred to as POIFS file systems (or sometimes archives). The OLE 2 Compound Document format is designed to mimic many of the characteristics of a pre-modern file system (most similar to FAT). We make the distinction between POIFS written files and "native" written OLE 2 Compound Document Format files because while we believe POIFS to be a full, correct and complete implementation, most of this was accomplished through researching other open source implementations and flat out guesses.

This overview is in no way intended to be complete (for a more intense discussion please see POIFSFormat.html in this same directory), it should give you a good idea into the principals of a POIFS file system. Please note that specific file formats such as XLS (HSSF) or DOC utilize POIFS file systems to contain their data, POIFS itself does not know how to interpret the archived data.

Every POIFS file system contains a hierarchy of directories starting with the root (there is always one, and only one, root). Each directory, including the root, may contain one or more directories and/or documents. Every directory and document has a name. The root directory has a name, but unlike other directories, its name is fixed and cannot be renamed.

The POIFS API was not designed to be, and is not, thread-safe. Only one thread of control should ever manipulate a specific POIFS file system over that file system's lifetime. You can, of course, have multiple threads, each manipulating a distinct POIFS file system instance.

Writing a new one

To create a new (from scratch) POIFS file system for writing to, you simply create an instance of net.sourceforge.poi.poifs.filesystem.Filesystem using the default constructor (no arguments). Initially this POIFS file system will be empty except for containing the essential root directory.

From there you can create a directory entry by calling  Filesystem.createDirectory(name), and passing in the name of the directory. This will return an instance of net.sourceforge.poi.poifs.filesystem.DirectoryEntry . You can also create a document within the root directory by calling  Filesystem.createDocument(name, inputstream), and passing the name of the document and an instance of java.io.InputStream from which the document's data can be obtained. It is noted that, the most commonly used file formats of the Microsoft Corporation such as DOC, XLS, etc. are all POIFS-compatible file systems with documents stored in the root directory.

Supposing the document is to be stored in a directory other than the root, you take the instance of DirectoryEntry that you created and call createDocument(name, inputstream) on it instead. You can also create a child directory by calling  createDirectory(name). Alternatively you can call Filesystem.getRoot() and use it just like any other directory entry.

When you've finished creating entries in the filesystem, simply call  Filesystem.writeFilesystem(stream) passing in an instance of  java.io.OutputStream. Be sure you close the stream when you're done.

Names

The POIFS file system imposes two limitations on document and directory names:

  1. The names of documents and directories must be unique within their containing directory. Pretty obvious.

  2. Names are restricted to 31 characters. If you create a directory or document with a name longer than that, it will be silently truncated. When truncated, it may conflict with the name of another directory or document, and the create operation will fail.

Why not Readers and Writers?

The POIFS file system uses Streams because HSSF, and virtually all other applications that would use POIFS, deals with binary files, which Streams handle correctly. Readers and Writers deal with text and know how to handle 16-bit characters. If there is a demand for providing support for Readers and Writers, let us know.

Here is some example code (excerpted and adapted from net.sourceforge.poi.hssf.usermodel.Workbook class):

        byte[]     bytes        = getBytes();                                             // get the bytes for the document (elsewhere in the class)
        FileOutputStream stream = new FileOutputStream("/home/reportsys/test/text.xls");  // create a new FileOuputStream
        Filesystem fs           = new Filesystem();                                       // create a new POIFS Filesystem object
        fs.createDocument(new ByteArrayInputStream(bytes), "Workbook");                   // create a new document in the root directory of the POIFS filesystem
                                                                                          // close on ByteArrayInputStream is a no-op so we don't bother, no real file handle is used
        fs.writeFilesystem(stream);                                                       // write the filesystem to the output stream.
        Stream.close();                                                                   // close our stream (don't leak file handles its bad news)

Reading or modifying an existing file

Reading in an exising POIFS file system is equally simple. Create a new instance of net.sourceforge.poi.poifs.filesystem.Filesystem by calling the Filesystem(java.io.InputStream) constructor and passing in your file system's data (this would probably be a FileInputStream , but it doesn't matter). From there you can get documents from the root directory by calling Filesystem.createDocumentInputStream(name) and passing a string representing that document's name.

If you wish to walk the filesystem, the easiest thing to do is DirectoryEntry.getEntries(). This will give you a java.util.Iterator of Entry instances (DirectoryEntry and DocumentEntry are extensions of Entry) contained by the DirectoryEntry . For instance you could call Filesystem.getRoot() to retrieve a DirectoryEntry instance. From there you could call DirectoryEntry.getEntries() and retrieve an Iterator of those entries. Iterating through these entries, you'd call getName() to check the name of the entry and isDocumentEntry() or isDirectoryEntry() to determine its type. Going the other way, given an Entry, you can walk back up the directory chain by calling getParent(), which returns the Entry's containing DirectoryEntry (calling getParent() on the root directory returns a null reference).

With a DocumentEntry, you can create an instance of net.sourceforge.poi.poifs.filesystem.DocumentInputStream , by passing the DocumentEntry as the only argument to the constructor of DocumentInputStream.The DocumentInputStream class is a simple extension of java.io.InputStream that fully supports the InputStream API, including the mark , reset, and skip methods, providing a form of random access I/O.

To modify the file you would simply walk through the entries and follow the same instructions for writing a POIFS file system from scratch. There are also methods to delete an Entry (note: you cannot delete the root directory, nor can you delete a DirectoryEntry unless it's empty) and to rename an Entry (but see the notes above).

POIFS Logging facility

POIFS does not yet use log4j style logging.

Here is an example

Paste log config example

POIFS Developer's Tools

POIFS does not yet have developer's tools.

What's Next?

  1. Refactoring of the API to more cleanly separate write from read.

  2. Add logging/tracing code

  3. Add tree viewer (probably Andy)

  4. Read/write support for creation and modification time stamps