This HOW-TO is organized in three section. You should read them sequentially because the later sections build upon the earlier ones.
The first section explains how to read the most important standard properties of a Microsoft Office document. Standard properties are things like title, author, creation date etc. It is quite likely that you will find here what you need and don't have to read the other sections.
The second section goes a small step further and focusses on reading additional standard properties. It also talks about exceptions that may be thrown when dealing with HPSF and shows how you can read properties of embedded objects.
The third section tells how to read non-standard properties. Non-standard properties are application-specific name/value/type triples.
The first thing you should understand is that properties are stored in separate documents inside the POI filesystem. (If you don't know what a POI filesystem is, read its documentation.) A document in a POI filesystem is also called a stream.
The following example shows how to read a POI filesystem's
"title" property. Reading other properties is similar. Consider the API
documentation of org.apache.poi.hpsf.SummaryInformation
.
The standard properties this section focusses on can be found in a document called \005SummaryInformation in the root of the POI filesystem. The notation \005 in the document's name means the character with the decimal value of 5. In order to read the title, an application has to perform the following steps:
Open the document \005SummaryInformation located in the root of the POI filesystem.
Create an instance of the class
SummaryInformation
from that
document.
Call the SummaryInformation
instance's
getTitle()
method.
Sounds easy, doesn't it? Here are the steps in detail.
An application that wants to open a document in a POI filesystem (POIFS) proceeds as shown by the following code fragment. (The full source code of the sample application is available in the examples section of the POI source tree as ReadTitle.java.)
The first interesting statement is
It creates a
org.apache.poi.poifs.eventfilesystem.POIFSReader
instance
which we shall need to read the POI filesystem. Before the application
actually opens the POI filesystem we have to tell the
POIFSReader
which documents we are interested in. In this
case the application should do something with the document
\005SummaryInformation.
This method call registers a
org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
with the POIFSReader
. The POIFSReaderListener
interface specifies the method processPOIFSReaderEvent
which processes a document. The class
MyPOIFSReaderListener
implements the
POIFSReaderListener
and thus the
processPOIFSReaderEvent
method. The eventing POI filesystem
calls this method when it finds the \005SummaryInformation
document. In the sample application MyPOIFSReaderListener
is
a static class in the ReadTitle.java source file.)
Now everything is prepared and reading the POI filesystem can start:
The following source code fragment shows the
MyPOIFSReaderListener
class and how it retrieves the
title.
The line
declares a SummaryInformation
variable and initializes it
with null
. We need an instance of this class to access the
title. The instance is created in a try
block:
The expression e.getStream()
returns the input stream
containing the bytes of the property set stream named
\005SummaryInformation. This stream is passed into the
create
method of the factory class
org.apache.poi.hpsf.PropertySetFactory
which returns
a org.apache.poi.hpsf.PropertySet
instance. It is more or
less safe to cast this result to SummaryInformation
, a
convenience class with methods like getTitle()
,
getAuthor()
etc.
The PropertySetFactory.create
method may throw all sorts
of exceptions. We'll deal with them in the next sections. For now we just
catch all exceptions and throw a RuntimeException
containing the message text of the origin exception.
If all goes well, the sample application retrieves the title and prints it to the standard output. As you can see you must be prepared for the case that the POI filesystem does not have a title.
Please note that a Microsoft Office document does not necessarily
contain the \005SummaryInformation stream. The documents created
by the Microsoft Office suite have one, as far as I know. However, an
Excel spreadsheet exported from StarOffice 5.2 won't have a
\005SummaryInformation stream. In this case the applications
won't throw an exception but simply does not call the
processPOIFSReaderEvent
method. You have been warned!
A couple of additional standard properties are not contained in the \005SummaryInformation stream explained above, for example a document's category or the number of multimedia clips in a PowerPoint presentation. Microsoft has invented an additional stream named \005DocumentSummaryInformation to hold these properties. With two minor exceptions you can proceed exactly as described above to read the properties stored in \005DocumentSummaryInformation:
Instead of \005SummaryInformation use \005DocumentSummaryInformation as the stream's name.
Replace all occurrences of the class
SummaryInformation
by
DocumentSummaryInformation
.
And of course you cannot call getTitle()
because
DocumentSummaryInformation
has different query methods. See
the API documentation for the details!
In the previous section the application simply caught all exceptions and was in no way interested in any details. However, a real application will likely want to know what went wrong and act appropriately. Besides any IO exceptions there are three HPSF resp. POI specific exceptions you should know about:
NoPropertySetStreamException
:This exception is thrown if the application tries to create a
PropertySet
or one of its subclasses
SummaryInformation
and
DocumentSummaryInformation
from a stream that is not a
property set stream. A faulty property set stream counts as not being a
property set stream at all. An application should be prepared to deal
with this case even if opens streams named
\005SummaryInformation or
\005DocumentSummaryInformation only. These are just names. A
stream's name by itself does not ensure that the stream contains the
expected contents and that this contents is correct.
UnexpectedPropertySetTypeException
This exception is thrown if a certain type of property set is
expected somewhere (e.g. a SummaryInformation
or
DocumentSummaryInformation
) but the provided property
set is not of that type.
MarkUnsupportedException
This exception is thrown if an input stream that is to be parsed
into a property set does not support the
InputStream.mark(int)
operation. The POI filesystem uses
the DocumentInputStream
class which does support this
operation, so you are safe here. However, if you read a property set
stream from another kind of input stream things may be
different.
Many Microsoft Office documents contain embedded
objects, for example an Excel sheet on a page in a Word
document. Embedded objects may have property sets of their own. An
application can open these property set streams as described above. The
only difference is that they are not located in the POI filesystem's root
but in a nested directory instead. Just register a
POIFSReaderListener
for the property set streams you are
interested in. For example, the POIBrowser application in the
contrib section tries to open each and every document in a POI filesystem
as a property set stream. If this operation was successful it displays the
properties.