- Added first sections to HPSF HOW-TO.
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@352153 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
5d260ea7d1
commit
b7bfdf6fe9
@ -143,7 +143,7 @@
|
|||||||
<div align="right">
|
<div align="right">
|
||||||
<table cellspacing="0" cellpadding="2" border="0" width="100%">
|
<table cellspacing="0" cellpadding="2" border="0" width="100%">
|
||||||
<tr>
|
<tr>
|
||||||
<td bgcolor="#525D76"><font color="#ffffff" size="+1"><font face="Arial,sans-serif"><b> 1.1-dev (March 3 2002)</b></font></font></td>
|
<td bgcolor="#525D76"><font color="#ffffff" size="+1"><font face="Arial,sans-serif"><b> 1.1-dev (March 6 2002)</b></font></font></td>
|
||||||
</tr>
|
</tr>
|
||||||
<tr>
|
<tr>
|
||||||
<td>
|
<td>
|
||||||
|
@ -73,11 +73,503 @@
|
|||||||
<tr>
|
<tr>
|
||||||
<td>
|
<td>
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
<p align="justify">TODO: This documentation is still to be written. For the
|
|
||||||
time being, please see the API documentation (javadocs) of the
|
<p align="justify">This HOW-TO is organized in three section. You should read them
|
||||||
<code>org.apache.poi.hpsf</code> package.</p>
|
sequentially because the later sections build upon the earlier ones.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<ol>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">The <a href="#sec1">first section</a> explains how to read
|
||||||
|
the most important standard properties of a Microsoft Office
|
||||||
|
document. Standard properties are things like title, author, creation
|
||||||
|
date etc. It is quite likely that you will find here what you need and
|
||||||
|
don't have to read the other sections.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">The <a href="#sec2">second section</a> goes a small step
|
||||||
|
further and focusses on reading additional standard properties. It also
|
||||||
|
talks about exceptions that may be thrown when dealing with HPSF and
|
||||||
|
shows how you can read properties of embedded objects.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">The <a href="#sec3">third section</a> tells how to read
|
||||||
|
non-standard properties. Non-standard properties are application-specific
|
||||||
|
name/value/type triples.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
</ol>
|
||||||
|
|
||||||
|
|
||||||
|
<anchor id="sec1"></anchor>
|
||||||
|
|
||||||
|
<div align="right">
|
||||||
|
<table cellspacing="0" cellpadding="2" border="0" width="99%">
|
||||||
|
<tr>
|
||||||
|
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Standard Properties</b></font></font></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<note>This section explains how to read
|
||||||
|
the most important standard properties of a Microsoft Office
|
||||||
|
document. Standard properties are things like title, author, creation
|
||||||
|
date etc. Chances are that you will find here what you need and
|
||||||
|
don't have to read the other sections.</note>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The first thing you should understand is that properties are stored in
|
||||||
|
separate documents inside the POI filesystem. (If you don't know what a
|
||||||
|
POI filesystem is, read its <a href="../poifs/index.html">documentation</a>.) A document in a POI
|
||||||
|
filesystem is also called a <em>stream</em>.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The following example shows how to read a POI filesystem's
|
||||||
|
"title" property. Reading other properties is similar. Consider the API
|
||||||
|
documentation of <code>org.apache.poi.hpsf.SummaryInformation</code>.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The standard properties this section focusses on can be
|
||||||
|
found in a document called <em>\005SummaryInformation</em> in the root of
|
||||||
|
the POI filesystem. The notation <em>\005</em> in the document's name
|
||||||
|
means the character with the decimal value of 5. In order to read the
|
||||||
|
title, an application has to perform the following steps:</p>
|
||||||
|
|
||||||
|
|
||||||
|
<ol>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">Open the document <em>\005SummaryInformation</em> located in the root
|
||||||
|
of the POI filesystem.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">Create an instance of the class
|
||||||
|
<code>SummaryInformation</code> from that
|
||||||
|
document.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
|
||||||
|
<p align="justify">Call the <code>SummaryInformation</code> instance's
|
||||||
|
<code>getTitle()</code> method.</p>
|
||||||
|
|
||||||
|
</li>
|
||||||
|
|
||||||
|
</ol>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">Sounds easy, doesn't it? Here are the steps in detail.</p>
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<div align="right">
|
||||||
|
<table cellspacing="0" cellpadding="2" border="0" width="98%">
|
||||||
|
<tr>
|
||||||
|
<td bgcolor="#525D76"><font color="#ffffff" size="-1"><font face="Arial,sans-serif"><b>Open the document \005SummaryInformation in the root of the POI filesystem</b></font></font></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">An application that wants to open a document in a POI filesystem
|
||||||
|
(POIFS) proceeds as shown by the following code fragment. (The full
|
||||||
|
source code of the sample application is available in the
|
||||||
|
<em>examples</em> section of the POI source tree as
|
||||||
|
<em>ReadTitle.java</em>.)</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>
|
||||||
|
import java.io.*;
|
||||||
|
import org.apache.poi.hpsf.*;
|
||||||
|
import org.apache.poi.poifs.eventfilesystem.*;
|
||||||
|
|
||||||
|
// ...
|
||||||
|
|
||||||
|
public static void main(String[] args)
|
||||||
|
throws IOException
|
||||||
|
{
|
||||||
|
final String filename = args[0];
|
||||||
|
POIFSReader r = new POIFSReader();
|
||||||
|
r.registerListener(new MyPOIFSReaderListener(),
|
||||||
|
"\005SummaryInformation");
|
||||||
|
r.read(new FileInputStream(filename));
|
||||||
|
}</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The first interesting statement is</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>POIFSReader r = new POIFSReader();</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">It creates a
|
||||||
|
<code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
|
||||||
|
which we shall need to read the POI filesystem. Before the application
|
||||||
|
actually opens the POI filesystem we have to tell the
|
||||||
|
<code>POIFSReader</code> which documents we are interested in. In this
|
||||||
|
case the application should do something with the document
|
||||||
|
<em>\005SummaryInformation</em>.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>
|
||||||
|
r.registerListener(new MyPOIFSReaderListener(),
|
||||||
|
"\005SummaryInformation");</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">This method call registers a
|
||||||
|
<code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
|
||||||
|
with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
|
||||||
|
interface specifies the method <code>processPOIFSReaderEvent</code>
|
||||||
|
which processes a document. The class
|
||||||
|
<code>MyPOIFSReaderListener</code> implements the
|
||||||
|
<code>POIFSReaderListener</code> and thus the
|
||||||
|
<code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
|
||||||
|
calls this method when it finds the <em>\005SummaryInformation</em>
|
||||||
|
document. In the sample application <code>MyPOIFSReaderListener</code> is
|
||||||
|
a static class in the <em>ReadTitle.java</em> source file.)</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">Now everything is prepared and reading the POI filesystem can
|
||||||
|
start:</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>r.read(new FileInputStream(filename));</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The following source code fragment shows the
|
||||||
|
<code>MyPOIFSReaderListener</code> class and how it retrieves the
|
||||||
|
title.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>
|
||||||
|
static class MyPOIFSReaderListener implements POIFSReaderListener
|
||||||
|
{
|
||||||
|
public void processPOIFSReaderEvent(POIFSReaderEvent e)
|
||||||
|
{
|
||||||
|
SummaryInformation si = null;
|
||||||
|
try
|
||||||
|
{
|
||||||
|
si = (SummaryInformation)
|
||||||
|
PropertySetFactory.create(e.getStream());
|
||||||
|
}
|
||||||
|
catch (Exception ex)
|
||||||
|
{
|
||||||
|
throw new RuntimeException
|
||||||
|
("Property set stream \"" +
|
||||||
|
event.getPath() + event.getName() + "\": " + ex);
|
||||||
|
}
|
||||||
|
final String title = si.getTitle();
|
||||||
|
if (title != null)
|
||||||
|
System.out.println("Title: \"" + title + "\"");
|
||||||
|
else
|
||||||
|
System.out.println("Document has no title.");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The line</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>SummaryInformation si = null;</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">declares a <code>SummaryInformation</code> variable and initializes it
|
||||||
|
with <code>null</code>. We need an instance of this class to access the
|
||||||
|
title. The instance is created in a <code>try</code> block:</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>si = (SummaryInformation)
|
||||||
|
PropertySetFactory.create(e.getStream());</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The expression <code>e.getStream()</code> returns the input stream
|
||||||
|
containing the bytes of the property set stream named
|
||||||
|
<em>\005SummaryInformation</em>. This stream is passed into the
|
||||||
|
<code>create</code> method of the factory class
|
||||||
|
<code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
|
||||||
|
a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
|
||||||
|
less safe to cast this result to <code>SummaryInformation</code>, a
|
||||||
|
convenience class with methods like <code>getTitle()</code>,
|
||||||
|
<code>getAuthor()</code> etc.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">The <code>PropertySetFactory.create</code> method may throw all sorts
|
||||||
|
of exceptions. We'll deal with them in the next sections. For now we just
|
||||||
|
catch all exceptions and throw a <code>RuntimeException</code>
|
||||||
|
containing the message text of the origin exception.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">If all goes well, the sample application retrieves the title and prints
|
||||||
|
it to the standard output. As you can see you must be prepared for the
|
||||||
|
case that the POI filesystem does not have a title.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<pre>final String title = si.getTitle();
|
||||||
|
if (title != null)
|
||||||
|
System.out.println("Title: \"" + title + "\"");
|
||||||
|
else
|
||||||
|
System.out.println("Document has no title.");</pre>
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">Please note that a Microsoft Office document does not necessarily
|
||||||
|
contain the <em>\005SummaryInformation</em> stream. The documents created
|
||||||
|
by the Microsoft Office suite have one, as far as I know. However, an
|
||||||
|
Excel spreadsheet exported from StarOffice 5.2 won't have a
|
||||||
|
<em>\005SummaryInformation</em> stream. In this case the applications
|
||||||
|
won't throw an exception but simply does not call the
|
||||||
|
<code>processPOIFSReaderEvent</code> method. You have been warned!</p>
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<anchor id="sec2"></anchor>
|
||||||
|
|
||||||
|
<div align="right">
|
||||||
|
<table cellspacing="0" cellpadding="2" border="0" width="99%">
|
||||||
|
<tr>
|
||||||
|
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Additional Standard Properties, Exceptions And Embedded Objects</b></font></font></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<note>This section focusses on reading additional standard properties. It
|
||||||
|
also talks about exceptions that may be thrown when dealing with HPSF and
|
||||||
|
shows how you can read properties of embedded objects.</note>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">A couple of <em>additional standard properties</em> are not
|
||||||
|
contained in the <em>\005SummaryInformation</em> stream explained above,
|
||||||
|
for example a document's category or the number of multimedia clips in a
|
||||||
|
PowerPoint presentation. Microsoft has invented an additional stream named
|
||||||
|
<em>\005DocumentSummaryInformation</em> to hold these properties. With two
|
||||||
|
minor exceptions you can proceed exactly as described above to read the
|
||||||
|
properties stored in <em>\005DocumentSummaryInformation</em>:</p>
|
||||||
|
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<p align="justify">Instead of <em>\005SummaryInformation</em> use
|
||||||
|
<em>\005DocumentSummaryInformation</em> as the stream's name.</p>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<p align="justify">Replace all occurrences of the class
|
||||||
|
<code>SummaryInformation</code> by
|
||||||
|
<code>DocumentSummaryInformation</code>.</p>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">And of course you cannot call <code>getTitle()</code> because
|
||||||
|
<code>DocumentSummaryInformation</code> has different query methods. See
|
||||||
|
the API documentation for the details!</p>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">In the previous section the application simply caught all
|
||||||
|
<em>exceptions</em> and was in no way interested in any
|
||||||
|
details. However, a real application will likely want to know what went
|
||||||
|
wrong and act appropriately. Besides any IO exceptions there are three
|
||||||
|
HPSF resp. POI specific exceptions you should know about:</p>
|
||||||
|
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
|
||||||
|
<dt>
|
||||||
|
<code>NoPropertySetStreamException</code>:</dt>
|
||||||
|
|
||||||
|
<dd>
|
||||||
|
<p align="justify">This exception is thrown if the application tries to create a
|
||||||
|
<code>PropertySet</code> or one of its subclasses
|
||||||
|
<code>SummaryInformation</code> and
|
||||||
|
<code>DocumentSummaryInformation</code> from a stream that is not a
|
||||||
|
property set stream. A faulty property set stream counts as not being a
|
||||||
|
property set stream at all. An application should be prepared to deal
|
||||||
|
with this case even if opens streams named
|
||||||
|
<em>\005SummaryInformation</em> or
|
||||||
|
<em>\005DocumentSummaryInformation</em> only. These are just names. A
|
||||||
|
stream's name by itself does not ensure that the stream contains the
|
||||||
|
expected contents and that this contents is correct.</p>
|
||||||
|
</dd>
|
||||||
|
|
||||||
|
|
||||||
|
<dt>
|
||||||
|
<code>UnexpectedPropertySetTypeException</code>
|
||||||
|
</dt>
|
||||||
|
|
||||||
|
<dd>
|
||||||
|
<p align="justify">This exception is thrown if a certain type of property set is
|
||||||
|
expected somewhere (e.g. a <code>SummaryInformation</code> or
|
||||||
|
<code>DocumentSummaryInformation</code>) but the provided property
|
||||||
|
set is not of that type.</p>
|
||||||
|
</dd>
|
||||||
|
|
||||||
|
|
||||||
|
<dt>
|
||||||
|
<code>MarkUnsupportedException</code>
|
||||||
|
</dt>
|
||||||
|
|
||||||
|
<dd>
|
||||||
|
<p align="justify">This exception is thrown if an input stream that is to be parsed
|
||||||
|
into a property set does not support the
|
||||||
|
<code>InputStream.mark(int)</code> operation. The POI filesystem uses
|
||||||
|
the <code>DocumentInputStream</code> class which does support this
|
||||||
|
operation, so you are safe here. However, if you read a property set
|
||||||
|
stream from another kind of input stream things may be
|
||||||
|
different.</p>
|
||||||
|
</dd>
|
||||||
|
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
|
||||||
|
<p align="justify">Many Microsoft Office documents contain <em>embedded
|
||||||
|
objects</em>, for example an Excel sheet on a page in a Word
|
||||||
|
document. Embedded objects may have property sets of their own. An
|
||||||
|
application can open these property set streams as described above. The
|
||||||
|
only difference is that they are not located in the POI filesystem's root
|
||||||
|
but in a nested directory instead. Just register a
|
||||||
|
<code>POIFSReaderListener</code> for the property set streams you are
|
||||||
|
interested in. For example, the <em>POIBrowser</em> application in the
|
||||||
|
contrib section tries to open each and every document in a POI filesystem
|
||||||
|
as a property set stream. If this operation was successful it displays the
|
||||||
|
properties.</p>
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<anchor id="sec3"></anchor>
|
||||||
|
|
||||||
|
<div align="right">
|
||||||
|
<table cellspacing="0" cellpadding="2" border="0" width="99%">
|
||||||
|
<tr>
|
||||||
|
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Non-Standard Properties</b></font></font></td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
|
||||||
|
<note>This section tells how to read
|
||||||
|
non-standard properties. Non-standard properties are application-specific
|
||||||
|
name/value/type triples.</note>
|
||||||
|
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table cellspacing="2" cellpadding="2" border="1">
|
||||||
|
<tr>
|
||||||
|
<td bgcolor="#c0c0c0"><font size="-1" color="#023264">Write this section!</font></td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
</div>
|
||||||
|
<br>
|
||||||
|
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
@ -212,7 +212,8 @@
|
|||||||
|
|
||||||
<li>Glen Stampoultzis (glens at apache.org)</li>
|
<li>Glen Stampoultzis (glens at apache.org)</li>
|
||||||
|
|
||||||
<li>Rainer Klute (klute at rainer-klute dot de)</li>
|
<li>
|
||||||
|
<a href="http://www.rainer-klute.de/">Rainer Klute</a> (klute at apache dot org)</li>
|
||||||
|
|
||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
|
@ -1,17 +1,305 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../dtd/document-v10.dtd">
|
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../dtd/document-v10.dtd">
|
||||||
<document>
|
<document>
|
||||||
<header>
|
<header>
|
||||||
<title>HPSF HOW-TO</title>
|
<title>HPSF HOW-TO</title>
|
||||||
<authors>
|
<authors>
|
||||||
<person name="Rainer Klute" email="klute@rainer-klute.de"/>
|
<person name="Rainer Klute" email="klute@rainer-klute.de"/>
|
||||||
</authors>
|
</authors>
|
||||||
</header>
|
</header>
|
||||||
<body>
|
<body>
|
||||||
<s1 title="How To Use the HPSF APIs">
|
<s1 title="How To Use the HPSF APIs">
|
||||||
<p class="todo">TODO: This documentation is still to be written. For the
|
|
||||||
time being, please see the API documentation (javadocs) of the
|
<p>This HOW-TO is organized in three section. You should read them
|
||||||
<code>org.apache.poi.hpsf</code> package.</p>
|
sequentially because the later sections build upon the earlier ones.</p>
|
||||||
</s1>
|
|
||||||
</body>
|
<ol>
|
||||||
|
<li>
|
||||||
|
<p>The <link href="#sec1">first section</link> explains how to read
|
||||||
|
the most important standard properties of a Microsoft Office
|
||||||
|
document. Standard properties are things like title, author, creation
|
||||||
|
date etc. It is quite likely that you will find here what you need and
|
||||||
|
don't have to read the other sections.</p>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<p>The <link href="#sec2">second section</link> goes a small step
|
||||||
|
further and focusses on reading additional standard properties. It also
|
||||||
|
talks about exceptions that may be thrown when dealing with HPSF and
|
||||||
|
shows how you can read properties of embedded objects.</p>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li>
|
||||||
|
<p>The <link href="#sec3">third section</link> tells how to read
|
||||||
|
non-standard properties. Non-standard properties are application-specific
|
||||||
|
name/value/type triples.</p>
|
||||||
|
</li>
|
||||||
|
</ol>
|
||||||
|
|
||||||
|
<anchor id="sec1" />
|
||||||
|
<s2 title="Reading Standard Properties">
|
||||||
|
|
||||||
|
<note>This section explains how to read
|
||||||
|
the most important standard properties of a Microsoft Office
|
||||||
|
document. Standard properties are things like title, author, creation
|
||||||
|
date etc. Chances are that you will find here what you need and
|
||||||
|
don't have to read the other sections.</note>
|
||||||
|
|
||||||
|
<p>The first thing you should understand is that properties are stored in
|
||||||
|
separate documents inside the POI filesystem. (If you don't know what a
|
||||||
|
POI filesystem is, read its <link
|
||||||
|
href="../poifs/index.html">documentation</link>.) A document in a POI
|
||||||
|
filesystem is also called a <strong>stream</strong>.</p>
|
||||||
|
|
||||||
|
<p>The following example shows how to read a POI filesystem's
|
||||||
|
"title" property. Reading other properties is similar. Consider the API
|
||||||
|
documentation of <code>org.apache.poi.hpsf.SummaryInformation</code>.</p>
|
||||||
|
|
||||||
|
<p>The standard properties this section focusses on can be
|
||||||
|
found in a document called <em>\005SummaryInformation</em> in the root of
|
||||||
|
the POI filesystem. The notation <em>\005</em> in the document's name
|
||||||
|
means the character with the decimal value of 5. In order to read the
|
||||||
|
title, an application has to perform the following steps:</p>
|
||||||
|
|
||||||
|
<ol>
|
||||||
|
<li>
|
||||||
|
<p>Open the document <em>\005SummaryInformation</em> located in the root
|
||||||
|
of the POI filesystem.</p>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<p>Create an instance of the class
|
||||||
|
<code>SummaryInformation</code> from that
|
||||||
|
document.</p>
|
||||||
|
</li>
|
||||||
|
<li>
|
||||||
|
<p>Call the <code>SummaryInformation</code> instance's
|
||||||
|
<code>getTitle()</code> method.</p>
|
||||||
|
</li>
|
||||||
|
</ol>
|
||||||
|
|
||||||
|
<p>Sounds easy, doesn't it? Here are the steps in detail.</p>
|
||||||
|
|
||||||
|
|
||||||
|
<s3 title="Open the document \005SummaryInformation in the root of the
|
||||||
|
POI filesystem">
|
||||||
|
|
||||||
|
<p>An application that wants to open a document in a POI filesystem
|
||||||
|
(POIFS) proceeds as shown by the following code fragment. (The full
|
||||||
|
source code of the sample application is available in the
|
||||||
|
<em>examples</em> section of the POI source tree as
|
||||||
|
<em>ReadTitle.java</em>.)</p>
|
||||||
|
|
||||||
|
<source>
|
||||||
|
import java.io.*;
|
||||||
|
import org.apache.poi.hpsf.*;
|
||||||
|
import org.apache.poi.poifs.eventfilesystem.*;
|
||||||
|
|
||||||
|
// ...
|
||||||
|
|
||||||
|
public static void main(String[] args)
|
||||||
|
throws IOException
|
||||||
|
{
|
||||||
|
final String filename = args[0];
|
||||||
|
POIFSReader r = new POIFSReader();
|
||||||
|
r.registerListener(new MyPOIFSReaderListener(),
|
||||||
|
"\005SummaryInformation");
|
||||||
|
r.read(new FileInputStream(filename));
|
||||||
|
}</source>
|
||||||
|
|
||||||
|
<p>The first interesting statement is</p>
|
||||||
|
|
||||||
|
<source>POIFSReader r = new POIFSReader();</source>
|
||||||
|
|
||||||
|
<p>It creates a
|
||||||
|
<code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
|
||||||
|
which we shall need to read the POI filesystem. Before the application
|
||||||
|
actually opens the POI filesystem we have to tell the
|
||||||
|
<code>POIFSReader</code> which documents we are interested in. In this
|
||||||
|
case the application should do something with the document
|
||||||
|
<em>\005SummaryInformation</em>.</p>
|
||||||
|
|
||||||
|
<source>
|
||||||
|
r.registerListener(new MyPOIFSReaderListener(),
|
||||||
|
"\005SummaryInformation");</source>
|
||||||
|
|
||||||
|
<p>This method call registers a
|
||||||
|
<code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
|
||||||
|
with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
|
||||||
|
interface specifies the method <code>processPOIFSReaderEvent</code>
|
||||||
|
which processes a document. The class
|
||||||
|
<code>MyPOIFSReaderListener</code> implements the
|
||||||
|
<code>POIFSReaderListener</code> and thus the
|
||||||
|
<code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
|
||||||
|
calls this method when it finds the <em>\005SummaryInformation</em>
|
||||||
|
document. In the sample application <code>MyPOIFSReaderListener</code> is
|
||||||
|
a static class in the <em>ReadTitle.java</em> source file.)</p>
|
||||||
|
|
||||||
|
<p>Now everything is prepared and reading the POI filesystem can
|
||||||
|
start:</p>
|
||||||
|
|
||||||
|
<source>r.read(new FileInputStream(filename));</source>
|
||||||
|
|
||||||
|
<p>The following source code fragment shows the
|
||||||
|
<code>MyPOIFSReaderListener</code> class and how it retrieves the
|
||||||
|
title.</p>
|
||||||
|
|
||||||
|
<source>
|
||||||
|
static class MyPOIFSReaderListener implements POIFSReaderListener
|
||||||
|
{
|
||||||
|
public void processPOIFSReaderEvent(POIFSReaderEvent e)
|
||||||
|
{
|
||||||
|
SummaryInformation si = null;
|
||||||
|
try
|
||||||
|
{
|
||||||
|
si = (SummaryInformation)
|
||||||
|
PropertySetFactory.create(e.getStream());
|
||||||
|
}
|
||||||
|
catch (Exception ex)
|
||||||
|
{
|
||||||
|
throw new RuntimeException
|
||||||
|
("Property set stream \"" +
|
||||||
|
event.getPath() + event.getName() + "\": " + ex);
|
||||||
|
}
|
||||||
|
final String title = si.getTitle();
|
||||||
|
if (title != null)
|
||||||
|
System.out.println("Title: \"" + title + "\"");
|
||||||
|
else
|
||||||
|
System.out.println("Document has no title.");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
</source>
|
||||||
|
|
||||||
|
<p>The line</p>
|
||||||
|
|
||||||
|
<source>SummaryInformation si = null;</source>
|
||||||
|
|
||||||
|
<p>declares a <code>SummaryInformation</code> variable and initializes it
|
||||||
|
with <code>null</code>. We need an instance of this class to access the
|
||||||
|
title. The instance is created in a <code>try</code> block:</p>
|
||||||
|
|
||||||
|
<source>si = (SummaryInformation)
|
||||||
|
PropertySetFactory.create(e.getStream());</source>
|
||||||
|
|
||||||
|
<p>The expression <code>e.getStream()</code> returns the input stream
|
||||||
|
containing the bytes of the property set stream named
|
||||||
|
<em>\005SummaryInformation</em>. This stream is passed into the
|
||||||
|
<code>create</code> method of the factory class
|
||||||
|
<code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
|
||||||
|
a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
|
||||||
|
less safe to cast this result to <code>SummaryInformation</code>, a
|
||||||
|
convenience class with methods like <code>getTitle()</code>,
|
||||||
|
<code>getAuthor()</code> etc.</p>
|
||||||
|
|
||||||
|
<p>The <code>PropertySetFactory.create</code> method may throw all sorts
|
||||||
|
of exceptions. We'll deal with them in the next sections. For now we just
|
||||||
|
catch all exceptions and throw a <code>RuntimeException</code>
|
||||||
|
containing the message text of the origin exception.</p>
|
||||||
|
|
||||||
|
<p>If all goes well, the sample application retrieves the title and prints
|
||||||
|
it to the standard output. As you can see you must be prepared for the
|
||||||
|
case that the POI filesystem does not have a title.</p>
|
||||||
|
|
||||||
|
<source>final String title = si.getTitle();
|
||||||
|
if (title != null)
|
||||||
|
System.out.println("Title: \"" + title + "\"");
|
||||||
|
else
|
||||||
|
System.out.println("Document has no title.");</source>
|
||||||
|
|
||||||
|
<p>Please note that a Microsoft Office document does not necessarily
|
||||||
|
contain the <em>\005SummaryInformation</em> stream. The documents created
|
||||||
|
by the Microsoft Office suite have one, as far as I know. However, an
|
||||||
|
Excel spreadsheet exported from StarOffice 5.2 won't have a
|
||||||
|
<em>\005SummaryInformation</em> stream. In this case the applications
|
||||||
|
won't throw an exception but simply does not call the
|
||||||
|
<code>processPOIFSReaderEvent</code> method. You have been warned!</p>
|
||||||
|
</s3>
|
||||||
|
</s2>
|
||||||
|
|
||||||
|
<anchor id="sec2"/>
|
||||||
|
<s2 title="Additional Standard Properties, Exceptions And Embedded Objects">
|
||||||
|
|
||||||
|
<note>This section focusses on reading additional standard properties. It
|
||||||
|
also talks about exceptions that may be thrown when dealing with HPSF and
|
||||||
|
shows how you can read properties of embedded objects.</note>
|
||||||
|
|
||||||
|
<p>A couple of <strong>additional standard properties</strong> are not
|
||||||
|
contained in the <em>\005SummaryInformation</em> stream explained above,
|
||||||
|
for example a document's category or the number of multimedia clips in a
|
||||||
|
PowerPoint presentation. Microsoft has invented an additional stream named
|
||||||
|
<em>\005DocumentSummaryInformation</em> to hold these properties. With two
|
||||||
|
minor exceptions you can proceed exactly as described above to read the
|
||||||
|
properties stored in <em>\005DocumentSummaryInformation</em>:</p>
|
||||||
|
|
||||||
|
<ul>
|
||||||
|
<li><p>Instead of <em>\005SummaryInformation</em> use
|
||||||
|
<em>\005DocumentSummaryInformation</em> as the stream's name.</p></li>
|
||||||
|
<li><p>Replace all occurrences of the class
|
||||||
|
<code>SummaryInformation</code> by
|
||||||
|
<code>DocumentSummaryInformation</code>.</p></li>
|
||||||
|
</ul>
|
||||||
|
|
||||||
|
<p>And of course you cannot call <code>getTitle()</code> because
|
||||||
|
<code>DocumentSummaryInformation</code> has different query methods. See
|
||||||
|
the API documentation for the details!</p>
|
||||||
|
|
||||||
|
<p>In the previous section the application simply caught all
|
||||||
|
<strong>exceptions</strong> and was in no way interested in any
|
||||||
|
details. However, a real application will likely want to know what went
|
||||||
|
wrong and act appropriately. Besides any IO exceptions there are three
|
||||||
|
HPSF resp. POI specific exceptions you should know about:</p>
|
||||||
|
|
||||||
|
<dl>
|
||||||
|
<dt><code>NoPropertySetStreamException</code>:</dt>
|
||||||
|
<dd><p>This exception is thrown if the application tries to create a
|
||||||
|
<code>PropertySet</code> or one of its subclasses
|
||||||
|
<code>SummaryInformation</code> and
|
||||||
|
<code>DocumentSummaryInformation</code> from a stream that is not a
|
||||||
|
property set stream. A faulty property set stream counts as not being a
|
||||||
|
property set stream at all. An application should be prepared to deal
|
||||||
|
with this case even if opens streams named
|
||||||
|
<em>\005SummaryInformation</em> or
|
||||||
|
<em>\005DocumentSummaryInformation</em> only. These are just names. A
|
||||||
|
stream's name by itself does not ensure that the stream contains the
|
||||||
|
expected contents and that this contents is correct.</p></dd>
|
||||||
|
|
||||||
|
<dt><code>UnexpectedPropertySetTypeException</code></dt>
|
||||||
|
<dd><p>This exception is thrown if a certain type of property set is
|
||||||
|
expected somewhere (e.g. a <code>SummaryInformation</code> or
|
||||||
|
<code>DocumentSummaryInformation</code>) but the provided property
|
||||||
|
set is not of that type.</p></dd>
|
||||||
|
|
||||||
|
<dt><code>MarkUnsupportedException</code></dt>
|
||||||
|
<dd><p>This exception is thrown if an input stream that is to be parsed
|
||||||
|
into a property set does not support the
|
||||||
|
<code>InputStream.mark(int)</code> operation. The POI filesystem uses
|
||||||
|
the <code>DocumentInputStream</code> class which does support this
|
||||||
|
operation, so you are safe here. However, if you read a property set
|
||||||
|
stream from another kind of input stream things may be
|
||||||
|
different.</p></dd>
|
||||||
|
</dl>
|
||||||
|
|
||||||
|
<p>Many Microsoft Office documents contain <strong>embedded
|
||||||
|
objects</strong>, for example an Excel sheet on a page in a Word
|
||||||
|
document. Embedded objects may have property sets of their own. An
|
||||||
|
application can open these property set streams as described above. The
|
||||||
|
only difference is that they are not located in the POI filesystem's root
|
||||||
|
but in a nested directory instead. Just register a
|
||||||
|
<code>POIFSReaderListener</code> for the property set streams you are
|
||||||
|
interested in. For example, the <em>POIBrowser</em> application in the
|
||||||
|
contrib section tries to open each and every document in a POI filesystem
|
||||||
|
as a property set stream. If this operation was successful it displays the
|
||||||
|
properties.</p>
|
||||||
|
</s2>
|
||||||
|
|
||||||
|
<anchor id="sec3"/>
|
||||||
|
<s2 title="Reading Non-Standard Properties">
|
||||||
|
|
||||||
|
<note>This section tells how to read
|
||||||
|
non-standard properties. Non-standard properties are application-specific
|
||||||
|
name/value/type triples.</note>
|
||||||
|
|
||||||
|
<fixme author="Rainer Klute">Write this section!</fixme>
|
||||||
|
</s2>
|
||||||
|
</s1>
|
||||||
|
</body>
|
||||||
</document>
|
</document>
|
||||||
|
Loading…
Reference in New Issue
Block a user