From c6814aa16c1d1c4e0cb27501297b8dee4aa24aa3 Mon Sep 17 00:00:00 2001 From: Rainer Klute Date: Wed, 5 Feb 2003 19:33:27 +0000 Subject: [PATCH] Completed the third main section of the HPSF HOW-TO. git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353000 13f79535-47bb-0310-9956-ffa450edef68 --- src/documentation/xdocs/hpsf/how-to.xml | 447 +++++++++++++----- src/documentation/xdocs/hpsf/todo.xml | 17 +- src/java/org/apache/poi/hpsf/TypeReader.java | 5 + .../poi/hpsf/wellknown/PropertyIDMap.java | 6 +- 4 files changed, 359 insertions(+), 116 deletions(-) diff --git a/src/documentation/xdocs/hpsf/how-to.xml b/src/documentation/xdocs/hpsf/how-to.xml index 8098acb15..c640f1944 100644 --- a/src/documentation/xdocs/hpsf/how-to.xml +++ b/src/documentation/xdocs/hpsf/how-to.xml @@ -33,10 +33,9 @@
  • -

    The third section tells how to read +

    The third section tells how to read non-standard properties. Non-standard properties are application-specific - name/value/type triples. This section is still to be written. Look up - the API documentation for the time being!

    + triples consisting of an ID, a type, and a value.

  • @@ -303,54 +302,60 @@ else
    This section tells how to read non-standard properties. Non-standard - properties are application-specific name/type/value triples. + properties are application-specific ID/type/value triples. -

    Now comes the really hardcode stuff. As mentioned above, - SummaryInformation and - DocumentSummaryInformation are just special cases of the - general concept of a property set. The general concept says that a - property set consists of properties. Each property is an - entity that has a name, a type, and a - value.

    +
    +

    Now comes the real hardcode stuff. As mentioned above, + SummaryInformation and + DocumentSummaryInformation are just special cases of the + general concept of a property set. This concept says that a + property set consists of properties and that each + property is an entity with an ID, a + type, and a value.

    -

    Okay, that was still rather easy. However, to make things more - complicated, Microsoft in its infinite wisdom decided that a property set - shalt be broken into sections. Each section holds a bunch - of properties. But since that's still not complicated enough: A section - can optionally have a dictionary that maps property IDs to property - names - we'll explain later what that means.

    +

    Okay, that was still rather easy. However, to make things more + complicated, Microsoft in its infinite wisdom decided that a property set + shalt be broken into one or more sections. Each section + holds a bunch of properties. But since that's still not complicated + enough, a section may have an optional dictionary that + maps property IDs to property names - we'll explain + later what that means.

    -

    So the procedure to get to the properties is as follows:

    +

    The procedure to get to the properties is the following:

    -
      -
    1. Use the PropertySetFactory to create a - PropertySet from an input stream. You can try this with any - input stream: You'll either PropertySet instance or an - exception is thrown.
    2. +
        +
      1. Use the PropertySetFactory class to + create a PropertySet object from a property set stream. If + you don't know whether an input stream is a property set stream, just + try to call PropertySetFactory.create(java.io.InputStream): + You'll either get a PropertySet instance returned or an + exception is thrown.
      2. -
      3. Call the PropertySet's method getSections() - to get a list of sections contained in the property set. Each section is - an instance of the Section class.
      4. +
      5. Call the PropertySet's method getSections() + to get the sections contained in the property set. Each section is + an instance of the Section class.
      6. -
      7. Each section has a format ID. The format ID of the first section in a - property set determines the property set's type. For example, the first - (and only) section of the SummaryInformation property set has a format ID - of F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9. You can - get the format ID with Section.getFormatID().
      8. +
      9. Each section has a format ID. The format ID of the first section in a + property set determines the property set's type. For example, the first + (and only) section of the SummaryInformation property set has a format + ID of F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9. You can + get the format ID with Section.getFormatID().
      10. -
      11. The properties contained in a Section can be retrieved - with Section.getProperties(). The result is an array of - Property instances.
      12. +
      13. The properties contained in a Section can be retrieved + with Section.getProperties(). The result is an array of + Property instances.
      14. -
      15. A property has a name, a type, and a value. The Property - class has methods to retrieve them.
      16. -
      +
    3. A property has a name, a type, and a value. The Property + class has methods to retrieve them.
    4. +
    +
    -

    Let's have a look at a sample Java application that dumps all property - set streams contained in a POI file system. The full source code of this - program can be found as ReadCustomPropertySets.java in the - examples area of the POI source code tree. Here are the key - sections:

    +
    +

    Let's have a look at a sample Java application that dumps all property + set streams contained in a POI file system. The full source code of this + program can be found as ReadCustomPropertySets.java in the + examples area of the POI source code tree. Here are the key + sections:

    import java.io.*; import java.util.*; @@ -381,8 +386,10 @@ import org.apache.poi.util.HexDump;

    The POIFSReader is set up in a way that the listener MyPOIFSReaderListener is called on every file in the POI file system.

    +
    -

    The listener class tries to create a PropertySet from each +

    +

    The listener class tries to create a PropertySet from each stream using the PropertySetFactory.create() method:

    static class MyPOIFSReaderListener implements POIFSReaderListener @@ -420,8 +427,10 @@ import org.apache.poi.util.HexDump; other types of exceptions cause the program to terminate by throwing a runtime exception. If all went well, we can print the name of the property set stream.

    +
    -

    The next step is to print the number of sections followed by the +

    +

    The next step is to print the number of sections followed by the sections themselves:

    /* Print the number of sections: */ @@ -439,18 +448,18 @@ for (Iterator i = sections.iterator(); i.hasNext();) // See below for the complete loop body. } -

    The PropertySet's method getSectionCount() - returns the number of sections.

    +

    The PropertySet's method getSectionCount() + returns the number of sections.

    -

    To retrieve the sections, use the getSections() - method. This method returns a java.util.List containing - instances of the Section class in their proper order.

    +

    To retrieve the sections, use the getSections() + method. This method returns a java.util.List containing + instances of the Section class in their proper order.

    -

    The sample code shows a loop that retrieves the Section - objects one by one and prints some information about each one. Here is the - complete body of the loop:

    +

    The sample code shows a loop that retrieves the Section + objects one by one and prints some information about each one. Here is + the complete body of the loop:

    - /* Print a single section: */ + /* Print a single section: */ Section sec = (Section) i.next(); out(" Section " + nr++ + ":"); String s = hex(sec.getFormatID().getBytes()); @@ -473,49 +482,53 @@ for (int i2 = 0; i2 < properties.length; i2++) out(" Property ID: " + id + ", type: " + type + ", value: " + value); } +
    -

    The first method called on the Section instance is - getFormatID(). As explained above, the format ID of the first - section in a property set determines the type of the property set. Its - type is ClassID which is essentially a sequence of 16 - bytes. A real application using its own type of a custom property set - should have defined a unique format ID and, when reading a property set - stream, should check the format ID is equal to that unique format ID. The - sample program just prints the format ID it finds in a section:

    +
    +

    The first method called on the Section instance is + getFormatID(). As explained above, the format ID of the + first section in a property set determines the type of the property + set. Its type is ClassID which is essentially a sequence of + 16 bytes. A real application using its own type of a custom property set + should have defined a unique format ID and, when reading a property set + stream, should check the format ID is equal to that unique format ID. The + sample program just prints the format ID it finds in a section:

    - String s = hex(sec.getFormatID().getBytes()); + String s = hex(sec.getFormatID().getBytes()); s = s.substring(0, s.length() - 1); out(" Format ID: " + s); -

    As you can see, the getFormatID() method returns a - ClassID object. An array containing the bytes can be - retrieved with ClassID.getBytes(). In order to get a nicely - formatted printout, the sample program uses the hex() helper - method which in turn uses the POI utility class HexDump in - the org.apache.poi.util package. Another helper method is - out() which just saves typing - System.out.println().

    +

    As you can see, the getFormatID() method returns a + ClassID object. An array containing the bytes can be + retrieved with ClassID.getBytes(). In order to get a nicely + formatted printout, the sample program uses the hex() helper + method which in turn uses the POI utility class HexDump in + the org.apache.poi.util package. Another helper method is + out() which just saves typing + System.out.println().

    +
    -

    Before getting the properties, it is possible to find out how many - properties are available in the section via the - Section.getPropertyCount(). The sample application uses this - method to print the number of properties to the standard output:

    +
    +

    Before getting the properties, it is possible to find out how many + properties are available in the section via the + Section.getPropertyCount(). The sample application uses this + method to print the number of properties to the standard output:

    - int propertyCount = sec.getPropertyCount(); + int propertyCount = sec.getPropertyCount(); out(" No. of properties: " + propertyCount); -

    Now its time to get to the properties themselves. You can retrieve a - section's properties with the method - Section.getProperties():

    +

    Now its time to get to the properties themselves. You can retrieve a + section's properties with the method + Section.getProperties():

    - Property[] properties = sec.getProperties(); + Property[] properties = sec.getProperties(); -

    As you can see the result is an array of Property - objects. This class has three methods to retrieve a property's ID, its - type, and its value. The following code snippet shows how to call - them:

    +

    As you can see the result is an array of Property + objects. This class has three methods to retrieve a property's ID, its + type, and its value. The following code snippet shows how to call + them:

    - for (int i2 = 0; i2 < properties.length; i2++) + for (int i2 = 0; i2 < properties.length; i2++) { /* Print a single property: */ Property p = properties[i2]; @@ -525,15 +538,17 @@ out(" No. of properties: " + propertyCount); out(" Property ID: " + id + ", type: " + type + ", value: " + value); } +
    -

    The output of the sample program might look like the following. It shows - the summary information and the document summary information property sets - of a Microsoft Word document. However, unlike the first and second section - of this HOW-TO the application does not have any code which is specific to - the SummaryInformation and - DocumentSummaryInformation classes.

    +
    +

    The output of the sample program might look like the following. It + shows the summary information and the document summary information + property sets of a Microsoft Word document. However, unlike the first and + second section of this HOW-TO the application does not have any code + which is specific to the SummaryInformation and + DocumentSummaryInformation classes.

    - Property set stream "/SummaryInformation": + Property set stream "/SummaryInformation": No. of sections: 1 Section 0: Format ID: 00000000 F2 9F 85 E0 4F F9 10 68 AB 91 08 00 2B 27 B3 D9 ....O..h....+'.. @@ -588,29 +603,247 @@ No property set stream: "/WordDocument" No property set stream: "/CompObj" No property set stream: "/1Table" -

    There are some interestion items to note:

    +

    There are some interestion items to note:

    -
      -
    • The first property set (summary information) consists of a single +
        +
      • The first property set (summary information) consists of a single section, the second property set (document summary information) consists of two sections.
      • -
      • Each section type (identified by its format ID) has its own domain of - property ID. For example, in the second property set the properties with - ID 2 have different meanings in the two section. By the way, the format - IDs of these sections are not equal, but you have to - look hard to find the difference.
      • +
      • Each section type (identified by its format ID) has its own domain of + property ID. For example, in the second property set the properties with + ID 2 have different meanings in the two section. By the way, the format + IDs of these sections are not equal, but you have to + look hard to find the difference.
      • -
      • The properties are not in any particular order in the section, - although they slightly tend to be sorted by their IDs.
      • -
      +
    • The properties are not in any particular order in the section, + although they slightly tend to be sorted by their IDs.
    • +
    +
    - [To be continued.] +
    +

    Properties in the same section are distinguished by their IDs. This is + similar to variables in a programming language like Java, which are + distinguished by their names. But unlike variable names, property IDs are + simple integral numbers. There is another similarity, however. Just like + a Java variable has a certain scope (e.g. a member variables in a class), + a property ID also has its scope of validity: the section.

    - A last note: There are still some aspects of HSPF left which are not - documented in this HOW-TO. You should dig into the Javadoc API - documentation to learn further details. Since you struggled through this - document up to this point, you are well prepared. +

    Two property IDs in sections with different section format IDs + don't have the same meaning even though their IDs might be equal. For + example, ID 4 in the first (and only) section of a summary + information property set denotes the document's author, while ID 4 in the + first section of the document summary information property set means the + document's byte count. The sample output above does not show a property + with an ID of 4 in the first section of the document summary information + property set. That means that the document does not have a byte + count. However, there is a property with an ID of 4 in the + second section: This is a user-defined property ID - we'll get + to that topic in a minute.

    + +

    So, how can you find out what the meaning of a certain property ID in + the summary information and the document summary information property set + is? The standard property sets as such don't have any hints about the + meanings of their property IDs. For example, the summary + information property set does not tell you that the property ID 4 stands + for the document's author. This is external knowledge. Microsoft defined + standard meanings for some of the property IDs in the summary information + and the document summary information property sets. As a help to the Java + and POI programmer, the class PropertyIDMap in the + org.apache.poi.hpsf.wellknown package defines constants + for the "well-known" property IDs. For example, there is the + definition

    + + public final static int PID_AUTHOR = 4; + +

    These definitions allow you to use symbolic names instead of + numbers.

    + +

    In order to provide support for the other way, too, - i.e. to map + property IDs to property names - the class PropertyIDMap + defines two static methods: + getSummaryInformationProperties() and + getDocumentSummaryInformationProperties(). Both return + java.util.Map objects which map property IDs to + strings. Such a string gives a hint about the property's meaning. For + example, + PropertyIDMap.getSummaryInformationProperties().get(4) + returns the string "PID_AUTHOR". An application could use this string as + a key to a localized string which is displayed to the user, e.g. "Author" + in English or "Verfasser" in German. HPSF might provide such + language-dependend ("localized") mappings in a later release.

    + +

    Usually you won't have to deal with those two maps. Instead you should + call the Section.getPIDString(int) method. It returns the + string associated with the specified property ID in the context of the + Section object.

    + +

    Above you learned that property IDs have a meaning in the scope of a + section only. However, there are two exceptions to the rule: The property + IDs 0 and 1 have a fixed meaning in all sections:

    + + + + + + + + + + + + + + + + +
    Property IDMeaning
    0The property's value is a dictionary, i.e. a + mapping from property IDs to strings.
    1The property's value is the number of a codepage, + i.e. a mapping from character codes to characters. All strings in the + section containing this property must be interpreted using this + codepage. Typical property values are 1252 (8-bit "western" characters) + or 1200 (16-bit Unicode characters).
    +
    + +
    +

    A property is nothing without its value. It is stored in a property set + stream as a sequence of bytes. You must know the property's + type in order to properly interpret those bytes and + reasonably handle the value. A property's type is one of the so-called + Microsoft-defined "variant types". When you call + Property.getType() you'll get a long value + which denoting the property's variant type. The class + Variant in the org.apache.poi.hpsf package + holds most of those long values as named constants. For + example, the constant VT_I4 = 3 means a signed integer value + of four bytes. Examples of other types are VT_LPSTR = 30 + meaning a null-terminated string of 8-bit characters, VT_LPWSTR = + 31 which means a null-terminated Unicode string, or VT_BOOL + = 11 denoting a boolean value.

    + +

    In most cases you won't need a property's type because HPSF does all + the work for you.

    +
    + +
    +

    When an application wants to retrieve a property's value and calls + Property.getValue(), HPSF has to interpret the bytes making + out the value according to the property's type. The type determines how + many bytes the value consists of and what + to do with them. For example, if the type is VT_I4, HPSF + knows that the value is four bytes long and that these bytes + comprise a signed integer value in the little-endian format. This is + quite different from e.g. a type of VT_LPWSTR. In this case + HPSF has to scan the value bytes for a Unicode null character and collect + everything from the beginning to that null character as a Unicode + string.

    + +

    The good new is that HPSF does another job for you, too: It maps the + variant type to an adequate Java type.

    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Variant type:Java type:
    VT_I2java.lang.Integer
    VT_I4java.lang.Long
    VT_FILETIMEjava.util.Date
    VT_LPSTRString
    VT_LPWSTRString
    VT_CFbyte[]
    VT_BOOLjava.lang.Boolean
    + +

    The bad news is that there are still a couple of variant types HPSF + does not yet support. If it encounters one of these types it + returns the property's value as a byte array and leaves it to be + interpreted by the application.

    + +

    An application retrieves a property's value by calling the + Property.getValue() method. This method's return type is the + abstract Object class. The getValue() method + looks up the property's variant type, reads the property's value bytes, + creates an instance of an adequate Java type, assigns it the property's + value and returns it. Primitive types like int or + long will be returned as the corresponding class, + e.g. Integer or Long.

    +
    + + +
    +

    The property with ID 0 has a very special meaning: It is a + dictionary mapping property IDs to property names. We + have seen already that the meanings of standard properties in the + summary information and the document summary information property sets + have been defined by Microsoft. The advantage is that the labels of + properties like "Author" or "Title" don't have to be stored in the + property set. However, a user can define custom fields in, say, Microsoft + Word. For each field the user has to specify a name, a type, and a + value.

    + +

    The names of the custom-defined fields (i.e. the property names) are + stored in the document summary information second section's + dictionary. The dictionary is a map which associates + property IDs with property names.

    + +

    The method Section.getPIDString(int) not only returns with + the well-known property names of the summary information and document + summary information property sets, but with self-defined properties, + too. It should also work with self-defined properties in self-defined + sections.

    +
    + +
    + Improve codepage support! + +

    The property with ID 1 holds the number of the codepage which was used + to encode the strings in this section. The present HPSF codepage support + is still very limited: When reading property value strings, HPSF + distinguishes between 16-bit characters and 8-bit characters. 16-bit + characters should be Unicode characters and thus be okay. 8-bit + characters are interpreted according to the platform's default character + set. This is fine as long as the document being read has been written on + a platform with the same default character set. However, if you receive a + document from another region of the world and want to process it with + HPSF you are in trouble - unless the creator used Unicode, of course.

    +
    + +
    +

    There are still some aspects of HSPF left which are not covered by this + HOW-TO. You should dig into the Javadoc API documentation to learn + further details. Since you've struggled through this document up to this + point, you are well prepared.

    +
    diff --git a/src/documentation/xdocs/hpsf/todo.xml b/src/documentation/xdocs/hpsf/todo.xml index a77ce8126..f62d9d373 100644 --- a/src/documentation/xdocs/hpsf/todo.xml +++ b/src/documentation/xdocs/hpsf/todo.xml @@ -16,22 +16,25 @@
    1. -

      Add writing capability for property sets.

      +

      Add writing capability for property sets. Presently property sets can + be read only.

    2. -

      Add codepage support.

      -
    3. -
    4. -

      Add Unicode support.

      +

      Add codepage support: Presently the bytes making out the string in a + property's value are interpreted using the platform's default character + set.

    5. Add resource bundles to org.apache.poi.hpsf.wellknown to ease - localizations.

      + localizations. This would be useful for mapping standard property IDs to + localized strings. Example: The property ID 4 could be mapped to "Author" + in English or "Verfasser" in German.

    6. Implement reading functionality for those property types that are not - yet supported (other than byte arrays).

      + yet supported. HPSF should return proper Java types instead of just byte + arrays.

    7. Add WMF to java.awt.Image example code in