From 4cd3f446780ca99ba5acfa89dda9329a35fd95e8 Mon Sep 17 00:00:00 2001 From: Said Ryan Ackley Date: Sun, 14 Apr 2002 06:45:58 +0000 Subject: [PATCH] Work in progress git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@352408 13f79535-47bb-0310-9956-ffa450edef68 --- src/documentation/xdocs/hdf/docoverview.xml | 94 +++++++++++++++++++++ 1 file changed, 94 insertions(+) create mode 100644 src/documentation/xdocs/hdf/docoverview.xml diff --git a/src/documentation/xdocs/hdf/docoverview.xml b/src/documentation/xdocs/hdf/docoverview.xml new file mode 100644 index 000000000..d23c71b89 --- /dev/null +++ b/src/documentation/xdocs/hdf/docoverview.xml @@ -0,0 +1,94 @@ + + + + +
+ HDF + Word file format + + + +
+ + + + +

The purpose of this document is to give a brief high level overview of the + Word 97 document format. This document does not go into in-depth technical + detail and is only meant as a supplement to the Microsoft Word 97 Binary + File Format written by Microsoft.

+

The OLE file format is not discussed in this document. It is assumed that + the reader has a working knowledge of the POIFS API.

+ + +

A Word file is made up of the document text and data structures + containing formatting information about the text. Of course, this is a + very simplified illustration. There are fields and macros and other + things that have not been considered. At this stage, HDF is mainly + concerned with formatted text.

+
+ +

The entry point for HDF's reading of a Word file is the File Information + Block (FIB). This structure is the entry point for the locations and size + of a document's text and data structures. The FIB is located at the + beginning of the main stream.

+ +

The document's text is also located in the main stream. Its starting + location is given as FIB.fcMin and its length is given in bytes by + FIB.ccpText. These two values are not very useful in getting the text + because of unicode. There may be unicode text intermingled with ASCII + text. That brings us to the piece table.

+

The piece table is used to divide the text into non-unicode and unicode + pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx + respectively. The piece table may contain Property Modifiers (prm). + These are for complex(fast-saved) files and are skipped. Each text piece + contains offsets in the main stream that contain text for that piece. + If the piece uses unicode, the file offset is masked with a certain bit. + Then you have to unmask the bit and divide by 2 to get the real file + offset.

+
+ + +

All text formatting is based on styles contained in the StyleSheet. + The StyleSheet is a data structure containing among other things, style + descriptions. Each style description can contain a paragraph style and + a character style or simply a character style. Each style description + is stored in a compressed version on file. Basically these are deltas + from another style.

+

Eventually, you have to chain back to the nil style which is an + imaginary style with certain implied values.

+
+ +

Paragraph and Character formatting properties for a document's text are + stored on file as deltas from some base style in the Stylesheet. The + deltas are used to create a complete uncompressed style in memory.

+

Uncompressed paragraph styles are represented by the Pargraph + Properties(PAP) data structure. Uncompressed character styles are + represented by the Character Properties(CHP) data structure. The styles + for the document text are stored in compressed format in the + corresponding Formatted Disk Pages (FKP). A compressed PAP is referred + to as a PAPX and a compressed CHP is a CHPX. The FKP locations are + stored in the bin table. There are seperate bin tables for CHPXs and + PAPXs. The bin tables' locations and sizes are stored in the FIB.

+

A FKP is a 512 byte OLE page. It contains the offsets of the beginning + and end of each paragraph/character run in the main stream and the + compressed properties for that interval. The compessed PAPX is based on + its base style in the StyleSheet. The compressed CHPX is based on the + enclosing paragraph's base style in the Stylesheet.

+
+ +

All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl + is an array of sprms. A sprm defines a delta from some base property. + There is a table of possible sprms in the Word 97 spec. Each sprm is a + two byte operand followed by a parameter. The parameter size depends on + the sprm. Each sprm describes an operation that should be performed on + the base style. After every sprm in the grpprl is performed on the base + style you will have the style for the paragraph, character run, + section, etc.

+
+
+
+
+ +
+