HDF -> HWPF : HDF directory is obsolete
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353299 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
b0604e9210
commit
8ff11e457e
@ -1,12 +0,0 @@
|
|||||||
<?xml version="1.0"?>
|
|
||||||
<!DOCTYPE book PUBLIC "-//APACHE//DTD Cocoon Documentation Book V1.0//EN" "../dtd/book-cocoon-v10.dtd">
|
|
||||||
<book software="POI Project" title="HDF" copyright="@year@ POI Project">
|
|
||||||
<menu label="Jakarta POI">
|
|
||||||
<menu-item label="Top" href="../index.html"/>
|
|
||||||
</menu>
|
|
||||||
<menu label="HWPF">
|
|
||||||
<menu-item label="Overview" href="index.html"/>
|
|
||||||
<menu-item label="HWPF Format" href="docoverview.html"/>
|
|
||||||
<menu-item label="HWPF Project plan" href="projectplan.html"/>
|
|
||||||
</menu>
|
|
||||||
</book>
|
|
@ -1,94 +0,0 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
|
||||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
|
|
||||||
|
|
||||||
<document>
|
|
||||||
<header>
|
|
||||||
<title>HDF</title>
|
|
||||||
<subtitle>Word file format</subtitle>
|
|
||||||
<authors>
|
|
||||||
<person name="S. Ryan Ackley" email="sackley@cfl.rr.com"/>
|
|
||||||
</authors>
|
|
||||||
</header>
|
|
||||||
|
|
||||||
<body>
|
|
||||||
<section><title>The Word 97 File Format in semi-plain English</title>
|
|
||||||
|
|
||||||
<p>The purpose of this document is to give a brief high level overview of the
|
|
||||||
HDF document format. This document does not go into in-depth technical
|
|
||||||
detail and is only meant as a supplement to the Microsoft Word 97 Binary
|
|
||||||
File Format freely available at <link href="http://wotsit.org">Wotsit.org</link>.</p>
|
|
||||||
<p>The OLE file format is not discussed in this document. It is assumed that
|
|
||||||
the reader has a working knowledge of the POIFS API. </p>
|
|
||||||
|
|
||||||
<section><title>Word file structure</title>
|
|
||||||
<p>A Word file is made up of the document text and data structures
|
|
||||||
containing formatting information about the text. Of course, this is a
|
|
||||||
very simplified illustration. There are fields and macros and other
|
|
||||||
things that have not been considered. At this stage, HDF is mainly
|
|
||||||
concerned with formatted text.</p>
|
|
||||||
</section>
|
|
||||||
<section><title>Reading Word files</title>
|
|
||||||
<p>The entry point for HDF's reading of a Word file is the File Information
|
|
||||||
Block (FIB). This structure is the entry point for the locations and size
|
|
||||||
of a document's text and data structures. The FIB is located at the
|
|
||||||
beginning of the main stream.</p>
|
|
||||||
<section><title>Text</title>
|
|
||||||
<p>The document's text is also located in the main stream. Its starting
|
|
||||||
location is given as FIB.fcMin and its length is given in bytes by
|
|
||||||
FIB.ccpText. These two values are not very useful in getting the text
|
|
||||||
because of unicode. There may be unicode text intermingled with ASCII
|
|
||||||
text. That brings us to the piece table.</p>
|
|
||||||
<p>The piece table is used to divide the text into non-unicode and unicode
|
|
||||||
pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx
|
|
||||||
respectively. The piece table may contain Property Modifiers (prm).
|
|
||||||
These are for complex(fast-saved) files and are skipped. Each text piece
|
|
||||||
contains offsets in the main stream that contain text for that piece.
|
|
||||||
If the piece uses unicode, the file offset is masked with a certain bit.
|
|
||||||
Then you have to unmask the bit and divide by 2 to get the real file
|
|
||||||
offset. </p>
|
|
||||||
</section>
|
|
||||||
<section><title>Text Formatting</title>
|
|
||||||
<section><title>Stylesheet</title>
|
|
||||||
<p>All text formatting is based on styles contained in the StyleSheet.
|
|
||||||
The StyleSheet is a data structure containing among other things, style
|
|
||||||
descriptions. Each style description can contain a paragraph style and
|
|
||||||
a character style or simply a character style. Each style description
|
|
||||||
is stored in a compressed version on file. Basically these are deltas
|
|
||||||
from another style.</p>
|
|
||||||
<p>Eventually, you have to chain back to the nil style which is an
|
|
||||||
imaginary style with certain implied values.</p>
|
|
||||||
</section>
|
|
||||||
<section><title>Paragraph and Character styles</title>
|
|
||||||
<p>Paragraph and Character formatting properties for a document's text are
|
|
||||||
stored on file as deltas from some base style in the Stylesheet. The
|
|
||||||
deltas are used to create a complete uncompressed style in memory.</p>
|
|
||||||
<p>Uncompressed paragraph styles are represented by the Pargraph
|
|
||||||
Properties(PAP) data structure. Uncompressed character styles are
|
|
||||||
represented by the Character Properties(CHP) data structure. The styles
|
|
||||||
for the document text are stored in compressed format in the
|
|
||||||
corresponding Formatted Disk Pages (FKP). A compressed PAP is referred
|
|
||||||
to as a PAPX and a compressed CHP is a CHPX. The FKP locations are
|
|
||||||
stored in the bin table. There are seperate bin tables for CHPXs and
|
|
||||||
PAPXs. The bin tables' locations and sizes are stored in the FIB.</p>
|
|
||||||
<p>A FKP is a 512 byte OLE page. It contains the offsets of the beginning
|
|
||||||
and end of each paragraph/character run in the main stream and the
|
|
||||||
compressed properties for that interval. The compessed PAPX is based on
|
|
||||||
its base style in the StyleSheet. The compressed CHPX is based on the
|
|
||||||
enclosing paragraph's base style in the Stylesheet.</p>
|
|
||||||
</section>
|
|
||||||
<section><title>Uncompressing styles and other data structures</title>
|
|
||||||
<p>All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl
|
|
||||||
is an array of sprms. A sprm defines a delta from some base property.
|
|
||||||
There is a table of possible sprms in the Word 97 spec. Each sprm is a
|
|
||||||
two byte operand followed by a parameter. The parameter size depends on
|
|
||||||
the sprm. Each sprm describes an operation that should be performed on
|
|
||||||
the base style. After every sprm in the grpprl is performed on the base
|
|
||||||
style you will have the style for the paragraph, character run,
|
|
||||||
section, etc.</p>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
</section>
|
|
||||||
</body>
|
|
||||||
</document>
|
|
||||||
|
|
@ -1,34 +0,0 @@
|
|||||||
<?xml version="1.0" encoding="UTF-8"?>
|
|
||||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
|
|
||||||
|
|
||||||
<document>
|
|
||||||
<header>
|
|
||||||
<title>Jakarta POI - HDF -Java APIs with XML manipulate MS-Word</title>
|
|
||||||
<subtitle>Overview</subtitle>
|
|
||||||
<authors>
|
|
||||||
<person name="Nicola Ken Barozzi" email="barozzi@nicolaken.com"/>
|
|
||||||
<person name="Andrew C. Oliver" email="acoliver@apache.org"/>
|
|
||||||
<person name="Ryan Ackley" email="sackley@apache.org"/>
|
|
||||||
</authors>
|
|
||||||
</header>
|
|
||||||
|
|
||||||
<body>
|
|
||||||
<section><title>Overview</title>
|
|
||||||
|
|
||||||
<p>HDF is the name of OUR port of the Microsoft Word 97(-2002) file format to
|
|
||||||
pure Java.</p>
|
|
||||||
<p>HDF is still in early development. It is in the
|
|
||||||
<link href="http://cvs.apache.org/viewcvs/jakarta-poi/src/scratchpad/">scratchpad section of the
|
|
||||||
CVS.</link> Source code in the <em>org.apache.poi.hdf.extractor</em> tree is
|
|
||||||
legacy code. Source in the <em>org.apache.poi.hdf.model</em>
|
|
||||||
tree is the old legacy code refactored into an object model. Check the How-To
|
|
||||||
page for detailed examples on using HDF.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
We are looking for developers!!! If you are interested in helping with HDF
|
|
||||||
familiarize yourself with the source code and just start coding. Make sure
|
|
||||||
you read the guidelines for <link href="http://jakarta.apache.org/poi/getinvolved/index.html">
|
|
||||||
getting involved</link></p>
|
|
||||||
</section>
|
|
||||||
</body>
|
|
||||||
</document>
|
|
@ -1,367 +0,0 @@
|
|||||||
<?xml version="1.0"?>
|
|
||||||
<!-- edited with XMLSPY v5 rel. 4 U (http://www.xmlspy.com) by Ryan Ackley (Myself) -->
|
|
||||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
|
|
||||||
<document>
|
|
||||||
<body>
|
|
||||||
<p>HWPF Milestones</p>
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<th>
|
|
||||||
Milestones
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Target Date
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Owner
|
|
||||||
</th>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Read in a Word document
|
|
||||||
with minimum formatting
|
|
||||||
(no lists, tables, footnotes,
|
|
||||||
endnotes, headers, footers)
|
|
||||||
and write it back out with the
|
|
||||||
result viewable in Word
|
|
||||||
97/2000
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
07/11/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Ryan
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Add support for Lists and
|
|
||||||
Tables
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
8/15/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
 
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
HWPF 1.0-alpha release with
|
|
||||||
documentation and examples
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
8/18/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Praveen/Ryan
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Add support for Headers,
|
|
||||||
Footers, endnotes, and
|
|
||||||
footnotes
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
8/31/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
?
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Add support for forms and
|
|
||||||
mail merge
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
September/October 2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
?
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
<p>HWPF Task Lists</p>
|
|
||||||
<p>Read in a Word document with minimum formatting (no lists, tables, footnotes,
|
|
||||||
endnotes, headers, footers) and write it back out with the result viewable in Word 97/2000</p>
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<th>
|
|
||||||
Task
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Target Date
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Owner
|
|
||||||
</th>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Create classes to read and
|
|
||||||
write low level data
|
|
||||||
structures with test cases
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
7/10/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Ryan
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Create classes to read and
|
|
||||||
write FontTable and Font
|
|
||||||
names with test case
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
7/10/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Praveen
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Final test
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
7/11/2003
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
Ryan
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
<p>Develop user friendly API so it is fun and easy to read and write word documents
|
|
||||||
with java.</p>
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<th>
|
|
||||||
Task
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Target Date
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Owner
|
|
||||||
</th>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Develop a way for SPRMS to
|
|
||||||
be compressed and
|
|
||||||
uncompressed
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override CHPAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override PAPAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override SEPAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override DOPAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override TAPAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Override TCAbstractType
|
|
||||||
with a concrete class that
|
|
||||||
exposes attributes with
|
|
||||||
human readable names
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Develop a VerifyIntegrity
|
|
||||||
class for testing so it is easy
|
|
||||||
to determine if a Word
|
|
||||||
Document is well-formed.
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Develop general intuitive
|
|
||||||
API to tie everything together
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
<p>Add support for lists and tables</p>
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<th>
|
|
||||||
Task
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Target Date
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Owner
|
|
||||||
</th>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Add data structures for
|
|
||||||
reading and writing list data
|
|
||||||
with test cases.
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Add data structures for
|
|
||||||
reading and writing tables
|
|
||||||
with test cases.
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
<p>HWPF 1.0-alpha release with documentation and examples</p>
|
|
||||||
<table>
|
|
||||||
<tr>
|
|
||||||
<th>
|
|
||||||
Task
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Target Date
|
|
||||||
</th>
|
|
||||||
<th>
|
|
||||||
Owner
|
|
||||||
</th>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Document the user model
|
|
||||||
API
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Document the low level
|
|
||||||
classes
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
<tr>
|
|
||||||
<td>
|
|
||||||
Come up with detailed How-To’s
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
<td>
|
|
||||||
|
|
||||||
</td>
|
|
||||||
</tr>
|
|
||||||
</table>
|
|
||||||
</body>
|
|
||||||
</document>
|
|
Loading…
Reference in New Issue
Block a user