Update HWPF documentation to include the newly added word 6/95 text extraction support, as well as mention XWPF + Microsoft spec docs

git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@959384 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Nick Burch 2010-06-30 17:40:33 +00:00
parent 6ee6d9095f
commit 9973978524
4 changed files with 58 additions and 32 deletions

View File

@ -35,8 +35,12 @@
<section><title>Overview</title> <section><title>Overview</title>
<p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format <p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
to pure Java. It <em>does not</em> support the new Word 2007 .docx to pure Java. It also provides limited read only support for the older
file format, which is not OLE2 based.</p> Word 6 and Word 95 file formats.</p>
<p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>.
Whilst HWPF and XWPF provide similar features, there is not a common
interface across the two of them at this time.</p>
<p>HWPF is still in early development. It is in the <link <p>HWPF is still in early development. It is in the <link
href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/"> href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/">
@ -53,6 +57,20 @@
code. code.
</p> </p>
<section>
<title>XWPF Patches Required!</title>
<p>At the moment, XWPF covers many common use cases for reading and writing
.docx files. Whilst this is a great thing, it does mean that XWPF does
everything that the current POI committers need it to do, and so none of
the committers are actively adding new features.</p>
<p>If you come across a feature in XWPF that you need, and isn't currently
there, please do send in a patch to add the extra functionality! More details
on contributing patches are available on the <link
href="../getinvolved/index.html">"Contribution to POI" page</link>.</p>
</section>
<section> <section>
<title>HWPF Pointman Needed!</title> <title>HWPF Pointman Needed!</title>
@ -65,12 +83,12 @@
<p>If <strong>you</strong> are interested in becoming the new HWPF <p>If <strong>you</strong> are interested in becoming the new HWPF
pointman, you should look into the Microsoft Word internals. A good pointman, you should look into the Microsoft Word internals. A good
starting point seems to be Ryan Ackley's <link starting point seems to be Ryan Ackley's <link
href="docoverview.html">overview</link>. This document contains a link to href="docoverview.html">overview</link>. Full details on the word format
a detailled Word format description you can find somewhere at is available from
<link href="http://www.wotsit.org/">http://www.wotsit.org/</link>. Please <link href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">Microsoft</link>,
do not contact Ryan Ackley directly, because he is working for a company but the documentation can be a little hard to get into at first... Try reading the
now that signed a NDA with Microsoft and thus he will be no longer able to <link href="docoverview.html">overview</link> first, and looking at the existing
answer questions.</p> code, then finally look up the documentation for specific missing features.</p>
<p>As a first step you should familiarize yourself with the source code, <p>As a first step you should familiarize yourself with the source code,
examples, test cases, and the HWPF patches available at <link examples, test cases, and the HWPF patches available at <link
@ -88,13 +106,14 @@
</ul> </ul>
<p>When you start coding, you will not yet have write access to the <p>When you start coding, you will not yet have write access to the
CVS repository. Please submit your patches to <link SVN repository. Please submit your patches to <link
href="http://issues.apache.org/">Bugzilla</link> and nag <link href="http://issues.apache.org/">Bugzilla</link> and nag <link
href="mailto:klute@apache.org">Rainer Klute</link> until he commits href="mailto:dev@poi.apache.org">the dev list</link> until someone commits
them. Besides the actual checking in of HWPF patches Rainer will also do them. Besides the actual checking in of HWPF patches, current POI
some minor reviews now and then of your source code patches, test cases committers will also do some minor reviews now and then of your source code
and documentation to help ensure software quality. But most of the time patches, test cases and documentation to help ensure software quality. But
you will be on your own.</p> most of the time you will be on your own. However, anyone offering useful
contributions over a period of time will be offered committership!</p>
<p>Please do not forget to write <link <p>Please do not forget to write <link
href="http://www.junit.org/">JUnit</link> test cases and documentation! href="http://www.junit.org/">JUnit</link> test cases and documentation!
@ -102,15 +121,9 @@
consider that other contributors should be able to understand your source consider that other contributors should be able to understand your source
code easily. If you need any help getting started with JUnit test cases code easily. If you need any help getting started with JUnit test cases
for HWPF, please ask on the developers' mailing list! If you show that you for HWPF, please ask on the developers' mailing list! If you show that you
are prepared to stick at it you will most likely be given CVS commit are prepared to stick at it you will most likely be given SVN commit
access.</p> access. See <link href="../getinvolved/index.html">"Contribution to POI" page</link>
for more details and help getting started.</p>
<p><strong>Important:</strong> It is legally vital for POI that you have
never seen any documentation or specification from Microsoft that required
you or your employer to sign an NDA to get it. Please do read the <link
href="../getinvolved/index.html">"Contribution to POI" page</link> for
details! This page also contains further information for you to start POI
development.</p>
<p>Of course we will help you as best as we can. However, presently there <p>Of course we will help you as best as we can. However, presently there
is no committer who is really familiar with the Word format, so you'll be is no committer who is really familiar with the Word format, so you'll be

View File

@ -86,7 +86,8 @@
provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for
which there are serializers for HSSF; which there are serializers for HSSF;
<link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the <link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the
XLS format; and <link href="http://lucene.apache.org/">Lucene</link> XLS format; and <link href="http://tika.apache.org/">Tika</link> /
<link href="http://lucene.apache.org/">Lucene</link>,
for which we provide format interpretors. When practical, we donate for which we provide format interpretors. When practical, we donate
components directly to those projects for POI-enabling them. components directly to those projects for POI-enabling them.
</p> </p>

View File

@ -50,14 +50,16 @@
<section><title>HWPF and XWPF for Word Documents</title> <section><title>HWPF and XWPF for Word Documents</title>
<p> <p>
HWPF is our port of the Microsoft Word 97 (-2003) file format to pure HWPF is our port of the Microsoft Word 97 (-2003) file format to pure
Java. It supports read, and limited write capabilities. Please see <link Java. It supports read, and limited write capabilities. It also provides
href="./hwpf/index.html">the HWPF project page for more simple text extraction support for the older Word 6 and Word 95 formats.
Please see <link href="./hwpf/index.html">the HWPF project page for more
information</link>. This component remains in early stages of information</link>. This component remains in early stages of
development. It can already read and write simple files. development. It can already read and write simple files.
</p> </p>
<p> <p>
We are also working on the XWPF for the WordprocessingML (2007+) format from the We are also working on the XWPF for the WordprocessingML (2007+) format from the
OOXML specification. OOXML specification. This provides read and write support for simpler
files, along with text extraction capabilities.
</p> </p>
</section> </section>
<section><title>HSLF and XSLF for PowerPoint Documents</title> <section><title>HSLF and XSLF for PowerPoint Documents</title>
@ -108,8 +110,8 @@
<section><title>HSMF for Outlook Messages</title> <section><title>HSMF for Outlook Messages</title>
<p> <p>
HSMF is our port of the Microsoft Outlook message file format to pure HSMF is our port of the Microsoft Outlook message file format to pure
Java. It currently only some of the textual content of MSG files. Java. It currently only some of the textual content of MSG files, and
Further support and documentation is expected over the comming weeks and months. some attachments. Further support and documentation is coming in slowly.
For now, users are advised to consult the unit tests for example use. For now, users are advised to consult the unit tests for example use.
Please see <link href="./hsmf/index.html">the HPBF project page for more Please see <link href="./hsmf/index.html">the HPBF project page for more
information</link>. information</link>.

View File

@ -81,11 +81,15 @@
</section> </section>
<section><title>Word</title> <section><title>Word</title>
<p>For .doc files, in scratchpad there is <p>For .doc files from Word 97 - Word 2003, in scratchpad there is
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will <em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
return text for your document. Those using POI 3.5 can also use return text for your document.</p>
<p>Those using POI 3.7 can also extract simple textual content from
older Word 6 and Word 95 files, using the scratchpad class
<em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
<p>Since POI 3.5, it is possible to use
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
a similar task for .docx files.</p> text extraction for .docx files.</p>
</section> </section>
<section><title>PowerPoint</title> <section><title>PowerPoint</title>
@ -97,6 +101,12 @@
perform a similar task for .pptx files.</p> perform a similar task for .pptx files.</p>
</section> </section>
<section><title>Publisher</title>
<p>For .pub files, in scratchpad there is
<em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which
will return text for your file.</p>
</section>
<section><title>Visio</title> <section><title>Visio</title>
<p>For .vsd files, in scratchpad there is <p>For .vsd files, in scratchpad there is
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which <em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which