Update HWPF documentation to include the newly added word 6/95 text extraction support, as well as mention XWPF + Microsoft spec docs
git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@959384 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
6ee6d9095f
commit
9973978524
@ -35,8 +35,12 @@
|
|||||||
<section><title>Overview</title>
|
<section><title>Overview</title>
|
||||||
|
|
||||||
<p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
|
<p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
|
||||||
to pure Java. It <em>does not</em> support the new Word 2007 .docx
|
to pure Java. It also provides limited read only support for the older
|
||||||
file format, which is not OLE2 based.</p>
|
Word 6 and Word 95 file formats.</p>
|
||||||
|
|
||||||
|
<p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>.
|
||||||
|
Whilst HWPF and XWPF provide similar features, there is not a common
|
||||||
|
interface across the two of them at this time.</p>
|
||||||
|
|
||||||
<p>HWPF is still in early development. It is in the <link
|
<p>HWPF is still in early development. It is in the <link
|
||||||
href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/">
|
href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/">
|
||||||
@ -53,6 +57,20 @@
|
|||||||
code.
|
code.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
<section>
|
||||||
|
<title>XWPF Patches Required!</title>
|
||||||
|
|
||||||
|
<p>At the moment, XWPF covers many common use cases for reading and writing
|
||||||
|
.docx files. Whilst this is a great thing, it does mean that XWPF does
|
||||||
|
everything that the current POI committers need it to do, and so none of
|
||||||
|
the committers are actively adding new features.</p>
|
||||||
|
|
||||||
|
<p>If you come across a feature in XWPF that you need, and isn't currently
|
||||||
|
there, please do send in a patch to add the extra functionality! More details
|
||||||
|
on contributing patches are available on the <link
|
||||||
|
href="../getinvolved/index.html">"Contribution to POI" page</link>.</p>
|
||||||
|
</section>
|
||||||
|
|
||||||
<section>
|
<section>
|
||||||
<title>HWPF Pointman Needed!</title>
|
<title>HWPF Pointman Needed!</title>
|
||||||
|
|
||||||
@ -65,12 +83,12 @@
|
|||||||
<p>If <strong>you</strong> are interested in becoming the new HWPF
|
<p>If <strong>you</strong> are interested in becoming the new HWPF
|
||||||
pointman, you should look into the Microsoft Word internals. A good
|
pointman, you should look into the Microsoft Word internals. A good
|
||||||
starting point seems to be Ryan Ackley's <link
|
starting point seems to be Ryan Ackley's <link
|
||||||
href="docoverview.html">overview</link>. This document contains a link to
|
href="docoverview.html">overview</link>. Full details on the word format
|
||||||
a detailled Word format description you can find somewhere at
|
is available from
|
||||||
<link href="http://www.wotsit.org/">http://www.wotsit.org/</link>. Please
|
<link href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">Microsoft</link>,
|
||||||
do not contact Ryan Ackley directly, because he is working for a company
|
but the documentation can be a little hard to get into at first... Try reading the
|
||||||
now that signed a NDA with Microsoft and thus he will be no longer able to
|
<link href="docoverview.html">overview</link> first, and looking at the existing
|
||||||
answer questions.</p>
|
code, then finally look up the documentation for specific missing features.</p>
|
||||||
|
|
||||||
<p>As a first step you should familiarize yourself with the source code,
|
<p>As a first step you should familiarize yourself with the source code,
|
||||||
examples, test cases, and the HWPF patches available at <link
|
examples, test cases, and the HWPF patches available at <link
|
||||||
@ -88,13 +106,14 @@
|
|||||||
</ul>
|
</ul>
|
||||||
|
|
||||||
<p>When you start coding, you will not yet have write access to the
|
<p>When you start coding, you will not yet have write access to the
|
||||||
CVS repository. Please submit your patches to <link
|
SVN repository. Please submit your patches to <link
|
||||||
href="http://issues.apache.org/">Bugzilla</link> and nag <link
|
href="http://issues.apache.org/">Bugzilla</link> and nag <link
|
||||||
href="mailto:klute@apache.org">Rainer Klute</link> until he commits
|
href="mailto:dev@poi.apache.org">the dev list</link> until someone commits
|
||||||
them. Besides the actual checking in of HWPF patches Rainer will also do
|
them. Besides the actual checking in of HWPF patches, current POI
|
||||||
some minor reviews now and then of your source code patches, test cases
|
committers will also do some minor reviews now and then of your source code
|
||||||
and documentation to help ensure software quality. But most of the time
|
patches, test cases and documentation to help ensure software quality. But
|
||||||
you will be on your own.</p>
|
most of the time you will be on your own. However, anyone offering useful
|
||||||
|
contributions over a period of time will be offered committership!</p>
|
||||||
|
|
||||||
<p>Please do not forget to write <link
|
<p>Please do not forget to write <link
|
||||||
href="http://www.junit.org/">JUnit</link> test cases and documentation!
|
href="http://www.junit.org/">JUnit</link> test cases and documentation!
|
||||||
@ -102,15 +121,9 @@
|
|||||||
consider that other contributors should be able to understand your source
|
consider that other contributors should be able to understand your source
|
||||||
code easily. If you need any help getting started with JUnit test cases
|
code easily. If you need any help getting started with JUnit test cases
|
||||||
for HWPF, please ask on the developers' mailing list! If you show that you
|
for HWPF, please ask on the developers' mailing list! If you show that you
|
||||||
are prepared to stick at it you will most likely be given CVS commit
|
are prepared to stick at it you will most likely be given SVN commit
|
||||||
access.</p>
|
access. See <link href="../getinvolved/index.html">"Contribution to POI" page</link>
|
||||||
|
for more details and help getting started.</p>
|
||||||
<p><strong>Important:</strong> It is legally vital for POI that you have
|
|
||||||
never seen any documentation or specification from Microsoft that required
|
|
||||||
you or your employer to sign an NDA to get it. Please do read the <link
|
|
||||||
href="../getinvolved/index.html">"Contribution to POI" page</link> for
|
|
||||||
details! This page also contains further information for you to start POI
|
|
||||||
development.</p>
|
|
||||||
|
|
||||||
<p>Of course we will help you as best as we can. However, presently there
|
<p>Of course we will help you as best as we can. However, presently there
|
||||||
is no committer who is really familiar with the Word format, so you'll be
|
is no committer who is really familiar with the Word format, so you'll be
|
||||||
|
@ -86,7 +86,8 @@
|
|||||||
provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for
|
provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for
|
||||||
which there are serializers for HSSF;
|
which there are serializers for HSSF;
|
||||||
<link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the
|
<link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the
|
||||||
XLS format; and <link href="http://lucene.apache.org/">Lucene</link>
|
XLS format; and <link href="http://tika.apache.org/">Tika</link> /
|
||||||
|
<link href="http://lucene.apache.org/">Lucene</link>,
|
||||||
for which we provide format interpretors. When practical, we donate
|
for which we provide format interpretors. When practical, we donate
|
||||||
components directly to those projects for POI-enabling them.
|
components directly to those projects for POI-enabling them.
|
||||||
</p>
|
</p>
|
||||||
|
@ -50,14 +50,16 @@
|
|||||||
<section><title>HWPF and XWPF for Word Documents</title>
|
<section><title>HWPF and XWPF for Word Documents</title>
|
||||||
<p>
|
<p>
|
||||||
HWPF is our port of the Microsoft Word 97 (-2003) file format to pure
|
HWPF is our port of the Microsoft Word 97 (-2003) file format to pure
|
||||||
Java. It supports read, and limited write capabilities. Please see <link
|
Java. It supports read, and limited write capabilities. It also provides
|
||||||
href="./hwpf/index.html">the HWPF project page for more
|
simple text extraction support for the older Word 6 and Word 95 formats.
|
||||||
|
Please see <link href="./hwpf/index.html">the HWPF project page for more
|
||||||
information</link>. This component remains in early stages of
|
information</link>. This component remains in early stages of
|
||||||
development. It can already read and write simple files.
|
development. It can already read and write simple files.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
We are also working on the XWPF for the WordprocessingML (2007+) format from the
|
We are also working on the XWPF for the WordprocessingML (2007+) format from the
|
||||||
OOXML specification.
|
OOXML specification. This provides read and write support for simpler
|
||||||
|
files, along with text extraction capabilities.
|
||||||
</p>
|
</p>
|
||||||
</section>
|
</section>
|
||||||
<section><title>HSLF and XSLF for PowerPoint Documents</title>
|
<section><title>HSLF and XSLF for PowerPoint Documents</title>
|
||||||
@ -108,8 +110,8 @@
|
|||||||
<section><title>HSMF for Outlook Messages</title>
|
<section><title>HSMF for Outlook Messages</title>
|
||||||
<p>
|
<p>
|
||||||
HSMF is our port of the Microsoft Outlook message file format to pure
|
HSMF is our port of the Microsoft Outlook message file format to pure
|
||||||
Java. It currently only some of the textual content of MSG files.
|
Java. It currently only some of the textual content of MSG files, and
|
||||||
Further support and documentation is expected over the comming weeks and months.
|
some attachments. Further support and documentation is coming in slowly.
|
||||||
For now, users are advised to consult the unit tests for example use.
|
For now, users are advised to consult the unit tests for example use.
|
||||||
Please see <link href="./hsmf/index.html">the HPBF project page for more
|
Please see <link href="./hsmf/index.html">the HPBF project page for more
|
||||||
information</link>.
|
information</link>.
|
||||||
|
@ -81,11 +81,15 @@
|
|||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section><title>Word</title>
|
<section><title>Word</title>
|
||||||
<p>For .doc files, in scratchpad there is
|
<p>For .doc files from Word 97 - Word 2003, in scratchpad there is
|
||||||
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
|
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
|
||||||
return text for your document. Those using POI 3.5 can also use
|
return text for your document.</p>
|
||||||
|
<p>Those using POI 3.7 can also extract simple textual content from
|
||||||
|
older Word 6 and Word 95 files, using the scratchpad class
|
||||||
|
<em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p>
|
||||||
|
<p>Since POI 3.5, it is possible to use
|
||||||
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
|
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
|
||||||
a similar task for .docx files.</p>
|
text extraction for .docx files.</p>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
<section><title>PowerPoint</title>
|
<section><title>PowerPoint</title>
|
||||||
@ -97,6 +101,12 @@
|
|||||||
perform a similar task for .pptx files.</p>
|
perform a similar task for .pptx files.</p>
|
||||||
</section>
|
</section>
|
||||||
|
|
||||||
|
<section><title>Publisher</title>
|
||||||
|
<p>For .pub files, in scratchpad there is
|
||||||
|
<em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which
|
||||||
|
will return text for your file.</p>
|
||||||
|
</section>
|
||||||
|
|
||||||
<section><title>Visio</title>
|
<section><title>Visio</title>
|
||||||
<p>For .vsd files, in scratchpad there is
|
<p>For .vsd files, in scratchpad there is
|
||||||
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which
|
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which
|
||||||
|
Loading…
Reference in New Issue
Block a user