2008-04-13 09:16:36 -04:00
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!--
|
|
|
|
====================================================================
|
|
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
|
|
this work for additional information regarding copyright ownership.
|
|
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
(the "License"); you may not use this file except in compliance with
|
|
|
|
the License. You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
====================================================================
|
|
|
|
-->
|
|
|
|
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.3//EN" "./dtd/document-v13.dtd">
|
|
|
|
|
|
|
|
<document>
|
|
|
|
<header>
|
|
|
|
<title>POI Text Extraction</title>
|
|
|
|
<authors>
|
|
|
|
<person id="NB" name="Nick Burch" email="nick@apache.org"/>
|
|
|
|
</authors>
|
|
|
|
</header>
|
|
|
|
|
|
|
|
<body>
|
|
|
|
<section><title>Overview</title>
|
|
|
|
<p>POI provides text extraction for all the supported file
|
|
|
|
formats. In addition, it provides access to the metadata
|
|
|
|
associated with a given file, such as title and author.</p>
|
|
|
|
<p>In addition to providing direct text extraction classes,
|
|
|
|
POI works closely with the
|
|
|
|
<link href="http://incubator.apache.org/tika/">Apache Tika</link>
|
|
|
|
text extraction library. Users may wish to simply utilise
|
|
|
|
the functionality provided by Tika.</p>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>Common functionality</title>
|
|
|
|
<p>All of the POI text extractors extend from
|
|
|
|
<em>org.apache.poi.POITextExtractor</em>. This provides a common
|
|
|
|
method across all extractors, getText(). For many cases, the text
|
|
|
|
returned will be all you need. However, many extractors do provide
|
|
|
|
more targetted text extraction methods, so you may wish to use
|
|
|
|
these in some cases.</p>
|
|
|
|
<p>All POIFS / OLE 2 based text extractors also extend from
|
|
|
|
<em>org.apache.poi.POIOLE2TextExtractor</em>. This additionally
|
|
|
|
provides common methods to get at the <link href="hpfs/">HPFS
|
|
|
|
document metadata</link>.</p>
|
|
|
|
<p>All OOXML based text extractors (available in POI 3.5 and later)
|
|
|
|
also extend from
|
|
|
|
<em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
|
|
|
|
provides common methods to get at the OOXML metadata.</p>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>Text Extractor Factory - POI 3.5 or later</title>
|
|
|
|
<p>A new class in POI 3.5,
|
|
|
|
<em>org.apache.poi.extractor.ExtractorFactory</em> provides a
|
|
|
|
similar function to WorkbookFactory. You simply pass it an
|
|
|
|
InputStream, a file, a POIFSFileSystem or a OOXML Package. It
|
|
|
|
figures out the correct text extractor for you, and returns it.</p>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>Excel</title>
|
|
|
|
<p>For .xls files, there is
|
|
|
|
<em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will
|
|
|
|
return text, optionally with formulas instead of their contents.
|
|
|
|
Those using POI 3.5 can also use
|
|
|
|
<em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
|
|
|
|
a similar task for .xlsx files.</p>
|
2008-04-13 11:13:17 -04:00
|
|
|
<p>In addition, there is a second text extractor for .xls files,
|
|
|
|
<em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
|
|
|
|
is based on the streaming EventUserModel code, and will generally
|
|
|
|
deliver a lower memory footprint for extraction. However, it will
|
|
|
|
have problems correctly outputting more complex formulas, as it
|
|
|
|
works with records as they pass, and so doesn't have access to all
|
|
|
|
parts of complex and shared formulas.</p>
|
2008-04-13 09:16:36 -04:00
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>Word</title>
|
|
|
|
<p>For .doc files, in scratchpad there is
|
|
|
|
<em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will
|
|
|
|
return text for your document. Those using POI 3.5 can also use
|
|
|
|
<em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
|
|
|
|
a similar task for .docx files.</p>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>PowerPoint</title>
|
|
|
|
<p>For .ppt files, in scratchpad there is
|
|
|
|
<em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which
|
|
|
|
will return text for your slideshow, optionally restricted to just
|
|
|
|
slides text or notes text. Those using POI 3.5 can also use
|
|
|
|
<em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to
|
|
|
|
perform a similar task for .pptx files.</p>
|
|
|
|
</section>
|
|
|
|
|
|
|
|
<section><title>Visio</title>
|
|
|
|
<p>For .vsd files, in scratchpad there is
|
|
|
|
<em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which
|
|
|
|
will return text for your file.</p>
|
|
|
|
</section>
|
2009-07-18 12:49:56 -04:00
|
|
|
|
|
|
|
<section><title>Embedded Objects</title>
|
|
|
|
<p>Extractors already exist for Excel, Word, PowerPoint and Visio;
|
|
|
|
if one of these objects is embedded into a worksheet, the ExtractorFactory class can be used to recover an extractor for it.
|
|
|
|
</p>
|
|
|
|
<source>
|
|
|
|
FileInputStream fis = new FileInputStream(inputFile);
|
|
|
|
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
|
|
|
|
// Firstly, get an extractor for the Workbook
|
|
|
|
POIOLE2TextExtractor oleTextExtractor = ExtractorFactory.createExtractor(fileSystem);
|
|
|
|
// Then a List of extractors for any embedded Excel, Word, PowerPoint
|
|
|
|
// or Visio objects embedded into it.
|
|
|
|
POITextExtractor[] embeddedExtractors = ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
|
|
|
|
for (POITextExtractor textExtractor : embeddedExtractors) {
|
|
|
|
// If the embedded object was an Excel spreadsheet.
|
|
|
|
if (textExtractor instanceof ExcelExtractor) {
|
|
|
|
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
|
|
|
|
System.out.println(excelExtractor.getText());
|
|
|
|
}
|
|
|
|
// A Word Document
|
|
|
|
else if (textExtractor instanceof WordExtractor) {
|
|
|
|
WordExtractor wordExtractor = (WordExtractor) textExtractor;
|
|
|
|
String[] paragraphText = wordExtractor.getParagraphText();
|
|
|
|
for (String paragraph : paragraphText) {
|
|
|
|
System.out.println(paragraph);
|
|
|
|
}
|
|
|
|
// Display the document's header and footer text
|
|
|
|
System.out.println("Footer text: " + wordExtractor.getFooterText());
|
|
|
|
System.out.println("Header text: " + wordExtractor.getHeaderText());
|
|
|
|
}
|
|
|
|
// PowerPoint Presentation.
|
|
|
|
else if (textExtractor instanceof PowerPointExtractor) {
|
|
|
|
PowerPointExtractor powerPointExtractor = (PowerPointExtractor) textExtractor;
|
|
|
|
System.out.println("Text: " + powerPointExtractor.getText());
|
|
|
|
System.out.println("Notes: " + powerPointExtractor.getNotes());
|
|
|
|
}
|
|
|
|
// Visio Drawing
|
|
|
|
else if (textExtractor instanceof VisioTextExtractor) {
|
|
|
|
VisioTextExtractor visioTextExtractor = (VisioTextExtractor) textExtractor;
|
|
|
|
System.out.println("Text: " + visioTextExtractor.getText());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
</source>
|
|
|
|
</section>
|
2008-04-13 09:16:36 -04:00
|
|
|
</body>
|
|
|
|
|
|
|
|
<footer>
|
|
|
|
<legal>
|
|
|
|
Copyright 2005 The Apache Software Foundation or its licensors, as applicable.
|
|
|
|
$Revision: 639487 $ $Date: 2008-03-20 22:31:15 +0000 (Thu, 20 Mar 2008) $
|
|
|
|
</legal>
|
|
|
|
</footer>
|
|
|
|
</document>
|