A few small updates to the HSLF useage docs, and adding some initial documentation on the PowerPoint file format
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353707 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
2b8f3a505f
commit
ed83ff62b1
@ -13,6 +13,7 @@
|
||||
<menu label="HSLF">
|
||||
<menu-item label="Overview" href="index.html"/>
|
||||
<menu-item label="Quick Guide" href="quick-guide.html"/>
|
||||
<menu-item label="PPT File Format" href="ppt-file-format.html"/>
|
||||
</menu>
|
||||
|
||||
</book>
|
||||
|
181
src/documentation/content/xdocs/hslf/ppt-file-format.xml
Normal file
181
src/documentation/content/xdocs/hslf/ppt-file-format.xml
Normal file
@ -0,0 +1,181 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->
|
||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
|
||||
|
||||
<document>
|
||||
<header>
|
||||
<title>POI-HSLF - A Guide to the PowerPoint File Format</title>
|
||||
<subtitle>Overview</subtitle>
|
||||
<authors>
|
||||
<person name="Nick Burch" email="nick at torchbox dot com"/>
|
||||
</authors>
|
||||
</header>
|
||||
|
||||
<body>
|
||||
<section><title>Records, Containers and Atoms</title>
|
||||
<p>
|
||||
PowerPoint documents are made up of a tree of records. A record may
|
||||
contain either other records (in which case it is a Container),
|
||||
or data (in which case it's an Atom). A record can't hold both.
|
||||
</p>
|
||||
<p>
|
||||
PowerPoint documents don't have one overall container record. Instead,
|
||||
there are a number of different container records to be found at
|
||||
the top level.
|
||||
</p>
|
||||
<p>
|
||||
Any numbers or strings stored in the records are always stored in
|
||||
Little Endian format (least important bytes first). This is the case
|
||||
no matter what platform the file was written on - be that a
|
||||
Little Endian or a Big Endian system.
|
||||
</p>
|
||||
<p>
|
||||
PowerPoint may have Escher (DDF) records embeded in it. These
|
||||
are always held as the children of a PPDrawing record (record
|
||||
type 1036). Escher records have the same format as PowerPoint
|
||||
records.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section><title>Record Headers</title>
|
||||
<p>
|
||||
All records, be they containers or atoms, have the same standard
|
||||
8 byte header. It is:
|
||||
</p>
|
||||
<ul><li>1/2 byte container flag</li>
|
||||
<li>1.5 byte option field</li>
|
||||
<li>2 byte record type</li>
|
||||
<li>4 byte record length</li></ul>
|
||||
<p>
|
||||
If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
|
||||
then the record is a container. Otherwise, it's an atom. The rest
|
||||
of the first two bytes are used to store the "options" for the
|
||||
record. Most commonly, this is used to indicate the version of
|
||||
the record, but the exact useage is record specific.
|
||||
</p>
|
||||
<p>
|
||||
The record type is a little endian number, which tells you what
|
||||
kind of record you're dealing with. Each different kind of record
|
||||
has it's own value that gets stored here. PowerPoint records have
|
||||
a type that's normally less than 6000 (decimal). Escher records
|
||||
normally have a type between 0xF000 and 0xF1FF.
|
||||
</p>
|
||||
<p>
|
||||
The record length is another little endian number. For an atom,
|
||||
it's the size of the data part of the record, i.e. the length
|
||||
of the record <em>less</em> its 8 byte record header. For a
|
||||
container, it's the size of all the records that are children of
|
||||
this record. That means that the size of a container record is the
|
||||
length, plus 8 bytes for its record header.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>
|
||||
<p><strong>aka Records that care about the byte level position of other records</strong></p>
|
||||
<p>
|
||||
A small number of records contain byte level position offsets to other
|
||||
records. If you change the position of any records in the file, then
|
||||
there's a good chance that you will need to update some of these
|
||||
special records.
|
||||
</p>
|
||||
<p>
|
||||
First up, CurrentUserAtom. This is actually stored in a different
|
||||
OLE2 (POIFS) stream to the main PowerPoint document. It contains
|
||||
a few bits of information on who lasted edited the file. Most
|
||||
importantly, at byte 8 of its contents, it stores (as a 32 bit
|
||||
little endian number) the offset in the main stream to the most
|
||||
recent UserEditAtom.
|
||||
</p>
|
||||
<p>
|
||||
The UserEditAtom contains two byte level offsets (again as 32 bit
|
||||
little endian numbers). At byte 12 is the offset to the
|
||||
PersistPtrIncrementalBlock associated with this UserEditAtom
|
||||
(each UserEditAtom has one and only one PersistPtrIncrementalBlock).
|
||||
At byte 8, there's the offset to the previous UserEditAtom. If this
|
||||
is 0, then you're at the first one.
|
||||
</p>
|
||||
<p>
|
||||
Every time you do a non full save in PowerPoint, it tacks on another
|
||||
UserEditAtom and another PersistPtrIncrementalBlock. The
|
||||
CurrentUserAtom is updated to point to this new UserEditAtom, and the
|
||||
new UserEditAtom points back to the previous UserEditAtom. You then
|
||||
end up with a chain, starting from the CurrentUserAtom, linking
|
||||
back through all the UserEditAtoms, until you reach the first one
|
||||
from a full save.
|
||||
</p>
|
||||
<source>
|
||||
/-------------------------------\
|
||||
| CurrentUserAtom (own stream) |
|
||||
| OffsetToCurrentEdit = 10562 |==\
|
||||
\-------------------------------/ |
|
||||
|
|
||||
/==================================/
|
||||
| /-----------------------------------\
|
||||
| | PersistPtrIncrementalBlock @ 6144 |
|
||||
| \-----------------------------------/
|
||||
| /---------------------------------\ |
|
||||
| | UserEditAtom @ 6176 | |
|
||||
| | LastUserEditAtomOffset = 0 | |
|
||||
| | PersistPointersOffset = 6144 |==================/
|
||||
| \---------------------------------/
|
||||
| | /-----------------------------------\
|
||||
| \====================\ | PersistPtrIncrementalBlock @ 8646 |
|
||||
| | \-----------------------------------/
|
||||
| /---------------------------------\ | |
|
||||
| | UserEditAtom @ 8674 | | |
|
||||
| | LastUserEditAtomOffset = 6176 |=/ |
|
||||
| | PersistPointersOffset = 8646 |==================/
|
||||
| \---------------------------------/
|
||||
| | /------------------------------------\
|
||||
| \====================\ | PersistPtrIncrementalBlock @ 10538 |
|
||||
| | \------------------------------------/
|
||||
| /---------------------------------\ | |
|
||||
\==| UserEditAtom @ 10562 | | |
|
||||
| LastUserEditAtomOffset = 8674 |=/ |
|
||||
| PersistPointersOffset = 10538 |==================/
|
||||
\---------------------------------/
|
||||
</source>
|
||||
<p>
|
||||
The PersistPtrIncrementalBlock contains byte offsets to all the
|
||||
Slides, Notes, Documents and MasterSlides in the file. The first
|
||||
PersistPtrIncrementalBlock will point to all the ones that
|
||||
were present the first time the file was saved. Subsequent
|
||||
PersistPtrIncrementalBlocks will contain pointers to all the ones
|
||||
that were changed in that edit. To find the offset to a given
|
||||
sheet in the latest version, then start with the most recent
|
||||
PersistPtrIncrementalBlock. If this knows about the sheet, use the
|
||||
offset it has. If it doesn't, then work back through older
|
||||
PersistPtrIncrementalBlocks until you find one which does, and
|
||||
use that.
|
||||
</p>
|
||||
<p>
|
||||
Each PersistPtrIncrementalBlock can contain a number of entries
|
||||
blocks. Each block holds information on a sequence of sheets.
|
||||
Each block starts with a 32 bit little endian integer. Once read
|
||||
into memory, the lower 20 bits contain the starting number for the
|
||||
sequence of sheets to be described. The higher 12 bits contain
|
||||
the count of the number of sheets described. Following that is
|
||||
one 32 bit little endian integer for each sheet in the sequence,
|
||||
the value being the offset to that sheet. If there is any data
|
||||
left after parsing a block, then it corresponds to the next block.
|
||||
</p>
|
||||
<source>
|
||||
hex on disk decimal description
|
||||
----------- ------- -----------
|
||||
0000 0 No options
|
||||
7217 6002 Record type is 6002
|
||||
2000 0000 32 Length of data is 32 bytes
|
||||
0100 5000 5242881 Count is 5 (12 highest bits)
|
||||
Starting number is 1 (20 lowest bits)
|
||||
0000 0000 0 Sheet (1+0)=1 starts at offset 0
|
||||
900D 0000 3472 Sheet (1+1)=2 starts at offset 3472
|
||||
E403 0000 996 Sheet (1+2)=3 starts at offset 996
|
||||
9213 0000 5010 Sheet (1+3)=4 starts at offset 5010
|
||||
BE15 0000 5566 Sheet (1+4)=5 starts at offset 5566
|
||||
0900 1000 1048585 Count is 1 (12 highest bits)
|
||||
Starting number is 9 (20 lowest bits)
|
||||
4418 0000 6212 Sheet (9+0)=9 starts at offset 9212
|
||||
</source>
|
||||
</section>
|
||||
</body>
|
||||
</document>
|
@ -15,8 +15,9 @@
|
||||
<section><title>Basic Text Extraction</title>
|
||||
<p>For basic text extraction, make use of
|
||||
<code>org.apache.poi.extractor.PowerPointExtractor</code>. It accepts a file or an input
|
||||
stream. The <code>getText()</code> method can be used to get the text from the slides,
|
||||
from the notes, or from both.
|
||||
stream. The <code>getText()</code> method can be used to get the text from the slides, and the <code>getNotes()</code> method can be used to get the text
|
||||
from the notes. Finally, <code>getText(true,true)</code> will get the text
|
||||
from both.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
@ -31,19 +32,45 @@ what type of text it is (eg Body, Title)
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section><title>Poor Quality Text Extraction</title>
|
||||
<p>If speed is the most important thing for you, you don't care
|
||||
about getting duplicate blocks of text, you don't care about
|
||||
getting text from master sheets, and you don't care about getting
|
||||
old text, then
|
||||
<code>org.apache.poi.extractor.QuickButCruddyTextExtractor</code>
|
||||
might be of use.</p>
|
||||
<p>QuickButCruddyTextExtractor doesn't use the normal record
|
||||
parsing code, instead it uses a tree structure blind search
|
||||
method to get all text holding records. You will get all the text,
|
||||
including lots of text you normally wouldn't ever want. However,
|
||||
you will get it back very very fast!</p>
|
||||
<p>There are two ways of getting the text back.
|
||||
<code>getTextAsString()</code> will return a single string with all
|
||||
the text in it. <code>getTextAsVector()</code> will return a
|
||||
vector of strings, one for each text record found in the file.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section><title>Changing Text</title>
|
||||
<p>It is possible to change the text via <code>TextRun.setText(String)</code>. However, if
|
||||
the length of the text is changed, things will break because PowerPoint has
|
||||
internal file references in byte offsets, which are not yet all updated when
|
||||
the size changes.
|
||||
<p>It is possible to change the text via
|
||||
<code>TextRun.setText(String)</code>. However, if the length of
|
||||
the text is changed, things will break because PowerPoint has
|
||||
internal file references in byte offsets. We currently update all
|
||||
of these byte references that we know about when writing out, but
|
||||
there are a few more still to be found.
|
||||
</p>
|
||||
</section>
|
||||
|
||||
<section><title>Guide to key classes</title>
|
||||
<ul>
|
||||
<li><code>org.apache.poi.hslf.HSLFSlideShow</code>
|
||||
Handles reading in and writing out files. Generates a tree of the records
|
||||
in the file
|
||||
Handles reading in and writing out files. Calls
|
||||
<code>org.apache.poi.hslf.record.record</code> to build a tree
|
||||
of all the records in the file, which it allows access to.
|
||||
</li>
|
||||
<li><code>org.apache.poi.hslf.record.record</code>
|
||||
Base class of all records. Also provides the main record generation
|
||||
code, which will build up a tree of records for a file.
|
||||
</li>
|
||||
<li><code>org.apache.poi.hslf.usermode.SlideShow</code>
|
||||
Builds up model entries from the records, and presents a user facing
|
||||
@ -55,4 +82,4 @@ the size changes.
|
||||
</ul>
|
||||
</section>
|
||||
</body>
|
||||
</document>
|
||||
</document>
|
||||
|
Loading…
Reference in New Issue
Block a user