poi/src/documentation/content/xdocs/hslf/ppt-file-format.xml

277 lines
12 KiB
XML
Raw Normal View History

<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
<document>
<header>
<title>POI-HSLF - A Guide to the PowerPoint File Format</title>
<subtitle>Overview</subtitle>
<authors>
<person name="Nick Burch" email="nick at torchbox dot com"/>
</authors>
</header>
<body>
<section><title>Records, Containers and Atoms</title>
<p>
PowerPoint documents are made up of a tree of records. A record may
contain either other records (in which case it is a Container),
or data (in which case it's an Atom). A record can't hold both.
</p>
<p>
PowerPoint documents don't have one overall container record. Instead,
there are a number of different container records to be found at
the top level.
</p>
<p>
Any numbers or strings stored in the records are always stored in
Little Endian format (least important bytes first). This is the case
no matter what platform the file was written on - be that a
Little Endian or a Big Endian system.
</p>
<p>
PowerPoint may have Escher (DDF) records embeded in it. These
are always held as the children of a PPDrawing record (record
type 1036). Escher records have the same format as PowerPoint
records.
</p>
</section>
<section><title>Record Headers</title>
<p>
All records, be they containers or atoms, have the same standard
8 byte header. It is:
</p>
<ul><li>1/2 byte container flag</li>
<li>1.5 byte option field</li>
<li>2 byte record type</li>
<li>4 byte record length</li></ul>
<p>
If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
then the record is a container. Otherwise, it's an atom. The rest
of the first two bytes are used to store the "options" for the
record. Most commonly, this is used to indicate the version of
the record, but the exact useage is record specific.
</p>
<p>
The record type is a little endian number, which tells you what
kind of record you're dealing with. Each different kind of record
has it's own value that gets stored here. PowerPoint records have
a type that's normally less than 6000 (decimal). Escher records
normally have a type between 0xF000 and 0xF1FF.
</p>
<p>
The record length is another little endian number. For an atom,
it's the size of the data part of the record, i.e. the length
of the record <em>less</em> its 8 byte record header. For a
container, it's the size of all the records that are children of
this record. That means that the size of a container record is the
length, plus 8 bytes for its record header.
</p>
</section>
<section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>
<p><strong>aka Records that care about the byte level position of other records</strong></p>
<p>
A small number of records contain byte level position offsets to other
records. If you change the position of any records in the file, then
there's a good chance that you will need to update some of these
special records.
</p>
<p>
First up, CurrentUserAtom. This is actually stored in a different
OLE2 (POIFS) stream to the main PowerPoint document. It contains
a few bits of information on who lasted edited the file. Most
importantly, at byte 8 of its contents, it stores (as a 32 bit
little endian number) the offset in the main stream to the most
recent UserEditAtom.
</p>
<p>
The UserEditAtom contains two byte level offsets (again as 32 bit
little endian numbers). At byte 12 is the offset to the
PersistPtrIncrementalBlock associated with this UserEditAtom
(each UserEditAtom has one and only one PersistPtrIncrementalBlock).
At byte 8, there's the offset to the previous UserEditAtom. If this
is 0, then you're at the first one.
</p>
<p>
Every time you do a non full save in PowerPoint, it tacks on another
UserEditAtom and another PersistPtrIncrementalBlock. The
CurrentUserAtom is updated to point to this new UserEditAtom, and the
new UserEditAtom points back to the previous UserEditAtom. You then
end up with a chain, starting from the CurrentUserAtom, linking
back through all the UserEditAtoms, until you reach the first one
from a full save.
</p>
<source>
/-------------------------------\
| CurrentUserAtom (own stream) |
| OffsetToCurrentEdit = 10562 |==\
\-------------------------------/ |
|
/==================================/
| /-----------------------------------\
| | PersistPtrIncrementalBlock @ 6144 |
| \-----------------------------------/
| /---------------------------------\ |
| | UserEditAtom @ 6176 | |
| | LastUserEditAtomOffset = 0 | |
| | PersistPointersOffset = 6144 |==================/
| \---------------------------------/
| | /-----------------------------------\
| \====================\ | PersistPtrIncrementalBlock @ 8646 |
| | \-----------------------------------/
| /---------------------------------\ | |
| | UserEditAtom @ 8674 | | |
| | LastUserEditAtomOffset = 6176 |=/ |
| | PersistPointersOffset = 8646 |==================/
| \---------------------------------/
| | /------------------------------------\
| \====================\ | PersistPtrIncrementalBlock @ 10538 |
| | \------------------------------------/
| /---------------------------------\ | |
\==| UserEditAtom @ 10562 | | |
| LastUserEditAtomOffset = 8674 |=/ |
| PersistPointersOffset = 10538 |==================/
\---------------------------------/
</source>
<p>
The PersistPtrIncrementalBlock contains byte offsets to all the
Slides, Notes, Documents and MasterSlides in the file. The first
PersistPtrIncrementalBlock will point to all the ones that
were present the first time the file was saved. Subsequent
PersistPtrIncrementalBlocks will contain pointers to all the ones
that were changed in that edit. To find the offset to a given
sheet in the latest version, then start with the most recent
PersistPtrIncrementalBlock. If this knows about the sheet, use the
offset it has. If it doesn't, then work back through older
PersistPtrIncrementalBlocks until you find one which does, and
use that.
</p>
<p>
Each PersistPtrIncrementalBlock can contain a number of entries
blocks. Each block holds information on a sequence of sheets.
Each block starts with a 32 bit little endian integer. Once read
into memory, the lower 20 bits contain the starting number for the
sequence of sheets to be described. The higher 12 bits contain
the count of the number of sheets described. Following that is
one 32 bit little endian integer for each sheet in the sequence,
the value being the offset to that sheet. If there is any data
left after parsing a block, then it corresponds to the next block.
</p>
<source>
hex on disk decimal description
----------- ------- -----------
0000 0 No options
7217 6002 Record type is 6002
2000 0000 32 Length of data is 32 bytes
0100 5000 5242881 Count is 5 (12 highest bits)
Starting number is 1 (20 lowest bits)
0000 0000 0 Sheet (1+0)=1 starts at offset 0
900D 0000 3472 Sheet (1+1)=2 starts at offset 3472
E403 0000 996 Sheet (1+2)=3 starts at offset 996
9213 0000 5010 Sheet (1+3)=4 starts at offset 5010
BE15 0000 5566 Sheet (1+4)=5 starts at offset 5566
0900 1000 1048585 Count is 1 (12 highest bits)
Starting number is 9 (20 lowest bits)
4418 0000 6212 Sheet (9+0)=9 starts at offset 9212
</source>
</section>
<section><title>Paragraph and Text Styling</title>
<p>
There are quite a number of records that affect the styling
of text, and a smaller number that are responsible for the
styling of paragraphs.
</p>
<p>
By default, a given set of text will inherit paragraph and text
stylings from the appropriate master sheet. If anything differs
from the master sheet, then appropriate styling records will
follow the text record.
</p>
<p>
<em>(We don't currently know enough about master sheet styling
to write about it)</em>
</p>
<p>
Normally, powerpoint will have one text record (TextBytesAtom
or TextCharsAtom) for every paragraph, with a preceeding
TextHeaderAtom to describe what sort of paragraph it is.
If any of the stylings differ from the master's, then a
StyleTextPropAtom will follow the text record. This contains
the paragraph style information, and the styling information
for each section of the text which has a different style.
(More on StyleTextPropAtom later)
</p>
<p>
For every font used, a FontEntityAtom must exist for that font.
The FontEntityAtoms live inside a FontCollection record, and
there's one of those inside Environment record inside the
Document record. <em>(More on Fonts to be discovered)</em>
</p>
</section>
<section><title>StyleTextPropAtom</title>
<p>
If the text or paragraph stylings for a given text record
differ from those of the appropriate master, then there will
be one of these records.
</p>
<p>
Firstly, this contains the number of characters it applies to,
stored in a 2 byte little endian number.
Normally, this will be the same as the number of characters
in the text record. Then there are two values which encode
paragraph properties (alignment, text spacing etc), both 4
byte little endian numbers.
</p>
<p>
Following this is one block of information for each subsequent
bit of text with a different styling. (If your text was
10 characters in blue, then 10 in red, you would have two blocks).
Firstly is the number of characters it applies to, or 0 if it
applies to all remaining text. (This is a 2 byte little endian
number). Then there is a number (4 byte little endian) that
encodes if the text is bold/italic/underlined. If that number
was non zero, it is followed by another 4 byte number, that
encodes further text styling information. If it was zero,
then it's followed by a 2 byte number.
</p>
<p>
In the character styling block, the first number after the
character count indicated the bold/italic/underlined status
of the text. If you binary AND it with 0x00010000 (65536) and
get that value back, it is in bold. If you binary AND it with
0x00020000 (131072) and get that value back, it is in italic.
If you binary AND it with 0x00040000 (262144) and get that
value back, it is underlined.
</p>
<source>
hex on disk decimal description
----------- ------- -----------
0000 0 No options
A10F 4001 Record type is 4001
2E00 0000 46 Length of data is 46 bytes
5300 83 The paragraph stylings apply to 83 characters
0000 0000 0 Paragraph stylings 1 - as per the master
0000 0000 0 Paragraph stylings 2 - as per the master
1E00 30 These character properties apply to 30 characters
0000 0100 65536 Bold
0000 0100 65536 ??
1C00 28 These character properties apply to 28 characters
0000 0200 131072 Italic
0400 0200 131076 ??
0000 0 These character properties apply to the remaining characters
0005 1900 1639680 Bold
0000 0000 0 ??
0400 4 ??
FF33 13311 ??
00FE 65024 ??
</source>
</section>
</body>
</document>