poi/src/documentation/content/xdocs/hslf/ppt-file-format.xml

<?xml version="1.0" encoding="UTF-8"?>
<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">

<document>
    <header>
        <title>POI-HSLF - A Guide to the PowerPoint File Format</title>
        <subtitle>Overview</subtitle>
        <authors>
            <person name="Nick Burch" email="nick at torchbox dot com"/>
        </authors>
    </header>

    <body>
        <section><title>Records, Containers and Atoms</title>
		<p>
		PowerPoint documents are made up of a tree of records. A record may
		contain either other records (in which case it is a Container),
		or data (in which case it's an Atom). A record can't hold both.
		</p>
		<p>
		PowerPoint documents don't have one overall container record. Instead,
		there are a number of different container records to be found at
		the top level.
		</p>
		<p>
		Any numbers or strings stored in the records are always stored in
		Little Endian format (least important bytes first). This is the case
		no matter what platform the file was written on - be that a 
		Little Endian or a Big Endian system.
		</p>
		<p>
		PowerPoint may have Escher (DDF) records embeded in it. These
		are always held as the children of a PPDrawing record (record
		type 1036). Escher records have the same format as PowerPoint
		records.
		</p>
		</section>
		
		<section><title>Record Headers</title>
		<p>
		All records, be they containers or atoms, have the same standard
		8 byte header. It is:
		</p>
		<ul><li>1/2 byte container flag</li>
		<li>1.5 byte option field</li>
		<li>2 byte record type</li>
		<li>4 byte record length</li></ul>
		<p>
		If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
		then the record is a container. Otherwise, it's an atom. The rest
		of the first two bytes are used to store the "options" for the
		record. Most commonly, this is used to indicate the version of
		the record, but the exact useage is record specific.
		</p>
		<p>
		The record type is a little endian number, which tells you what
		kind of record you're dealing with. Each different kind of record
		has it's own value that gets stored here. PowerPoint records have
		a type that's normally less than 6000 (decimal). Escher records
		normally have a type between 0xF000 and 0xF1FF.
		</p>
		<p>
		The record length is another little endian number. For an atom,
		it's the size of the data part of the record, i.e. the length
		of the record <em>less</em> its 8 byte record header. For a
		container, it's the size of all the records that are children of
		this record. That means that the size of a container record is the
		length, plus 8 bytes for its record header.
		</p>
		</section>

		<section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>
		<p><strong>aka Records that care about the byte level position of other records</strong></p>
		<p>
		A small number of records contain byte level position offsets to other
		records. If you change the position of any records in the file, then
		there's a good chance that you will need to update some of these
		special records.
		</p>
		<p>
		First up, CurrentUserAtom. This is actually stored in a different
		OLE2 (POIFS) stream to the main PowerPoint document. It contains
		a few bits of information on who lasted edited the file. Most
		importantly, at byte 8 of its contents, it stores (as a 32 bit
		little endian number) the offset in the main stream to the most
		recent UserEditAtom.
		</p>
		<p>
		The UserEditAtom contains two byte level offsets (again as 32 bit
		little endian numbers). At byte 12 is the offset to the 
		PersistPtrIncrementalBlock associated with this UserEditAtom
		(each UserEditAtom has one and only one PersistPtrIncrementalBlock).
		At byte 8, there's the offset to the previous UserEditAtom. If this
		is 0, then you're at the first one.
		</p>
		<p>
		Every time you do a non full save in PowerPoint, it tacks on another
		UserEditAtom and another PersistPtrIncrementalBlock. The 
		CurrentUserAtom is updated to point to this new UserEditAtom, and the
		new UserEditAtom points back to the previous UserEditAtom. You then
		end up with a chain, starting from the CurrentUserAtom, linking
		back through all the UserEditAtoms, until you reach the first one
		from a full save.
		</p>
<source>
/-------------------------------\
| CurrentUserAtom (own stream)  |
|   OffsetToCurrentEdit = 10562 |==\
\-------------------------------/  |
                                   |
/==================================/
|                                         /-----------------------------------\
|                                         | PersistPtrIncrementalBlock @ 6144 |
|                                         \-----------------------------------/
|  /---------------------------------\                  |
|  | UserEditAtom @ 6176             |                  |
|  |   LastUserEditAtomOffset = 0    |                  |
|  |   PersistPointersOffset =  6144 |==================/
|  \---------------------------------/
|                 |                       /-----------------------------------\
|                 \====================\  | PersistPtrIncrementalBlock @ 8646 |
|                                      |  \-----------------------------------/
|  /---------------------------------\ |                |
|  | UserEditAtom @ 8674             | |                |
|  |   LastUserEditAtomOffset = 6176 |=/                |
|  |   PersistPointersOffset =  8646 |==================/
|  \---------------------------------/
|                 |                       /------------------------------------\
|                 \====================\  | PersistPtrIncrementalBlock @ 10538 |
|                                      |  \------------------------------------/
|  /---------------------------------\ |                |
\==| UserEditAtom @ 10562            | |                |
   |   LastUserEditAtomOffset = 8674 |=/                |
   |   PersistPointersOffset = 10538 |==================/
   \---------------------------------/
</source>
		<p>
		The PersistPtrIncrementalBlock contains byte offsets to all the
		Slides, Notes, Documents and MasterSlides in the file. The first
		PersistPtrIncrementalBlock will point to all the ones that
		were present the first time the file was saved. Subsequent 
		PersistPtrIncrementalBlocks will contain pointers to all the ones
		that were changed in that edit. To find the offset to a given
		sheet in the latest version, then start with the most recent
		PersistPtrIncrementalBlock. If this knows about the sheet, use the
		offset it has. If it doesn't, then work back through older
		PersistPtrIncrementalBlocks until you find one which does, and
		use that.
		</p>
		<p>
		Each PersistPtrIncrementalBlock can contain a number of entries
		blocks. Each block holds information on a sequence of sheets.
		Each block starts with a 32 bit little endian integer. Once read
		into memory, the lower 20 bits contain the starting number for the
		sequence of sheets to be described. The higher 12 bits contain
		the count of the number of sheets described. Following that is
		one 32 bit little endian integer for each sheet in the sequence, 
		the value being the offset to that sheet. If there is any data
		left after parsing a block, then it corresponds to the next block.
		</p>
<source>
hex on disk      decimal        description
-----------      -------        -----------
0000             0              No options
7217             6002           Record type is 6002
2000 0000        32             Length of data is 32 bytes
0100 5000        5242881        Count is 5 (12 highest bits)
                                Starting number is 1 (20 lowest bits)
0000 0000        0              Sheet (1+0)=1 starts at offset 0
900D 0000        3472           Sheet (1+1)=2 starts at offset 3472
E403 0000        996            Sheet (1+2)=3 starts at offset 996
9213 0000        5010           Sheet (1+3)=4 starts at offset 5010
BE15 0000        5566           Sheet (1+4)=5 starts at offset 5566
0900 1000        1048585        Count is 1 (12 highest bits)
                                Starting number is 9 (20 lowest bits)
4418 0000        6212           Sheet (9+0)=9 starts at offset 9212
</source>
		</section>

		<section><title>Paragraph and Text Styling</title>
		<p>
			There are quite a number of records that affect the styling
			of text, and a smaller number that are responsible for the
			styling of paragraphs.
		</p>
		<p>
			By default, a given set of text will inherit paragraph and text
			stylings from the appropriate master sheet. If anything differs
			from the master sheet, then appropriate styling records will
			follow the text record.
		</p>
		<p>
			<em>(We don't currently know enough about master sheet styling
			to write about it)</em>
		</p>
		<p>
			Normally, powerpoint will have one text record (TextBytesAtom
			or TextCharsAtom) for every paragraph, with a preceeding 
			TextHeaderAtom to describe what sort of paragraph it is.
			If any of the stylings differ from the master's, then a 
			StyleTextPropAtom will follow the text record. This contains
			the paragraph style information, and the styling information
			for each section of the text which has a different style.
			(More on StyleTextPropAtom later)
		</p>
		<p>
			For every font used, a FontEntityAtom must exist for that font.
			The FontEntityAtoms live inside a FontCollection record, and 
			there's one	of those inside Environment record inside the
			Document record. <em>(More on Fonts to be discovered)</em>
		</p>
		</section>

		<section><title>StyleTextPropAtom</title>
		<p>
			If the text or paragraph stylings for a given text record
			differ from those of the appropriate master, then there will
			be one of these records.
		</p>
		<p>
			Firstly, this contains the number of characters it applies to,
			stored in a 2 byte little endian number.
			Normally, this will be the same as the number of characters
			in the text record. Then there are two values which encode
			paragraph properties (alignment, text spacing etc), both 4
			byte little endian numbers.
		</p>
		<p>
			Following this is one block of information for each subsequent
			bit of text with a different styling. (If your text was
			10 characters in blue, then 10 in red, you would have two blocks).
			Firstly is the number of characters it applies to, or 0 if it
			applies to all remaining text. (This is a 2 byte little endian
			number). Then there is a number (4 byte little endian) that
			encodes if the text is bold/italic/underlined. If that number
			was non zero, it is followed by another 4 byte number, that
			encodes further text styling information. If it was zero,
			then it's followed by a 2 byte number.
		</p>
		<p>
			In the character styling block, the first number after the
			character count indicated the bold/italic/underlined status
			of the text. If you binary AND it with 0x00010000 (65536) and
			get that value back, it is in bold. If you binary AND it with
			0x00020000 (131072) and get that value back, it is in italic.
			If you binary AND it with 0x00040000 (262144) and get that
			value back, it is underlined.
		</p>
<source>
hex on disk      decimal        description
-----------      -------        -----------
0000             0              No options
A10F             4001           Record type is 4001
2E00 0000        46             Length of data is 46 bytes
5300             83             The paragraph stylings apply to 83 characters
0000 0000        0              Paragraph stylings 1 - as per the master
0000 0000        0              Paragraph stylings 2 - as per the master

1E00             30             These character properties apply to 30 characters
0000 0100        65536          Bold
0000 0100        65536          ??
1C00             28             These character properties apply to 28 characters
0000 0200        131072         Italic
0400 0200        131076         ??
0000             0              These character properties apply to the remaining characters
0005 1900        1639680        Bold
0000 0000        0              ??

0400             4              ??
FF33             13311          ??
00FE             65024          ??
</source>
		</section>
	</body>
</document>
A few small updates to the HSLF useage docs, and adding some initial documentation on the PowerPoint file format git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353707 13f79535-47bb-0310-9956-ffa450edef68 2005-06-09 09:12:59 -04:00			`<?xml version="1.0" encoding="UTF-8"?>`
			`<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->`
			`<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">`

			`<document>`
			`<header>`
			`<title>POI-HSLF - A Guide to the PowerPoint File Format</title>`
			`<subtitle>Overview</subtitle>`
			`<authors>`
			`<person name="Nick Burch" email="nick at torchbox dot com"/>`
			`</authors>`
			`</header>`

			`<body>`
			`<section><title>Records, Containers and Atoms</title>`
			`<p>`
			`PowerPoint documents are made up of a tree of records. A record may`
			`contain either other records (in which case it is a Container),`
			`or data (in which case it's an Atom). A record can't hold both.`
			`</p>`
			`<p>`
			`PowerPoint documents don't have one overall container record. Instead,`
			`there are a number of different container records to be found at`
			`the top level.`
			`</p>`
			`<p>`
			`Any numbers or strings stored in the records are always stored in`
			`Little Endian format (least important bytes first). This is the case`
			`no matter what platform the file was written on - be that a`
			`Little Endian or a Big Endian system.`
			`</p>`
			`<p>`
			`PowerPoint may have Escher (DDF) records embeded in it. These`
			`are always held as the children of a PPDrawing record (record`
			`type 1036). Escher records have the same format as PowerPoint`
			`records.`
			`</p>`
			`</section>`

			`<section><title>Record Headers</title>`
			`<p>`
			`All records, be they containers or atoms, have the same standard`
			`8 byte header. It is:`
			`</p>`
			`<ul><li>1/2 byte container flag</li>`
			`<li>1.5 byte option field</li>`
			`<li>2 byte record type</li>`
			`<li>4 byte record length</li></ul>`
			`<p>`
			`If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,`
			`then the record is a container. Otherwise, it's an atom. The rest`
			`of the first two bytes are used to store the "options" for the`
			`record. Most commonly, this is used to indicate the version of`
			`the record, but the exact useage is record specific.`
			`</p>`
			`<p>`
			`The record type is a little endian number, which tells you what`
			`kind of record you're dealing with. Each different kind of record`
			`has it's own value that gets stored here. PowerPoint records have`
			`a type that's normally less than 6000 (decimal). Escher records`
			`normally have a type between 0xF000 and 0xF1FF.`
			`</p>`
			`<p>`
			`The record length is another little endian number. For an atom,`
			`it's the size of the data part of the record, i.e. the length`
			`of the record <em>less</em> its 8 byte record header. For a`
			`container, it's the size of all the records that are children of`
			`this record. That means that the size of a container record is the`
			`length, plus 8 bytes for its record header.`
			`</p>`
			`</section>`

			`<section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>`
			`<p><strong>aka Records that care about the byte level position of other records</strong></p>`
			`<p>`
			`A small number of records contain byte level position offsets to other`
			`records. If you change the position of any records in the file, then`
			`there's a good chance that you will need to update some of these`
			`special records.`
			`</p>`
			`<p>`
			`First up, CurrentUserAtom. This is actually stored in a different`
			`OLE2 (POIFS) stream to the main PowerPoint document. It contains`
			`a few bits of information on who lasted edited the file. Most`
			`importantly, at byte 8 of its contents, it stores (as a 32 bit`
			`little endian number) the offset in the main stream to the most`
			`recent UserEditAtom.`
			`</p>`
			`<p>`
			`The UserEditAtom contains two byte level offsets (again as 32 bit`
			`little endian numbers). At byte 12 is the offset to the`
			`PersistPtrIncrementalBlock associated with this UserEditAtom`
			`(each UserEditAtom has one and only one PersistPtrIncrementalBlock).`
			`At byte 8, there's the offset to the previous UserEditAtom. If this`
			`is 0, then you're at the first one.`
			`</p>`
			`<p>`
			`Every time you do a non full save in PowerPoint, it tacks on another`
			`UserEditAtom and another PersistPtrIncrementalBlock. The`
			`CurrentUserAtom is updated to point to this new UserEditAtom, and the`
			`new UserEditAtom points back to the previous UserEditAtom. You then`
			`end up with a chain, starting from the CurrentUserAtom, linking`
			`back through all the UserEditAtoms, until you reach the first one`
			`from a full save.`
			`</p>`
			`<source>`
			`/-------------------------------\`
			`\| CurrentUserAtom (own stream) \|`
			`\| OffsetToCurrentEdit = 10562 \|==\`
			`\-------------------------------/ \|`
			`\|`
			`/==================================/`
			`\| /-----------------------------------\`
			`\| \| PersistPtrIncrementalBlock @ 6144 \|`
			`\| \-----------------------------------/`
			`\| /---------------------------------\ \|`
			`\| \| UserEditAtom @ 6176 \| \|`
			`\| \| LastUserEditAtomOffset = 0 \| \|`
			`\| \| PersistPointersOffset = 6144 \|==================/`
			`\| \---------------------------------/`
			`\| \| /-----------------------------------\`
			`\| \====================\ \| PersistPtrIncrementalBlock @ 8646 \|`
			`\| \| \-----------------------------------/`
			`\| /---------------------------------\ \| \|`
			`\| \| UserEditAtom @ 8674 \| \| \|`
			`\| \| LastUserEditAtomOffset = 6176 \|=/ \|`
			`\| \| PersistPointersOffset = 8646 \|==================/`
			`\| \---------------------------------/`
			`\| \| /------------------------------------\`
			`\| \====================\ \| PersistPtrIncrementalBlock @ 10538 \|`
			`\| \| \------------------------------------/`
			`\| /---------------------------------\ \| \|`
			`\==\| UserEditAtom @ 10562 \| \| \|`
			`\| LastUserEditAtomOffset = 8674 \|=/ \|`
			`\| PersistPointersOffset = 10538 \|==================/`
			`\---------------------------------/`
			`</source>`
			`<p>`
			`The PersistPtrIncrementalBlock contains byte offsets to all the`
			`Slides, Notes, Documents and MasterSlides in the file. The first`
			`PersistPtrIncrementalBlock will point to all the ones that`
			`were present the first time the file was saved. Subsequent`
			`PersistPtrIncrementalBlocks will contain pointers to all the ones`
			`that were changed in that edit. To find the offset to a given`
			`sheet in the latest version, then start with the most recent`
			`PersistPtrIncrementalBlock. If this knows about the sheet, use the`
			`offset it has. If it doesn't, then work back through older`
			`PersistPtrIncrementalBlocks until you find one which does, and`
			`use that.`
			`</p>`
			`<p>`
			`Each PersistPtrIncrementalBlock can contain a number of entries`
			`blocks. Each block holds information on a sequence of sheets.`
			`Each block starts with a 32 bit little endian integer. Once read`
			`into memory, the lower 20 bits contain the starting number for the`
			`sequence of sheets to be described. The higher 12 bits contain`
			`the count of the number of sheets described. Following that is`
			`one 32 bit little endian integer for each sheet in the sequence,`
			`the value being the offset to that sheet. If there is any data`
			`left after parsing a block, then it corresponds to the next block.`
			`</p>`
			`<source>`
			`hex on disk decimal description`
			`----------- ------- -----------`
			`0000 0 No options`
			`7217 6002 Record type is 6002`
			`2000 0000 32 Length of data is 32 bytes`
			`0100 5000 5242881 Count is 5 (12 highest bits)`
			`Starting number is 1 (20 lowest bits)`
			`0000 0000 0 Sheet (1+0)=1 starts at offset 0`
			`900D 0000 3472 Sheet (1+1)=2 starts at offset 3472`
			`E403 0000 996 Sheet (1+2)=3 starts at offset 996`
			`9213 0000 5010 Sheet (1+3)=4 starts at offset 5010`
			`BE15 0000 5566 Sheet (1+4)=5 starts at offset 5566`
			`0900 1000 1048585 Count is 1 (12 highest bits)`
			`Starting number is 9 (20 lowest bits)`
			`4418 0000 6212 Sheet (9+0)=9 starts at offset 9212`
Add some information on fonts, paragraph stylings etc. (Holds the latest knowledge as encapsulated in StyleTextPropAtom.java) git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353761 13f79535-47bb-0310-9956-ffa450edef68 2005-08-07 11:04:46 -04:00			`</source>`
			`</section>`

			`<section><title>Paragraph and Text Styling</title>`
			`<p>`
			`There are quite a number of records that affect the styling`
			`of text, and a smaller number that are responsible for the`
			`styling of paragraphs.`
			`</p>`
			`<p>`
			`By default, a given set of text will inherit paragraph and text`
			`stylings from the appropriate master sheet. If anything differs`
			`from the master sheet, then appropriate styling records will`
			`follow the text record.`
			`</p>`
			`<p>`
			`<em>(We don't currently know enough about master sheet styling`
			`to write about it)</em>`
			`</p>`
			`<p>`
			`Normally, powerpoint will have one text record (TextBytesAtom`
			`or TextCharsAtom) for every paragraph, with a preceeding`
			`TextHeaderAtom to describe what sort of paragraph it is.`
			`If any of the stylings differ from the master's, then a`
			`StyleTextPropAtom will follow the text record. This contains`
			`the paragraph style information, and the styling information`
			`for each section of the text which has a different style.`
			`(More on StyleTextPropAtom later)`
			`</p>`
			`<p>`
			`For every font used, a FontEntityAtom must exist for that font.`
			`The FontEntityAtoms live inside a FontCollection record, and`
			`there's one of those inside Environment record inside the`
			`Document record. <em>(More on Fonts to be discovered)</em>`
			`</p>`
			`</section>`

			`<section><title>StyleTextPropAtom</title>`
			`<p>`
			`If the text or paragraph stylings for a given text record`
			`differ from those of the appropriate master, then there will`
			`be one of these records.`
			`</p>`
			`<p>`
			`Firstly, this contains the number of characters it applies to,`
			`stored in a 2 byte little endian number.`
			`Normally, this will be the same as the number of characters`
			`in the text record. Then there are two values which encode`
			`paragraph properties (alignment, text spacing etc), both 4`
			`byte little endian numbers.`
			`</p>`
			`<p>`
			`Following this is one block of information for each subsequent`
			`bit of text with a different styling. (If your text was`
			`10 characters in blue, then 10 in red, you would have two blocks).`
			`Firstly is the number of characters it applies to, or 0 if it`
			`applies to all remaining text. (This is a 2 byte little endian`
			`number). Then there is a number (4 byte little endian) that`
			`encodes if the text is bold/italic/underlined. If that number`
			`was non zero, it is followed by another 4 byte number, that`
			`encodes further text styling information. If it was zero,`
			`then it's followed by a 2 byte number.`
			`</p>`
			`<p>`
			`In the character styling block, the first number after the`
			`character count indicated the bold/italic/underlined status`
			`of the text. If you binary AND it with 0x00010000 (65536) and`
			`get that value back, it is in bold. If you binary AND it with`
			`0x00020000 (131072) and get that value back, it is in italic.`
			`If you binary AND it with 0x00040000 (262144) and get that`
			`value back, it is underlined.`
			`</p>`
			`<source>`
			`hex on disk decimal description`
			`----------- ------- -----------`
			`0000 0 No options`
			`A10F 4001 Record type is 4001`
			`2E00 0000 46 Length of data is 46 bytes`
			`5300 83 The paragraph stylings apply to 83 characters`
			`0000 0000 0 Paragraph stylings 1 - as per the master`
			`0000 0000 0 Paragraph stylings 2 - as per the master`

			`1E00 30 These character properties apply to 30 characters`
			`0000 0100 65536 Bold`
			`0000 0100 65536 ??`
			`1C00 28 These character properties apply to 28 characters`
			`0000 0200 131072 Italic`
			`0400 0200 131076 ??`
			`0000 0 These character properties apply to the remaining characters`
			`0005 1900 1639680 Bold`
			`0000 0000 0 ??`

			`0400 4 ??`
			`FF33 13311 ??`
			`00FE 65024 ??`
A few small updates to the HSLF useage docs, and adding some initial documentation on the PowerPoint file format git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353707 13f79535-47bb-0310-9956-ffa450edef68 2005-06-09 09:12:59 -04:00			`</source>`
			`</section>`
			`</body>`
			`</document>`