diff --git a/src/documentation/content/xdocs/hslf/book.xml b/src/documentation/content/xdocs/hslf/book.xml index cc92cdb1c..a0a827b0a 100644 --- a/src/documentation/content/xdocs/hslf/book.xml +++ b/src/documentation/content/xdocs/hslf/book.xml @@ -13,6 +13,7 @@ + diff --git a/src/documentation/content/xdocs/hslf/ppt-file-format.xml b/src/documentation/content/xdocs/hslf/ppt-file-format.xml new file mode 100644 index 000000000..ede2eee96 --- /dev/null +++ b/src/documentation/content/xdocs/hslf/ppt-file-format.xml @@ -0,0 +1,181 @@ + + + + + +
+ POI-HSLF - A Guide to the PowerPoint File Format + Overview + + + +
+ + +
Records, Containers and Atoms +

+ PowerPoint documents are made up of a tree of records. A record may + contain either other records (in which case it is a Container), + or data (in which case it's an Atom). A record can't hold both. +

+

+ PowerPoint documents don't have one overall container record. Instead, + there are a number of different container records to be found at + the top level. +

+

+ Any numbers or strings stored in the records are always stored in + Little Endian format (least important bytes first). This is the case + no matter what platform the file was written on - be that a + Little Endian or a Big Endian system. +

+

+ PowerPoint may have Escher (DDF) records embeded in it. These + are always held as the children of a PPDrawing record (record + type 1036). Escher records have the same format as PowerPoint + records. +

+
+ +
Record Headers +

+ All records, be they containers or atoms, have the same standard + 8 byte header. It is: +

+ +

+ If the first byte of the header, BINARY_AND with 0x0f, is 0x0f, + then the record is a container. Otherwise, it's an atom. The rest + of the first two bytes are used to store the "options" for the + record. Most commonly, this is used to indicate the version of + the record, but the exact useage is record specific. +

+

+ The record type is a little endian number, which tells you what + kind of record you're dealing with. Each different kind of record + has it's own value that gets stored here. PowerPoint records have + a type that's normally less than 6000 (decimal). Escher records + normally have a type between 0xF000 and 0xF1FF. +

+

+ The record length is another little endian number. For an atom, + it's the size of the data part of the record, i.e. the length + of the record less its 8 byte record header. For a + container, it's the size of all the records that are children of + this record. That means that the size of a container record is the + length, plus 8 bytes for its record header. +

+
+ +
CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock +

aka Records that care about the byte level position of other records

+

+ A small number of records contain byte level position offsets to other + records. If you change the position of any records in the file, then + there's a good chance that you will need to update some of these + special records. +

+

+ First up, CurrentUserAtom. This is actually stored in a different + OLE2 (POIFS) stream to the main PowerPoint document. It contains + a few bits of information on who lasted edited the file. Most + importantly, at byte 8 of its contents, it stores (as a 32 bit + little endian number) the offset in the main stream to the most + recent UserEditAtom. +

+

+ The UserEditAtom contains two byte level offsets (again as 32 bit + little endian numbers). At byte 12 is the offset to the + PersistPtrIncrementalBlock associated with this UserEditAtom + (each UserEditAtom has one and only one PersistPtrIncrementalBlock). + At byte 8, there's the offset to the previous UserEditAtom. If this + is 0, then you're at the first one. +

+

+ Every time you do a non full save in PowerPoint, it tacks on another + UserEditAtom and another PersistPtrIncrementalBlock. The + CurrentUserAtom is updated to point to this new UserEditAtom, and the + new UserEditAtom points back to the previous UserEditAtom. You then + end up with a chain, starting from the CurrentUserAtom, linking + back through all the UserEditAtoms, until you reach the first one + from a full save. +

+ +/-------------------------------\ +| CurrentUserAtom (own stream) | +| OffsetToCurrentEdit = 10562 |==\ +\-------------------------------/ | + | +/==================================/ +| /-----------------------------------\ +| | PersistPtrIncrementalBlock @ 6144 | +| \-----------------------------------/ +| /---------------------------------\ | +| | UserEditAtom @ 6176 | | +| | LastUserEditAtomOffset = 0 | | +| | PersistPointersOffset = 6144 |==================/ +| \---------------------------------/ +| | /-----------------------------------\ +| \====================\ | PersistPtrIncrementalBlock @ 8646 | +| | \-----------------------------------/ +| /---------------------------------\ | | +| | UserEditAtom @ 8674 | | | +| | LastUserEditAtomOffset = 6176 |=/ | +| | PersistPointersOffset = 8646 |==================/ +| \---------------------------------/ +| | /------------------------------------\ +| \====================\ | PersistPtrIncrementalBlock @ 10538 | +| | \------------------------------------/ +| /---------------------------------\ | | +\==| UserEditAtom @ 10562 | | | + | LastUserEditAtomOffset = 8674 |=/ | + | PersistPointersOffset = 10538 |==================/ + \---------------------------------/ + +

+ The PersistPtrIncrementalBlock contains byte offsets to all the + Slides, Notes, Documents and MasterSlides in the file. The first + PersistPtrIncrementalBlock will point to all the ones that + were present the first time the file was saved. Subsequent + PersistPtrIncrementalBlocks will contain pointers to all the ones + that were changed in that edit. To find the offset to a given + sheet in the latest version, then start with the most recent + PersistPtrIncrementalBlock. If this knows about the sheet, use the + offset it has. If it doesn't, then work back through older + PersistPtrIncrementalBlocks until you find one which does, and + use that. +

+

+ Each PersistPtrIncrementalBlock can contain a number of entries + blocks. Each block holds information on a sequence of sheets. + Each block starts with a 32 bit little endian integer. Once read + into memory, the lower 20 bits contain the starting number for the + sequence of sheets to be described. The higher 12 bits contain + the count of the number of sheets described. Following that is + one 32 bit little endian integer for each sheet in the sequence, + the value being the offset to that sheet. If there is any data + left after parsing a block, then it corresponds to the next block. +

+ +hex on disk decimal description +----------- ------- ----------- +0000 0 No options +7217 6002 Record type is 6002 +2000 0000 32 Length of data is 32 bytes +0100 5000 5242881 Count is 5 (12 highest bits) + Starting number is 1 (20 lowest bits) +0000 0000 0 Sheet (1+0)=1 starts at offset 0 +900D 0000 3472 Sheet (1+1)=2 starts at offset 3472 +E403 0000 996 Sheet (1+2)=3 starts at offset 996 +9213 0000 5010 Sheet (1+3)=4 starts at offset 5010 +BE15 0000 5566 Sheet (1+4)=5 starts at offset 5566 +0900 1000 1048585 Count is 1 (12 highest bits) + Starting number is 9 (20 lowest bits) +4418 0000 6212 Sheet (9+0)=9 starts at offset 9212 + +
+ +
diff --git a/src/documentation/content/xdocs/hslf/quick-guide.xml b/src/documentation/content/xdocs/hslf/quick-guide.xml index 5f6525232..7b7b98ded 100644 --- a/src/documentation/content/xdocs/hslf/quick-guide.xml +++ b/src/documentation/content/xdocs/hslf/quick-guide.xml @@ -15,8 +15,9 @@
Basic Text Extraction

For basic text extraction, make use of org.apache.poi.extractor.PowerPointExtractor. It accepts a file or an input -stream. The getText() method can be used to get the text from the slides, -from the notes, or from both. +stream. The getText() method can be used to get the text from the slides, and the getNotes() method can be used to get the text +from the notes. Finally, getText(true,true) will get the text +from both.

@@ -31,19 +32,45 @@ what type of text it is (eg Body, Title)

+
Poor Quality Text Extraction +

If speed is the most important thing for you, you don't care + about getting duplicate blocks of text, you don't care about + getting text from master sheets, and you don't care about getting + old text, then + org.apache.poi.extractor.QuickButCruddyTextExtractor + might be of use.

+

QuickButCruddyTextExtractor doesn't use the normal record + parsing code, instead it uses a tree structure blind search + method to get all text holding records. You will get all the text, + including lots of text you normally wouldn't ever want. However, + you will get it back very very fast!

+

There are two ways of getting the text back. + getTextAsString() will return a single string with all + the text in it. getTextAsVector() will return a + vector of strings, one for each text record found in the file. +

+
+
Changing Text -

It is possible to change the text via TextRun.setText(String). However, if -the length of the text is changed, things will break because PowerPoint has -internal file references in byte offsets, which are not yet all updated when -the size changes. +

It is possible to change the text via + TextRun.setText(String). However, if the length of + the text is changed, things will break because PowerPoint has + internal file references in byte offsets. We currently update all + of these byte references that we know about when writing out, but + there are a few more still to be found.

Guide to key classes
- \ No newline at end of file +