From ed83ff62b1681221f536567a8c5913d65edcb472 Mon Sep 17 00:00:00 2001
From: Nick Burch
+ PowerPoint documents are made up of a tree of records. A record may
+ contain either other records (in which case it is a Container),
+ or data (in which case it's an Atom). A record can't hold both.
+
+ PowerPoint documents don't have one overall container record. Instead,
+ there are a number of different container records to be found at
+ the top level.
+
+ Any numbers or strings stored in the records are always stored in
+ Little Endian format (least important bytes first). This is the case
+ no matter what platform the file was written on - be that a
+ Little Endian or a Big Endian system.
+
+ PowerPoint may have Escher (DDF) records embeded in it. These
+ are always held as the children of a PPDrawing record (record
+ type 1036). Escher records have the same format as PowerPoint
+ records.
+
+ All records, be they containers or atoms, have the same standard
+ 8 byte header. It is:
+
+ If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
+ then the record is a container. Otherwise, it's an atom. The rest
+ of the first two bytes are used to store the "options" for the
+ record. Most commonly, this is used to indicate the version of
+ the record, but the exact useage is record specific.
+
+ The record type is a little endian number, which tells you what
+ kind of record you're dealing with. Each different kind of record
+ has it's own value that gets stored here. PowerPoint records have
+ a type that's normally less than 6000 (decimal). Escher records
+ normally have a type between 0xF000 and 0xF1FF.
+
+ The record length is another little endian number. For an atom,
+ it's the size of the data part of the record, i.e. the length
+ of the record less its 8 byte record header. For a
+ container, it's the size of all the records that are children of
+ this record. That means that the size of a container record is the
+ length, plus 8 bytes for its record header.
+ aka Records that care about the byte level position of other records
+ A small number of records contain byte level position offsets to other
+ records. If you change the position of any records in the file, then
+ there's a good chance that you will need to update some of these
+ special records.
+
+ First up, CurrentUserAtom. This is actually stored in a different
+ OLE2 (POIFS) stream to the main PowerPoint document. It contains
+ a few bits of information on who lasted edited the file. Most
+ importantly, at byte 8 of its contents, it stores (as a 32 bit
+ little endian number) the offset in the main stream to the most
+ recent UserEditAtom.
+
+ The UserEditAtom contains two byte level offsets (again as 32 bit
+ little endian numbers). At byte 12 is the offset to the
+ PersistPtrIncrementalBlock associated with this UserEditAtom
+ (each UserEditAtom has one and only one PersistPtrIncrementalBlock).
+ At byte 8, there's the offset to the previous UserEditAtom. If this
+ is 0, then you're at the first one.
+
+ Every time you do a non full save in PowerPoint, it tacks on another
+ UserEditAtom and another PersistPtrIncrementalBlock. The
+ CurrentUserAtom is updated to point to this new UserEditAtom, and the
+ new UserEditAtom points back to the previous UserEditAtom. You then
+ end up with a chain, starting from the CurrentUserAtom, linking
+ back through all the UserEditAtoms, until you reach the first one
+ from a full save.
+
+ The PersistPtrIncrementalBlock contains byte offsets to all the
+ Slides, Notes, Documents and MasterSlides in the file. The first
+ PersistPtrIncrementalBlock will point to all the ones that
+ were present the first time the file was saved. Subsequent
+ PersistPtrIncrementalBlocks will contain pointers to all the ones
+ that were changed in that edit. To find the offset to a given
+ sheet in the latest version, then start with the most recent
+ PersistPtrIncrementalBlock. If this knows about the sheet, use the
+ offset it has. If it doesn't, then work back through older
+ PersistPtrIncrementalBlocks until you find one which does, and
+ use that.
+
+ Each PersistPtrIncrementalBlock can contain a number of entries
+ blocks. Each block holds information on a sequence of sheets.
+ Each block starts with a 32 bit little endian integer. Once read
+ into memory, the lower 20 bits contain the starting number for the
+ sequence of sheets to be described. The higher 12 bits contain
+ the count of the number of sheets described. Following that is
+ one 32 bit little endian integer for each sheet in the sequence,
+ the value being the offset to that sheet. If there is any data
+ left after parsing a block, then it corresponds to the next block.
+ For basic text extraction, make use of
+ org.apache.poi.extractor.PowerPointExtractor
. It accepts a file or an input
-stream. The getText()
method can be used to get the text from the slides,
-from the notes, or from both.
+stream. The getText()
method can be used to get the text from the slides, and the getNotes()
method can be used to get the text
+from the notes. Finally, getText(true,true)
will get the text
+from both.
If speed is the most important thing for you, you don't care
+ about getting duplicate blocks of text, you don't care about
+ getting text from master sheets, and you don't care about getting
+ old text, then
+ org.apache.poi.extractor.QuickButCruddyTextExtractor
+ might be of use.
QuickButCruddyTextExtractor doesn't use the normal record + parsing code, instead it uses a tree structure blind search + method to get all text holding records. You will get all the text, + including lots of text you normally wouldn't ever want. However, + you will get it back very very fast!
+There are two ways of getting the text back.
+ getTextAsString()
will return a single string with all
+ the text in it. getTextAsVector()
will return a
+ vector of strings, one for each text record found in the file.
+
It is possible to change the text via TextRun.setText(String)
. However, if
-the length of the text is changed, things will break because PowerPoint has
-internal file references in byte offsets, which are not yet all updated when
-the size changes.
+
It is possible to change the text via
+ TextRun.setText(String)
. However, if the length of
+ the text is changed, things will break because PowerPoint has
+ internal file references in byte offsets. We currently update all
+ of these byte references that we know about when writing out, but
+ there are a few more still to be found.
org.apache.poi.hslf.HSLFSlideShow
- Handles reading in and writing out files. Generates a tree of the records
- in the file
+ Handles reading in and writing out files. Calls
+ org.apache.poi.hslf.record.record
to build a tree
+ of all the records in the file, which it allows access to.
+ org.apache.poi.hslf.record.record
+ Base class of all records. Also provides the main record generation
+ code, which will build up a tree of records for a file.
org.apache.poi.hslf.usermode.SlideShow
Builds up model entries from the records, and presents a user facing
@@ -55,4 +82,4 @@ the size changes.