HPSF: codepage support added

git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@353460 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Rainer Klute 2003-12-02 17:46:01 +00:00
parent 6385296f3f
commit 131bb9d0bd
15 changed files with 276 additions and 103 deletions

View File

@ -12,7 +12,11 @@
<person id="MJ" name="Marc Johnson" email="mjohnson@apache.org"/>
<person id="NKB" name="Nicola Ken Barozzi" email="barozzi@nicolaken.com"/>
<person id="POI-DEVELOPERS" name="POI Developers" email="poi-dev@jakarta.apache.org"/>
<person id="RK" name="Rainer Klute" email="klute@apache.org"/>
</devs>
<release version="2.0-pre3" date="unreleased">
<action dev="RK" type="add">HPSF: Much better codepage support</action>
</release>
<release version="2.0-pre1" date="unreleased">
<action dev="POI-DEVELOPERS" type="add">Patch applied for deep cloning of worksheets was provided</action>
<action dev="POI-DEVELOPERS" type="add">Patch applied to allow sheet reordering</action>

View File

@ -708,8 +708,9 @@ No property set stream: "/1Table"</source>
<td>The property's value is the number of a <strong>codepage</strong>,
i.e. a mapping from character codes to characters. All strings in the
section containing this property must be interpreted using this
codepage. Typical property values are 1252 (8-bit "western" characters)
or 1200 (16-bit Unicode characters).</td>
codepage. Typical property values are 1252 (8-bit "western" characters,
ISO-8859-1), 1200 (16-bit Unicode characters, UFT-16), or 65001 (8-bit
Unicode characters, UFT-8).</td>
</tr>
</table>
</section>
@ -833,18 +834,34 @@ No property set stream: "/1Table"</source>
</section>
<section><title>Codepage support</title>
<fixme author="Rainer Klute">Improve codepage support!</fixme>
<p>The property with ID 1 holds the number of the codepage which was used
to encode the strings in this section. The present HPSF codepage support
is still very limited: When reading property value strings, HPSF
distinguishes between 16-bit characters and 8-bit characters. 16-bit
characters should be Unicode characters and thus be okay. 8-bit
characters are interpreted according to the platform's default character
set. This is fine as long as the document being read has been written on
a platform with the same default character set. However, if you receive a
document from another region of the world and want to process it with
HPSF you are in trouble - unless the creator used Unicode, of course.</p>
to encode the strings in this section. If this property is not available
in a section, the platform's default character encoding will be
used. This works fine as long as the document being read has been written
on a platform with the same default character encoding. However, if you
receive a document from another region of the world and the codepage is
undefined, you are in trouble.</p>
<p>HPSF's codepage support is as good as the character encoding support of
the Java Virtual Machine (JVM) the application runs on. If HPSF
encounters a codepage number it assumes that the JVM has a character
encoding with a corresponding name. For example, if the codepage is 1252,
HPSF uses the character encoding "cp1252" to read or write strings. If
the JVM does not have that character encoding installed or if the
codepage number is illegal, an UnsupportedEncodingException will be
thrown.</p>
<p>There are two exceptions to the rule that a character encoding's name
is derived from the codepage number by prepending the string "cp" to
it:</p>
<dl>
<dt>Codepage 1200</dt>
<dd>is mapped to the character encoding "UTF-16".</dd>
<dt>Codepage 65001</dt>
<dd>is mapped to the character encoding "UTF-8".</dd>
</dl>
</section>
</section>

View File

@ -944,6 +944,60 @@
<section>
<title>The Dictionary</title>
<p>What a dictionary is good for is explained in the <link
href="how-to.html">HPSF HOW-TO</link>. This chapter explains how it is
organized internally.</p>
<p>The dictionary has a simple header consisting of a single UInt value. It
tells how many entries the dictionary comprises:</p>
<table>
<tr>
<th>Name</th>
<th>Data type</th>
<th>Description</th>
</tr>
<tr>
<td>nrEntries</td>
<th>UInt</th>
<td>Number of dictionary entries</td>
</tr>
</table>
<p>The dictionary entries follow the header. Each one looks like this:</p>
<table>
<tr>
<th>Name</th>
<td>Data type</td>
<th>Description</th>
</tr>
<tr>
<td>key</td>
<td>UInt</td>
<td>The unique number of this property, i.e. the PID</td>
</tr>
<tr>
<td>length</td>
<td>UInt</td>
<td>The length of the property name associated with the key</td>
</tr>
<tr>
<td>value</td>
<td>String</td>
<td>The property's name, terminated with a 0x00 character</td>
</tr>
</table>
<p>The entries are not aligned, i.e. each one follows its predecessor
without any gap or fill characters.</p>
</section>
<section><title>References</title>
<p>In order to assemble the HPSF description I used information publically

View File

@ -20,11 +20,6 @@
easily writing summary information streams and document summary
information streams.
</li>
<li>
Add codepage support: Presently the bytes making out the string in a
property's value are interpreted using the platform's default character
set.
</li>
<li>
Add resource bundles to
<code>org.apache.poi.hpsf.wellknown</code> to ease
@ -38,8 +33,8 @@
arrays.
</li>
<li>
Add WMF to <code>java.awt.Image</code> example code in <link
href="thumbnails.html">Thumbnail HOW TO</link>.
Add WMF to <code>java.awt.Image</code> example code in the <link
href="thumbnails.html">Thumbnail HOW-TO</link>.
</li>
</ol>
</section>

View File

@ -558,7 +558,10 @@ public class CopyCompare
* exists. However, since we have full control about directory
* creation we can ensure that this will never happen. */
ex.printStackTrace(System.err);
throw new RuntimeException(ex);
throw new RuntimeException(ex.toString());
/* FIXME (2): Replace the previous line by the following once we
* no longer need JDK 1.3 compatibility. */
// throw new RuntimeException(ex);
}
}
}

View File

@ -444,7 +444,10 @@ public class WriteAuthorAndTitle
* exists. However, since we have full control about directory
* creation we can ensure that this will never happen. */
ex.printStackTrace(System.err);
throw new RuntimeException(ex);
throw new RuntimeException(ex.toString());
/* FIXME (2): Replace the previous line by the following once we
* no longer need JDK 1.3 compatibility. */
// throw new RuntimeException(ex);
}
}
}

View File

@ -80,19 +80,20 @@ public class MutableProperty extends Property
* <p>Writes the property to an output stream.</p>
*
* @param out The output stream to write to.
* @param codepage The codepage to use for writing non-wide strings
* @return the number of bytes written to the stream
*
* @exception IOException if an I/O error occurs
* @exception WritingNotSupportedException if a variant type is to be
* written that is not yet supported
*/
public int write(final OutputStream out)
public int write(final OutputStream out, final int codepage)
throws IOException, WritingNotSupportedException
{
int length = 0;
long variantType = getType();
length += TypeWriter.writeUIntToStream(out, variantType);
length += VariantSupport.write(out, variantType, getValue());
length += VariantSupport.write(out, variantType, getValue(), codepage);
return length;
}

View File

@ -420,16 +420,16 @@ public class MutableSection extends Section
/* If the property ID is not equal 0 we write the property and all
* is fine. However, if it equals 0 we have to write the section's
* dictionary which does not have a type but just a value. */
* dictionary which has an implicit type only and an explicit
* value. */
if (id != 0)
/* Write the property and update the position to the next
* property. */
position += p.write(propertyStream);
position += p.write(propertyStream, getCodepage());
else
{
final Integer codepage =
(Integer) getProperty(PropertyIDMap.PID_CODEPAGE);
if (codepage == null)
final int codepage = getCodepage();
if (codepage == -1)
throw new IllegalPropertySetDataException
("Codepage (property 1) is undefined.");
position += writeDictionary(propertyStream, dictionary);

View File

@ -62,9 +62,11 @@
*/
package org.apache.poi.hpsf;
import java.io.UnsupportedEncodingException;
import java.util.HashMap;
import java.util.Map;
import org.apache.poi.util.HexDump;
import org.apache.poi.util.LittleEndian;
/**
@ -161,9 +163,13 @@ public class Property
* @param length The property's type/value pair's length in bytes.
* @param codepage The section's and thus the property's
* codepage. It is needed only when reading string values.
*
* @exception UnsupportedEncodingException if the specified codepage is not
* supported
*/
public Property(final long id, final byte[] src, final long offset,
final int length, final int codepage)
throws UnsupportedEncodingException
{
this.id = id;
@ -183,7 +189,7 @@ public class Property
try
{
value = VariantSupport.read(src, o, length, (int) type);
value = VariantSupport.read(src, o, length, (int) type, codepage);
}
catch (UnsupportedVariantTypeException ex)
{
@ -382,8 +388,27 @@ public class Property
b.append(getID());
b.append(", type: ");
b.append(getType());
final Object value = getValue();
b.append(", value: ");
b.append(getValue());
b.append(value.toString());
if (value instanceof String)
{
final String s = (String) value;
final int l = s.length();
final byte[] bytes = new byte[l * 2];
for (int i = 0; i < l; i++)
{
final char c = s.charAt(i);
final byte high = (byte) ((c & 0x00ff00) >> 8);
final byte low = (byte) ((c & 0x0000ff) >> 0);
bytes[i * 2] = high;
bytes[i * 2 + 1] = low;
}
final String hex = HexDump.dump(bytes, 0L, 0);
b.append(" [");
b.append(hex);
b.append("]");
}
b.append(']');
return b.toString();
}

View File

@ -56,6 +56,7 @@ package org.apache.poi.hpsf;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;
@ -300,9 +301,11 @@ public class PropertySet
* @param length The length of the stream data.
* @throws NoPropertySetStreamException if the byte array is not a
* property set stream.
*
* @exception UnsupportedEncodingException if the codepage is not supported
*/
public PropertySet(final byte[] stream, final int offset, final int length)
throws NoPropertySetStreamException
throws NoPropertySetStreamException, UnsupportedEncodingException
{
if (isPropertySetStream(stream, offset, length))
init(stream, offset, length);
@ -321,8 +324,11 @@ public class PropertySet
* complete byte array contents is the stream data.
* @throws NoPropertySetStreamException if the byte array is not a
* property set stream.
*
* @exception UnsupportedEncodingException if the codepage is not supported
*/
public PropertySet(final byte[] stream) throws NoPropertySetStreamException
public PropertySet(final byte[] stream)
throws NoPropertySetStreamException, UnsupportedEncodingException
{
this(stream, 0, stream.length);
}
@ -435,6 +441,7 @@ public class PropertySet
* @param length Length of the property set stream.
*/
private void init(final byte[] src, final int offset, final int length)
throws UnsupportedEncodingException
{
/* FIXME (3): Ensure that at most "length" bytes are read. */
@ -651,7 +658,7 @@ public class PropertySet
final PropertySet ps = (PropertySet) o;
int byteOrder1 = ps.getByteOrder();
int byteOrder2 = getByteOrder();
ClassID classId1 = ps.getClassID();
ClassID classID1 = ps.getClassID();
ClassID classID2 = getClassID();
int format1 = ps.getFormat();
int format2 = getFormat();
@ -660,7 +667,7 @@ public class PropertySet
int sectionCount1 = ps.getSectionCount();
int sectionCount2 = getSectionCount();
if (byteOrder1 != byteOrder2 ||
!classId1.equals(classID2) ||
!classID1.equals(classID2) ||
format1 != format2 ||
osVersion1 != osVersion2 ||
sectionCount1 != sectionCount2)

View File

@ -54,6 +54,7 @@
*/
package org.apache.poi.hpsf;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
@ -193,8 +194,12 @@ public class Section
* @param src Contains the complete property set stream.
* @param offset The position in the stream that points to the
* section's format ID.
*
* @exception UnsupportedEncodingException if the section's codepage is not
* supported.
*/
public Section(final byte[] src, final int offset)
throws UnsupportedEncodingException
{
int o1 = offset;
@ -638,4 +643,18 @@ public class Section
return dictionary;
}
/**
* <p>Gets the section's codepage, if any.</p>
*
* @return The section's codepage if one is defined, else -1.
*/
public int getCodepage()
{
final Integer codepage =
(Integer) getProperty(PropertyIDMap.PID_CODEPAGE);
return codepage != null ? codepage.intValue() : -1;
}
}

View File

@ -185,7 +185,8 @@ public class TypeWriter
* @exception IOException if an I/O error occurs
*/
public static void writeToStream(final OutputStream out,
final Property[] properties)
final Property[] properties,
final int codepage)
throws IOException, UnsupportedVariantTypeException
{
/* If there are no properties don't write anything. */
@ -207,7 +208,7 @@ public class TypeWriter
final Property p = (Property) properties[i];
long type = p.getType();
writeUIntToStream(out, type);
VariantSupport.write(out, (int) type, p.getValue());
VariantSupport.write(out, (int) type, p.getValue(), codepage);
}
}

View File

@ -64,6 +64,7 @@ package org.apache.poi.hpsf;
import java.io.IOException;
import java.io.OutputStream;
import java.io.UnsupportedEncodingException;
import java.util.Date;
import java.util.LinkedList;
import java.util.List;
@ -163,17 +164,21 @@ public class VariantSupport extends Variant
* @param length The length of the variant including the variant
* type field
* @param type The variant type to read
* @param codepage The codepage to use to write non-wide strings
* @return A Java object that corresponds best to the variant
* field. For example, a VT_I4 is returned as a {@link Long}, a
* VT_LPSTR as a {@link String}.
* @exception ReadingNotSupportedException if a property is to be written
* who's variant type HPSF does not yet support
* @exception UnsupportedEncodingException if the specified codepage is not
* supported
*
* @see Variant
*/
public static Object read(final byte[] src, final int offset,
final int length, final long type)
throws ReadingNotSupportedException
final int length, final long type,
final int codepage)
throws ReadingNotSupportedException, UnsupportedEncodingException
{
Object value;
int o1 = offset;
@ -221,18 +226,18 @@ public class VariantSupport extends Variant
* Read a byte string. In Java it is represented as a
* String object. The 0x00 bytes at the end must be
* stripped.
*
* FIXME (2): Reading an 8-bit string should pay attention
* to the codepage. Currently the byte making out the
* property's value are interpreted according to the
* platform's default character set.
*/
final int first = o1 + LittleEndian.INT_SIZE;
long last = first + LittleEndian.getUInt(src, o1) - 1;
o1 += LittleEndian.INT_SIZE;
final int rawLength = (int) (last - first + 1);
while (src[(int) last] == 0 && first <= last)
last--;
value = new String(src, (int) first, (int) (last - first + 1));
final int l = (int) (last - first + 1);
value = codepage != -1 ?
new String(src, (int) first, l,
codepageToEncoding(codepage)) :
new String(src, (int) first, l);
break;
}
case Variant.VT_LPWSTR:
@ -298,6 +303,38 @@ public class VariantSupport extends Variant
/**
* <p>Turns a codepage number into the equivalent character encoding's
* name.</p>
*
* @param codepage The codepage number
*
* @return The character encoding's name. If the codepage number is 65001,
* the encoding name is "UTF-8". All other positive numbers are mapped to
* "cp" followed by the number, e.g. if the codepage number is 1252 the
* returned character encoding name will be "cp1252".
*
* @exception UnsupportedEncodingException if the specified codepage is
* less than zero.
*/
public static String codepageToEncoding(final int codepage)
throws UnsupportedEncodingException
{
if (codepage <= 0)
throw new UnsupportedEncodingException
("Codepage number may not be " + codepage);
switch (codepage)
{
case 1200:
return "UTF-16";
case 65001:
return "UTF-8";
default:
return "cp" + codepage;
}
}
/**
* <p>Writes a variant value to an output stream. This method ensures that
* always a multiple of 4 bytes is written.</p>
@ -305,6 +342,7 @@ public class VariantSupport extends Variant
* @param out The stream to write the value to.
* @param type The variant's type.
* @param value The variant's value.
* @param codepage The codepage to use to write non-wide strings
* @return The number of entities that have been written. In many cases an
* "entity" is a byte but this is not always the case.
* @exception IOException if an I/O exceptions occurs
@ -312,7 +350,7 @@ public class VariantSupport extends Variant
* who's variant type HPSF does not yet support
*/
public static int write(final OutputStream out, final long type,
final Object value)
final Object value, final int codepage)
throws IOException, WritingNotSupportedException
{
int length = 0;
@ -330,16 +368,13 @@ public class VariantSupport extends Variant
}
case Variant.VT_LPSTR:
{
length = TypeWriter.writeUIntToStream
(out, ((String) value).length() + 1);
char[] s = Util.pad4((String) value);
/* FIXME (2): The following line forces characters to bytes.
* This is generally wrong and should only be done according to
* a codepage. Alternatively Unicode could be written (see
* Variant.VT_LPWSTR). */
byte[] b = new byte[s.length + 1];
for (int i = 0; i < s.length; i++)
b[i] = (byte) s[i];
final byte[] bytes =
(codepage == -1 ?
((String) value).getBytes() :
((String) value).getBytes(codepageToEncoding(codepage)));
length = TypeWriter.writeUIntToStream(out, bytes.length + 1);
final byte[] b = new byte[bytes.length + 1];
System.arraycopy(bytes, 0, b, 0, bytes.length);
b[b.length - 1] = 0x00;
out.write(b);
length += b.length;
@ -419,12 +454,13 @@ public class VariantSupport extends Variant
}
}
/* Add 0x00 character to write a multiple of four bytes: */
while (length % 4 != 0)
{
out.write(0);
length++;
}
/* Add 0x00 characters to write a multiple of four bytes: */
// FIXME (1) Try this!
// while (length % 4 != 0)
// {
// out.write(0);
// length++;
// }
return length;
}

View File

@ -357,7 +357,10 @@ public class TestWrite extends TestCase
catch (Exception ex)
{
ex.printStackTrace();
throw new RuntimeException(ex);
throw new RuntimeException(ex.toString());
/* FIXME (2): Replace the previous line by the following
* one once we no longer need JDK 1.3 compatibility. */
// throw new RuntimeException(ex);
}
}
},
@ -398,37 +401,40 @@ public class TestWrite extends TestCase
public void testVariantTypes()
{
Throwable t = null;
final int codepage = -1;
/* FIXME (2): Add tests for various codepages! */
try
{
check(Variant.VT_EMPTY, null);
check(Variant.VT_BOOL, new Boolean(true));
check(Variant.VT_BOOL, new Boolean(false));
check(Variant.VT_CF, new byte[]{0});
check(Variant.VT_CF, new byte[]{0, 1});
check(Variant.VT_CF, new byte[]{0, 1, 2});
check(Variant.VT_CF, new byte[]{0, 1, 2, 3});
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4});
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5});
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10});
check(Variant.VT_I2, new Integer(27));
check(Variant.VT_I4, new Long(28));
check(Variant.VT_FILETIME, new Date());
check(Variant.VT_LPSTR, "");
check(Variant.VT_LPSTR, "ä");
check(Variant.VT_LPSTR, "äö");
check(Variant.VT_LPSTR, "äöü");
check(Variant.VT_LPSTR, "äöüÄ");
check(Variant.VT_LPSTR, "äöüÄÖ");
check(Variant.VT_LPSTR, "äöüÄÖÜ");
check(Variant.VT_LPSTR, "äöüÄÖÜß");
check(Variant.VT_LPWSTR, "");
check(Variant.VT_LPWSTR, "ä");
check(Variant.VT_LPWSTR, "äö");
check(Variant.VT_LPWSTR, "äöü");
check(Variant.VT_LPWSTR, "äöüÄ");
check(Variant.VT_LPWSTR, "äöüÄÖ");
check(Variant.VT_LPWSTR, "äöüÄÖÜ");
check(Variant.VT_LPWSTR, "äöüÄÖÜß");
check(Variant.VT_EMPTY, null, codepage);
check(Variant.VT_BOOL, new Boolean(true), codepage);
check(Variant.VT_BOOL, new Boolean(false), codepage);
check(Variant.VT_CF, new byte[]{0}, codepage);
check(Variant.VT_CF, new byte[]{0, 1}, codepage);
check(Variant.VT_CF, new byte[]{0, 1, 2}, codepage);
check(Variant.VT_CF, new byte[]{0, 1, 2, 3}, codepage);
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4}, codepage);
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5}, codepage);
check(Variant.VT_CF, new byte[]{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10},
codepage);
check(Variant.VT_I2, new Integer(27), codepage);
check(Variant.VT_I4, new Long(28), codepage);
check(Variant.VT_FILETIME, new Date(), codepage);
check(Variant.VT_LPSTR, "", codepage);
check(Variant.VT_LPSTR, "ä", codepage);
check(Variant.VT_LPSTR, "äö", codepage);
check(Variant.VT_LPSTR, "äöü", codepage);
check(Variant.VT_LPSTR, "äöüÄ", codepage);
check(Variant.VT_LPSTR, "äöüÄÖ", codepage);
check(Variant.VT_LPSTR, "äöüÄÖÜ", codepage);
check(Variant.VT_LPSTR, "äöüÄÖÜß", codepage);
check(Variant.VT_LPWSTR, "", codepage);
check(Variant.VT_LPWSTR, "ä", codepage);
check(Variant.VT_LPWSTR, "äö", codepage);
check(Variant.VT_LPWSTR, "äöü", codepage);
check(Variant.VT_LPWSTR, "äöüÄ", codepage);
check(Variant.VT_LPWSTR, "äöüÄÖ", codepage);
check(Variant.VT_LPWSTR, "äöüÄÖÜ", codepage);
check(Variant.VT_LPWSTR, "äöüÄÖÜß", codepage);
}
catch (Exception ex)
{
@ -466,20 +472,22 @@ public class TestWrite extends TestCase
* @throws UnsupportedVariantTypeException if the variant is not supported.
* @throws IOException if an I/O exception occurs.
*/
private void check(final long variantType, final Object value)
private void check(final long variantType, final Object value,
final int codepage)
throws UnsupportedVariantTypeException, IOException
{
final ByteArrayOutputStream out = new ByteArrayOutputStream();
VariantSupport.write(out, variantType, value);
VariantSupport.write(out, variantType, value, codepage);
out.close();
final byte[] b = out.toByteArray();
final Object objRead =
VariantSupport.read(b, 0, b.length + LittleEndian.INT_SIZE,
variantType);
variantType, -1);
if (objRead instanceof byte[])
{
final int diff = diff(org.apache.poi.hpsf.Util.pad4
((byte[]) value), (byte[]) objRead);
// final int diff = diff(org.apache.poi.hpsf.Util.pad4
// ((byte[]) value), (byte[]) objRead);
final int diff = diff((byte[]) value, (byte[]) objRead);
if (diff >= 0)
fail("Byte arrays are different. First different byte is at " +
"index " + diff + ".");