Shipwright/StormLib/doc/The MoPaQ File Format 0.9.txt

THE MOPAQ ARCHIVE FORMAT
v0.9 (Thursday, June 30, 2005)
by Justin Olbrantz(Quantam)

Distribution and reproduction of this specification are allowed without limitation, as long as it is not altered. Quoting
in other works is freely allowed, as long as the source and author of the quote is stated.

TABLE OF CONTENTS
1. Introduction to the MoPaQ Format
2. The MoPaQ Format
	2.1 General Archive Layout
	2.2 Archive Header
	2.3 Block Table
	2.4 Hash Table
	2.5 File Data
	2.6 Listfile
	2.7 Extended Attributes
	2.8 Weak (Old) Digital Signature
	2.9 Strong (New) Digital Signature
3. Algorithm Source Code
	3.1 Encryption/Decryption
	3.2 Hashing
	3.3 Conversion of FILETIME and time_t

1. INTRODUCTION TO THE MOPAQ FORMAT
The MoPaQ (or MPQ) format is an archive file format designed by Mike O'Brien (hence the name Mike O'brien PaCK) at Blizzard
Entertainment. The format has been used in all Blizzard games since (and including) Diablo. It is heavily optimized to be
a read-only game archive format, and excels at this role.

The Blizzard MoPaQ-reading functions are contained in the Storm module, which my be either statically or dynamically linked.
The Blizzard MoPaQ-writing functions are contained in the MPQAPI module, which is always statically linked.

2. THE MOPAQ FORMAT
All numbers in the MoPaQ format are in little endian. Data types are listed either as int (integer, the number of bits specified),
byte (8 bits), and char (bytes which contain ASCII characters). All sizes and offsets are in bytes, unless specified otherwise.
Structure members are listed in the following general form:
offset from the beginning of the structure: data type(array size) member name : member description

2.1 GENERAL ARCHIVE LAYOUT
- Archive Header
- File Data
- File Data - Special Files
- Hash Table
- Block Table
- Strong Digital signature

This is the usual archive format, and is not absolutely essential. Some archives have been observed placing the hash table
and file table after the archive header, and before the file data.

2.2 ARCHIVE HEADER
00h: char(4) Magic : Indicates that the file is a MoPaQ archive. Must be ASCII "MPQ" 1Ah.
04h: int32 HeaderSize : Size of the archive header. Should be 32.
08h: int32 ArchiveSize : Size of the whole archive, including the header. Does not include the strong digital signature, if present.
This size is used, among other things, for determining the region to hash in computing the digital signature.
0Ch: int16 Unknown : Unknown
0Eh: int8 SectorSizeShift : Power of two exponent specifying the number of 512-byte disk sectors in each logical sector
in the archive. The size of each logical sector the archive is 512 * 2^SectorSizeShift. Bugs in the Storm library dictate
that this should always be 3 (4096 byte sectors).
10h: int32 HashTableOffset : Offset to the beginning of the hash table, relative to the beginning of the archive.
14h: int32 BlockTableOffset : Offset to the beginning of the block table, relative to the beginning of the archive.
18h: int32 HashTableEntries : Number of entries in the hash table. Must be a power of two, and must be less than 2^16.
1Ch: int32 BlockTableEntries : Number of entries in the block table.

The archive header is the first structure in the archive, at archive offset 0, but the archive does not need to be at offset
0 of the containing file. The offset of the archive in the file is referred to here as ArchiveOffset. If the archive is not
at the beginning of the file, it must begin at a disk sector boundary (512 bytes). Early versions of Storm require that the
archive be at the end of the containing file (ArchiveOffset + ArchiveSize = file size), but this is not required in newer
versions (due to the strong digital signature not being considered a part of the archive).

2.3 BLOCK TABLE
The block table contains entries for each region in the archive. Regions may be either files or empty space, which may be
overwritten by new files (typically this space is from deleted file data). The block table is encrypted, using the hash
of "(block table)" as the key. Each entry is structured as follows:

00h: int32 BlockOffset : Offset of the beginning of the block, relative to the beginning of the archive. Meaningless if the block size is 0.
04h: int32 BlockSize : Size of the block in the archive.
08h: int32 FileSize : Size of the file data stored in the block. Only valid if the block is a file, otherwise meaningless, and should be 0. If the file is compressed, this is the size of the uncompressed file data.
0Ch: int32 Flags : Bit mask of the flags for the block. The following values are conclusively identified:
	80000000h: Block is a file, and follows the file data format; otherwise, block is free space, and may be overwritten. If the block is not a file, all other flags should be cleared.
	01000000h: File is stored as a single unit, rather than split into sectors.
	00020000h: The file's encryption key is adjusted by the block offset and file size (explained in detail in the File Data section). File must be encrypted.
	00010000h: File is encrypted.
	00000200h: File is compressed. Mutually exclusive to file imploded.
	00000100h: File is imploded. Mutually exclusive to file compressed.

2.4 HASH TABLE
Instead of storing file names, for quick access MoPaQs use a fixed, power of two-size hash table of files in the archive. A file is uniquely identified by its file path, its language, and its platform. The home entry for a file in the hash table is computed as a hash of the file path. In the event of a collision (the home entry is occupied by another file), progressive overflow is used, and the file is placed in the next available hash table entry. Searches for a desired file in the hash table proceed from the home entry for the file until either the file is found, the entire hash table is searched, or an empty hash table entry (FileBlockIndex of FFFFFFFFh) is encountered. The hash table is encrypted using the hash of "(hash table)" as the key. Each entry is structured as follows:

00h: int32 FilePathHashA : The hash of the file path, using method A.
04h: int32 FilePathHashB : The hash of the file path, using method B.
08h: int16 Language : The language of the file. This is a Windows LANGID data type, and uses the same values. 0 indicates the default language (American English), or that the file is language-neutral.
0Ah: int8 Platform : The platform the file is used for. 0 indicates the default platform. No other values have been observed.
0Ch: int32 FileBlockIndex : If the hash table entry is valid, this is the index into the block table of the file. Otherwise, one of the following two values:
	FFFFFFFFh: Hash table entry is empty, and has always been empty. Terminates searches for a given file.
	FFFFFFFEh: Hash table entry is empty, but was valid at some point (in other words, the file was deleted). Does not terminate searches for a given file.

2.5 FILE DATA
00h: int32(SectorsInFile + 1) SectorOffsetTable : Offsets to the start of each sector's data, relative to the beginning of the file data. Not present if this information is calculatable (see details below).
immediately following SectorOffsetTable: SectorData : Data of each sector in the file, packed end to end (see details below).

Normally, file data is split up into sectors, for simple streaming. All sectors, save for the last, will contain as many bytes of file data as specified in the archive header's SectorSizeShift; the last sector may be smaller than this, depending on the size of the file data. This sector size is the size of the raw file data; if the file is compressed, the compressed sector will be smaller or the same size as the uncompressed sector size. Individual sectors in a compressed file may be stored uncompressed; this occurs if and only if the sector could not be compressed by the algorithm used (if the compressed sector size was greater than or equal to the size of the raw data), and is indicated by the sector's compressed size in SectorOffsetTable being equal to the uncompressed size of the sector (which may be calculated from the FileSize).

If the sector is compressed (but not imploded), a bit mask byte of the compression algorithm(s) used to compress the sector is appended to the beginning of the compressed sector data. This additional byte counts towards the total size of the sector; if the size of the sector (including this byte) exceeds or matches the uncompressed size of the sector data, the sector will be stored uncompressed, and this byte omitted. Multiple compression algorithms may be used on the same sector; in this case, successive compression occurs in the order the algorithms are listed below, and decompression occurs in the opposite order. For implimentations of all of these algorithms, see StormLib.
	40h: IMA ADPCM mono
	80h: IMA ADPCM stereo
	01h: Huffman encoded
	02h: Deflated (see ZLib)
	08h: Imploded (see PKWare Data Compression Library)
	10h: BZip2 compressed (see BZip2)

If the file is stored as a single unit (indicated in the file's Flags), there is effectively only a single sector, which
contains the entire file.

If the file is encrypted, each sector (after compression and appendage of the compression type byte, if applicable)
is encrypted with the file's key. The base key for a file is determined by a hash of the file name stripped of the
directory (i.e. the key for a file named "directory\file" would be computed as the hash of "file"). If this key is
adjusted, as indicated in the file's Flags, the final key is calculated as ((base key + BlockOffset - ArchiveOffset)
XOR FileSize) (StormLib - an open-source implementation of the MoPaQ reading and writing functions,
by Ladislav Zezula - incorrectly uses an AND in place of the XOR). Each sector is encrypted using the key + the
0-based index of the sector in the file. The SectorOffsetTable, if present, is encrypted using the key - 1.

The SectorOffsetTable is omitted when the sizes and offsets of all sectors in the file are calculatable from the FileSize.
This can happen in several circumstances. If the file is not compressed/imploded, then the size and offset of all sectors
is known, based on the archive's SectorSizeShift. If the file is stored as a single unit compressed/imploded, then the
SectorOffsetTable is omitted, as the single file "sector" corresponds to BlockSize and FileSize, as mentioned previously.
Note that the SectorOffsetTable will always be present if the file is compressed/imploded and the file is not stored as
a single unit, even if there is only a single sector in the file (the size of the file is less than or equal to the
archive's sector size).

2.6 LISTFILE
The listfile is a very simple extension to the MoPaQ format that contains the file paths of (most) files in the archive.
The languages and platforms of the files are not stored in the listfile. The listfile is contained in the file "(listfile)",
and is simply a non-Unix-style text file with one file path on each line, lines terminated with the bytes 0Dh 0Ah. The file
"(listfile)" may not be listed in the listfile.

2.7 EXTENDED ATTRIBUTES
The extended attributes are optional file attributes for files in the block table. These attributes were added at times after
the MoPaQ format was already finalized, and it is not necessary for every archive to have all (or any) of the extended attributes.
If an archive contains a given attribute, there will be an instance of that attribute for every block in the block table, although
the attribute will be meaningless if the block is not a file. The order of the attributes for blocks correspond to the order of the
blocks in the block table, and are of the same number. The attributes are stored in parallel arrays in the "(attributes)" file,
in the archive. The attributes corresponding to this file need not be valid (and logically cannot be). Unlike all the other
structures in the MoPaQ format, entries in the extended attributes are NOT guaranteed to be aligned. Also note that in some
archives, malicious zeroing of the attributes has been observed, perhaps with the intent of breaking archive viewers. This
file is structured as follows:

00h: int32 Version : Specifies the extended attributes format version. For now, must be 100.
04h: int32 AttributesPresent : Bit mask of the extended attributes present in the archive:
	00000001h: File CRC32s.
	00000002h: File timestamps.
	00000004h: File MD5s.
08h: int32(BlockTableEntries) CRC32s : CRC32s of the (uncompressed) file data for each block in the archive. Omitted if the
archive does not have CRC32s. immediately after CRC32s: FILETIME(BlockTableEntries) Timestamps : Timestamps for each block
in the archive. The format is that of the Windows FILETIME structure. Omitted if the archive does not have timestamps.
immediately after Timestamps: MD5(BlockTableEntries) MD5s : MD5s of the (uncompressed) file data for each block in the archive.
Omitted if the archive does not have MD5s.

2.8 WEAK DIGITAL SIGNATURE
The weak digital signature is a digital signature using Microsoft CryptoAPI. It is an implimentation of the RSASSA-PKCS1-v1_5
digital signature protocol, using the MD5 hashing algorithm and a 512-bit (weak) RSA key (for more information about this
protocol, see the RSA Labs PKCS1 specification). The public key and exponent are stored in a resource in Storm. The signature
is stored uncompressed, unencrypted in the file "(signature)" in the archive. The archive is hashed from the beginning of the
archive (ArchiveOffset in the containing file) to the end of the archive (the length indicated by ArchiveSize); the signature
file is added to the archive before signing, and the space occupied by the file is considered to be all binary 0s during
signing/verification. This file is structured as follows:

00h: int32 Unknown : Must be 0.
04h: int32 Unknown : must be 0.
08h: int512 Signature : The digital signature. Like all other numbers in the MoPaQ format, this is stored in little-endian order.

2.9 STRONG DIGITAL SIGNATURE
The strong digital signature uses a simple proprietary implementation of RSA signing, using the SHA-1 hashing algorithm and
a 2048-bit (strong) RSA key. The default public key and exponent are stored in Storm, but other keys may be used as well.
The strong digital signature is stored immediately after the archive, in the containing file; the entire archive (ArchiveSize
bytes, starting at ArchiveOffset in the containing file) is hashed as a single block. The signature has the following format:

00h: char(4) Magic : Indicates the presence of a digital signature. Must be "NGIS" ("SIGN" backwards).
04h: int2048 Signature : The digital signature, stored in little-endian format.

When the Signature field is decrypted with the public key and exponent, and the result stored in little-endian order, it is structured as follows:

00h: byte Padding : Must be 0Bh.
01h: byte(235) Padding : Must be BBh.
ECh: byte(20) SHA-1 : SHA-1 hash of the archive, in standard SHA-1 format.

3. ALGORITHM SOURCE CODE
3.1 ENCRYPTION/DECRYPTION
I believe this was derived at some point from code in StormLib. Assumes the long type to be 32 bits, and the machine to be little endian order.

unsigned long dwCryptTable[0x500];

void InitializeCryptTable()
{
    unsigned long seed   = 0x00100001;
    unsigned long index1 = 0;
    unsigned long index2 = 0;
    int   i;

    for (index1 = 0; index1 < 0x100; index1++)
    {
        for (index2 = index1, i = 0; i < 5; i++, index2 += 0x100)
        {
            unsigned long temp1, temp2;

            seed  = (seed * 125 + 3) % 0x2AAAAB;
            temp1 = (seed & 0xFFFF) << 0x10;

            seed  = (seed * 125 + 3) % 0x2AAAAB;
            temp2 = (seed & 0xFFFF);

            dwCryptTable[index2] = (temp1 | temp2);
        }
    }
}

void EncryptData(void *lpbyBuffer, unsigned long dwLength, unsigned long dwKey)
{
    unsigned long *lpdwBuffer = (unsigned long *)lpbyBuffer;
    unsigned long seed = 0xEEEEEEEE;
    unsigned long ch;

	assert(lpbyBuffer);

    dwLength /= sizeof(unsigned long);

    while(dwLength-- > 0)
    {
        seed += dwCryptTable[0x400 + (dwKey & 0xFF)];
        ch = *lpdwBuffer ^ (dwKey + seed);

        dwKey = ((~dwKey << 0x15) + 0x11111111) | (dwKey >> 0x0B);
        seed = *lpdwBuffer + seed + (seed << 5) + 3;

		*lpdwBuffer++ = ch;
    }
}

void DecryptData(void *lpbyBuffer, unsigned long dwLength, unsigned long dwKey)
{
	unsigned long *lpdwBuffer = (unsigned long *)lpbyBuffer;
    unsigned long seed = 0xEEEEEEEE;
    unsigned long ch;

	assert(lpbyBuffer);

    dwLength /= sizeof(unsigned long);

    while(dwLength-- > 0)
    {
        seed += dwCryptTable[0x400 + (dwKey & 0xFF)];
        ch = *lpdwBuffer ^ (dwKey + seed);

        dwKey = ((~dwKey << 0x15) + 0x11111111) | (dwKey >> 0x0B);
        seed = ch + seed + (seed << 5) + 3;

		*lpdwBuffer++ = ch;
    }
}

3.2 HASHING
Based on code from StormLib.

// Different types of hashes to make with HashString
#define MPQ_HASH_TABLE_OFFSET	0
#define MPQ_HASH_NAME_A	1
#define MPQ_HASH_NAME_B	2
#define MPQ_HASH_FILE_KEY	3

unsigned long HashString(const char *lpszString, unsigned long dwHashType)
{
    unsigned long  seed1 = 0x7FED7FED;
    unsigned long  seed2 = 0xEEEEEEEE;
    int    ch;

    while (*lpszString != 0)
    {
        ch = toupper(*lpszString++);

        seed1 = dwCryptTable[(dwHashType * 0xFF) + ch] ^ (seed1 + seed2);
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
    }
    return seed1;
}

3.3 CONVERSION OF FILETIME AND time_t

#define EPOCH_OFFSET 116444736000000000ULL	// Number of 100 ns units between 01/01/1601 and 01/01/1970

bool GetTimeFromFileTime(FILETIME &fileTime, time_t &time)
{
	// The FILETIME represents a 64-bit integer: the number of 100 ns units since January 1, 1601
	unsigned long long nTime = ((unsigned long long)fileTime.dwHighDateTime << 32) + fileTime.dwLowDateTime;

	if (nTime < EPOCH_OFFSET)
		return false;

	nTime -= EPOCH_OFFSET;	// Convert the time base from 01/01/1601 to 01/01/1970
	nTime /= 10000000ULL;	// Convert 100 ns to sec

	time = (time_t)nTime;

	// Test for overflow (FILETIME is 64 bits, time_t is 32 bits)
	if ((nTime - (unsigned long long)time) > 0)
		return false;

	return true;
}

void GetFileTimeFromTime(time_t &time, FILETIME &fileTime)
{
	unsigned long long nTime = (unsigned long long)time;

	nTime *= 10000000ULL;
	nTime += EPOCH_OFFSET;

	fileTime.dwLowDateTime = (DWORD)nTime;
	fileTime.dwHighDateTime = (DWORD)(nTime >> 32);
}