i-effect® *ZIP basics

Algorithms

The"deflation Algorithm"used by zip and gzip is a variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length). Distances are limited to 32K bytes, and lengths are limited to 258 bytes. When a string does not occur anywhere in the previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, `string' must be taken as an arbitrary sequence of bytes, and is not restricted to printable characters.)

Literals or match lengths are compressed with one Huffman tree, and match
distances are compressed with another tree. The trees are stored in a compact form at the start of each block. The blocks can have any size (except that the compressed data for one block must fit in available memory). A block is terminated when zip determines that it would be useful to start another block with fresh trees.

Duplicated strings are found using a hash table. All input strings of length 3 are inserted in the hash table. A hash index is computed for the next 3 bytes. If the hash chain for this index is not empty, all strings in the chain are compared with the current input string, and the longest match is selected.

The hash chains are searched starting with the most recent strings, to favor small
distances and thus take advantage of the Huffman encoding. The hash chains are singly linked. There are no deletions from the hash chains, the algorithm simply discards matches that are too old.

To avoid a worst-case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a runtime option (1 to 9). So zip does not always find the longest possible match but generally finds a match which is long enough. Additionally, zip defers the selection of matches with a lazy evaluation mechanism.

After a match of length N has been found, zip searches for a longer match at the next input byte. If a longer match is found, the previous match is truncated to a length of one (thus producing a single literal byte) and the process of lazy evaluation begins again. Otherwise, the original match is kept, and the next match search is attempted only N steps later.
The lazy match evaluation is also subject to a runtime parameter. If the current match is long enough, zip reduces the search for a longer match, thus speeding up the whole process. If compression ratio is more important than speed, zip attempts a complete second search even if the first match is already long enough.

The lazy match evaluation is not performed for the fastest compression modes (level parameter 1 to 3). For these fast modes, new strings are inserted in the hash table only when no match was found, or when the match is not too long. This degrades the compression ratio but saves time since there are both fewer insertions and fewer searches.
according to
Jean-loup Gailly
gzip@prep.ai.mit.edu
jloup@gzip.org

Gzip File Format

The pkzip format imposes a lot of overheads in different headers, which are practical for an archive program, but not for compression of a single file. Gzip uses an easier structure: Figures are saved in the little endian file format and bit 0 is the least significant bit. A gzip file is a sequence of compressed members. Every member has the following structure:

2 Bytes	magic header 0x1f, 0x8b (\037 \213)
1 Byte	compression method (0..7 reserved, 8 = deflate)
1 Byte	flags
	bit 0 set: file is probably ascii text file
	bit 2 set: extra field available
	bit 3 set: original file name available
	bit 4 set: file comment available
	bit 5 set: file is encoded
	bit 6,7: reserved
4 Bytes	date of file change in Unix file format
1 Byte	extra flags (depending on the method of compression)
1 Byte	System software on which the compression was conducted
2 Bytes	optional parts number (second part = 1)
2 Bytes	optional extra field lenght
? Bytes	optional extra field
? Bytes	optional original file name, X'00' in the end
? Bytes	optional file comment, X'00' in the end
12 Bytes	optional encription header
? Bytes	compressed data
4 Bytes	crc32
4 Bytes	unsummarized file lenght modulo 2^32

The file format was developped in order to

(a)	conduct compression in "single-pass" without a reverse reading being necessary and
(b)	without it being necessary or required to know the lenght of the uncompressed file or the amout of memory capacity on the output medium. If the input does not come from a regular disk file, the modification date will be set to the date of the beginning of compression.

Generally, the timestamp is practical if the gzip file is to be transmitted via a network. In this case, it is not sufficient to preserve only the file attributes. In the present case, the attributes of gzip are preserved during compression/decompression, a timestamp 0 will be ignored.
Bit 0 in the flags is an optional index that can be set through a short reading of the input file. When in doubt, the bit will be deleted in order to display binary data. On systems supporting different file formats for ASCII text and binary data, the decompressor can use this flag to choose the right format.

The extra field, if available, must consist of at least one subfield, each of them with the following format:

Unterfeld ID	2 Bytes
Unterfeld Größe:	2 Bytes (little-endian Format)

Subfield Data

The subfield ID may consist of two characters of symbolic meaning. IDs with a
second 0 byte are reserved for later use. Currently, the follwing IDs are defined:

Ap (0x41, 0x70): Apollo file type information

The subfield size is the size of subfield data and does not contain the ID or the size itself.
The field "extra field size" is the total lenght of the extra field, including the subfield ID and subfield size.
It needs to be possible to determine the end of the compressed data without depending on compression method or actual size of the compressed data. If the compressed data does not fit into a file (espsecially when using disks or tapes), every part will start with the head described above, but only the last part carries the information "crc32" and "uncompessed length". For such files compressed in various parts, a decompressor may send user requirements for additionally needed parts.
It is preferable, but not necessary, to decompress individual parts of a file independently from one another, in case a part is damaged. This will only be possible if the compression status is not taken over from one part to another.
If the file is compressed on a system using capital and lower case letters in its file system, the original file name must be changed into lower case letters. There will be no original file name if data is compressed before standard input.
Compression is always performed, even if the compressed file is slightly larger than the original. The worst case expansion is a few bytes for the gzip file header, plus 5 bytes every 32K block, or an expansion ratio of 0.015% for large files. Note that the actual number of used disk blocks almost never increases.

The encryption is a ZIP 1.9 encryption. The last byte of the decoded encryption header will be examined for the encryption checkup.
It must be 0. The timestamp of an encoded file can be set to 0 in order to avoid the derivation of the composition of the header.

according to
Jean-loup Gailly
gzip@prep.ai.mit.edu
jloup@gzip.org

Indication of References

[LZ77] Ziv J., Lempel A., „A Universal Algorithm for Sequential Data
Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, pp. 337-343.
APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
ftp in ftp.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
Use „unzip pkz193a.exe APPNOTE.TXT" to extract (note: unzip, not gunzip).

ZIP File Format

In addition to the previously described gzip file format, i-effect®*ZIP supports the ZIP archive file format, which can be found on many platforms. On Windows platforms for example, it became famous through products like WINZIP™.
A ZIP archive is a file in which compressed data of several files is stored. Appro-
priate software functions allow addition, deletion, extraction, or display of compressed contents.
i-effect®with its range of functions in the optional module *ZIP is fully compatible with the common ZIP file format and, therefore, can exchange archived data with other platforms (e.g. WINZIP™).
Within an archive file (.ZIP file), different object types can be compressed. Currently, the following file types are supported: Physical files (PF-DTA), source files (SRC-PF), backup files (SAVF), files from IFS files systems and spooled files.
When restoring these object types with i-effect®*ZIP on an IBM System i, the original object types will be received. Spooled files will be placed into output queues as actual spooled entries, compressed backup files become actual backup files and so on.

EDIFACT Basics

The EDIFACT set of regulations provides a comprehensive basis of message definitions for many different commercial documents. The acronym EDIFACT stands for the claim to exchange routine business documents for adminstration, commerce, and transport in a standard format:

Electronic
Data
Interchange
For
Administration
Commerce and
Transport

The United Nations are a significant responsible for this standardization. Documents originating from the United Nations organizational structure are called UN/EDIFACT. In addition to these definitions, document formats from other institutions (e.g. Odette, representatives of the automotive industry), edited according to equal codification rules, exist. However, in their allover representation they do not correspond to the UN documents.
Because of its flexible database table logic, i-effect®is able to convert every document that is edited according to the EDIFACT syntax, inasmuch as an appropriate implementation code is provided. The documents have to be in accordance with the ISO 9735 syntax rules. A German version is available at the German Institute for Standardization (DIN) under publication number 16557.
Further information concerning EDIFACT in general and its practical application is provided by EDI consulting services and agencies or organizations involved in the standardization of EDIFACT documents, such as:

DIN	EDIFACT in general
DEDIG	EDIFACT related to practical application
CCG / GS1	Consumer goods management
BSL	Transport and Logistics
VDA	Automotive industry

and a lot more organizations and associations from various areas (e.g. textile, furniture, bank etc.)

EDIFACT Rules for Conversion

In the following, formal and technical EDIFACT rules, which are relevant for conver-sion, are described.
The smallest information unit in EDIFACT is a data item or a data sub item, inasmuch as a composition of several data items is concerned. Hence, data items are compo-nents from which information is composed, either as a simple data item or as a composed data item, a so-called Composit. Every data item has certain features such as:

Lenght	e.g. 4
Type of lenght	fixed, variable
Type of data	numeric, alphanumeric, alphabetical
Semantics	content description

Concering semantics, i.e. the description of the content of the field, EDIFACT diverges from usual encryption modes. The description on database fields or record type fields is based on the assumption that an unambiguous value is recorded in this field. One field is reserved for a telephone number, another field is reserved for a fax number. In EDIFACT, the unambiguous descriptions is realized by an interpretation of a further data field, a so-called Qualifier. In EDIFACT, the telephone and fax number example would be represented as follows:

Contact number: 0226123990	Qualifier: TE
Contact number: 0226191845	Qualifier: FX

The contact number is a data field for any type of communication connection to the described contact person; the used type of service (Telefone, fax, voice mail, telex, BTX screen text, Citytruf, GSM) is specified by the Qualifier. This type of representation was used to make sure that, in case of newly emerging means of communication, only the code list (list of all Qualifiers) needs to be updated, while the original form of the message is kept.
The composition of several data items and data item groups form a segment. Every segment contains a distinct, three-character description (segment tag). It describes a specific part of a document information, e.g.:

NAD	name and addresses
DTM	date/time/period
MOA	monetary amount
TDT	details of transport

Different segments can be combined into segment groups, inasmuch as they belong together with regard to contents. In some cases, a line item description in an order requires additional information, which is then combined into a segment group:

segment group LIN-IMD-PIA-API...

LIN	line item
IMD	item description
PIA	refund and reference
API	additional product ID

The composition of all segments and segment groups results in the actual message definition:

INVOIC	invoice
ORDERS	order
IFTMIN	instruction message

At the moment, more than 200 documents are defined in EDIFACT, which forms the basis for an intense and paperless exchange between companies, organizations, and other involved partners.
EDIFACT messages underlie a specified structure, whereas a distinction between service segment and generic segment is necessary.
EDIFACT assumes that the generated messages are transmitted from partner A to partner B via remote data processing. Therefore, special segments for the trans-
mission of a physical file were defined. A physical file will be embraced by the segments

CODE

UNB
...
UNZ

and contains information for an addressee who is indicated in the UNB segment. Within this data interchange, different document types may occur, e.g. delivery notes and invoices. The document types are seperated from each other with the help of the segments

CODE

UNG
....
UNE

The actual message stands between the segments

CODE

UNH
....
UNT

The total structure of the data interchange may be as follows:

UNB			beginning of data interchange
	UNG		delivery notes
		UNH	delivery note A
		...
		UNT	end delivery note A
		UNH	delivery note B
		...
		UNT	end delivery note B
	UNE		end delivery notes
	UNG		invoices
		UNH
		...
		UNT	end invoice A
		UNH	invoice B
		...
		UNT	end invoice B
	UNE		end invoices
UNZ			end of data interchange

Every message the converter has to edit requires a clearly directed code conversion which contructs a relation (mapping) between the inhouse page and the EDIFACT page. This assignment has to be effected for every single data item within the segments and for every single type of segment in a segment group, inasmuch as the segment is assigned differently in different positions. The specific table options are described in the following chapter.

[data-colorid=xae5mr6nzu]{color:#383838} html[data-color-mode=dark] [data-colorid=xae5mr6nzu]{color:#c7c7c7}Algorithms

Gzip File Format

Subfield Data

Indication of References

ZIP File Format

EDIFACT Basics

EDIFACT Rules for Conversion

Algorithms