Distributed Files

Distributed Files

Top  Previous  Next

 

A distributed file holds no data. Instead it is simply a reference to a set of separate dynamic files that may be treated by an application as though they were a single file.

 

Distributed files allow an application to break a large data set into smaller pieces and reconstruct it in various ways to optimise performance. For example, a sales processing system might store orders for each month as a separate file (ORDERS-JAN09, ORDERS-FEB09, ORDERS-MAR09, etc). Reports of orders in the current month now only require the query to be run using the file that holds this month's orders. A report based on all orders ever received would require all the separate orders files to be processed. Rather than running multiple separate queries and merging the results in some way, this is achieved by creating a distributed file that references all of the monthly order files.

 

A distributed file gives the application the ability to access data in all of its component part files without any special logic in the application itself. When a record is to be accessed, QM will work out which part file would contain the record by applying the partitioning algorithm that defines the distributed file structure. This algorithm is supplied by the user and is effectively a higher level of hashing that translates a record id to a file part number.

 

There is no reason why multiple distributed files cannot reference the same part files in different combinations. As well as having a distributed file that combines all of the monthly orders files in the example above, we could also have distributed files to bring together all orders for a quarter or for a year.

 

DistributedFiles

 

The underlying requirements for a distributed file are:

1.The record ids must be unique across the entire set of part files. In the example above, the ids would probably be order numbers.
2.Given a record id, it must be possible for the partitioning algorithm to determine which part file the record would be in. This transformation may be as simple or as complex as the application design requires. In our order processing example, having the date as part of the record id would make the process very simple. On the other hand, the partitioning algorithm could use a list of order number ranges for each month.
3.The part files would normally hold records with the same structure and a single dictionary would be used for all parts as well as for the distributed file itself.
4.The part files may not be data collection files.

 

 

The Partitioning Algorithm

 

The partitioning algorithm is written as an I-type expression in the dictionary that defines the part files. This calculation must be based only on examination of the record id, not the data in the record, because the data is not available when identifying the part that will contain the record for a read operation. It is possible for the expression to use TRANS() functions to reference other files but the performance impact of this could be very severe as the expression will be evaluated for every read, write or delete.

 

The partitioning algorithm must return an integer part number. This must be in the range 0 to 2147483647 and does not need to be a simple sequential number. Our order file example might use the year and month as a four digit number (0811, 0812, 0901, 0902, etc).

 

Where a file uses case insensitive record ids, the result of the partitioning algorithm must not be affected by the casing of the supplied id. This is because case insensitivity is a property of the part file and is therefore unknown when the expression is evaluated.

 

Because an application can update the individual part files independently of the distributed file, it is essential that records written in this way are placed into the correct part file. Failure to ensure this may lead to all manner of strange results such as a record appearing in a select list but not being found by a read operation.

 

Attempting to access a distributed file where the partitioning algorithm returns a value that does not correspond to any of the file parts will cause the operation to take the ON ERROR clause. If this clause is omitted, the program will abort. The last part number returned by the partitioning algorithm, including errors, can be determined using key FL$LAST.PART in the FILEINFO() function.

 

 

Distributed File Related Commands

 

A distributed file is created using the ADD.DF command:

ADD.DF dist.file part.file part.no algorithm {RELATIVE}

where

dist.fileis the name of the distributed file.
part.fileis the name of the part file to be added.
part.nois the part number associated with this part file.
algorithmis the name of the partitioning algorithm in the dictionary of part.file.

 

First use of the command will create the distributed file, compiling the partitioning algorithm and adding the named part file. The newly created distributed file will share the dictionary of the first part file. Subsequent use of the ADD.DF command will add further part files. The algorithm need not be specified for the second and subsequent parts and will be ignored if present.

 

Note that the partitioning algorithm is copied into the distributed file. Changing the expression in the dictionary will not have any effect. It will be necessary to reconstruct the distributed file.

 

A new part may be added to a distributed file at any time though QM processes that already have the file open will not see the new part until the file is closed and reopened.

 

 

A part file can be removed from a distributed file using the REMOVE.DF command:

REMOVE.DF dist.file part.file

where

dist.fileis the name of the distributed file.
part.fileis the name of the part file to be removed. The part number can be used instead of the name. Use of the keyword ALL in place of part.file deletes the entire distributed file.

 

Removing the final part also deletes the distributed file. A part may be removed at any time though QM processes that already have the file open will continue to use the part until the file is closed and reopened

 

 

The components of a distributed file can be listed using the LIST.DF command;

LIST.DF dist.file

where

dist.fileis the name of the distributed file.

 

This command shows a list of the part file pathnames and their associated part numbers.

 

 

 

Alternate Key Indices in Distributed Files

 

Alternate key indices can be used with distributed files but are defined on the part files in the usual way, not on the distributed file. When a distributed file is opened, QM determines which index names are defined in all of the part files and only these indices are available when referencing the distributed file. Note that this process is based only on the index name. If two part files have indices of the same name that are defined differently, the effects are undefined.

 

 

 

QMBasic Programming with Distributed Files

 

A QMBasic program that opens a distributed file can access the data in this file in exactly the same way as for any other file. The partitioning algorithm will be applied internally by QM for all record level operations to determine which part file should be accessed.

 

Locks are maintained at the part file level. Locking a record via an operation that references the distributed file will apply the lock to the part file in which that record would reside based on the partitioning algorithm.

 

Obtaining a file lock on a distributed file will acquire the file lock on all of the component part files. Because a process is never blocked by its own locks, a program can obtain the file lock on an individual part file that has been opened separately and then go on to obtain the file lock on the distributed file. Releasing the file lock on the distributed file would effectively also release the file lock on the individual part file. Conversely, releasing the file lock on the individual part file would mean that the file lock on the distributed file was no longer complete. Given the probable inappropriateness of file locks in distributed files, it is unlikely that this situation is of concern to application developers.

 

The FILEINFO() function will return a file type code of FL$TYPE.DIST (7). The FL$PARTS key value of FILEINFO() will return a field mark delimited list of the part numbers of each part file. Many of the other key values of this function are inappropriate to distributed files but can be applied to an individual part by using the DFPART() function to reference the part file.

 

The INDICES() function will return data for indices that are common to all part files.

 

The SELECTINDEX statement will return composite data constructed from the indices of all part files. The index scan position pointer normally set by this operation is irrelevant to distributed files.

 

The index scanning functions provided by SETLEFT, SETRIGHT, SELECTLEFT and SELECTRIGHT are not available with distributed files.

 

 

 

Other Uses of Distributed Files

 

Distributed files can also be used to:

1.Spread a large file over multiple disks for load balancing.
2.Overcome operating system file size limits.
3.Store a file that is larger than a single available disk drive.
4.Physically split a file over multiple servers using QMNet.

 

 

System Administration Issues

 

Use of distributed files with a large number of part files requires the system administrator to think carefully about the setting of the NUMFILES configuration parameter as each part file will be opened internally as a separate file.