Scan_rtn.c

Background

Increasingly, the SCAN_FILE and SCAN_RECORD routines are at the heart of most of the complex programs that revolve around the Fictionmags data format, specifically:

This is because the routine reads through a specified set of files and sets up a scandata structure containing all the data in those files in a structured manner, and often from multiple viewpoints, as discussed below. This structure can then be examined by the various programs as required to extract the data they need.

Note that the primary drivers in developing the code have been XVALIDATE and GENATTRIB - adjustments may need to be made in future to accommodate the other programs. However, the emphasis has been to try to streamline the code as far as possible to address the first two in the hopes that future extensions will be well-documented.

Bylines

One of the early complications was in the handling of bylines. When trying to match up two records containing, say:

E A0~House1 ,(hp:Author1)~Example3~ss|ACT|1929|Aug|(v8:12)|~  (original)
E A0~Author1~Example3 [as by House1]~ss1929ACTAug~            (reprint)

we have two choices. The first was to regard the original byline as being "House1 ,(hp:Author1)" which worked fine in simple cases, but fell apart in some of the more complex cases using ", (by:" such as initials or "The Author of xxx" as there was no easy way of predicting what the original byline would be. Similar problems (in theory at least) would arise if you had an original appearance written under two different house names with two different authors (as happened a lot in Wild West Weekly for example) and was then reprinted under the author's own names, as there would be no way of knowing which author had used which house name.

The second (and current) approach is to take the opposite approach and regard the byline as exactly what is used in the book/magazine - i.e. in the above case, "House1" - this should be easily deducible in all cases without any of the problems discussed above. It does, however, introduce problems when the same title is used by different people under the same house name (or anonymously) or when, by mistake, the real author behind a house name is specified incorrectly - these are discussed in more detail under Real Authors below.

Note that, when we say "exactly what is used in the book/magazine", this is not quite true as we need to normalise all such names to their base formats (e.g. translating via ~04~ records) if we are to get a match.

Note also that if there is a secondary name qualified with ", (as told to:" or ", (as told by:" then these also need to be incorporated into the byline so that we can match up:

E 32A0~Branscombe, Arthur #2 ,(as told to:Rousseau, Victor)~Soul That Lost Its Way~ss|GHS|1927|Aug|(v3:2)|~The ~~Martinus| Doctor~
E 80A0~Anon.~Monkey-Face and Mrs. Thorpe ["The Soul That Lost Its Way", as by Arthur Branscombe #2 as told to Victor Rousseau]~ss1927GHSAug~~~Martinus|Doctor~

Note also that if a primary name is qualified with ", (err:" then we use the name behind the error as the byline instead.

Real Authors

The current approach to bylines raises a number of problems some (all?) of which weren't visible in the prior approach.

Firstly, XVALIDATE tries to police instances where the same author uses the same (or very similar) titles with different publication details and this was done by comparing adjacent scanitem elements (after sorting). However, if two different authors write items with the same title anonymously this is flagged up (unnecessarily) as a conflict because we have two records with an author (i.e. the original byline) of "Anon." with the same title but different publication details.

Similarly, if we consider a variant to the above example of:

E A0~House1 ,(hp:Author1)~Example3~ss|ACT|1929|Aug|(v8:12)|~  (original)
E A0~Author2~Example3 [as by House1]~ss1929ACTAug~            (reprint)
E A0~House1~Example3~ss1929ACTAug~ (reprint)

we would like to flag up the two reprints as being in error as neither matches the original (the first reprint because the author is wrong; the second because the author behind the house name has been omitted). However, with the current approach to bylines the only records we'll be comparing will all just say that the "original byline" was "House1" and hence will think everything is OK. (Note that, if ACT is a fully-validated magazine the first reprint would trigger an "original appearance not found" message, but even that wouldn't address the problem with the second reprint.)

The current thinking is to record the "real authors" associated with an item and then report an error if the author/title/publication details match but the real author(s) don't (the second case above) and not report an error if the author/title match but publication details don't if the real author(s) also don't (the first case above).

The challenge lies in working out who the "real authors" are. In our three examples just above we can do so by parsing the author on the item as that would produce "Author1", "Author2" and "House1" respectively, and hence give us a mismatch. However, if we consider a different valid combination:

E A0~Author1~Example3~ss|ACT|1929|Aug|(v8:12)|~               (original)
E A0~House1~Example3 [as by Author1]~ss1929ACTAug~            (reprint)

this would produce "Author1" for the first case and "House1" for the second case and hence generate an error, even though the reprint is actually valid.

The current thinking is to try to look at both the author on the item and any byline specified and combine the two. This is non-trivial as we need to ensure that:

There are still cases where this won't work, but hopefully they will be a small number of "false positives" that can be handled by the exceptions file.

Some points to be addressed:

Multiple Viewpoints

As mentioned in the introduction, a single input record may generate data looking at the same record from multiple viewpoints. Specifically:

Note that the scanitem elements generated are virtually identical in all cases generated from a single record. The only exceptions are:

Co-Authors

One of the many things that XVALIDATE tries to check is that any occurrences of an item from the viewpoint of a given author has the same co-authors, i.e. that:

E A0~Author1/Author2~Example3~ss|ACT|1929|Aug|(v8:12)|~                 (original)
E A0~House1~Example3 [as by Author1 & Author2]~ss1929ACTAug~            (reprint)

produces a match, but something like:

E A0~House1~Example3 [as by Author1]~ss1929ACTAug~            (reprint)

produces a mismatch. Given the possibilities of personal pseudonyms being used (validly) just about anywhere experimentation has shown that the only viable way of handling co-authors is to specify the "real names" in all cases (where known). This might need adjusting when it comes to generating the website, but we'll think about that later.

Data Structures

The routines in Scan_rtn.c revolve around the use of two structures. The first is a scandata structure (defined in Scan_rtn.h and visible externally) which contains:

The second is an itemdetails structure that is only used internally and contains:

The master copy of this is set up in thisitem when parsing an A record, a DC record, or a set of E records and is passed to FLUSH_SORT. FLUSH_SORT then works out the different types of record that are needed (e.g. an EA record may need records for the current author(s), the original author(s), the artist(s) and/or the subject(s) and sets up a local version of the structure called curritem to reflect the records needed in each case and calls FLUSH_SORT_PART to create the relevant records.


SCAN_FILE: Routine to scan a file and add the contents to a scandata structure

/************************************************************************/
/* */
/* SCAN_FILE - Scan file and populate a scandata structure. */
/* */
/* Calling Format: */
/* */
/* SCAN_FILE (inpfil_ptr, scandata_ptr, scan_type, prtfil_ptr); */
/* */
/* Where: */
/* */
/* inpfil_ptr = File pointer for file to scan */
/* scandata_ptr = Pointer to SCANDATA structure */
/* scan_type = Type of scan to perform */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/* This routine frees a scandata structure and any buffers */
/* allocated to it. */
/* */
/************************************************************************/

SCAN_FILE is a simple interface to scan a complete file and create scanitem records for the content. As some programs (such as IdxGen) need more control over the process it is now simply a shell that reads the records and passes them through to SCAN_RECORD to do all the hard work.

The only complicating factor is that Xvalidate doesn't want to include any books that are flagged with a DQN or DQX record. As such it needs to read all D records before processing an A record and then decide whether or not to process the A and D records depending on whether or not a DQN/X record was found. (This does not apply to IdxGen as that does not use SCAN_FILE.


SCAN_RECORD: Routine to scan a record and add the contents to a scandata structure

/************************************************************************/
/* */
/* SCAN_RECORD - Populate a scandata structure from a single record. */
/* */
/* Calling Format: */
/* */
/* SCAN_RECORD (recbuf_ptr, fld_ptr, fldcnt, scandata_ptr, */
/* dpdate_ptr, scan_type, call_type, prtfil_ptr); */
/* */
/* Where: */
/* */
/* recbuf_ptr = Record buffer to scan */
/* fld_ptr = Array of fields in record buffer */
/* fldcnt = Count of fields in record buffer */
/* scandata_ptr = Pointer to SCANDATA structure */
/* dpdate_ptr = Pointer to date from DP record (if any) */
/* scan_type = Type of scan to perform (XVALIDATE/GENATTRIB... */
/* call_type = Type of call to routine */
/* = 0 for normal record */
/* = 1 for first record in file */
/* = -1 for dummy call at end of file */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/

As the data is built up from multiple records, we call FLUSH_SORT whenever we're about to start compiling a new set of data (i.e. when we encounter an "EA" record or an "A" or "D" record and have some preceding data, identified by a non-empty authbuf), to flush the previous records. The routine revolves around the use of an itemdetails structure called thisitem. It builds up the details of the current item in this structure and then calls FLUSH_SORT to flush it when starting a new item (or when called with call_type=-1) as long as there is something in it. For 'A' records the fields are set up as follows:

Note that the remaining fields are used exclusively within FLUSH_SORT.

For DC records, we have:

All other fields are inherited from the previous 'A' record.

For EA records for which the item type is not "pu", we have:

All other fields are inherited from the previous 'A' record.

If the scan type is SCANTYPE_XVALIDATE, the routine also builds a list of all magazine and (real) book IDs (in magbuf_ptr) for which full or major cross-validation is required: this allows XVALIDATE to check that we have a magazine/book entry for all items in the database that use that magazine/book ID. This means adding an entry for the MagId specified on each "Features" record encountered when the validation level is 0 or 1 and adding an entry for each (normal) book ID. As a special case, if the routine encounters a DQE~VALFULL~ record, the routine resets the validation level (for that file) to 0 and adds an entry to the list of magazine IDs for it (if it hasn't already done so).


FLUSH_SORT: Internal routine to flush sort buffer to scandata structure

/************************************************************************/
/* */
/* FLUSH_SORT - Flush sort buffer to scandata structure(s) */
/* */
/* Calling Format: */
/* */
/* status = FLUSH_SORT (scandata_ptr, details_ptr, prtfil_ptr); */
/* */
/* Where: */
/* */
/* status = Result of operation: */
/* = PSP_TRUE if OK; else PSP_FALSE */
/* scandata_ptr = Pointer to scandata structure */
/* details_ptr = Pointer to itemdetails structure */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/

This routine basically calls FLUSH_SORT_PART a number of times to create the relevant sort records for the item in details_ptr from a number of perspectives. These are:

  1. records for the main authors specified for the item using only the item title if we have a '|' or '||' divider
  2. if we're not doing an XVALIDATE and we have a '|' or '||' divider then we write another record with the full title
  3. if we're doing an XVALIDATE and there is a series divider ('|') in the title, we call it for (just) the series name so that we can check they are consistent
  4. if we had a previous/original title and byline (see below) we call it for that byline and/or title
  5. if we're not doing an XVALIDATE and there are one or more artists on the record, then we call it for artist(s)
  6. if we're not doing an XVALIDATE and there are one or more subjects on the record, then we call for the subject(s)

Note that, at one point when trying to sort out the aggregation problems, a record was also created for the original appearance if the item was a reprint. This caused more problems than it solved so the code was now moved to

To facilitate these the routine first sets up an Auth structure called mainauth_strptr containing a sorted (by real name) set of authors by calling SPLIT_AUTH (on authbuf) followed by SORT_AUTH. This ensures that identical items with the authors specified in different orders match up OK.

It then checks to see if titlbuf contains "␢␢[" and, if so (as long as the item type is not "mg"), calls PARSE_TITLE to split it into its constituent parts, in the process setting up the following fields in the itemdetails structure:

It also initialises three of the other fields:

If there is a prior byline we convert it into a sorted, internal, format by calling TRANSLATE_AUTH (to convert it to internal format); SPLIT_AUTH (to create an Auth structure called byline_strptr from it), SORT_AUTH to sort them into order and then BUILD_AUTH to rebuild origbyln, stipulating that we want the "real" names. It also calls FIXUP_BYLINE to try to clean up the result. As a special case we also translate a byline of "anonymously" to "Anon." and check to see if the original byline was "xxx ,as told to" and, if so, strip off the trailing qualifier.

It then sets up the "actual" (original) byline in byline by taking it from origbyln (i.e. as specified on "as by xxx") or by calling BUILD_AUTH on the primary authors in mainauth_strptr (i.e. the current byline). It also tries to create a list of the "real authors" in realnames by calling GET_REAL_AUTHORS. It then sets up cvtpbdt with a version of the publication details in Bill's format (by translating pbdtbuf if it contains a new format date; or just copying it over otherwise).

If we're doing an XVALIDATE it then checks to see if the publication details indicate the original item was multi-part (e.g. "na1867FOUJan 5+3") and, if so, strips off the last bit and pretends we have the first part (e.g. [Part 1 of 4] in the example shown).

If the title (in titlbuf) contains "||" then we want to extract only the relevant part for the listing (this possibly needs expanding).

It then sets up the edition as follows:

It then tries to sort out the title to be used. First, if we have an "el" or "en" record it prefixes the title with "0_" to match the format used by magazine editors set up in SCAN_RECORD above. If the title then does not start with "0_" it <does a lot of massaging (TBS)>.

It then defaults authtype to "normal" (SECAUT_NORMAL) and calls FLUSH_SORT_PART a number of times as discussed above. Note that:


FLUSH_SORT_PART: Internal routine to flush the specified part to scandata structure

/************************************************************************/
/* */
/* FLUSH_SORT_PART - Flush the specified part to scandata structure */
/* */
/* Calling Format: */
/* */
/* status = FLUSH_SORT_PART (scandata_ptr, details_ptr, */
/* auth_strptr, titl_ptr, titlad_ptr, */
/* do_secondary, prtfil_ptr) */
/* */
/* Where: */
/* */
/* status = Result of operation: */
/* = PSP_TRUE if OK; else PSP_FALSE */
/* scandata_ptr = Pointer to scandata structure */
/* details_ptr = Pointer to itemdetails structure */
/* auth_strptr = Pointer to AUTH structure for this part */
/* titl_ptr = Pointer to title to use */
/* titlad_ptr = Pointer to title additional to use */
/* do_secondary = PSP_TRUE if we want secondary authors */
/* = PSP_FALSE otherwise */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/

The routine then tries to create a new scanitem element as seen from the perspective of each of the authors (primary or secondary). Much of the complexity of this is discussed under the background section above.

The hard parts of this are done via BUILD_PART_AUTH (which works out the perspective from each author in turn) and SETUP_SCANITEM (which actually sets up the scanitem element, so all this routine actually does is:


SETUP_SCANITEM: Internal routine to set up a scanitem and add to scandata structure

/************************************************************************/
/* */
/* SETUP_SCANITEM - Set up a scanitem and add to scandata structure */
/* */
/* Calling Format: */
/* */
/* status = SETUP_SCANITEM (scandata_ptr, details_ptr, */
/* titl_ptr, titlad_ptr, itemad_ptr, */
/* auth_ptr, coauth_ptr, authnum, */
/* prtfil_ptr) */
/* */
/* Where: */
/* */
/* status = Result of operation: */
/* = PSP_TRUE if OK; else PSP_FALSE */
/* scandata_ptr = Pointer to scandata structure */
/* details_ptr = Pointer to itemdetails structure */
/* titl_ptr = Pointer to title to use */
/* titlad_ptr = Pointer to title additional to use */
/* itemad_ptr = Pointer to item additional to use */
/* auth_ptr = Pointer to author name for this part */
/* coauth_ptr = Pointer to coauthor names for this part */
/* authnum = Author number (if multiple authors) */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/

For each instance it sets up the fields as follows:

If we're not doing a file sort, these fields are added to a new scanitem structure, a pointer to which is stored in itmbuf_ptr; if we are doing a file sort then the fields are output to file prefixed as discussed above and separated by %x1f characters (to ensure the sort orders we want). The order in which the fields are output is optimised for the sort routines and are as follows, depending on the value of scantype in the scandata structure:


WRITE_SCANITEM: Write a scanitem element to specified file

/************************************************************************/
/* */
/* WRITE_SCANITEM - Write scanitem element to specified file */
/* */
/* Calling Format: */
/* */
/* WRITE_SCANITEM (outfil_ptr, scanitem_ptr, sort_order, */
/* prtfil_ptr) */
/* */
/* Where: */
/* */
/* outfil_ptr = Output file to write record to */
/* scanitem_ptr = Pointer to scanitem element
to output */
/* sort_order = Sort Order to determine order of fields */
/* = SCANTYPE_XVALIDATE: for XVALIDATE */
/* = SCANTYPE_GENATTRIB: for GENATTRIB */
/* = SCANTYPE_IDXGEN: for Story Title Index */
/* = SCANTYPE_IDXAUT: for Story Author Index */
/* = SCANTYPE_IDXCHN: for Chronological Index */
/* = SCANTYPE_IDXSR1: for Type 1 Series Order */
/* = SCANTYPE_IDXSR2: for Type 2 Series Order */
/* = SCANTYPE_IDXSR3: for Type 3 Series Order */
/* = SCANTYPE_IDXSR4: for Type 4 Series Order */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/

Although the external sort software (CMSORT) supports the specification of specific fields to sort by, early tests showed that this increased the sort time dramatically. To avoid the various programs that use these routines create the files concerned with the data arranged in the order it is to be sorted by. In the case of IDXGEN there are several different sort orders depending on the routines concerned. Currently the sort orders supported are:

In each case, all the other fields can be specified in any order. The routine constructs a record where each field is prefixed by a field code and a colon and suffixed with a divider (^x1f) - the field codes are indicated in the definition of the scanitem element.


ALLOC_SCANDATA: Routine to allocate a new scandata structure

/************************************************************************/
/* */
/* ALLOC_SCANDATA - Allocate a new scandata structure. */
/* */
/* Calling Format: */
/* */
/* scandata_ptr = ALLOC_SCANDATA (); */
/* */
/* Where: */
/* */
/* scandata_ptr = Pointer to new scandata structure (NULL if none)*/
/* */
/* This routine allocates a new scandata structure and initializes */
/* the contents of it. */
/* */
/************************************************************************/

COPY_ITEM_DETAILS: Copy an itemdetails structure

/************************************************************************/
/* */
/* COPY_ITEM_DETAILS - Copy Item Details Structure. */
/* */
/* Calling Format: */
/* */
/* COPY_ITEM_DETAILS (newdetails_ptr, oldetails_ptr); */
/* */
/* Where: */
/* */
/* newdetails_ptr = Pointer to new itemdetails structure */
/* olddetails_ptr = Pointer to old itemdetails structure */
/* */
/* This routine copies all the fields from the old ITEMDETAILS */
/* structure to the new ITEMDETAILS structure. */
/* */
/************************************************************************/

DUMP_SCANDATA: Internal routine to dump a scandata structure

/************************************************************************/
/* */
/* DUMP_SCANDATA - Dump a scandata structure. */
/* */
/* Calling Format: */
/* */
/* DUMP_SCANDATA (scandata_ptr, outfil_ptr); */
/* */
/* Where: */
/* */
/* scandata_ptr = Pointer to scandata structure */
/* outfil_ptr = Pointer to file to dump structure to */
/* */
/* This routine dumps the contents of a scandata structure to an */
/* external file. */
/* */
/************************************************************************/


FREE_SCANDATA: Routine to free a scandata structure

/************************************************************************/
/* */
/* FREE_SCANDATA - Free a scandata structure. */
/* */
/* Calling Format: */
/* */
/* FREE_SCANDATA (scandata_ptr); */
/* */
/* Where: */
/* */
/* scandata_ptr = Pointer to scandata structure */
/* */
/* This routine frees a scandata structure and any buffers */
/* allocated to it. */
/* */
/************************************************************************/

INIT_SCANITEM: Internal routine to initialise a scanitem element

/************************************************************************/
/* */
/* INIT_SCANITEM - Initialise a scanitem structure. */
/* */
/* Calling Format: */
/* */
/* INIT_SCANITEM (scanitem_ptr); */
/* */
/* Where: */
/* */
/* scanitem_ptr = Pointer to scanitem element */
/* */
/* This routine initialises the contents of a scanitem structure. */
/* */
/************************************************************************/

RESET_ITEM_DETAILS: Internal routine to reset an itemdetails structure

/************************************************************************/
/* */
/* RESET_ITEM_DETAILS - Reset Item Details Structure. */
/* */
/* Calling Format: */
/* */
/* RESET_ITEM_DETAILS (details_ptr); */
/* */
/* Where: */
/* */
/* details_ptr = Pointer to itemdetails structure */
/* */
/* This routine resets all the fields in the specified ITEMDETAILS */
/* structure. */
/* */
/************************************************************************/

RESET_SCANDATA: Internal routine to reset a scandata structure

/************************************************************************/
/* */
/* RESET_SCANDATA - Reset a scandata structure. */
/* */
/* Calling Format: */
/* */
/* RESET_SCANDATA (scandata_ptr); */
/* */
/* Where: */
/* */
/* scandata_ptr = Pointer to scandata structure */
/* */
/* This is an internal routine that resets a scandata structure so */
/* that it can be reused or deleted. */
/* */
/************************************************************************/

SCANITEM_COMP_RTN: Sort a pair of scanitem elements

/************************************************************************/
/* */
/* SCANITEM_COMP_RTN - Sort a scanitem array. */
/* */
/* Calling Format: */
/* */
/* rtnsts = SCANITEM_COMP_RTN (rec1_ptr, rec2_ptr); */
/* */
/* Where: */
/* */
/* rtnsts = Routine return status (as for strcmp) */
/* rec1_ptr = Pointer to first scanitem element */
/* rec2_ptr = Pointer to second scanitem element */
/* */
/* This routine is called via qsort to sort a scanitem array into */
/* the "correct" order - see the documentation or code for details. */
/* */
/************************************************************************/

This routine is called via qsort to sort the scanitem array in a scandata structure into the order required by XVALIDATE. The sort priority is as follows:


SPLIT_SCANITEM: Internal routine to split a scanitem record into a scanitem element

/************************************************************************/
/* */
/* SPLIT_SCANITEM - Split scanitem record into scanitem structure */
/* */
/* Calling Format: */
/* */
/* status = SPLIT_SCANITEM (fldbuf_ptr, scanitem_ptr, prtfil_ptr) */
/* */
/* Where: */
/* */
/* status = Routine return status: */
/* = PSP_TRUE if everything OK */
/* = PSP_FALSE if we hit an error */
/* fldbuf_ptr = Pointer to buffer holding record */
/* scanitem_ptr = Pointer to scanitem element to set up */
/* prtfil_ptr = Pointer to diagnostics file (may be NULL) */
/* */
/************************************************************************/