Scan_rtn.c

Background
Data Structures
Routines:
- ALLOC_SCANDATA: Routine to allocate a new scandata structure
- COPY_ITEM_DETAILS: Routine to copy an itemdetails structure
- DUMP_SCANDATA: Routine to dump a scandata structure
- FLUSH_SORT: Internal routine to flush the sort buffer to the scandata structure
- FLUSH_SORT_PART: Internal routine to flush the specified part to the scandata structure
- FREE_SCANDATA: Routine to free a scandata structure
- INIT_SCANITEM: Routine to initialise a scanitem element
- RESET_ITEM_DETAILS: Routine to reset an itemdetails structure
- RESET_SCANDATA: Internal routine to reset a scandata structure
- SCAN_FILE: Routine to scan a file and add the contents to a scandata structure
- SCAN_RECORD: Routine to scan a record and add the contents to a scandata structure
- SCANITEM_COMP_RTN: Routine to compare a pair of scandata structures for sort purposes
- SETUP_SCANITEM: Routine to set up a scanitem element and add it to a scandata structure
- SPLIT_SCANITEM: Routine to split a scanitem record into a scanitem element
- WRITE_SCANITEM: Routine to write a scanitem element to a specified file in a particular order

Background

Increasingly, the SCAN_FILE and SCAN_RECORD routines are at the heart of most of the complex programs that revolve around the Fictionmags data format, specifically:

XVALIDATE
GENATTRIB
IDXGEN
GENERATE (at some point in the future

This is because the routine reads through a specified set of files and sets up a scandata structure containing all the data in those files in a structured manner, and often from multiple viewpoints, as discussed below. This structure can then be examined by the various programs as required to extract the data they need.

Note that the primary drivers in developing the code have been XVALIDATE and GENATTRIB - adjustments may need to be made in future to accommodate the other programs. However, the emphasis has been to try to streamline the code as far as possible to address the first two in the hopes that future extensions will be well-documented.

Bylines

One of the early complications was in the handling of bylines. When trying to match up two records containing, say:

E A0~House1 ,(hp:Author1)~Example3~ss|ACT|1929|Aug|(v8:12)|~  (original)
E A0~Author1~Example3 [as by House1]~ss1929ACTAug~            (reprint)

we have two choices. The first was to regard the original byline as being "House1 ,(hp:Author1)" which worked fine in simple cases, but fell apart in some of the more complex cases using ", (by:" such as initials or "The Author of xxx" as there was no easy way of predicting what the original byline would be. Similar problems (in theory at least) would arise if you had an original appearance written under two different house names with two different authors (as happened a lot in Wild West Weekly for example) and was then reprinted under the author's own names, as there would be no way of knowing which author had used which house name.

The second (and current) approach is to take the opposite approach and regard the byline as exactly what is used in the book/magazine - i.e. in the above case, "House1" - this should be easily deducible in all cases without any of the problems discussed above. It does, however, introduce problems when the same title is used by different people under the same house name (or anonymously) or when, by mistake, the real author behind a house name is specified incorrectly - these are discussed in more detail under Real Authors below.

Note that, when we say "exactly what is used in the book/magazine", this is not quite true as we need to normalise all such names to their base formats (e.g. translating via ~04~ records) if we are to get a match.

Note also that if there is a secondary name qualified with ", (as told to:" or ", (as told by:" then these also need to be incorporated into the byline so that we can match up:

E 32A0~Branscombe, Arthur #2 ,(as told to:Rousseau, Victor)~Soul That Lost Its Way~ss|GHS|1927|Aug|(v3:2)|~The ~~Martinus| Doctor~
E 80A0~Anon.~Monkey-Face and Mrs. Thorpe ["The Soul That Lost Its Way", as by Arthur Branscombe #2 as told to Victor Rousseau]~ss1927GHSAug~~~Martinus|Doctor~

Note also that if a primary name is qualified with ", (err:" then we use the name behind the error as the byline instead.

Real Authors

The current approach to bylines raises a number of problems some (all?) of which weren't visible in the prior approach.

Firstly, XVALIDATE tries to police instances where the same author uses the same (or very similar) titles with different publication details and this was done by comparing adjacent scanitem elements (after sorting). However, if two different authors write items with the same title anonymously this is flagged up (unnecessarily) as a conflict because we have two records with an author (i.e. the original byline) of "Anon." with the same title but different publication details.

Similarly, if we consider a variant to the above example of:

E A0~House1 ,(hp:Author1)~Example3~ss|ACT|1929|Aug|(v8:12)|~  (original)
E A0~Author2~Example3 [as by House1]~ss1929ACTAug~            (reprint)
E A0~House1~Example3~ss1929ACTAug~            	              (reprint)

we would like to flag up the two reprints as being in error as neither matches the original (the first reprint because the author is wrong; the second because the author behind the house name has been omitted). However, with the current approach to bylines the only records we'll be comparing will all just say that the "original byline" was "House1" and hence will think everything is OK. (Note that, if ACT is a fully-validated magazine the first reprint would trigger an "original appearance not found" message, but even that wouldn't address the problem with the second reprint.)

The current thinking is to record the "real authors" associated with an item and then report an error if the author/title/publication details match but the real author(s) don't (the second case above) and not report an error if the author/title match but publication details don't if the real author(s) also don't (the first case above).

The challenge lies in working out who the "real authors" are. In our three examples just above we can do so by parsing the author on the item as that would produce "Author1", "Author2" and "House1" respectively, and hence give us a mismatch. However, if we consider a different valid combination:

E A0~Author1~Example3~ss|ACT|1929|Aug|(v8:12)|~               (original)
E A0~House1~Example3 [as by Author1]~ss1929ACTAug~            (reprint)

this would produce "Author1" for the first case and "House1" for the second case and hence generate an error, even though the reprint is actually valid.

The current thinking is to try to look at both the author on the item and any byline specified and combine the two. This is non-trivial as we need to ensure that:

if any of the names is a pseudonym, we translate it to the real name(s) underneath it.
if the name is not a real name (e.g. a house name or "Anon.) then it is not included (in the above example the first would otherwise generate "Author1" and the second "House1/Author1" which still wouldn't match).
the names are sorted into a unique order to make matching feasible.
any duplication of the same name (e.g. when an item by a real author is reprinted under one of their personal pseudonyms or vice versa) is removed for the same reason.

There are still cases where this won't work, but hopefully they will be a small number of "false positives" that can be handled by the exceptions file.

Some points to be addressed:

with: if record by a ,(with:b) is reprinted as a/b then we list the original as "as by b" but the record generated for the original appearance from the latter is missing the co-author. Can we fix this (e.g. when creating record for byline in SCAN_FILE?)
08 records: document
gho: should over-ride primary name (see "Hands of Death" in mags.sym)
err: should over-ride primary name (see "Seven Shapes of Solomon Bean" in sdmau.mag)
23 records: (see "The Peep Show That Shook Spain" in mags.adv)
as by "Haywire Mac" (rrmmt.mag)
problems with hp:xxx or by:xxx where 'xxx' is not the final name but is an '08' link away (or possibly a '12' link away?) (see Cassiday, Robert J. in sfi/xvalidate.xxx)
"The Author of xxx": see Young Mrs. Jardine in fmi/xvalidate.xxx
E_798A1~Grant, Elizabeth ,(tr:Alexander, Rev. Dr. W. Lindsay)~Roy's Wife~pm~ in mags.hrp generates auth=Grant, Elizabeth but no "real auth": why? Likewise first appearance of "Her Bright Smile Haunts Me Still" in ptrsn.mag; Likewise "To Lesbia" in mags.mcl
should "Scriptunes" in mags.rwg be recorded as "Publication details differ" (one is by Elizabeth Sale & the other by Elizabeth Sale & somebody else)

Multiple Viewpoints

As mentioned in the introduction, a single input record may generate data looking at the same record from multiple viewpoints. Specifically:

We generate a scanitem element for each primary name in turn;
If any of the primary names is a pseudonym for a specific author (or group of authors) we also generate an element for each such author behind the name;
If there are any author-specific secondary names, we generate an element for each such (note that secondary authors should never be pseudonyms so we don't need the second step above)
If there is a byline specified on the record, we repeat the above for the byline
(Future): If there are any global secondary names, we generate an element for each such
(Future): If there are any artists specified on the record, we generate an element for each such
(Future): If there are any subjects specified on the record, we generate an element for each such

Note that the scanitem elements generated are virtually identical in all cases generated from a single record. The only exceptions are:

auth_ptr points to the specific name from whose viewpoint we are considering the data
coauth_ptr points to the co-authors as seen from the viewpoint of the person in auth_ptr as discussed below

Co-Authors

One of the many things that XVALIDATE tries to check is that any occurrences of an item from the viewpoint of a given author has the same co-authors, i.e. that:

E A0~Author1/Author2~Example3~ss|ACT|1929|Aug|(v8:12)|~                 (original)
E A0~House1~Example3 [as by Author1 & Author2]~ss1929ACTAug~            (reprint)

produces a match, but something like:

E A0~House1~Example3 [as by Author1]~ss1929ACTAug~            (reprint)

produces a mismatch. Given the possibilities of personal pseudonyms being used (validly) just about anywhere experimentation has shown that the only viable way of handling co-authors is to specify the "real names" in all cases (where known). This might need adjusting when it comes to generating the website, but we'll think about that later.

Data Structures

The routines in Scan_rtn.c revolve around the use of two structures. The first is a scandata structure (defined in Scan_rtn.h and visible externally) which contains:

structid: this is simply a sanity-check field set by ALLOC_SCANDATA to SCANDATA_STRUCTID ("SCANDATA")
scantype: this indicates the type of scan we are doing (SCANTYPE_GENATTRIB, SCANTYPE_IDXGEN or SCANTYTPE_XVALIDATE)
recnum: this is set to zero by ALLOC_SCANDATA (or RESET_SCANDATA) and is then incremented for each record processed by SCAN_FILE so that we have a unique number for each record
cur_val_level: the current validation level; this is only used by XVALIDATE and may take one of three values:
- 0: full cross-validation
- 1: major-cross-validation
- 2: minor cross-validation (not currently used)
filnam: this contains the current filename and is set by the calling program(s) to be stored in each record
nummag: the number of magazines in magbuf_ptr: these are only used by XVALIDATE and are described there
magbuf_ptr: a list of magazines for which full cross-validation is enabled
numitm: the number of items in itmbuf_ptr
itmbuf_ptr: an array of scanitem structures items built by SCAN_FILE; each structure contains the following fields which may also be output to the temporary file in which case each field is prefixed by xx:, partly for diagnostic purposes, and partly so we can create the file differently at different times for sorting purposes:
- auth_ptr: Pointer to Author name (au)
- authtype: Author type (at)
- authnum: Author Number (if multiple authors for item) (an)
- nrmaut_ptr: Pointer to normalised Author name (using NORMALISE_AUTHOR) (na)
- cmpttl_ptr: Pointer to Compacted Item Title (using COMPACT_TITLE) (ct)
- titl_ptr: Pointer to Normal Item Title (includes trailing '|' for column/series prefixes) (tt)
- ser_part_ptr: Pointer to Serial Part (or " " if none) (sp)
- ser_max_ptr: Pointer to Serial Maximum (or " " if none) (sm)
- pubdet_ptr: Pointer to Publication details (in "old" format) (pd)
- ttad_ptr: Pointer to Title additional field (ta)
- itad_ptr: Pointer to Item additional field (it)
- series_ptr: Pointer to Series name (sr)
- edition: char*2 Edition: (ed)
  - for items: 1=same as book/mag ID; 2=reprint
  - for books: standard edition code from Book Record
  - 0 used by XVALIDATE only (unclear when)
  - X=dummy record for PSEUD.CVT entries
- done_original: char*2 Flag to say we've created a record for the original publication (0=No/Not Needed; 1=Yes & this is it; 2=Yes) (do)
- char edittype: char*2 Flags to distinguish between editorial positions (1); Magazine Issue Editors (2) & Anything Else (3) (et)
- book_type: char*2 Book Type (1 = Book; 2 = Item; 3 = Item Reprinted as Book) (bk)
- dtpubl_ptr: Pointer to numeric publication date (CCYYMMDD) (dt)
- magid_ptr: Pointer to Book/Magazine ID (mg)
- origttl_ptr: Pointer to Original Title (unless doing record for original title) (ot)
- origtad_ptr: Pointer to Original Title Additional (unless doing record for original title) (oa)
- byline_ptr: Pointer to Byline used (unless doing record for byline, except when that is also a pseudonym) (by)
- rlauth_ptr: Pointer to the Real Authors associated with this item (ra)
- coauth_ptr: Pointer to Co-authors (co)
- secnam_ptr: Pointer to Secondary Names (sc)
- subject_ptr: Pointer to Subjects (sb)
- notes_ptr: Pointer to Notes related to "from <xxx>" or similar (on); also used in the Series Index to indicate the "other" series
- recnum: char*10 8-digit number representing the position in the input file(s) of the record, derived from recnum . It is used to prevent the program falsely comparing items on the same record. (rn)
- filnam_ptr: Pointer to Current File Name, derived from filnam (fn)
- filabb_ptr: Pointer to Current File Abbreviation (fa)
- val_level: char*2 Validation Level, derived from cur_val_level (vl); also used in the Chronological and Series Indexes to indicate multiple instances of a title
- ednote_ptr: Pointer to ED notes (nt); also used in the Series Index to indicate the "About" link URL
- bylntype: Byline type (bt)
- styttl_ptr: Pointer to anchor in Story Title Index (st)
- sttaut_ptr: Pointer to Story Title anchor in Story Author Index (sa)
- seriestyp: char*2 Series Type (sy, only used by the Series Index)
- repttl_ptr: Pointer to title on reprint (if different to titl_ptr) (rt)
- repaut_ptr: Pointer to author on reprint (if different to auth_ptr) (rp)
- prvpub_ptr: Pointer to publisher on DP record (if there is one) (pp)

The second is an itemdetails structure that is only used internally and contains:

inpbuf: Input Buffer
bookid: Book ID: set from the most recent A record encountered
dtpubl: Publication date for book/magazine
autnam: Author Buffer: as specified on the A, DC or EA record
titlbuf: Title Buffer: as specified on the A, DC, EA or EC record: note that, if there is a numeric suffix (i.e. ^-|) then it and any following space is stripped off first
pbdtbuf: Publication Details Buffer: as specified on EA record: note that, if the item type is ex, it is reset to zx so that it sorts after everything else
cvtpbdt: publication details from pbdtbuf converted to Bill's format if necessary
ttadbuf: Title Additional Buffer: as specified on A, DC, EA or EC record
itadbuf: Item Additional Buffer: as specified on EA or EC record
artistbuf: Artist Buffer: as specified on EA record
seriesbuf: Series Buffer: as specified on EA record
subjbuf: Subject Buffer: as specified on EA record
notebuf: ED Note Buffer: contains the contents of any ED records, separated by ^//
firstbuf: First Printing Buffer
priorbuf: Prior Printing Buffer
serialpart: serial part number (set to five spaces if not part of a serial)
serialmax: serial maximum (set to five spaces if not part of a serial)
origtitl: Original Title if specified on EA record
origttad: Original Title Additional if specified on EA record
origbyln: Original Byline specified in "as by xxx"
orignote: any other odd notes specified in []
byline: Byline used on the Original Appearance (taken from origbyln, if specified, or authbuf otherwise)
thisrecnum: This record number: a formatted version of recnum
secnames: Secondary names
realnames: Real names
edition: the edition for this item (0=unknown; 1=original; 2=reprint; *=book)
book_type: Book Type: 1 = (Original) Book; 2 = Item; 3 = Item Reprinted as Book
authtype: the author type (SCANAUT_xxx)
edittype: the editor type (as in scandata)
bylntype: Byline Type ('0'=current; '1'=from original)
filabb: Current Filename Abbreviation
reptitl: Reprint Title (incl Titl Additional)
repauth: Reprint Author
prvpub: Previous Publisher on DP record (if any)

The master copy of this is set up in thisitem when parsing an A record, a DC record, or a set of E records and is passed to FLUSH_SORT. FLUSH_SORT then works out the different types of record that are needed (e.g. an EA record may need records for the current author(s), the original author(s), the artist(s) and/or the subject(s) and sets up a local version of the structure called curritem to reflect the records needed in each case and calls FLUSH_SORT_PART to create the relevant records.

SCAN_FILE: Routine to scan a file and add the contents to a scandata structure

/************************************************************************/
/*									*/
/*    SCAN_FILE - Scan file and populate a scandata structure.		*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	SCAN_FILE (inpfil_ptr, scandata_ptr, scan_type, prtfil_ptr);	*/
/*									*/
/*    Where:								*/
/*									*/
/*	inpfil_ptr   = File pointer for file to scan			*/
/*	scandata_ptr = Pointer to SCANDATA structure			*/
/*	scan_type    = Type of scan to perform				*/
/*      prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/*    This routine frees a scandata structure and any buffers		*/
/*    allocated to it.							*/
/*									*/
/************************************************************************/

SCAN_FILE is a simple interface to scan a complete file and create scanitem records for the content. As some programs (such as IdxGen) need more control over the process it is now simply a shell that reads the records and passes them through to SCAN_RECORD to do all the hard work.

The only complicating factor is that Xvalidate doesn't want to include any books that are flagged with a DQN or DQX record. As such it needs to read all D records before processing an A record and then decide whether or not to process the A and D records depending on whether or not a DQN/X record was found. (This does not apply to IdxGen as that does not use SCAN_FILE.

SCAN_RECORD: Routine to scan a record and add the contents to a scandata structure

/************************************************************************/
/*									*/
/*    SCAN_RECORD - Populate a scandata structure from a single record.	*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	SCAN_RECORD (recbuf_ptr, fld_ptr, fldcnt, scandata_ptr,		*/
/*		     dpdate_ptr, scan_type, call_type, prtfil_ptr);	*/
/*									*/
/*    Where:								*/
/*									*/
/*	recbuf_ptr   = Record buffer to scan				*/
/*	fld_ptr	     = Array of fields in record buffer			*/
/*	fldcnt	     = Count of fields in record buffer			*/
/*	scandata_ptr = Pointer to SCANDATA structure			*/
/*	dpdate_ptr   = Pointer to date from DP record (if any)		*/
/*	scan_type    = Type of scan to perform (XVALIDATE/GENATTRIB...	*/
/*	call_type    = Type of call to routine				*/
/*		     = 0 for normal record				*/
/*		     = 1 for first record in file			*/
/*		     = -1 for dummy call at end of file			*/
/*      prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/

As the data is built up from multiple records, we call FLUSH_SORT whenever we're about to start compiling a new set of data (i.e. when we encounter an "EA" record or an "A" or "D" record and have some preceding data, identified by a non-empty authbuf), to flush the previous records. The routine revolves around the use of an itemdetails structure called thisitem. It builds up the details of the current item in this structure and then calls FLUSH_SORT to flush it when starting a new item (or when called with call_type=-1) as long as there is something in it. For 'A' records the fields are set up as follows:

inpbuf: <undefined>
bookid: contents of [] in FLD_CLASS, or contents of {} for internally generated book IDs
dtpubl: from FLD_DTPUBL
autnam: from FLD_TITLE if a magazine or from FLD_AUTHOR otherwise; note that, for magazines, if there is no editor then this field will be empty and no records will be generated; for books, if there is an editor specified in FLD_CLASS this is appended to the autnam field, prefixed with ", (ed:"
titlbuf: from FLD_TITLE if not a magazine; for magazines this is complicated when we are doing IDXGEN as we want the entries to sort after any books rather than mixing magazine issue names with book titles. As such, we set the title to the fake "0_Editor|_Z"; is we're not doing IDXGEN then it is just taken from FLD_AUTHOR.
pbdtbuf: the item type is taken from FLD_BKTYPE; if we were passed a date in dpdate_ptr this is added; if not we append the contents of bookid if FLD_CLASS contains [ or < or FLD_DTPUBL otherwise.
ttadbuf: from FLD_TITLAD except for magazine editors (see titlbuf above) where it is set to ""
itadbuf: ""
artistbuf: set from FLD_COVER
seriesbuf: set from FLD_SERIES; for books, if the publisher name matches an imprint series then it is appended to the series buffer with a special prefix of '#'
subjbuf: set from subject field in FLD_CLASS (if any)
notebuf: ""
firstbuf: ""
priorbuf: ""
thisrecnum: the current record number padded to 8 digits
edition: stored from FLD_EDITION or 2 if FLD_CLASS contains < (? should this be 6) or 1 for magazine editors (see titlbuf above)
book_type: 1 or 3 if FLD_CLASS contains < or 2 for magazine editors (see titlbuf above)
filabb: copied from SCANDATA, set to MagId on Feature record or "" otherwise

Note that the remaining fields are used exclusively within FLUSH_SORT.

For DC records, we have:

inpbuf: <undefined>
autnam: from FLD_DCARTIST
titlbuf: from FLD_DCTITL if nonempty; else "[front cover]"
pbdtbuf: cv followed by bookid (if specified) or dtpubl (if not) from previous 'A' record.
ttadbuf: from FLD_DCTITLAD if FLD_DCTITL nonempty; else ""
itadbuf: ""
seriesbuf: ""
subjbuf: ""
notebuf: ""
firstbuf: ""
priorbuf: ""
thisrecnum: the current record number padded to 8 digits
edition: 1
book_type: 2 (i.e. an item)

All other fields are inherited from the previous 'A' record.

For EA records for which the item type is not "pu", we have:

inpbuf: set to "<record> in <file>" where <record> is the record buffer passed to the routine and <file> is the file name from SCANDATA
autnam: set from FLD_EAUTH
titlbuf: set from FLD_EATITL, stripping off any numeric prefix (i.e. something ending with ^-|); if there is an EC record with a title then this field is replaced by FLD_ECTITL, coping with all the usual things like | prefixes and part numbers
pbdtbuf: set from FLD_EAPUBL except that an item type of "ex" is replaced with "zx" to sort it after whatever it is an extract from and the remainder of the field is replaced by the bookid if it is set to nnnn(*)
ttadbuf: set from FLD_EATITLAD or FLD_ECTITLAD if there is an EC record
itadbuf: set from FLD_EAITEMAD or FLD_ECITEMAD if there is an EC record with a series prefix
artistbuf: set from FLD_EAILLUS
seriesbuf: set from FLD_EASERIES
subjbuf: set from FLD_EASUBJ
notebuf: set by concatenating any following ED records (apart for ED* records)
firstbuf: ""
priorbuf: set from any ED* or EX record (the latter if not doing an XVALIDATE) and reformatted into the proper format.
thisrecnum: the current record number padded to 8 digits
edition: 1
book_type: 2

All other fields are inherited from the previous 'A' record.

If the scan type is SCANTYPE_XVALIDATE, the routine also builds a list of all magazine and (real) book IDs (in magbuf_ptr) for which full or major cross-validation is required: this allows XVALIDATE to check that we have a magazine/book entry for all items in the database that use that magazine/book ID. This means adding an entry for the MagId specified on each "Features" record encountered when the validation level is 0 or 1 and adding an entry for each (normal) book ID. As a special case, if the routine encounters a DQE~VALFULL~ record, the routine resets the validation level (for that file) to 0 and adds an entry to the list of magazine IDs for it (if it hasn't already done so).

FLUSH_SORT: Internal routine to flush sort buffer to scandata structure

/************************************************************************/
/*									*/
/*  FLUSH_SORT - Flush sort buffer to scandata structure(s)		*/
/*									*/
/*  Calling Format:							*/
/*									*/
/*	status = FLUSH_SORT (scandata_ptr, details_ptr, prtfil_ptr);	*/
/*									*/
/*  Where:								*/
/*									*/
/*	status	     = Result of operation:				*/
/*		     = PSP_TRUE if OK; else PSP_FALSE			*/
/*	scandata_ptr = Pointer to scandata structure			*/
/*	details_ptr  = Pointer to itemdetails structure			*/
/*    	prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/

This routine basically calls FLUSH_SORT_PART a number of times to create the relevant sort records for the item in details_ptr from a number of perspectives. These are:

records for the main authors specified for the item using only the item title if we have a '|' or '||' divider
if we're not doing an XVALIDATE and we have a '|' or '||' divider then we write another record with the full title
if we're doing an XVALIDATE and there is a series divider ('|') in the title, we call it for (just) the series name so that we can check they are consistent
if we had a previous/original title and byline (see below) we call it for that byline and/or title
if we're not doing an XVALIDATE and there are one or more artists on the record, then we call it for artist(s)
if we're not doing an XVALIDATE and there are one or more subjects on the record, then we call for the subject(s)

Note that, at one point when trying to sort out the aggregation problems, a record was also created for the original appearance if the item was a reprint. This caused more problems than it solved so the code was now moved to

To facilitate these the routine first sets up an Auth structure called mainauth_strptr containing a sorted (by real name) set of authors by calling SPLIT_AUTH (on authbuf) followed by SORT_AUTH. This ensures that identical items with the authors specified in different orders match up OK.

It then checks to see if titlbuf contains "␢␢[" and, if so (as long as the item type is not "mg"), calls PARSE_TITLE to split it into its constituent parts, in the process setting up the following fields in the itemdetails structure:

serialpart: any serial part number (defaults to "␢␢␢␢␢" for sorting purposes)
serialmax: any serial maximum (defaults to "␢␢␢␢␢" for sorting purposes)
origtitl: any original title (defaults to "")
origttad: any original title additional (defaults to "")
origbyln: any specified original byline (defaults to "")
orignote: anything else specified in brackets (such as "from <xxx>"; defaults to "")

It also initialises three of the other fields:

bylntype: 0
reptitl: ""
repauth: ""

If there is a prior byline we convert it into a sorted, internal, format by calling TRANSLATE_AUTH (to convert it to internal format); SPLIT_AUTH (to create an Auth structure called byline_strptr from it), SORT_AUTH to sort them into order and then BUILD_AUTH to rebuild origbyln, stipulating that we want the "real" names. It also calls FIXUP_BYLINE to try to clean up the result. As a special case we also translate a byline of "anonymously" to "Anon." and check to see if the original byline was "xxx ,as told to" and, if so, strip off the trailing qualifier.

It then sets up the "actual" (original) byline in byline by taking it from origbyln (i.e. as specified on "as by xxx") or by calling BUILD_AUTH on the primary authors in mainauth_strptr (i.e. the current byline). It also tries to create a list of the "real authors" in realnames by calling GET_REAL_AUTHORS. It then sets up cvtpbdt with a version of the publication details in Bill's format (by translating pbdtbuf if it contains a new format date; or just copying it over otherwise).

If we're doing an XVALIDATE it then checks to see if the publication details indicate the original item was multi-part (e.g. "na1867FOUJan 5+3") and, if so, strips off the last bit and pretends we have the first part (e.g. [Part 1 of 4] in the example shown).

If the title (in titlbuf) contains "||" then we want to extract only the relevant part for the listing (this possibly needs expanding).

It then sets up the edition as follows:

left unchanged if we have a book
else, for XVALIDATE only, '0' if we have either an original title or byline or if the publication details field is too small to be valid
'1' if we have a bookid that matches the publication details
'2' otherwise

It then tries to sort out the title to be used. First, if we have an "el" or "en" record it prefixes the title with "0_" to match the format used by magazine editors set up in SCAN_RECORD above. If the title then does not start with "0_" it <does a lot of massaging (TBS)>.

It then defaults authtype to "normal" (SECAUT_NORMAL) and calls FLUSH_SORT_PART a number of times as discussed above. Note that:

In the first case (main authors) it checks to see if we have an item title and, if so, uses that instead of the main title. Note that, in this case, if the item type is "cl" it is reset to "ar". This is also the only time when the do_secondary flag is set to PSP_TRUE
In the second case (series names) it copies the itemdetails structure so that it can reset serialpart and serialmax to their defaults and set edition to '0' (these are critical for the comparisons XVALIDATE does)
In the third case (original title/byline) it copies the itemdetails structure so that it can reset titlbuf and ttadbuf from origtitl and origttad (if we have an original title) before setting the latter to null. It then calls FLUSH_SORT_PART specifying either the main author(s) or the byline author(s) as appropriate and with the main title or item title as appropriate. (Note that, if we have an original title, there can't be an item title).
In the fourth case (artists) it creates an empty itemdetails structure, copying bookid from the original structure, setting cvtpbdt to "il" and byline by calling SPLIT_AUTH, SORT_AUTH & BUILD_AUTH (as above) from the artists specified in artistbuf. It then calls FLUSH_SORT_PART with a title of "[illustration]" and no title additional.
In the fifth case (subjects) it copies the itemdetails structure so that it can reset authtype to SECAUT_SUBJECT and recreate mainauth_strptr from the name(s) specified in subjbuf

FLUSH_SORT_PART: Internal routine to flush the specified part to scandata structure

/************************************************************************/
/*									*/
/*    FLUSH_SORT_PART - Flush the specified part to scandata structure	*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	status = FLUSH_SORT_PART (scandata_ptr, details_ptr,		*/
/*				  auth_strptr, titl_ptr, titlad_ptr,	*/
/*				  do_secondary, prtfil_ptr)		*/
/*									*/
/*    Where:								*/
/*									*/
/*	status	     = Result of operation:				*/
/*		     = PSP_TRUE if OK; else PSP_FALSE			*/
/*	scandata_ptr = Pointer to scandata structure			*/
/*	details_ptr  = Pointer to itemdetails structure			*/
/*	auth_strptr  = Pointer to AUTH structure for this part		*/
/*	titl_ptr     = Pointer to title to use				*/
/*	titlad_ptr   = Pointer to title additional to use		*/
/*	do_secondary = PSP_TRUE if we want secondary authors		*/
/*		     = PSP_FALSE otherwise				*/
/*      prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/

The routine then tries to create a new scanitem element as seen from the perspective of each of the authors (primary or secondary). Much of the complexity of this is discussed under the background section above.

The hard parts of this are done via BUILD_PART_AUTH (which works out the perspective from each author in turn) and SETUP_SCANITEM (which actually sets up the scanitem element, so all this routine actually does is:

Loop round calling BUILD_PART_AUTH for each primary author in turn, calling SETUP_SCANITEM for each one. Note that we always call SETUP_SCANITEM at least once even if there are no primary authors (i.e. the initial call to BUILD_PART_AUTH returns PSP_FALSE because this picks up cases where the author field is specified as something like "Hall, Gladys ,as told to" or "Merryweather, E. ,trans." - there are probably better ways to do this, but it works for now. For it to work OK, though, BUILD_PART_AUTH must initialise the output buffers even when returning an error.
If we're doing main authors/titles (sort_part = 1) we then do the same for the secondary authors.

SETUP_SCANITEM: Internal routine to set up a scanitem and add to scandata structure

/************************************************************************/
/*									*/
/*    SETUP_SCANITEM - Set up a scanitem and add to scandata structure	*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	status = SETUP_SCANITEM (scandata_ptr, details_ptr,		*/
/*				 titl_ptr, titlad_ptr, itemad_ptr,	*/
/*				 auth_ptr, coauth_ptr, authnum,		*/
/*				 prtfil_ptr)				*/
/*									*/
/*    Where:								*/
/*									*/
/*	status	     = Result of operation:				*/
/*		     = PSP_TRUE if OK; else PSP_FALSE			*/
/*	scandata_ptr = Pointer to scandata structure			*/
/*	details_ptr  = Pointer to itemdetails structure			*/
/*	titl_ptr     = Pointer to title to use				*/
/*	titlad_ptr   = Pointer to title additional to use		*/
/*	itemad_ptr   = Pointer to item additional to use		*/
/*	auth_ptr     = Pointer to author name for this part		*/
/*	coauth_ptr   = Pointer to coauthor names for this part		*/
/*	authnum      = Author number (if multiple authors)		*/
/*      prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/

For each instance it sets up the fields as follows:

auth_ptr: Pointer to the contents of the author name as returned by BUILD_PART_AUTH. The only massaging done here is to change any name enclosed in double quotes such that both double quotes are at the end (e.g. "Saki" --> Saki"")
nrmaut_ptr: Pointer to the normalised version of auth_ptr as returned by NORMALISE_AUTHOR
authtype: As specified in authtype in the item details structure plus SCANAUT_MAYBE if the author in auth_ptr was flagged as being dubiou
cmpttl_ptr: Pointer to a compacted version of the title passed as a parameter to the routine (using COMPACT_TITLE)
titl_ptr: Pointer to the normal version of the title passed as a parameter to the routine (note that this includes a trailing '|' for column/series prefixes)
ser_part_ptr: Pointer to the contents of serialpart if sort_part is not 2 or a pointer to "␢␢␢␢␢" if sort_part is 2 (i.e.we're checking series titles only). Note that in the latter case we allocate a new buffer each time rather than pointing to a static string because of problems in FREE_SCANDATA which attempts to free the contents of this field if it is specified.
ser_max_ptr: Pointer to the contents of serialmax if sort_part is not 2 or a pointer to "␢␢␢␢␢" if sort_part is 2 (i.e.we're checking series titles only). Note that in the latter case we allocate a new buffer each time rather than pointing to a static string because of problems in FREE_SCANDATA which attempts to free the contents of this field if it is specified.
pubdet_ptr: Pointer to the contents of cvt_pubdet
ttad_ptr: Pointer to the contents of titladbuf
series_ptr: Pointer to the contents of seriesbuf
edition: char*2 Edition:
- 0 if unknown or sort_part=1 and we have an original title or byline or sort_part=2
- 1 if sort_part=1 and bookid =pubdetbuf
- 2 otherwise (i.e. reprint))
dtpubl_ptr: Pointer to the contents of dtpubl
magid_ptr: Pointer to the contents of bookid
origttl_ptr: Pointer to the contents of origtitl if sort_part is not 3 (i.e. not doing record for original title/byline) and null otherwise
origtad_ptr: Pointer to the contents of origtitlad if sort_part is not 3 (i.e. not doing record for original title/byline) and null otherwise
byline_ptr: Pointer to the contents of the byline as returned by BUILD_PART_AUTH (unless doing record for byline, except when that is also a pseudonym as discussed above)
coauth_ptr: Pointer to the contents of the co-authors as returned by BUILD_PART_AUTH.
secnam_ptr: Pointer to the contents of the secondary names as returned by BUILD_PART_AUTH.
notes_ptr: Pointer to the contents of orignotes
recnum: char*10 8-digit number representing the position in the input file(s) of the record, derived from recnum . It is used to prevent the program falsely comparing items on the same record.
filnam_ptr: Pointer to Current File Name, derived from filnam
val_level: char*2 Validation Level, derived from cur_val_level
ednote_ptr: Pointer to contents of notebuf

If we're not doing a file sort, these fields are added to a new scanitem structure, a pointer to which is stored in itmbuf_ptr; if we are doing a file sort then the fields are output to file prefixed as discussed above and separated by %x1f characters (to ensure the sort orders we want). The order in which the fields are output is optimised for the sort routines and are as follows, depending on the value of scantype in the scandata structure:

For SCANTYPE_XVALIDATE & SCANTYPE_GENATTRIB:
- Normalised Author/Author Type/Compacted Title/Full Title/Serial Part/Serial Maximum/Series Name/Publication Details/Edition/Record Number
For SCANTYPE_IDXGEN:
- Normalised Author/Author Type/Compacted Title/Full Title/Coauthors/Secondary Names/Edition/Publication Date/Record Number

WRITE_SCANITEM: Write a scanitem element to specified file

/************************************************************************/
/*									*/
/*    WRITE_SCANITEM - Write scanitem element to specified file		*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	WRITE_SCANITEM (outfil_ptr, scanitem_ptr, sort_order,		*/
/*			prtfil_ptr)					*/
/*									*/
/*    Where:								*/
/*									*/
/*	outfil_ptr   = Output file to write record to			*/
/*	scanitem_ptr = Pointer to scanitem element to output		*/
/*	sort_order   = Sort Order to determine order of fields		*/ 
/*		     = SCANTYPE_XVALIDATE: for XVALIDATE		*/
/*		     = SCANTYPE_GENATTRIB: for GENATTRIB		*/
/*		     = SCANTYPE_IDXGEN: for Story Title Index		*/
/*		     = SCANTYPE_IDXAUT: for Story Author Index		*/
/*		     = SCANTYPE_IDXCHN: for Chronological Index		*/
/*		     = SCANTYPE_IDXSR1: for Type 1 Series Order		*/
/*		     = SCANTYPE_IDXSR2: for Type 2 Series Order		*/
/*		     = SCANTYPE_IDXSR3: for Type 3 Series Order		*/
/*		     = SCANTYPE_IDXSR4: for Type 4 Series Order		*/
/*    	prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/

Although the external sort software (CMSORT) supports the specification of specific fields to sort by, early tests showed that this increased the sort time dramatically. To avoid the various programs that use these routines create the files concerned with the data arranged in the order it is to be sorted by. In the case of IDXGEN there are several different sort orders depending on the routines concerned. Currently the sort orders supported are:

SCANTYPE_XVALIDATE (XVALIDATE):
- Normalised Author/Author Type/Compacted Title/Full Title/Serial Part/Serial Maximum/Series Name/Publication Details/Edition/Record Number
SCANTYPE_GENATTRIB (GENATTRIB):
- Same as SCANTYPE_XVALIDATE
SCANTYPE_IDXGEN (IDXGEN for Story Title Index):
- Compacted Title/Full Title/Normalised Author/Publication Details/Edition/Record Number
- Note that the "Publication Details" was added when the code was changed to add "first appearance" records for any items that didn't have them. This was because there might be multiple uncredited items with the same title and SETUP_STYTTL_IDX is called before the new records are added and BUILD_STYTTL_IDX is called afterwards so it is essential that the items appear in the same order or the anchors don't work. This is probably redundant now that we have the done_original flag but it doesn't do any harm.
- For the same reason "Record Number" was added at the end as sometimes the new records added changed the sort order if the rest of the key was the same. This is probably also now redundant.
SCANTYPE_IDXAUT (IDXGEN for Story Author Index):
- Normalised Author/Author Type/Edit Type/Compacted Title/Full Title/Title Additional/Item Type/Coauthors/Secondary Names/Subject/Edition/Original Title/Publication Date/Record Number
  - NB: An earlier version omitted the Title Additional but this caused problems where multiple items had the same title but different title additionals (e.g. "(An) Interview with Piers Anthony"
  - An earlier version omitted the "Original Title" but this caused problems when the aggregation rules were changed to split out items with different original titles (see Notes on Past Problem Areas)
  - The Item Type is not a formal field (hence uses the dummy prefix xx:) and is used to allow items with the same title but different item types to be sorted together (note this only works if we don't have an extract from either). The item type was used instead of the full publication details as the dates in the latter would have disrupted the sort order (I think).
SCANTYPE_IDXCHN (IDXGEN for Chronological Index):
- Normalised Author/Publication Date/Book Type/Compacted Title/Edition/Full Title/Record Number
SCANTYPE_IDXSR1 (IDXGEN for Type 1 Series Order):
- Series Name/Series Type/Normalised Author/Real Author(s)/Publication Date/Book Type/Compacted Title/Full Title/Edition/Record Number
SCANTYPE_IDXSR2 (IDXGEN for Type 2 Series Order):
- Series Name/Series Type/Publication Date/Book Type/Compacted Title/Full Title/Edition/Record Number
SCANTYPE_IDXSR3 (IDXGEN for Type 3 Series Order):
- Series Name/Series Type/Compacted Title/Full Title/Publication Date/Normalised Author/Edition/Record Number
SCANTYPE_IDXSR4 (IDXGEN for Type 4 Series Order):
- Same as IDXSR1 (apart for the Series Type)

In each case, all the other fields can be specified in any order. The routine constructs a record where each field is prefixed by a field code and a colon and suffixed with a divider (^x1f) - the field codes are indicated in the definition of the scanitem element.

ALLOC_SCANDATA: Routine to allocate a new scandata structure

/************************************************************************/
/*									*/
/*    ALLOC_SCANDATA - Allocate a new scandata structure.		*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	scandata_ptr = ALLOC_SCANDATA ();				*/
/*									*/
/*    Where:								*/
/*									*/
/*	scandata_ptr  = Pointer to new scandata structure (NULL if none)*/
/*									*/
/*    This routine allocates a new scandata structure and initializes	*/
/*    the contents of it.						*/
/*									*/
/************************************************************************/

COPY_ITEM_DETAILS: Copy an itemdetails structure

/************************************************************************/
/*									*/
/*    COPY_ITEM_DETAILS - Copy Item Details Structure.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	COPY_ITEM_DETAILS (newdetails_ptr, oldetails_ptr);		*/
/*									*/
/*    Where:								*/
/*									*/
/*	newdetails_ptr	 = Pointer to new itemdetails structure		*/
/*	olddetails_ptr	 = Pointer to old itemdetails structure		*/
/*									*/
/*    This routine copies all the fields from the old ITEMDETAILS	*/
/*    structure to the new ITEMDETAILS structure.			*/
/*									*/
/************************************************************************/

DUMP_SCANDATA: Internal routine to dump a scandata structure

/************************************************************************/
/*									*/
/*    DUMP_SCANDATA - Dump a scandata structure.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	DUMP_SCANDATA (scandata_ptr, outfil_ptr);			*/
/*									*/
/*    Where:								*/
/*									*/
/*	scandata_ptr = Pointer to scandata structure			*/
/*	outfil_ptr   = Pointer to file to dump structure to		*/
/*									*/
/*    This routine dumps the contents of a scandata structure to an	*/
/*    external file.							*/
/*									*/
/************************************************************************/

FREE_SCANDATA: Routine to free a scandata structure

/************************************************************************/
/*									*/
/*    FREE_SCANDATA - Free a scandata structure.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	FREE_SCANDATA (scandata_ptr);					*/
/*									*/
/*    Where:								*/
/*									*/
/*	scandata_ptr  = Pointer to scandata structure			*/
/*									*/
/*    This routine frees a scandata structure and any buffers		*/
/*    allocated to it.							*/
/*									*/
/************************************************************************/

INIT_SCANITEM: Internal routine to initialise a scanitem element

/************************************************************************/
/*									*/
/*    INIT_SCANITEM - Initialise a scanitem structure.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	INIT_SCANITEM (scanitem_ptr);					*/
/*									*/
/*    Where:								*/
/*									*/
/*	scanitem_ptr  = Pointer to scanitem element			*/
/*									*/
/*    This routine initialises the contents of a scanitem structure.	*/
/*									*/
/************************************************************************/

RESET_ITEM_DETAILS: Internal routine to reset an itemdetails structure

/************************************************************************/
/*									*/
/*    RESET_ITEM_DETAILS - Reset Item Details Structure.		*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	RESET_ITEM_DETAILS (details_ptr);				*/
/*									*/
/*    Where:								*/
/*									*/
/*	details_ptr  = Pointer to itemdetails structure			*/
/*									*/
/*    This routine resets all the fields in the specified ITEMDETAILS	*/
/*    structure.							*/
/*									*/
/************************************************************************/

RESET_SCANDATA: Internal routine to reset a scandata structure

/************************************************************************/
/*									*/
/*    RESET_SCANDATA - Reset a scandata structure.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	RESET_SCANDATA (scandata_ptr);					*/
/*									*/
/*    Where:								*/
/*									*/
/*	scandata_ptr = Pointer to scandata structure			*/
/*									*/
/*    This is an internal routine that resets a scandata structure so	*/
/*    that it can be reused or deleted.					*/
/*									*/
/************************************************************************/

SCANITEM_COMP_RTN: Sort a pair of scanitem elements

/************************************************************************/
/*									*/
/*    SCANITEM_COMP_RTN - Sort a scanitem array.			*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	rtnsts = SCANITEM_COMP_RTN (rec1_ptr, rec2_ptr);		*/
/*									*/
/*    Where:								*/
/*									*/
/*	rtnsts	 = Routine return status (as for strcmp)		*/
/*	rec1_ptr = Pointer to first scanitem element			*/
/*	rec2_ptr = Pointer to second scanitem element			*/
/*									*/
/*    This routine is called via qsort to sort a scanitem array into	*/
/*    the "correct" order - see the documentation or code for details.	*/
/*									*/
/************************************************************************/

This routine is called via qsort to sort the scanitem array in a scandata structure into the order required by XVALIDATE. The sort priority is as follows:

author name and compacted title: if these are not the same then we don't regard two records as matching at all
full title: a mismatch in the full title is of primary importance
serial part and maximum: these need to be sorted next so that we can match up reprints, whether the reprint has the same serial parts as the original or not (as in the latter case we fake a reprint record to match the first part of the serial)
publication details: this is also critical for serial parts to ensure they match up. Note that, during the sorting, item types of "ex" are changed to "zx" so that the sort after whatever type they are supposedly an extract of
edition: so that reprints sort after the original
record number: to ensure there are stable results

SPLIT_SCANITEM: Internal routine to split a scanitem record into a scanitem element

/************************************************************************/
/*									*/
/*    SPLIT_SCANITEM - Split scanitem record into scanitem structure	*/
/*									*/
/*    Calling Format:							*/
/*									*/
/*	status = SPLIT_SCANITEM (fldbuf_ptr, scanitem_ptr, prtfil_ptr)	*/
/*									*/
/*    Where:								*/
/*									*/
/*	status	     = Routine return status:				*/
/*		     = PSP_TRUE if everything OK			*/
/*		     = PSP_FALSE if we hit an error			*/
/*	fldbuf_ptr   = Pointer to buffer holding record			*/
/*	scanitem_ptr = Pointer to scanitem element to set up		*/
/*      prtfil_ptr   = Pointer to diagnostics file (may be NULL)	*/
/*									*/
/************************************************************************/