Data Format
Home Authors Magazines Oddments

Current Processes for Maintaining Index(es)

This uses the "Quarterly Update" as an example as this is a superset of the processes involved in adding other ToCs to the index(es). Note that programs and files flagged ** are ones that were only used by Phil and hence are less well documented (and probably less portable) than the others.

Create Text File containing new ToCs

The source list for the Big List (ARCHIVE.TXT**) contains an entry for each magazine. The magazines included in the quarterly update have a #NOTE record saying "Current: Checked" which includes the date it was last checked, the latest cover scan that has been retrieved, the latest entry in FT_LINKS.CVT, and possibly other notes. (Note that there are also some flagged "Current: Checked: TODO" which are under consideration but have not yet been added to the quarterly update.)

The quarterly update then basically goes through each such entry, checks the website (and/or Amazon) to see if there are any new issues and, if so, updates the entry accordingly and adds a skeleton ToC for each such issue to a text file. Currently this is in a format similar to that described in the user documentation with some exceptions and additions:

Clearly, exactly the same process is used to create the text file from ToCs supplied by other users and/or taken from magazines or magazine (ToC) scans. Note that the holding file I use for FMI ToCs (fmtocs.txt**) contains a series of notes at the beginning that help convert ToCs supplied by users into a standard format.

Note that, when doing the quarterly update, a cover scan for each new issue (where relevant) should also be downloaded and stored with all the other new/updated cover scans for the current period (see below).

Convert Text File to Data File

The text file is converted to the internal, data file, format via a program called MagParse** – a C++ program which prompts for the input file name, defaulting to the previous file used. In general I just reuse the same file (temp.txt) over and over again.

MagParse primarily tries to convert the text file format into the internal format used by all the main programs. It is fairly fussy and will report errors if it doesn’t understand something. "Some" of these are reported via dialog boxes but the best place to check is the error file the program creates. This is named xxx.err where the input text file was named xxx.txt (and is in the same folder as the text file).

One particular error that MagParse traps is the accidental inclusion of any 8-bit characters (such as á or –). The programs all work on 7-bit characters (partly because the 8-bit mappings in Bill’s computer were not standard Windows mappings though I was never quite sure why, and partly because 8-bit characters sometimes get corrupted in e-mail exchanges).

Note that the data file is created as xxx.mag in the same folder. To avoid accidental over-writing of the data file (as has happened before) the program checks for the existence of such when it starts and will refuse to overwrite it. As such, if you want to fix errors that have been reported and then to rerun MagParse you must first manually delete xxx.mag

Anything MagParse doesn’t understand at all it puts in an EB record so this is a useful way to include post-processing notes (e.g. for things MagParse doesn’t handle) so the first thing to do after running MagParse is to check all instances of B1 in case it failed to convert something that it should have convertedJ and/or to fix up any post-processing notes.

Validate the Data File

Validate attempts to validate a data file against the various formatting and capitalisation rules. The degree of validation performed can be controlled by means of the Validation Control Flags but it’s desirable to do a full validation on all new ToCs added to the database so that the "validation level" of the database as a whole gradually improves.

The errors reported by Validate generally fall into three categories:

  1. Adjustments needed to the date ranges in PSEUD.CVT for specified names;
  2. Reports of names that are not defined in PSEUD.CVT;
  3. Others

I tend to handle these in three passes. On the first pass I delete all the diagnostics in categories 1 & 2 and fix all the remaining (or add them to the exceptions file VALIDATE.XXX) – note that this pass includes resolving any ambiguous names in the file.

Once the first pass is complete, I rerun Validate, delete the diagnostics in the first category and handle all the undefined names – this is often the most time-consuming part of the process. My approach is to add these records to the latest version of ATTRIB.TXT (see below) with a distinguishing flag (^^^) and then check and resolve each flagged item.

Typically the names will fall into one of four categories:

I generally leave resolution of the first category of diagnostic (adjustments) until after the next two steps as some of the items might be reprints which might, of course, affect the date range.

Revalidate the Control Files

As part of the previous step, it is almost inevitable that changes will have been made to PSEUD.CVT and possibly ABBREV.CVT and/or SERIES.CVT and errors might have crept in. ValNames is a simple program that validates the contents of the three control files, looking for obvious errors.

Check for Prior Appearances

It is likely that some of the items added will either be reprints of something already in the database or, when indexing older magazines, will be earlier appearances of items already in the database. The programs require that all instances of a given item specify the same "first appearance data" so any such conflicts need to be resolved (this was originally a requirement of Bill’s programs that has been carried over to the new programs but might be worth reconsidering at some point).

There is no "official" way of doing this, but I use a file called STORY.IDX** which contains the earliest known appearance of all "significant" items in the database, and a program called Mrg_Sty** which tries to merge the new data with the existing data, flagging up any discrepancies. It also flags all new additions to STORY.IDX (with a ^^^ suffix) so you can manually check them for subtle variations that Mrg_Sty can’t spot (e.g. UK/US spellings).

Mrg_Sty is an old program, and I’m not entirely sure I know how it works any more, but it seems to do the job and there’s always something more important to work on.

Complete processing of the data file

By now we should have updated all known prior appearances so we can run Validate again to see what date adjustments are needed in PSEUD.CVT. Once these have been made, the contents of the (new) data file need to be moved to the associated data files. Note that this might involve creating new data files (if new magazines have been indexed) and it is important to ensure that the new files are also added to the appropriate Index Definition File.

Handle knock-on impact of changes

The changes made as part of adding the new ToCs will probably have a knock-on impact on the existing data, in one of three ways.

Firstly, any new disambiguations will require changes to existing references to the associated names, so the first step is to run Validate on the entire database. With judicious use of the Validation Control Flags and VALIDATE.XXX it should be possible to maintain the database in a state where Validate produces no errors at all. As such, any errors thrown up at this point will purely be a result of the new data and can easily be resolved.

Secondly, the new data may have identified some earlier appearances of existings items. Some of these will have become apparent in step 5 above, but there may well be additional instances. My approach is to use Mrg_Sty to merge each group of files in turn and then use ALL.TXT** (see below) to identify which files are affected.

The third area is somewhat more complex and relates to cross-validation within and between the files. One aspect of this is to check that a repeated item (e.g. a column or series) has the same characteristics in all instances. Another looks for instances of an item under one name being reprinted under a different name. To address these I have a program called Xvalidate**.

Xvalidate is simultaneously very simple and very complex. It is very simple because the guts of it are shared with IdxGen (and GenAttrib); it is very complex because it tries to do some very complicated validation and, to be honest, there are times I don’t quite understand what’s going on! However, as with Validate, if the database is held in a state where Xvalidate produces no errors then rerunning it after every change throws up any problems introduced by the change and they can then be easily resolved.

Note that it can be startling quite how many errors Xvalidate throws up so I tend to run it on each group of files in turn before running it on the whole (magazine) database.

One special type of error that Xvalidate throws up is when one file references an item that published in another of the magazines in the database, but the file for that magazine does not include the item in question, possibly because the relevant issue hasn’t been indexed.  In these cases we generate a skeleton entry so that we can catch any discrepancies if the issue is subsequently indexed. There is a small programme, CvtSkel**, which converts the original item line into a skeleton entry of the required format.

Regenerate ATTRIB.TXT and ALL.TXT

While not (yet) formally documented as control/support files, ATTRIB.TXT and ALL.TXT were files originally generated by Bill (I think as offshoots of his index generation programs) which proved so useful that, when Bill could no longer supply them I wrote my own program GenAttrib** to generate them. FWIW, GenAttrib was written by generalising the core code used by Xvalidate and that core code then formed the basis for IdxGen.

ATTRIB.TXT contains an entry for every name in the database (or for whichever part of the database you run the program on) indicating the date range of their (original) appearances and summarising the types of entry (e.g. fiction, poems, etc.). If the name has an entry in PSEUD.CVT then it also includes the main data from PSEUD.CVT.  This is used, as mentioned above, when adding new entries to PSEUD.CVT as it shows all names in use, rather than just those in PSEUD.CVT, and gives an idea of when they were active.

ALL.TXT is basically a flat file version of the whole database (or whichever part of the database you run the program on). It is useful when trying to disambiguate an author as it lists all the different appearances (apart for artwork which is more of a challenge). It’s also useful for finding all appearances of a given item when trying to update prior appearance data.

Update COVERS.CVT

Although not directly part of the indexing process, another job I have is the maintenance of COVERS.CVT. Currently all images are held on my website although the file allows for multiple sources of images (adding new locations would currently require program changes to IdxGen).

The first step is obviously gathering new/updated images. For the Quarterly Update this is simply part of the process, but in parallel with this images come from a wide variety of sources – scans of new listings on eBay, direct contributions from others, full scan magazines posted on pulscans or similar. As these are acquired, the filenames need to be normalised, the images adjusted to remove any skewing and to shrink them to a standard size (400px wide) and to check (if they are replacements) that they are an improvement on the existing images.

Periodically there is then a need to update COVERS.CVT (and elsewhere, see below) with any new cover scans. To assist in this I use a program called CvtCovers** which reads a list of file names, compares it against the existing COVERS.CVT and attempts to add new entries for any new images by parsing the file name and attempting to deduce the corresponding issue abbreviation.

This is usually at least 90% successful but there are some file names the program cannot (yet) parse successfully so the diagnostic file needs to be checked for any errors. There are also unavoidable ambiguities (e.g. when two magazines with the same name exist at the same time) and, at times, confusion about whether a date or issue number should be used, so all new entries are flagged (with a trailing ^^^) and these need to be checked at some point.

CvtCovers asks if you want to keep any changes made so at this point I tend to say "No" and focus on addressing any "fixable" errors that are reported. The next step is to add any new images to the relevant Illustrated Checklist and/or to ARCHIVE.TXT – I have a macro in my text editor that assists with the former but this is still a very time-consuming process! It is also likely that, during this process, some mistakes can be found in the file names so these need to be corrected.

Once you’re happy that all the files are named correctly you can run CvtCovers again and say you do want to keep the changes, and then check COVERS.CVT to see if any of the conversions are wrong and to fix up any entries that the program was unable to convert. In a small number of cases when an image has been obtained that is not needed in COVERS.CVT they can be added to UNMATCHED_COVERS.TXT**.

It is also necessary to create two thumbnails for each new or improved image (one 100px tall; the other 150px tall) – I use a piece of shareware called ThumbsUp for this – and then to upload everything to the appropriate website.

As a final "belts-and-braces" exercise, I also run ChkCovers** which basically compares all the magazine images in the folder structure against the FM database via COVERS.CVT and UNMATCHED_COVERS.TXT – among other useful things this does, it is also useful for identifying cases where the translation in COVERS.CVT is incorrect (e.g. using a month instead of an issue number or vice versa).

Regenerate the Big List

Terminology gets a bit confusing in this area as I tend to refer to "the GCP website" which, these days, includes a lot of different things including:

The Big List is generated from ARCHIVE.TXT (and ABBREVIATIONS.TXT**) by means of a pair of programs. MagPop** reads, parses and validates the input file(s) and creates an Access 97 database called MAGS.MDB. MagGen** then reads this database and generates all the relevant HTML files as well as a new copy of ARCHIVE.TXT which should be used to replace the working copy (as it is somewhat enhanced/tidied up).

(One of the many "projects for a rainy day" is to rewrite these two to remove the need for the intermediate database. It shouldn’t be too hard – the only tricky bit is that the database implicitly does additional validation by rejecting duplicate records with the same key and this would need to be replicated by manual checks.)

Note that The Big List and The Fictionmags Index Family are relatively independent (necessary because I maintained one and Bill maintained the other) but are linked. The Big List links to the Indexes as follows:

http://www.philsp.com/homeville/FMI/link.asp?magid=xxx

where "xxx" is the abbreviation in ABBREV.CVT (defined via the ABBREV keyword in ARCHIVE.TXT).

If  the abbreviation can’t be matched (e.g. because of a typo in ARCHIVE.TXT) the link just goes to the front page of the index.

The reverse link is a bit trickier as the FM database doesn’t contain any mention of which magazines are or are not defined in ARCHIVE.TXT. Instead, MagGen generates a special file called ZZMAGIDS.TXT which contains a series of entries of the form:

xxx~yyyy

where xxx is the abbreviation in ABBREV.CVT as above and yyy is the name specified on the appropriate MAGID header in ABBREV.CVT.

IdxGen then reads this file and, for each (group) header checks if the abbreviation is listed in ZZMAGIDS.TXT and, if so, generates a link of the form:

http://www.philsp.com/links2.asp?magid=yyyy

which works in much the same way as above.

This allows the two sets of files to be compiled independently of each other, but pragmatically it is best not to release a new version of The Big List without an associated version of the Index Family or there might be explicit index links (e.g. for newly indexed magazines) that don’t work. If the two are being generated "at the same time" then The Big List should be generated first so that IdxGen has an up-to-date version of ZZMAGIDS.TXT to work from.

In theory the current schedule is:

though, in reality, the monthly schedule for the indexes is more of an aspiration than a realityL

Regenerate & Upload the Indexes

Regenerating the Indexes is simply a question of running IdxGen on each of the Index Configuration Files in turn and checking the logs. Note that IdxGen is deliberately intolerant of serious errors (e.g. it will exit if it encounters a series name that is not defined in SERIES.CVT) so it is essential to do a full Validate on the magazine database before running it.

It may also report some less serious errors (e.g. mismatched quote characters) and it’s up to you whether rerun the program after you fix these or leave it until next time.

When uploading a new version of The Big List I tend just to upload the new files on top of the old files as the upload is fairly quick, the changes relatively minor from release to release and the traffic fairly light.

For the Indexes, though, I use a process that Bill pioneered:

One key point to remember after regenerating the indexes is to update the LASTUPDATE field in the Index Configuration Files.