MagParse - Parse an external file into Fictionmags Format

MagParse creates a draft file in Fictionmags format from a variety of input formats:


Plain Text File

MagParse translates a number of entries representing magazine issues and/or books into the internal Fictionmags format. It is primarily used for converting files submitted to the Fictionmags Index by assorted users, all of whom tend to use a subtly different format. Note that the format used by MagParse is neither the input format documented in the user documentation nor the output format used in the indexes themselves, though it is similar to both.

The start of a typical entry is identified by a leading ">>" which identifies a magazine or book header. For magazines the basic format is:

>>Magazine Name [issue details] ed. editor(s) (publisher, price, pagecount, format, cover by artist)

where:

For books the basic format is similar but is identified by the book title being in angle brackets:

>><book title> ed. editor(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)
>><book title> by author(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)

where:

The other fields are the same as for magazines.

Individual items (i.e. EA records) are then identical for books or magazines and are specified in the format:

[page number] * author(s)[/subject(s)] * title * item type [[series name]][; illus. artist(s)] [* original appearance data]

where:

In addition to the above, any existing A/D/E record may be included in the input file (and must include the terminating '~' character to identify them as such). If an existing 'A' record is used then any following items (to the next ">>" or 'A' record) are assumed to be part of that book/magazine and will inherit any specified publication details.

Any other line is treated as an item appearance note (EB) record unless it starts with "translated from " in which case it is translated to an item note (ED) record - note that such records should not contain asterisks or the code will try to treat them as an EA record item.

Errors detected by the program are written to a file called xxx.err in the same folder as the input file where "xxx" matches the file name of the input file; some errors are also output to the screen (for historic reasons) and the program will report at the end of conversion is any errors have been detected. Where there are any errors or not, the program will create a converted output file called xxx.mag in the same folder. Note that for safety reasons the program will not run if an existing file called xxx.mag is detected in the relevant folder to avoiding accidentally over-writing an existing file (as has happened in the past).

Note that MagParse was never intended for use by anybody other than the developer so it is not particularly flexible. In particular, the order of the fields in the records above is currently fixed so that if, for example, the illustrator is specified after the first appearance data it will not be translated correctly.


ISFDb Source Files

The ISFDb database is made up of a number of different page types:

Currently the program only handles the pl.htm page type.

One generic problem is that, even though the pages are (presumably) generated by a software program from the underlying database, the precise format of the page content is fairly fluid with elements sometimes starting a new line and sometimes just running on from the previous element. The program currently tries to handle this by progressively stripping off the recognised content and either leaving what's left of the current line or, if there's nothing there, reading a new line (e.g. in the routine CheckStart). This is not wholly successful and a possible alternative would be to read the entire file into a single (massive) CString variable at the start (as is done partially with the pl item header).


Book/Magazine Records (pl*.cgi)

Each record basically contains the following sections:

Page Header

This contains all the standard ISFDb page format such as search box, left-side hyperlinks, etc. It is (currently) identified by the presence of "<div id="content">" which announces the start of the next section. This section is currently ignored.

Item Header

This contains overall details of the item (e.g. editor, date, publisher, etc.). It currently seems to start with "<div class="ContentBox">" (although the code suggests that some records do, or did, start with "<div class="MetadataBox">"). The former may be just at the start of the line while the latter apparently is/was on a line by itself. The code checks to see if it has one or the other: if so, it steps over it; if not then it panics.

If there is a cover scan associated with the item then the next element then the image and the item header data are held in a table, which is not present if there is no cover scan so we need to check for the existence of "<table>" and remember if we found it (so that we can handle the terminating "</table>" later). If there is a table then it is followed by code (typically on a single line) along the lines of:

<table>
<tr class="scan">
<td>
<a href="http://www.collectorshowcase.fr/images2/weird_4911.jpg">
<img src="http://www.collectorshowcase.fr/images2/weird_4911.jpg" alt="picture" class="scan"></a>
</td>
<td class="pubheader">

so we try to check for these and step over. In all cases this should be followed by "<ul>" and multiple items that start with "<li>" (but don't have a terminating "</li>"). As the information associated with a particular "<li>" may or may not be on the same data line, code first concatenates any following lines which do not start with either "<li>" or "</ul>". Within this section it also strips any spaces before a < or after a > as these tend to be variable and cause confusion. It then checks each of the "<li>" elements as follows:

If we find any other header records then we log an "Unexpected header line" error so that we can investigate and work out how to handle them.

If we determined that we had a magazine then we try to split off the issue information from the title. These appear in a bewildering variety of ways, such as:

The first three can be detected by looking for the prefix, but if those fail we look for the first comma and assume the rest is a date. In the latter case it tries to reverse the day and month and insert a comma before the year to keep the parse code happy. It also does a bit of house-keeping on the title, removing any trailing "," or ":" and converting "[UK]" to "(UK)". Note that if it can't find any issue information it logs an error and converts the book type to "an".

In all cases it then calls CONVERT_TITLE to strip off any "title additional".

If it is a magazine then it tries to convert the title and issue into the relevant Magazine ID (not sure what the date2 stuff is about).

There might then be one of two records indicating missing or incomplete contents - "Stub record" or "Placeholder, contents incomplete" which we translate (in due course) into the FM format equivalents.

The code then generates the header records for the issue/book from the information parsed to date, including any notes. It then checks we have the terminating "</ul>" and, if we had a cover scan, the terminating "</td>", "</table>" and "Cover art..." records. In the latter case there may also be a record starting "on <a href=" (no idea why) which we just ignore.

Contents Header

There are then a small number of records between the item header and the contents themselves. The contents are terminated by "<div id="VerificationBox">" or "<div class="VerificationBox">" so we check for those first and, if found, just return as it means there aren't any contents. If not then the next record should be either "<div id="ContentBox">" or "<div class="ContentBox">" and we log an error and return if neither is found.

There then might be some optional sections starting with "<span class="containertitle">". It's unclear what the purpose of these are so we just want to skip over them. We're really looking for the presence of a record starting "<h2>Contents " but rather than just throwing away anything that might be before that we explicitly look for records we have checked and found to be harmless (as well as the "VerificationBox" records as above).

Assuming all has gone according to plan we then get a record starting "<h2>Contents " followed by a record containing "<ul>" followed by multiple groups of records starting "<li>" which are parsed as per the next section.

Contents

Each item in the contents section may be spread across multiple lines so we first consolidate them all into a single record as in the Item Header section and remove any spaces before a "<", after a ">" or either side of a divider (which has been translated into the trigraph "^.@"). There are then a number of basic formats for the record:

<li>page number^.@title^.@type byauthor
<li>page number^.@title^.@interview ofinterviewee^.@interview byinterviewer
<li>page number^.@Review:titlebyauthor^.@
review byreviewer
<li>page number^.@Review of the xxx "title" by author^.@essay byreviewer

where "title", "author", "interviewee", "interviewer", "reviewer" and the fixed text "Review" are usually hyperlinked to the relevant item or name record. Note that:

An added complication is that various parts of the text may be enclosed in a hint or tooltip structure of the form:

<span class="hint" title="xxx"><a href="xxx" dir="ltr">xxx</a><img src="http://www.isfdb.org/question_mark_icon.gif" alt="Question mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup class="mouseover">?</sup><span class="tooltiptext tooltipnarrow">xxx</span></div>

in which case we try to isolate the standard "<a href="xxx" dir="ltr">xxx</a>" for the rest of the program to parse (maybe the other title is better??) by calling StripTtip as soon as we have consolidated the record.

Once we have isolated the title there are also cases where a repeated column (or untitled letter or similar) is distinguished by suffixing the issue data. Thus, for example, in "Weird Tales, March-April 2008" we have "The Eyrie (Weird Tales, March-April 2008)". This is easily handled when they are identical but more problematic otherwise. For instance, in "Interzone, #144 June 1999" we have "Ansible Link (Interzone #144)" - at the moment the code doesn't even attempt to handle this.

Note that, for all the above formats, we should now be at the point where we have "^.@type byauthor" (or "^.@interview ofinterviewee^.@interview byinterviewer") but there are occasionally other bits in the way:

The "type" will then translated as follows:

Anything else is reported as an error so that we can handle it next time round.

On a good day, all we have left is the author, but that may be followed by any (combination) of three suffix clauses:

There's just the (main) author left to parse, via ParseAuthor, with "uncredited" and "Anonymous" normalised to "Anon." (and an item type of "ar" being reset to "ms" in such cases) and "various" normalised to "[Various]".

Having parsed the record, there's some final tidying up to be done:

CheckVariant

 

ParseAuthor

In most cases an author name is just embedded in a hyperlink, but if there are multiple authors then each is embedded in its own hyperlink and separated with "<b>and</b>". There's also something messy with embedded "[as by " clauses but I'm not sure how that works.

The ISFDb distinguishes ambiguous authors by adding a suffix of the form " (x)" to the names so we strip those off so that we can handle our own disambiguation.

StripHref

This routine checks to see if the specified string contains "<a href=" and, if so, strips off the hyperlink. It also searches the string for any other instances of "<a href=" and, if found, strips them off as well (e.g. when multiple authors are specified as each has its own hyperlink).

StripTtip

As mentioned above, sometimes fields such as authors or titles are embedded inside hints or tooltips with constructs along the lines of:

<span class="hint" title="xxx"><a href="xxx" dir="ltr">xxx</a><img src="http://www.isfdb.org/question_mark_icon.gif" alt="Question mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup class="mouseover">?</sup><span class="tooltiptext tooltipnarrow">xxx</span></div>

where the hyperlink may be omitted. This routine isolates the (first) "xxx" string (and surrounding hyperlink if there is one).