Thursday, June 25, 2015

Output Structures

While John works on the generation of a working-data-structure set, let's go back to the description of the output. We're thinking about KWIC and Concordance structures, but I take it that the actual purpose is to be able to answer questions like those previously proposed in LeninConc: Searching the concordance
1. Define a sub-corpus by start-end dates, or by the title of Lenin's work, or by the volume.
2. Search by a "dictionary word" (e.g., дом) or by a specific word form (домов).
3. Search by a set of grammatical features (Singular Noun in the Genitive Case)
4. Search by two words separated at most by N words, both possibly constrained by grammatical features.

To be continued.
I haven't seen the Lenin files; I'd like to assume that they can be mechanically broken into
    Volume--Work--Chapter--Section--Paragraph--Line--Word
or some similar set of nested structures. (Note that I'm choosing  "Paragraph" rather than "Page" and yet "Line" rather than "Sentence"...this is debatable, but it does have the advantage that they correspond naturally to div structures in the HTML.) I would propose that we do this breakdown and then, except for Word at the bottom, we generate a copy of each volume as a single HTML file with a div for each subdivision, with id values representing location in the structure: volume 7's work 2, chapter 1, section 8, paragraph 3, line 6 might be:
 <div id="L7_2_1_8_3_6">for all good men to come to the aid of </div>
and from this you know that the enclosing paragraph has id L7_2_1_8_3 and that the entire chapter can be found within the div whose id is L7_2_1. Yes, the initial "L7_" is redundant in that every id in the volume file will begin the same way, but this gives us working sets of ids for more than one volume, which I think is likely to be necessary. At any rate,
     http://servername.com/leninconc/volname#L7_2_1_8_3_6
will simply scroll to the right place, with no further work to be done.  A query result can now be an id, sometimes a pair of ids indicating beginning and end of a range, or possibly sometimes an id with a numeric offset to indicate that the range begins or ends with a particular character count.

The actual interface, then, might be a web page with John's JSON files loaded to represent the sub-corpus active at any given moment, and with query-result display either
  • in an iframe holding one volume's HTML file, with scroll and selection set to display the right stuff, or
  • in a constructed div holding exactly the text(s) requested, copied from one or more volumes in iframes which are, for the moment, invisible.
Does this make any sense? I am not a linguist, nor do I play one on the internet....

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.