Thursday, May 14, 2015

Searching the concordance

Concordance can be considered a restricted kind of corpus, and its query capabilities can be modeled after those of the Russian National Corpus (RNC) / Национальный корпус русского языка http://www.ruscorpora.ru/en/index.html. Some of those capabilities are:

1. Define a sub-corpus by start-end dates, or by the title of Lenin's work, or by the volume.
2. Search by a "dictionary word" (e.g., дом) or by a specific word form (домов).
3. Search by a set of grammatical features (Singular Noun in the Genitive Case)
4. Search by two words separated at most by N words, both possibly constrained by grammatical features.

To be continued.

1 comment:

  1. I'm assuming that both (2) and (3) are supported by the morphological analysis package which is being invoked and which I haven't seen; the role of the project would be limited to (A) in preprocessing each word (each non-trivial word?), we record the output list from this package, and (B) in interaction, we do string-search among the output lists. For (2),(3), and (4), I'm assuming that the result of a "search" is a location in the overall text, i.e. we have a reasonable-sized HTML file with anchors to make it easy to autoscroll approximately to that point.... Or perhaps we have a paragraph-sized blob of output, or .... ?

    And on (4), we are no longer searching for an individual item. Is it assumed that the two words will be in the same sentence? Paragraph? Chapter, if we have chapter structure? Work?

    ReplyDelete

Note: Only a member of this blog may post a comment.