I may have missed some earlier discussions and suggestions, but let me ask a question about data structures. Assuming that we start with an XML format (fb2), we have the following stages of the process:
1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.
My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?
1. For each volume, extract the "Lenin-only" content of the volume.
2. Merge the contents of all volumes.
3. Produce a conc from the merged file.
My question is: what data structures do we use in the Lenin-only content? JSON records that include volume, title of work, page number? perhaps paragraph breaks? Perhaps number all paragraphs consecutively?
At present, I am using JSON records that include work title, year, page number, word numbers, contained mostly in the occurrence table outlined in http://leninconc.blogspot.com/2015/05/data-structures.html
ReplyDelete