Making an eBook — Part 2: Text Markup
Part 1
By Roger Sperberg
September 27, 2001
From eBookWeb.org. Reprinted with permission.
I spend so much time reading from my computer monitor already—e-mail, reports, word-processing files—that I often translate material into a .LIT file so I can look at it on my PocketPC, away from my desk.
Even when it's work material, being able to read lounging on the couch makes me grateful for my HP Jornada.
About half the material I transform this way, I grab from the web. I'll research a topic, finding five or ten or twenty essays or articles, and save them as HTML in a temporary folder, so I can read them off-line. Later I either print them or make them into an e-book, and dump the folder.
Because these files are marked in HTML, I use Overdrive's ReaderWorks for the job, but I suppose some people would prefer to use the Reader plug-in for Microsoft Word 2000. (I use ReaderWorks even with Word documents, which I usually mark up as HTML with a few search-and-replace macros.) An OEB publication is quite content to work with OEB documents that are of type text/html, and Microsoft Reader is one of the chief OEB-based Readers — the computer kind.
A good example of a web page that makes sense to carry around to study is the S.A.T. Dictionary page from the Wordsmyth collaboratory — it's a list of the 2000 most frequently appearing words used in the Scholastic Aptitude Test, complete with definitions, parts of speech and sample sentences. So much material can't all be absorbed at once, and the opportunities to do a little vocabulary study don't necessarily coincide with those moments of being in front of a computer connected to the internet. (The copyrighted file is used here — and contained in the downloadable .LIT file — by permission of Wordsmyth.)
The definitions in the S.A.T. dictionary, by the way, come from the Wordsmyth Electronic Dictionary-Thesaurus, "an innovative and evolving language reference source that meshes the functions of a dictionary and a thesaurus with powerful and flexible search capabilities." Its design reflects "the philosophy that word meanings are not simply equations that one can get right or get wrong, but rather grow out of and depend on specific uses and contexts."
Wordsmyth goes on to explain that "the dictionary-thesaurus itself has been designed to emphasize full, supportive usage information in entries, including example sentences and phrases, and indicators of context and grammar. It offers extensive crossreferencing, too; for each meaning of a word, lists of synonyms and similar words are provided." In other words, it's an ideal tool for just this purpose, studying vocabulary.
For making e-books, we have to acknowledge the reality that webpages aren't like e-book pages anymore than they are like print pages. Chiefly, we discover that, to make things line up on a web page, many designers use what I call Mondrian design — a table is used to stick different elements on the page, and sometimes individual cells themselves contain whole tables..
As you can see from this fragment at the top of the S.A.T. page, a table was used here too. Your task is to un-table it — to locate the cell with the meat of the page and remove the table around it. (Usually, the other cells position graphics, navigation and links and so.)
-
<BODY bgColor=#ffffff>
<TABLE border=0 cellPadding=7 cellSpacing=1 width=604>
<TBODY>
<TR>
<TD bgColor=#ccccff vAlign=top width=102> ...
You might think it simplest to just leave the table in, because after all the desktop Reader can handle tables. But the e-book page width, being so much narrower than the web page, usually defeats this labor-saving attempt. And the PocketPC Reader, more limited in its functionality as well as screen size, cannot parse tables.
So you separate the wheat from the chaff, or the reading stuff from the positioning stuff, by locating the <TD> tag that most immediately precedes the content and selecting from there all the way down to the closing </TD> tag, which in this is located on the last line of the file. Me, I copy the selection, close the document and start a new one in HTML-Kit, dropping the material into the fresh page between the opening and closing <body> tags. This has the advantage of giving me the necessary tags for a proper HTML file. Of course, the first thing I do is remove the <TD> and </TD> tags if they came along.
In most instances, the text captured in this fashion has simple markup — <p> heading each paragraph and <h1> (or <h2> or <h3>) tags around the section heads, with <b> and <i> tags scattered throughout, marking specific words or phrases as bold or italic.
In the case of the original S.A.T. page, the usual <dl> (definition list) tags, dividing the entry into <dt> (term) and <dd> (definition) seem to have gotten discombobulated — in the introductory paragraphs, everything seems to be marked <DT> and every part of the entries marked <DD>, Here are two contiguous entries, abstain and abstemious:
-
<DD><FONT color=#af0000 size=+1><B>abstain</B></FONT><FONT color=#0000af
size=+1>, v.</FONT>
<DD><FONT color=#0000af size=+1>(opp.: <B>indulge</B>)</FONT>
<DD><FONT color=#0000af size=+1>to choose to refrain from something: <U>Abstain from drinking</U>; <U>Don't abstain from voting.</U></FONT>
<DD><FONT color=#0000af size=+1><B>abstention</B>, n.; <B>abstinence</B>, n.; <B>abstinent</B>, adj.</FONT>
<DD><FONT color=#af0000 size=+1><B> </B></FONT>
<DD><FONT color=#af0000 size=+1><B>abstemious</B></FONT><FONT color=#0000af size=+1>, adj.</FONT>
<DD><FONT color=#0000af size=+1>(opp.: <B>gluttonous</B>)</FONT>
<DD><FONT color=#0000af size=+1>eating or drinking in controlled or moderate amounts; temperate: <U>The model had to be abstemious to keep her figure.</U></FONT>
<DD><FONT color=#0000af size=+1> </FONT>
You can differentiate several things here. The <DD> tags are used to begin each segment of the headword's definition, usage or related words on a new line. The <FONT> tags are used to distinguish the headword by making it a different color and size from the definition. Sample usage is underlined. The — for "nonbreaking space" — is used both to connect a meaning with an example, and to create a "blank" line between one entry and the next.
Here's a screen capture from the web page of these two entries:

In our more compact e-book pages — especially in our compact PocketPC pages — we will probably want to forego putting everything on separate lines. And we will probably prefer using italics to underline.
Here's what the same entries look like in my final version:
-
<p><B><font color="maroon">abstain</font></B>, v. (opp,: <B>indulge</B>)
to choose to refrain from something: <em>Abstain from drinking</em>; <em>Don't
abstain from voting.</em> <B>abstention</B>, n.; <B>abstinence</B>, n.;
<B>abstinent</B>, adj.</p>
<p><B><font color="maroon">abstemious</font></B>, adj. (opp,: <B>gluttonous</B>) eating or drinking in controlled or moderate amounts; temperate: <em>The model had to be abstemious to keep her figure.</em></p>
And here's the page the entries appear on in desktop Reader:

I used search-and-replace to locate the end of each entry and put in a closing and starting tag in its place (the first start tag and the last end tag I did manually), replaced the underlines with emphasis tags, and basically removed all the <DD> tags.
As you would expect in a dictionary, there are 26 sections, headed by the 26 letters of the alphabet. They too were distinguished typographically:
-
<DD><FONT color=#af0000 face="Apple Chancery" size=+3><B>A</B></FONT>
It was a short step to make these into <h2> tags and to wrap each section in a <div> to help Reader with its pagination:
- <div>
<h2>A</h2>
. . .
</div>
A little markup remains — as well as discussion of how to use stylesheets for formatting. And, of course, the explicit steps for generating the .LIT file.
That will appear in parts 3 and 4.
See also Making an eBook — Part 1
|
| ||
| ||
| ||
| ||