[bksvol-discuss] Automated Strippers and Page Numbers

From: "Donna Smith" <donnafsmith@xxxxxxxxxxxxx>
To: <bksvol-discuss@xxxxxxxxxxxxx>
Date: Sat, 28 Aug 2004 05:42:41 -0400

Hi gang.

I hear the frustration expressed on this list by dedicated volunteers
concerned that hard work is being negated by automated tools, and I share
your frustration.  While I am more than happy to volunteer quite a lot of my
available personal time to scanning/validating books for BookShare, I want
to know that I'm getting the best results for my time and effort.

So this morning I decided to do a little comparison of books I have either
scanned or validated with the final product in the collection.  I learned
that page numbers are preserved whether I leave headers in or take them out,
and if I leave headers in, they are mostly stripped.  Consequently, I have
determined that I will do the following:  If the headers in a book mostly
scan well and not a lot is required to normalize them, I'll leave them in so
the automated stripper will have something to strip.  If the headers in the
book inconsistently scan so that it is a lot of trouble to normalize them,
then I'll strip them out myself because that is actually less work in some
cases than normalizing.  Either way, the page numbers seem to be preserved.

BTW, when browsing through the new books page, I came across "Escape" one of
the choose your own adventure books, so I downloaded it for a check as well.
Page numbers are there.

It is my understanding that the automated stripper takes out only
consistently repeating phrases such as the author's name when it appears at
the top of every other page, and the name of the book when it appears at the
top of alternate pages.  Since the page numbers change with every page, even
if they are on the same line as the header, they are left in.  It is also my
understanding that the automated stripper doesn't strip out whatever happens
to be at the top of each page.  So if each page starts with a new line of
text, (no header), then it's not stripped unless every page, or a
significant number of pages, start out with the same line of text.

I also have to add that most of the books I have personally downloaded and
read over the last couple of years have had page numbers.  Headers are a
little more inconsistent, but it looks to me like junk headers remain in
those books where the headers are typically scrambled and not necessarily
scrambled in a consistent manner.

The bottom line for me is that page numbering is retained and lines of text
aren't stripped.  While I prefer that the headers not be there, and I will
continue to submit/validate in a manner designed to help the automated
stripper get rid of them, I've never chosen to not read a book because
headers were still present.  On the other hand, I have chosen to not read a
book because the text quality was too poor.

Hope this helps.  I didn't want to be discouraged about my favorite
volunteer job and I didn't want others to be discouraged unnecessarily
either.  I urge you to check your finished work from the collection even if
you have to go back a few months to find something.  I'm afraid I'm guilty
of reading from the RTF version I scan and submit rather than waiting for it
to clear the process and reading it from the collection.  <smile>  Hence,
the need to make this special effort this morning to actually look and see
what my scans look like once they're in the collection.

Peace and Hope,

Donna

Follow-Ups:
- [bksvol-discuss] Re: Automated Strippers and Page Numbers
  - From: Cindy

[bksvol-discuss] Automated Strippers and Page Numbers

Other related posts: