The Problem with H1

Updated 28-Jun-2024

H1 and Content Boundaries on the Web and EBook Publications


NOTE: H1 isn't really a problem, the thing is that it defines a chapter, not a work (though a chapter could be considered a work in the sense of songs being considered works that are part of a collection). H1 is a problem when using a Markdown Lint and concatinating all chapters into a singe document (which makes more sense than one might think when working in tools such as VSCode).


There is a problem embedded in epub, which is that it is normally composed of several html documents, one per chapter. However, for parsers to create epubs properly (such as Pandoc), they do it based on H1, so that each H1 signifies the beginning (and title) of a new html document (which is a chapter).

However, as we know an HTML document itself (especially regarding the web) should only have one H1. Therefore if the native (single) document being edited is itself a book, then the single document will have multiple H1s embedded within it.

This means there is a basic disconnect of a book being an ODT file or HTML file or even a PDF vs. being an Ebook.

Baker & Taylor require epubs to have only one H1, which is itself the title of the work, and everything else H2 (e.g., chapter headers). However, the spec and common use has an H1 for each chapter.

See:

-


Pushing the problem from H1 to Meta Title gives the same problem: An epub ebook has multiple html documents (one per chapter). On the web, there should be one and only one H1 (for Google's purposes, and possibly in the HTML spec). The Meta Title is not (necessarily) displayed to the user (though browsers traditionally put it into the browser title bar as well), whereas the H1 does get displayed to the user, so these definitely have two different uses in terms of user/display.

The problem I see comes when people what to explicitly tag H1, H2, etc., and your application decides if/when it will do overrides. This is (partially) what I mean by having semantics (markup via markdown/copymarkup) be primary. By not having this and dealing with HTML export/native file format, you put all the control into your application, but take it away from the user and the documents.

Further, I believe that documents themselves should not be the top level, but collections of documents (libraries).

What this allows for is people to have a single editor instance and navigate across multiple documents, books and book elements. This is very fast when wanting to bounce around between various documents. Granted it can slow down on load and save if the entire structure is written out.

This helps further define things such that a book is not a single document, but a collection of documents (the idea of "book" being a container). Not only does this work with epub thinking but also website thinking. A website is a collection of documents, but not a document itself (it is an address). Also, this idea helps out that each web page itself has an address, as well as meta title and h1. In essence this means that a web page is a chapter in a web site (book).

This means that epubs and websites are on the same page, as it were, whereas the pdf and odt is (or rather, can be) at odds insofar as a single instance can be a collection of chapters (and accompanying images, including usually a cover image).


Note that in a collection, if the Title of the collection is an H1, there is still the problem of each working having either a single H1 or multiple H1s. How to parse this as a tree is interesting both in the production of Epubs as well as PDFs. For a multiple-document (book) collection, with a generated ToC, there are many possibilities, such as:

  • Title Page (hidden, unlisted, unnumbered)
  • Copyright (hidden, unlisted, numbered)
  • ToC (hidden, unlisted, numbered)
  • Preface / Introduction
  • FIRST BOOK
    • Title Page (hidden, unlisted, unnumbered)
    • ...
  • SECOND BOOK