Citations

The more I want to incorporate some way to create citations in a document, the more I’m starting to hate citations and bibliography management with a passion. The document in question is supposed to export well to both HTML and PDF, the latter via TeX. While the bibliography management in TeX and friends is more or less “solved,”Through a monumental effort and probably some tears, as I imagine—BibTeX has been released 36 years ago. the story looks rather different for bibliography management in HTML.

Let’s look at an easy example using the bib language.

@article{chaudhuri2018fastdistcorr,
  title =        {A fast algorithm for computing distance correlation},
  author =       {Arin Chaudhuri and Wenhao Hu},
  journal =      {Computational Statistics and Data Analysis},
  year =         {2018},
  volume =       {135},
  pages =        {15--24}
}

This should be easy enough to format according to some publication guideline, or at least consistently. You could cite it in the text like “Chaudhuri and Hu (2018)” and then in the reference section you’d write out the information in some way or another. So far so good.

If you do not take issues with the following, then you should be able to find a suitable program that takes care of you citing needs in HTML. But there are issues and limitations with this model.

First, the title has to be entered in title case. This is because the guidelines for styling a reference could demand that the title is set in lower-case, upper-case, or title case. While the first two casings can be generated programatically, the last one cannot, hence you have to enter it as the default. An issue creeps up as soon as you have a proper noun in the text. While it is hopefully written correctly in the original text, the capitalization will be lost as bibtex is not aware of the special casing. This means that the title will have to be entered as title = {Heavy-tailed kernels reveal a finer cluster structure in {t-SNE} visualisations}. This way the casing is protected, but it makes the bibliographic notation more complex. While this concession is arguably sensible, the problems do not stop there.

In my thesis I had special styling for abbreviations, they were typeset in small caps. To achieve that, I used the LaTeX package glossaries-extra, that allowed me to style the abbreviations accordingly and consistently, while also providing a legend. For the bibliographic entry at hand it means that we have to adapt the title to use the glossaries macro \gls, changing t-SNE in the title to \gls{tsne}, where the acronym “tsne” has been defined accordingly.Because the “t” in t-SNE should not be capitalized, the definition for t-SNE is not straightforward if you want it to be typeset in small caps. It has to be defined like \newacronym[sort=tsne]{tsne}{\textup{t}-sne}{t-Distributed Stochastic Neighborhood Embedding}.

Another place where this creeps up is when the title includes mathematical formulas, for example in the Barnes–Hut paper, which is called “A hierarchical $O(N \log N)$ force-calculation algorithm.” The solution here is to use the math mode inherent to TeX to typeset the title, which is how the title is written in the bib file.

Now we have leveraged the programming capability of TeX to style the bibliography consistently. On the flip side, we have now lost the ability to interface with other programs that could use the format to generate a bibliography for, say, HTML. It is a bit unfortunate because it means that it’s not possible to leverage the format for bib files to use it in another language. I personally like the syntax, as it resembles the physical piece of literature somewhat well, but it seems that this format is not easy to use as an interface.

A solution to the problem would be to extend the bib language and parse it within the programming language of choice. Then the bibliography for bibtex could be generated programatically and for HTML it could be parsed into a dictionary-like data structure. My only gripe is that a format like that does not exist.

At the end of the day such a project is ambitious in scope. Parsing files in the bib format is no easy task as the data model is complex as the referencing system has grown organically over a long period and as such has some peculiarities.You can take a look at the biblatex package, especially Chapter 2, where the various bibliographic types are explained and detailed.

Of course the problem has been tackled in the past and there exist some approaches that have tried to create a program that can incorporate citations irrespective of the output format of the document.

One of the most notable approaches is perhaps the citation style language (CSL). It has a good data model and interfaces with a lot of other formats. The issue is that it does not support styling like shown above, it is not possible to add math formatting to a bibliographic entry.

The Distill Research Journal is another good contender as it parses bib entries and displays those references. It is lacking in the same way as CSL in that it is unable to recognize additional markup. Furthermore, the data model is a bit limited, although despite its simplicity it works quite well for the most part.

The shortcoming of the two approaches listed is that they assume that the content of any descriptor, for example the title, will be a plain string with at best minimal markup associated. This holds true for the vast number of bibliographic entries, but not all of them, unfortunately. Hence, the bibtex approach allows for arbitrary macros in its data model, which means it supports a full programming language. I am not aware of how to rectify that situation, but it seems that this is not a topic that many people work on or are interested in. It bothers me because the markup is possible in TeX, but then does not translate well to HTML.

Ultimately, it comes down to the fact that the bibliography is code and as such it is probably the easiest to specify it in code itself, portability be damned.I would argue that this should only be considered for one-man projects, as otherwise the concerns of interoperability quickly supersede concerns about stylistic perfection. To me, it appears as if this is the only option to inject the relevant additional knowledge into the bibliography.

What do you think? Is it demanding too much from a bibliographic tool, since the vast majority of readers will hardly notice or care? Classic libraries seem to also rely on plain-text information, as this is the only reliable way to query a large body of literature.

Or should something like this exist? This would amount to a DSL that is both independent from TeX, but it seems that it also has to be Turing complete, if it should be as powerful as the bibtex language.