hugo-site/latex-to-markdown.md at 3ee0d4f499d1b0ad970a240f1a8f79f5b4906734

lieuwe/hugo-site

Fork 0

mirror of https://github.com/lleene/hugo-site.git synced 2025-07-12 04:18:32 +02:00

Files

Lieuwe Leene 8a3ad22d99 WIP: formatting publications on personal website.

2022-09-05 18:48:11 +02:00

5.0 KiB

Raw Blame History

title, date, draft, tags

title

date

draft

tags

Latex to Markdown

2022-04-28T13:42:40+02:00

false

markdown

latex

code

python

hugo

Recently I started porting some of my latex articles to markdown as they would make a fine contribution to this website in simpler format. Making a simple parser python isn't that bad and I could have used Pandoc but I wanted to keep formatting as simple as possible when rendering a hugo markdown page. So I prepared several regex-based functions in python to dereference and construct a hugo-compatible markdown file.

class LatexFile:
    def __init__(self, src_file: Path):
        sys_path = path.abspath(src_file)
        src_dir = path.dirname(sys_path)
        src_file = path.basename(sys_path)
        self.tex_src = self.flatten_input("\\input{" + src_file + "}", src_dir)
        self.filter_tex(sys_path.replace(".tex", ".bbl"))

    def filter_tex(self, bbl_file: Path) -> None:
        """Default TEX filterting proceedure."""
        self.strip_tex()
        self.preprocess()
        self.replace_references(bbl_file)
        self.replace_figures()
        self.replace_tables()
        self.replace_equations()
        self.replace_sections()
        self.postprocess()

The general process for converting a Latex document is outlined above. The principle here is to process a flat text source which we then incrementally format such that Latex components are translated incrementally and replaced by plain text with markdown syntax.

Latex Components

In order to structure the python code I created several named-tuples for self-contained Latex contexts such as figures, tables, equations, etc. Then by adding a markdown property we can create a collection of objects where we can simple replace the corresponding latex code in a predictable manner.

class Figure(NamedTuple):
    """Structured Figure Item."""

    span: Tuple[int, int]
    index: int
    files: List[str]
    caption: str
    label: str

    @property
    def markdown(self) -> str:
        """Markdown string for this figure."""
        fig_str = ""
        for file in self.files[:-1]:
            fig_str += "{{" + f'< figure src="{file}" width="500" >' + "}}\n"
        fig_str += (
            "{{"
            + f'< figure src="{self.files[-1] if self.files else ""}" '
            + f'title="Figure {self.index}: {self.caption}" width="500" >'
            + "}}\n"
        )
        return fig_str

Notice that here we use a hugo short-code for when representing the figure in markdown. This lets us set with and other properties in a simpler and more systematic way.

Replacement Procedure

As mentioned before the replacement simply looks for sections in the source and directly replaces them with appropriate markdown text. In order to do this it is important to process the source code in reverse order such that the text location references remain correct as the replacement occurs.

def replace_figures(self) -> None:
    """Dereference and replace all figures with markdown formatting."""
    fig_list = self.figures
    fig_list.reverse()
    for figure in fig_list:
        self.tex_src = (
            self.tex_src[: figure.span[0]]
            + figure.markdown
            + self.tex_src[figure.span[1] :]
        )
    for figure in fig_list:
        self.tex_src = re.sub(
            "\\\\ref\{" + figure.label + "\}",
            str(figure.index),
            self.tex_src,
        )

Secondly we also replace the latex references with plain text references. This means that instead of using labels that are translated during compilation into numbers we directly reference the figure number.

@property
def figures(self) -> List[Figure]:
    """Parse TEX contents for context eces."""
    return [
        Figure(
            span=(begin.start(), stop.end()),
            index=index + 1,
            files=[
                elem[1]
                for elem in re.findall(
                    "\\\\includegraphics(.*)\{(.*)\}",
                    self.tex_src[begin.start() : stop.end()],
                )
            ],
            caption=self.first(
                re.findall(
                    "\\\\caption\{(.*)\}",
                    self.tex_src[begin.start() : stop.end()],
                )
            ),
            label=self.first(
                re.findall(
                    "\\\\label\{(.*)\}",
                    self.tex_src[begin.start() : stop.end()],
                )
            ),
        )
        for index, (begin, stop) in enumerate(
            zip(
                re.finditer("\\\\begin\{figure\*?\}", self.tex_src),
                re.finditer("\\\\end\{figure\*?\}", self.tex_src),
            )
        )
    ]

The piece of python code above exemplifies how we capture all figures found in the latex source code and aggregate them in a list of named-tuples. Naturally this is dependent on the style used when writing latex but I generally try to keep latex-code a simple and systematic as possible.

5.0 KiB Raw Blame History

Latex Components

Replacement Procedure

5.0 KiB

Raw Blame History