2022-06-13 17:39:51 +02:00
|
|
|
---
|
|
|
|
title: "Latex to Markdown"
|
|
|
|
date: 2022-04-28T13:42:40+02:00
|
|
|
|
draft: false
|
|
|
|
tags:
|
|
|
|
- markdown
|
|
|
|
- latex
|
|
|
|
- code
|
|
|
|
- python
|
|
|
|
- hugo
|
|
|
|
---
|
|
|
|
|
|
|
|
Recently I started porting some of my latex articles to markdown as they would
|
|
|
|
make a fine contribution to this website in simpler format. Making a simple
|
|
|
|
parser python isn't that bad and I could have used [Pandoc](https://pandoc.org/index.html)
|
2022-09-05 18:48:11 +02:00
|
|
|
but I wanted to keep formatting as simple as possible when rendering a hugo
|
|
|
|
markdown page. So I prepared several regex-based functions in python to
|
|
|
|
dereference and construct a hugo-compatible markdown file.
|
2022-06-13 17:39:51 +02:00
|
|
|
|
|
|
|
``` python3
|
|
|
|
class LatexFile:
|
|
|
|
def __init__(self, src_file: Path):
|
|
|
|
sys_path = path.abspath(src_file)
|
|
|
|
src_dir = path.dirname(sys_path)
|
|
|
|
src_file = path.basename(sys_path)
|
|
|
|
self.tex_src = self.flatten_input("\\input{" + src_file + "}", src_dir)
|
|
|
|
self.filter_tex(sys_path.replace(".tex", ".bbl"))
|
|
|
|
|
|
|
|
def filter_tex(self, bbl_file: Path) -> None:
|
|
|
|
"""Default TEX filterting proceedure."""
|
|
|
|
self.strip_tex()
|
|
|
|
self.preprocess()
|
|
|
|
self.replace_references(bbl_file)
|
|
|
|
self.replace_figures()
|
|
|
|
self.replace_tables()
|
|
|
|
self.replace_equations()
|
|
|
|
self.replace_sections()
|
|
|
|
self.postprocess()
|
|
|
|
```
|
|
|
|
|
|
|
|
The general process for converting a Latex document is outlined above. The
|
2022-09-05 18:48:11 +02:00
|
|
|
principle here is to process a flat text source which we then incrementally
|
|
|
|
format such that Latex components are translated incrementally and replaced
|
|
|
|
by plain text with markdown syntax.
|
2022-06-13 17:39:51 +02:00
|
|
|
|
|
|
|
|
|
|
|
## Latex Components
|
|
|
|
|
|
|
|
In order to structure the python code I created several named-tuples for
|
2022-09-05 18:48:11 +02:00
|
|
|
self-contained Latex contexts such as figures, tables, equations, etc. Then
|
|
|
|
by adding a `markdown` property we can create a collection of objects
|
|
|
|
where we can simple replace the corresponding latex code in a predictable
|
|
|
|
manner.
|
2022-06-13 17:39:51 +02:00
|
|
|
|
|
|
|
``` python3
|
|
|
|
class Figure(NamedTuple):
|
|
|
|
"""Structured Figure Item."""
|
|
|
|
|
|
|
|
span: Tuple[int, int]
|
|
|
|
index: int
|
|
|
|
files: List[str]
|
|
|
|
caption: str
|
|
|
|
label: str
|
|
|
|
|
|
|
|
@property
|
|
|
|
def markdown(self) -> str:
|
|
|
|
"""Markdown string for this figure."""
|
|
|
|
fig_str = ""
|
|
|
|
for file in self.files[:-1]:
|
|
|
|
fig_str += "{{" + f'< figure src="{file}" width="500" >' + "}}\n"
|
|
|
|
fig_str += (
|
|
|
|
"{{"
|
2022-09-05 18:48:11 +02:00
|
|
|
+ f'< figure src="{self.files[-1] if self.files else ""}" '
|
|
|
|
+ f'title="Figure {self.index}: {self.caption}" width="500" >'
|
2022-06-13 17:39:51 +02:00
|
|
|
+ "}}\n"
|
|
|
|
)
|
|
|
|
return fig_str
|
|
|
|
```
|
2022-09-05 18:48:11 +02:00
|
|
|
|
|
|
|
Notice that here we use a hugo short-code for when representing the figure in
|
|
|
|
markdown. This lets us set with and other properties in a simpler and more
|
|
|
|
systematic way.
|
|
|
|
|
|
|
|
## Replacement Procedure
|
|
|
|
|
|
|
|
As mentioned before the replacement simply looks for sections in the source and
|
|
|
|
directly replaces them with appropriate markdown text. In order to do this it
|
|
|
|
is important to process the source code in reverse order such that the text
|
|
|
|
location references remain correct as the replacement occurs.
|
|
|
|
|
|
|
|
``` python3
|
|
|
|
def replace_figures(self) -> None:
|
|
|
|
"""Dereference and replace all figures with markdown formatting."""
|
|
|
|
fig_list = self.figures
|
|
|
|
fig_list.reverse()
|
|
|
|
for figure in fig_list:
|
|
|
|
self.tex_src = (
|
|
|
|
self.tex_src[: figure.span[0]]
|
|
|
|
+ figure.markdown
|
|
|
|
+ self.tex_src[figure.span[1] :]
|
|
|
|
)
|
|
|
|
for figure in fig_list:
|
|
|
|
self.tex_src = re.sub(
|
|
|
|
"\\\\ref\{" + figure.label + "\}",
|
|
|
|
str(figure.index),
|
|
|
|
self.tex_src,
|
|
|
|
)
|
|
|
|
```
|
|
|
|
|
|
|
|
Secondly we also replace the latex references with plain text references. This
|
|
|
|
means that instead of using labels that are translated during compilation into
|
|
|
|
numbers we directly reference the figure number.
|
|
|
|
|
|
|
|
``` python3
|
|
|
|
@property
|
|
|
|
def figures(self) -> List[Figure]:
|
|
|
|
"""Parse TEX contents for context eces."""
|
|
|
|
return [
|
|
|
|
Figure(
|
|
|
|
span=(begin.start(), stop.end()),
|
|
|
|
index=index + 1,
|
|
|
|
files=[
|
|
|
|
elem[1]
|
|
|
|
for elem in re.findall(
|
|
|
|
"\\\\includegraphics(.*)\{(.*)\}",
|
|
|
|
self.tex_src[begin.start() : stop.end()],
|
|
|
|
)
|
|
|
|
],
|
|
|
|
caption=self.first(
|
|
|
|
re.findall(
|
|
|
|
"\\\\caption\{(.*)\}",
|
|
|
|
self.tex_src[begin.start() : stop.end()],
|
|
|
|
)
|
|
|
|
),
|
|
|
|
label=self.first(
|
|
|
|
re.findall(
|
|
|
|
"\\\\label\{(.*)\}",
|
|
|
|
self.tex_src[begin.start() : stop.end()],
|
|
|
|
)
|
|
|
|
),
|
|
|
|
)
|
|
|
|
for index, (begin, stop) in enumerate(
|
|
|
|
zip(
|
|
|
|
re.finditer("\\\\begin\{figure\*?\}", self.tex_src),
|
|
|
|
re.finditer("\\\\end\{figure\*?\}", self.tex_src),
|
|
|
|
)
|
|
|
|
)
|
|
|
|
]
|
|
|
|
```
|
|
|
|
|
|
|
|
The piece of python code above exemplifies how we capture all figures found in
|
|
|
|
the latex source code and aggregate them in a list of named-tuples. Naturally
|
|
|
|
this is dependent on the style used when writing latex but I generally try
|
|
|
|
to keep latex-code a simple and systematic as possible.
|