From b5d1242dc5377878bce384021e3d70786062195b Mon Sep 17 00:00:00 2001 From: Lieuwe Leene Date: Wed, 27 Oct 2021 00:27:09 +0200 Subject: [PATCH] content update oct 2021 --- content/about.md | 4 ++ content/posts/domain-setup.md | 52 ++++++++++++++++- content/posts/python-urllib.md | 100 +++++++++++++++++++++++++++++++++ 3 files changed, 154 insertions(+), 2 deletions(-) create mode 100644 content/posts/python-urllib.md diff --git a/content/about.md b/content/about.md index e689827..5679864 100644 --- a/content/about.md +++ b/content/about.md @@ -13,6 +13,10 @@ This site shares a bit of informal documentation and more blog-based record keeping. Providing commentary on design decisions should be just as useful as some of the technical documentation however included in my repositories. +### Contact + +You can reach me at `lieuwe at leene dot dev`. + ## My Setup I mainly use RHEL flavours of linux having both CentOS and Fedora machines. Most diff --git a/content/posts/domain-setup.md b/content/posts/domain-setup.md index 1ed7cf2..c88209a 100644 --- a/content/posts/domain-setup.md +++ b/content/posts/domain-setup.md @@ -1,6 +1,54 @@ --- -title: "Domain Setup" +title: "Domain Setup ☄💻" date: 2021-09-19T17:14:03+02:00 -draft: true +draft: false --- + + +## DNS Records + +The main part of setting up a domain is configuring your +[DNS Records](https://en.wikipedia.org/wiki/List_of_DNS_record_types). This +basically dictates how your physical machine address is mapped to your human +readable service names. I mainly use this domain for web services together +self hosted email. As such I outlined the relevant records below that these +services require. + +| Name | Description +| ----------------------------------------------- | ----------------------- +| **A** Address record | physical IPv4 address associated with this domain +| **CNAME** Canonical name record | Alias name for A record name. This is generally for subdomains (i.e. other.domain.xyz as alias for domain.xyz both served the same machine) +| **CAA** Certification Authority Authorization | DNS Certification Authority Authorization, constraining acceptable CAs for a host/domain. +| **DS** Delegation signer | The record used to identify the DNSSEC signing key of a delegated zone +| **MX** Mail exchange record | Maps a domain name to a list of message transfer agents for that domain +| **TXT** Text record | Carries machine-readable data, such as specified by RFC 1464, opportunistic encryption, Sender Policy Framework, DKIM, DMARC, DNS-SD, etc. + +The essential records for web services are the A and CNAME records which enable +correct name look up when outside you private network. Nowadays SSL should be +part and so specifying which certification authority you use should be set in +the CAA record. Most likely this will be `letsencrypt.org` which pretty much +provides SSL certificate signing free of charge securing your traffic to some +extent. In combination there should be a DS record here that presents your +public signing key used by your machine's SSL setup and allows you to +setup DNSSEC on your domain. + +The other records are required for secure email transfer. First you need the +equivalent of a name record, the MX record which should point to another A +record and may or may not the same machine / physical address as the domain +hosting your web-services. Signing your email is similar to SSL encryption +should be an essential part of your setup. A SMTP set-up with postfix +can do so by using [openDKIM](http://www.opendkim.org/). This will require +you to similarly provide your public signing key as a TXT record. + +```bash +"v=DKIM1;k=rsa;p=${key_part1}" +"${key_part2}" +``` + +The TXT record will look something like the above statement. There are some +inconveniences unfortunately when using RSA in combination with a high entropy +which yields a long public key. You need to break this key up into multiple +strings which the `openkdim` tool may or may not do by default as there is a +maximum character length for each TXT entry element. As long as no semi-colons +are inserted this should just work as expected. diff --git a/content/posts/python-urllib.md b/content/posts/python-urllib.md new file mode 100644 index 0000000..4da7cfa --- /dev/null +++ b/content/posts/python-urllib.md @@ -0,0 +1,100 @@ +--- +title: "Python Urllib ⬇📜" +date: 2021-10-26T20:02:07+02:00 +draft: false +toc: true +tags: + - python + - scraping + - code +--- + +I had to pull some meta data from a media data base and since this tends to +be a go to setup when I use urllib with python. I thought I would make a quick +note regarding cookies and making POST/GET requests accordingly. + +## Setting up a HTTP session + +The urllib python library allows you to get global session parameters directly +by calling the `build_opener` and `install_opener` methods accordingly. Usually +if you make HTTP requests with empty headers or little to no session data +any script will tend to be blocked when robots are not welcome so while setting +these parameters mitigates such an issue it is advised to be a responsible +end-user. + +```python +mycookies = http.cookiejar.MozillaCookieJar() +mycookies.load("cookies.txt") +opener = urllib.request.build_opener( +urllib.request.HTTPCookieProcessor(mycookies) +) +opener.addheaders = [ +( + "User-agent", + "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36" + + "(KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36", +), +( + "Accept", + "text/html,application/xhtml+xml,application/xml;q=0.9," + + "image/avif,image/webp,image/apng,*/*;q=0.8," + + "application/signed-exchange;v=b3;q=0.9", +), +] +urllib.request.install_opener(opener) +``` + +The above code snippet sets a user agent and what kind of data the session +is willing to accept. This is generic and simply taken from one of my own +browser sessions. Additionally I load in `cookies.txt` which are the session +cookies that I exported to a file for a given domain from my browser. + +## HTTP POST request + +Web based APIs will have various methods for interacting but POST requests with +JSON type input/output and occasionally XML but given python's native support +for JSON this is generally the way to do things. + +``` python +url = f"{host_name}/api.php" +data = json.dumps(post_data).encode() +req = urllib.request.Request(url, data=data) +meta = urllib.request.urlopen(req) +return json.loads(meta.read()) +``` + +The above code snippet prepares a `req` object for particular `host_name` and +`post_data` which is a dictionary that is encoded to a JSON string. Calling +urlopen on this request will perform a POST request accordingly where if +all works as expected should return a JSON string that is mapped to a python +collection. + +In the scenario where the data is returned as an XML string / document, there +is a `xmltodict` python library that will return a python collection. The +downside here is the xml has quite a deep hierarchy that is difficult to +appreciate unless the we get into large xml data structures that can be queried. +For reference the xml parsing will look something like this: + +```python +xmltodict.parse(meta.read()) +``` + +## HTTP GET request with BeautifulSoup + +Performing GET requests is usually much much most simply since you just need +to determine the appropriate url. Here I included an example where the +`BeautifulSoup` python library is used to container the HTTP response and +search through any links within the response that march a regular expression. + +```python +query_url = f"{host_name}/?f_search={tag_name}" +resp_data = urllib.request.urlopen(query_url) +resp_soup = BeautifulSoup(resp_data) +return [ link["href"] + for link in resp_soup.find_all("a", href=True) + if re.match( f"{host_name}/g/([0-9a-z]+)/([0-9a-z]+)", link["href"] ) +] +``` + +This is probably the most common use case for the `BeautifulSoup` library and +it is very effective instead of sifting through any html data.