How to export all your blog posts as a CSV using Python and SQLite.

How to export all your blog posts as a CSV with Python and SQLite

A friend asked me if I could help him export all of his blog posts in a CSV format as he was considering migrating his content to another platform.

The example in this article explains how to export content created on Bloggi.co but you can apply the same basic principles to any blogging platform accessible via a REST API. You could for instance use the same approach to access the WordPress API or Ghost API or any other REST API.

You can use this tutorial not only for migration purposes but also to learn how to access data to expose it in another web client.

1° Figure out how to connect to the REST API

You should look for the API Documentation. For Bloggi, you can find it here.

The documentation tells us how to authenticate to the API. Bear in mind that all the documentation provides curl examples (which can be used via the command line of your terminal window).

Here’s for instance the authentication instruction (replace bloggi.co by your domain and YOUR_API_KEY by your actual API key):

curl -H "X-API-Key: YOUR_API_KEY" https://bloggi.co/api/v1/posts

We’ll be using Python in this tutorial, so we need to understand how to adapt this curl command in Python language.

-H indicates that “X-API-Ley: YOUR_API_KEY” is the header part of the request which we’ll have to make .

In Python, to interact with an API, we use the REQUESTS library, which we simply import into our script by writing import requests.

2° Call the API to get the list of posts

Before making our first call, let’s import all the libraries we’ll use in our script. We’ve already talked about requests, we’ll also import sqlite3 (to create a sqlite database where we’ll import the content before generating the csv); csv, to generate the csv and json, to manipulate json files

import requests
import sqlite3
import csv
import json

The purpose of our first API call will be to get a list of all the posts on our blog.

We will first declare two variables:

  • url = the API endpoint
  • header = the authentication header

I’ll be using bloggi.co throughout this tutorial. Don’t forget to replace it with your domain.

url = "https://bloggi.co/api/v1/posts"
header = {"X-API-Key": "your api key"}

then we’ll call the endpoint using requests, declaring the headers in the variables. Bear in mind that the word headers is a standard command. Using .json we’ll then get the json response coming back from the API.

api_result = requests.get(url, headers=header)
jsonResponse = api_result.json()

We can print out jsonResponse to examine the structure of the response.

I’m using the json.dumps method to prettify the output, in order to see the structure of the json response, as outlined in the API documentation. If you don’t use json.dumps and print out jsonResponse you’ll get a long string which is more difficult to examine.

pretty_json = json.dumps(jsonResponse, indent=4)
print(pretty_json)

You should get something which looks like the example provided in the documentation.

{
  "data": [
    {
      "id": "855c13",
      "slug": "example",
      "url": "http://test.bloggi.co/example",
      "published_at": "2019-11-15T22:04:08Z",
      "created_at": "2019-11-15T19:28:21Z",
      "updated_at": "2020-10-22T23:24:24Z",
      "title": "Example",
      "html": "<p>The content...</p>",
      "excerpt": "The excerpt...",
      "image_url": null,
      "code_head": "",
      "tags": [
        {
          "name": "A tag",
          "slug": "a-tag"
        }
      ]
    }
  ],
  "pagination": {
    "page": 1,
    "pages": 1,
    "total": 1,
    "limit": 50
  }
}

This gives us clear indications on how we should access the elements of the response.

We want first to access the ID of each post (“id” = …) to iterate through all the posts using a FOR LOOP.

We can see that the LIST of posts is called “data”. So the list can be accessed via jsonResponse[“data”]. In Python we use square brackets to access the indented levels inside the JSON response. If we wanted to just access the first post, we could do it this way: jsonResponse[“data”][0] (index 0 = the first element in the list).

3° Call the POSTS endpoint to get the content of each post

We will iterate through the list using a FOR LOOP, accessing each post via its ID, then call the API endpoint to retrieve the data for each post. Before calling the API for each individual post, you can print out the post_id to verify that everything’s working fine, then comment out (#) the print instruction.

👉 Do not forget to replace the url inside the requests.get parenthesis if you copy-paste it from the first API call (url should be replaced by url_post).

for item in jsonResponse["data"]: 

    post_id = item["id"]

    #print(post_id)
    
    url_post = f"https://bloggi.co/api/v1/posts/{post_id}"

    api_post_result = requests.get(url_post, headers=header)

    jsonResponse_post = api_post_result.json()

if you print out jsonResponse_post for each post in the loop, you’ll get something like this.

{
  "data": {
    "id": "855c13",
    "slug": "example",
    "url": "http://test.bloggi.co/example",
    "published_at": "2019-11-15T22:04:08Z",
    "created_at": "2019-11-15T19:28:21Z",
    "updated_at": "2020-10-22T23:24:24Z",
    "title": "Example",
    "html": "<p>The content...</p>",
    "excerpt": "The excerpt...",
    "image_url": null,
    "code_head": "",
    "tags": [
      {
        "name": "A tag",
        "slug": "a-tag"
      }
    ]
  }
}

We need the title, the html (the body), the image_url and the tags.

That’s how we’ll get them.

title = jsonResponse_post["data"]["title"]
html = jsonResponse_post["data"]["html"]
image_url = jsonResponse_post["data"]["image_url"]

#Transform tags JSON into a list, select only the NAMES, then convert the list into a string of comma separated values

tags_json = jsonResponse_post["data"]["tags"]
    
list_of_tags = []
for tag in tags_json:
   list_of_tags.append(tag["name"])
tags = ", ".join(list_of_tags)

#print(tags)

I had to spend more time on the tags part since they’re first delivered in JSON format, which I had to transform into a list (tags_json), then iterate through the list to get just the name of the tags (I didn’t need the slug) append them to a dedicated list (list_of_tags) and create a new list of tag names separated by commas, ready for insertion in my database (tags).

When printed out, tags would look like tag1, tag2, tag3, etc. Perfect!

I also added a condition to replace “None” results by “no image” in the absence of images.

if image_url == None:
        image_url = "no image"

This is how the blog post access code looks like with the indentation under the for loop.

for item in jsonResponse["data"]: 

    post_id = item["id"]

    #print(post_id)
    
    url_post = f"https://yogakiddy.com/api/v1/posts/{post_id}"

    api_post_result = requests.get(url_post, headers=header)

    jsonResponse_post = api_post_result.json()

    title = jsonResponse_post["data"]["title"]
    html = jsonResponse_post["data"]["html"]
    image_url = jsonResponse_post["data"]["image_url"]
    
    #Transform tags JSON into a list, select only the NAMES, then convert the list into a string of comma separated values
    tags_json = jsonResponse_post["data"]["tags"]
    
    list_of_tags = []
    for tag in tags_json:
        list_of_tags.append(tag["name"])
    tags = ", ".join(list_of_tags)
    #print(tags)
    
    if image_url == None:
        image_url = "no image"

4° Create the SQLite database

When I started this exercise, my intention was to export the data to a Google Sheet, as I did in a previous article. That’s perfectly fine but I wanted to use this opportunity to experiment with SQLite.
The advantage of the SQLite method vs Google Sheets is that you’re not constrained by the rate limiting of the Google Sheets API, which forces you to insert time.sleep instructions between the loop iterations (otherwise you quickly get an error). If you want to process a lot of data, SQLite will be much faster.

First we need to connect to our DB (this step will create the db if it doesn’t exist yet) and initialize what’s called a cursor which we’ll use to execute all our SQL statements.

con = sqlite3.connect("blogposts.db", timeout=0, check_same_thread=False)
cur = con.cursor()

Then we’ll create a TABLE named POSTS in our database, with 4 columns named title, html, image, tags.

cur.execute("CREATE TABLE POSTS(title,html,image,tags)")

👉 You just have to execute this statement once. Comment it out or remove it from your script after the creation of the table.

Note: in the final code (available on Kaggle) this part comes BEFORE the FOR LOOP, since we need the DB and TABLE to be ready before inserting data at each iteration of our loop.

In the for loop, we’ll insert into the table the data retrieved via the API (Of course, I’m assuming that you run this code after running the API calls to get the list of posts and the data from the posts).

query = f"INSERT INTO POSTS (title, html, image, tags) VALUES(?, ?, ?, ?)"
cur.execute(query,(title, html, image_url, tags))
con.commit()

Don’t forget the con.commit( ) part.

👉 I used the VALUES(?) approach in order to be able to insert the image_url inside the tables via the next line (cur.execute). If you try to insert the variables straight into the VALUES parenthesis, you’ll get an error.

Bear in mind that I closed the db connection at the end of my for loop, via the con.close() command, which is a best practice when you’re done with a specific interaction with the database. So I’ll have to re-open it for the final part of this tutorial.

5° Export the content of the POSTS table in CSV format

We will select all the content in POSTS, create/open a CSV in WRITE mode (w) and write both the headers of the table and the content as new rows inside this CSV.

con = sqlite3.connect("blogposts.db", timeout=0, check_same_thread=False)
cur = con.cursor()

query = "SELECT * FROM POSTS"
cur.execute(query)

with open("blogposts.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow([i[0] for i in cur.description])  # write table headers
    writer.writerows(cur.fetchall())

We can now download the file blogposts.csv which contains all the data retrieved via the API.

🌐 You can access & clone the full script on Kaggle. Click HERE.

🚀 Subscribe to my weekly newsletter packed with tips & tricks around AI, SEO, coding and smart automations

☕️ If you found this piece helpful, you can buy me coffee