教程 > 使用 SQLite 和 Python 进行数据分析

使用 SQLite 和 Python 进行数据分析

在 PyCon 2023 上展示的关于 SQLite、Python、sqlite-utils 和 Datasette 的 2 小时 45 分钟视频教程。

您需要什么
- python3 和 pip
- 可选：GitHub Codespaces
sqlite-utils
- 使用命令行工具清理数据
使用 Datasette 探索数据
将 sqlite-utils 作为 Python 库使用，导入所有 PEP
启用全文搜索
将数据库发布到 Vercel
- 其他发布选项
Datasette Lite
进阶 SQL
JSON
- 窗口函数
烘焙数据
- 小众博物馆和 TILs
- 使用 Observable notebook 生成新闻通讯
更多演示和进一步阅读
- 有趣的演示
- SpatiaLite

您需要什么

python3 和 pip

本教程的第一部分，您需要一个可用的 Python 3 解释器，其中包含 sqlite3 标准库模块。

您可以在自己的计算机上运行它，或者使用基于浏览器的环境。

https://pyodide.org/en/stable/console.html 可用于纯粹基于浏览器的 (WebAssembly) 环境。

对于第二部分，您还需要能够使用 pip install 安装 Python 包。

在您自己的笔记本电脑上使用 Python 3（也许在一个新的虚拟环境中）是一个不错的选择。您也可以使用 GitHub Codespaces 或 Google Colab 或 Jupyter Hub 或您选择的在线 notebook 解决方案。

可选：GitHub Codespaces

我将使用 GitHub Codespaces 来完成本教程，具体使用 https://github.com/github/codespaces-jupyter

Animated demo of Codespaces Jupyter

sqlite-utils

sqlite-utils 提供“用于操作 SQLite 数据库的 CLI 工具和 Python 实用函数”。

您可以像安装 Datasette 一样安装它

pip install sqlite-utils

或者使用 pipx

pipx install sqlite-utils

或者使用 Homebrew

brew install sqlite-utils

它既可以作为 CLI 工具使用，也可以作为 Python 库使用。

使用命令行工具清理数据

接下来我们将按照此教程进行操作：使用 sqlite-utils 和 Datasette 清理数据

使用 Datasette 探索数据

Datasette 是一个“用于探索和发布数据的开源多功能工具”。

在本地安装 Datasette

pip install datasette

或者如果您喜欢使用 pipx

pipx install datasette

或者使用 Homebrew (在 macOS 上)

brew install datasette

更多安装选项.

在 Codespaces 中，您还应该安装 datasette-codespaces 插件

datasette install datasette-codespaces

尝试一个数据库：legislators.db

wget https://congress-legislators.datasettes.com/legislators.db

这是一个包含美国议员、总统和副总统的数据库。

您可以在线探索它：https://congress-legislators.datasettes.com/legislators

像这样在 Datasette 中打开它

datasette legislators.db

我们将按照此教程探索 Datasette 的功能：使用 Datasette 探索数据库

安装一些插件

Datasette 有一百多个插件：https://datasette.com.cn/plugins

您可以使用 pip install 安装它们，但最好使用 datasette install，因为这样可以确保它们安装到正确的虚拟环境中，特别是当您使用 pipx 或 Homebrew 安装 Datasette 本身时，这非常有用。

datasette install datasette-cluster-map

现在重启 Datasette 并访问“offices”表查看结果。

您可以使用此命令查看已安装的插件：

datasette plugins

或者通过访问 Datasette 中的 /-/plugins 页面。

可以使用此命令卸载插件：

datasette uninstall datasette-cluster-map

使用 Datasette 学习 SQL

“✎ 查看和编辑 SQL”链接是开始学习基本 SQL 查询的快捷方式。

接下来我们将按照此教程进行操作：使用 Datasette 学习 SQL

将 sqlite-utils 作为 Python 库使用，导入所有 PEP

让我们重新实现之前提到的 PEPs 示例，但使用 sqlite-utils 让它做得更好。

我将在一个 notebook 中完成此操作。

!git clone https://github.com/python/peps /tmp/peps

现在我们所有的 PEPs 都已存储在 /tmp/peps 中。

import pathlib

files = list(pathlib.Path("/tmp/peps").glob("pep-*.txt"))

并使用我们之前的函数解析它们

def parse_pep(s):
    intro, body = s.split("\n\n", 1)
    pep = {}
    current_key = None
    current_value = None
    for line in intro.split("\n"):
        # If the line starts with whitespace, it's a continuation of the previous value
        if line.startswith(" ") or line.startswith("\t"):
            if current_key is not None:
                current_value += " " + line.strip()
                pep[current_key] = current_value.strip()
        else:
            # Split the line into key and value
            parts = line.split(": ", 1)
            if len(parts) == 2:
                key, value = parts
                # Update the current key and value
                current_key = key
                current_value = value
                # Add the key-value pair to the pep dictionary
                pep[current_key] = current_value.strip()
    pep["Body"] = body.strip()
    return pep

peps = []
for file in files:
    peps.append(parse_pep(file.read_text()))

我们现在有一个字典列表。让我们将它们加载到 SQLite 中

%pip install sqlite-utils

import sqlite_utils
db = sqlite_utils.Database("/tmp/peps.db")
db["peps"].insert_all(peps, pk="PEP", replace=True)

我遇到了这个错误

OperationalError: table peps has no column named PEP-Delegate

要解决此问题，请使用 alter=True 自动添加任何缺失的列

db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
print(db["peps"].count)
# Outputs 429

启用全文搜索

SQLite 内置了令人惊讶的出色的全文搜索功能。

sqlite-utils 可以帮助您启用它

db["peps"].enable_fts(["Title", "Body"])

Datasette 将检测到此功能并在表格页面顶部添加一个搜索框。

要按相关性顺序运行搜索，您需要执行一个自定义的 SQL 查询

select
  PEP,
  peps.Title,
  Version,
  Author,
  Status,
  Type,
  Created,
  peps.Body,
  peps_fts.rank
from
  peps
join
  peps_fts on peps.rowid = peps_fts.rowid
where
  peps_fts match :search
order by
  peps_fts.rank
limit
  20

将数据库发布到 Vercel

首先，安装 Vercel 和 datasette-publish-vercel 插件。

https://vercel.com/docs/cli 提供了安装 Vercel CLI 的文档。

在 macOS 上

brew install vercel-cli

或者使用其中之一

npm i -g vercel

或者

pnpm i -g vercel

现在运行此命令登录

vercel login

安装插件

datasette install datasette-publish-vercel

并部署数据库

datasette publish vercel /tmp/peps.db --project python-peps

其他发布选项

Datasette 可以发布到以下提供商

Heroku (datasette publish heroku)
Google Cloud Run (datasette publish cloudrun)
Vercel (使用 datasette-publish-vercel)
Fly (使用 datasette-publish-fly)

更多部署选项在文档中有描述。

Datasette Lite

这是 Datasette... 完全在您的网络浏览器中使用 WebAssembly 和 Pyodide 运行！

https://lite.datasette.io/

加载 SQLite、CSV 和 JSON 数据

安装插件

添加 ?install=name-of-plugin 即可将该插件 pip install 到您的浏览器环境中！

这仅适用于部分插件。

https://lite.datasette.io/?install=datasette-copyable&json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464

进一步阅读

进阶 SQL

聚合

最简单的聚合形式是 Datasette 用于实现其 Faceting 功能的形式。

select
  party,
  count(*)
from
  executive_terms
where
  type = 'prez'
group by
  party

在此处尝试该查询.

group by 创建行组，然后聚合函数计算整个组的值。

最常用的聚合函数是

count(*) - 计算每组中的行数
max(column) - 列的最大值
min(column) - 列的最小值
sum(column) - 列中值的总和

这里是来自RedPajama-Data-1T LLM 训练集包含什么内容的 sum() 和 count() 示例

select
  top_folders,
  sum(size_gb) as total_gb,
  count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc

在 Datasette Lite 中运行.

将 total_gb 行更改为这样，以便将其四舍五入到最接近的整数

  cast (sum(size_gb) as integer) as total_gb,

子查询

SQLite 对子查询有出色的支持。您可以在 where X in 子句中使用它们

select html_url from releases where repo in (
  select id from repos where full_name in (
    select repo from plugin_repos
  )
)
order by created_at desc

在 datasette.io 上运行。有时我觉得它们比 joins 更易读！

您也可以直接在 select 子句中使用它们

select
  full_name,
  (
    select
      html_url
    from
      releases
    where
      releases.repo = repos.id
    order by
      created_at desc
    limit
      1
  ) as latest_release
from
  repos

在此处运行.

CTEs (公用表表达式)

CTE 是一个极其强大功能的糟糕命名。它代表 Common Table Expressions（公用表表达式）。可以将其视为在查询执行期间为临时表创建别名的方式。

with presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'prez'
),
vice_presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'viceprez'
)
select
  distinct name
from
  presidents
where name in vice_presidents

在此处尝试此 CTE 查询.

JSON

SQLite 内置了出色的 JSON 功能。将 JSON 存储在 text 列中，您可以使用 json_extract() 查询它 - 您也可以在 select 查询中构建 JSON 值。

使用 JSON 在单个 SQL 查询中返回相关行展示了一些您可以用此实现的进阶技巧。

select
  legislators.id,
  legislators.name,
  json_group_array(json_object(
    'type', legislator_terms.type,
    'state', legislator_terms.state,
    'start', legislator_terms.start,
    'end', legislator_terms.end,
    'party', legislator_terms.party
   )) as terms,
   count(*) as num_terms
from
  legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
  group by legislators.id
order by
  id
limit
  10

运行该查询.

Paul Ford 关于 SQLite 的 JSON 支持曾说过

JSON 接口就像，“我们保存文本，当您检索时，我们以每秒数百兆字节的速度解析 JSON，并让您对其进行路径查询，请不要过度思考，这就像文件柜一样。”

窗口函数

我想运行一个查询，该查询将返回以下内容

仓库名称
该仓库最新发布的日期（releases 表与 repos 表是多对一关系）
总发布数量
最近三次发布（作为 JSON 对象数组）

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max(created_at) as max_created_at,
  count(rel_id) as releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc

在此处运行该查询.

运行这个较小的查询有助于展示 rel_rank 列的作用

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select * from cte

烘焙数据

烘焙数据架构模式描述了这种方法，这是充分利用 SQLite 和 Datasette 的关键。

我喜欢在 GitHub Actions 中构建我的数据库。

小众博物馆和 TILs

https://www.niche-museums.com/ 是从 https://github.com/simonw/museums 仓库发布的 - 内容使用一个大型 YAML 文件。
https://til.simonwillison.net/ 是从 https://github.com/simonw/til 发布的 - 每个条目使用单独的 Markdown 文件。

这两个网站都有 Atom feed，它们是使用 Datasette 预设查询，结合 datasette-atom 插件定义的。

我在使用 Observable notebook 半自动化 Substack 新闻通讯一文中写到了这一点

https://datasette.simonwillison.net/simonwillisonblog 是我 Django 博客的 Datasette/SQLite 副本，使用我的 https://github.com/simonw/simonwillisonblog-backup GitHub 仓库中的 db-to-sqlite 创建。
https://observablehq.com/@simonw/blog-to-newsletter 是我的 Observable notebook，它从该数据组装新闻通讯。
https://simonw.substack.com/ 是我将该内容复制到的 Substack 新闻通讯。

更多演示和进一步阅读

有趣的演示

我在我的博客上发布了很多 Datasette 项目。我最近的一些喜爱项目包括：

探索 MusicCaps，随 Google MusicLM 文本转音乐模型发布的评估数据展示了 Datasette 如何用于探索作为机器学习模型训练一部分使用的数据。
使用 Datasette 分析 ScotRail 音频公告——从原型到生产解释了 scotrail.datasette.io 项目。
我构建了一个 ChatGPT 插件来回答关于托管在 Datasette 中的数据的问题是我早期在 LLM AI 和 Datasette 交集上的探索之一。

SpatiaLite

使用 SpatiaLite 构建位置到时区 API 展示了如何使用 SpatiaLite 和 Datasette 创建一个 API，用于根据纬度/经度点查找时区。
在地图上绘制形状查询 SpatiaLite 数据库介绍了 datasette-leaflet-freedraw 插件并介绍了 calands.datasettes.com 演示。
使用内存 SQLite 数据库连接 CSV 和 JSON 数据

使用 SQLite 和 Python 进行数据分析

您需要什么

python3 和 pip

可选：GitHub Codespaces

sqlite-utils

使用命令行工具清理数据

使用 Datasette 探索数据

在本地安装 Datasette

尝试一个数据库：legislators.db

安装一些插件

使用 Datasette 学习 SQL

将 sqlite-utils 作为 Python 库使用，导入所有 PEP

启用全文搜索

将数据库发布到 Vercel

其他发布选项

Datasette Lite

加载 SQLite、CSV 和 JSON 数据

安装插件

进一步阅读

进阶 SQL

聚合

子查询

CTEs (公用表表达式)

JSON

窗口函数

烘焙数据

小众博物馆和 TILs

使用 Observable notebook 生成新闻通讯

更多演示和进一步阅读

有趣的演示

SpatiaLite

更多教程