Things I Learned - 18 May 2025

This week, I learned:

Birds navigate using quantum entanglement! Guardian ChatGPT
DeerFlow is an open source Deep Research MCP. Lets you run deep research outside of the standard chatbots.
⭐ Today, if I had to store a bunch of data files (e.g. parquet) under 1GB, I would use GitHub Releases. Here are options:
- GitHub Releases. 2 GiB per file, unlimited total & bandwidth. 🟢 Immortal URL, versioning, easy CI publish. 🔴 Each file must stay < 2 GiB; no built-in SQL.
- Zenodo (CERN). 50 GB per record; one-off bumps to 200 GB. 🟢 DOI assignment, archival mandate. 🔴 Occasional throttled bandwidth; no API for partial file reads.
- Hugging Face Hub. 300 GB per repo; 50 GB per file. 🟢 Git-based, dataset tooling, lively ML community. 🔴 Large files need git-LFS; pushes via LFS can be slow.
- Cloudflare R2. 10 GB storage & 1 M ops / month. 🟢 S3 API, zero-egress to Cloudflare Workers, fast. 🔴 10 GB cap below your 50 GB target.
- Kaggle Datasets. 20 GB per dataset, public only. 🟢 Built-in notebooks & GPU. 🔴 No programmatic SQL API; quotas sometimes change.
- data.world (free). 1 GB total, 100 MB per dataset. 🟢 Nice social features. 🔴 Too small for your size.
If I had to query a bunch of data files in an external Parquet or SQLite file, here are SQL engines-as-a-service:
- MotherDuck. 10 GB storage + 10 CU-hrs/mo compute. Native DuckDB; no credit card; GA June 2024; monthly feature drops.
- Datasette Cloud. Two-month trial (or 1-yr for non-profits). SQLite backend. Great UX; but not free forever for general use.
- AWS Athena. Pay-per-TB scanned; no free tier; S3 fees after 12 mo. Costs creep quickly; free-tier S3 ends after a year.
Bootstrap has a .stretched-link that makes a link cover the containing block. A clever trick that I discovered when Claude 3.5 Sonnet wrote my code.
Discovered spray and peel paints at ArtFriend. I had no idea that was a thing.
Gemini Live API is the real-time equivalent from Gemini. It supports tools, search, and code execution.
mcp-mem0 is an MCP for memory
llm-min.txt compresses docs for LLMs to read optimally. Like a compressed llms.txt or context7. Usage GEMINI_API_KEY=... uvx llm-min -i $DIR #ai-coding
There’s a lot of action on encrypted LLM operations.
- Responses API allows reasoning tokens to be encrypted if organizations don’t want their reasoning data to persist. Ref
- Tinfoil (YC X25) offers an OpenAI-compatible inference API where data is encrypted from the client to the NVIDIA Hopper/Blackwell GPUs in confidential computing mode. Prompts, model weights, outputs are encrypted in transit and memory, with verifiable privacy on code running in GPU.
- Modelyo (Israel) offers VMs/K8 clusters with encrypted GPUs across multiple cloud providers with continuous attestation, managed on Modelyo’s portal.
⭐ LLMs are able to do things independently longer and longer. That’s a useful metric to track. METR: Measuring AI Ability to Complete Long Tasks.
If you’re looking for datasets / APIs related to research publications (especially funding), then explore:
- Crossref API and snapshots
- OpenAlex API and snapshots which is funded by OurResearch. OpenAlex is like CrossRef but includes some disambiguation
- OpenAIRE Graph 2024 / 2025
- Europe PMC dataset
To avoid Ubuntu 24 suspending on closing the laptop lid use one of these and restart:
- /etc/systemd/logind.conf: Set HandleLidSwitch=ignore
- etc/UPower/UPower.conf: Set IgnoreLid=true
UV_TORCH_BACKEND=auto uv pip install torch torchvision torchaudio installs the most appropriate PyTorch version. Ref
Cog is a Python based templating language. It is embedded as comment chunks in any file and replaced itself with the output of the Python code you write.
CloudFlare Zero Trust seems the easiest way to enable auth on static websites, especially if your DNS is already on Cloudflare. No cost
We could “fine-tune” system prompts automatically with evals, creating a “system prompt learning” paradim – like my promptevals. Andrej Karpathy
I was asked how to improve speed when building an enterprise ChatGPT clone using an API. Here’s what I’d suggest, in order:
- Streaming. High impact, low effort.
- Caching RAG retrieval as well as generation. High impact, low effort.
- UI tweaks. Loading / streaming icons and progress hints ()“Retrieving context”, “Generating answer”, etc.)
- Parallelize, if possible
- Use model options where available, e.g. speculative decoding, models with higher speed, models with closer CDN, etc.
- Shorten prompts
- Persistent HTTP/2 Keep-Alive. Low impact, low effort (tweak server settings).
Cloudflare Vectorize, at 768 dimensions / embedding, is free for ~6.5K chunks storage at ~1,000 queries / day. For a light load like 1M 768d chunks queried 1K times a day, the cost is: ChatGPT
NVIDIA parakeet is a lightweight speech to text model that leads benchmarks. Installing such packages continues to be a nightmare due to PyTorch (despite uv).
I explored the real-time avatar space. Heygen seems to be the easiest to use, but even that is complex and expensive ($99/mo). We may need to wait a few months for avatars to explode.
⭐ Model reliability is a huge enabler for performance. As models become more reliable, they can work autonomously for longer and that is another kind of scaling. Vending Bench
ChatGPT, Gemini, etc. have become lead generation engines. Chat Bot Optimization (CBO), is it? WhatsApp + ChatGPT
⭐ Never live delete data. Mark it for deletion and schedule a deletion task. That way you have time to react to mistakes. Simon Willison
Pandoc has several options useful when converting Markdown to HTML (cat file.md | pandoc -f markdown -t html). My favorites:
- --no-highlight skips code-highlighting. --highlight=pygments adds Pygments styling
- --wrap=none doesn’t wrap the content in a single block
- --number-sections adds section numbering (<h2>1. Introduction</h2>)
- --shift-heading-level-by=NUM – shift all headings by NUM levels (e.g., start at <h2> instead of <h1>)
- pandoc -f markdown-auto_identifiers drops the auto-identifiers extension that generates id=... for each heading
- pandoc -f gfm uses GitHub flavored Markdown. Run pandoc --list-extensions=gfm to identify the extensions it uses.
- Pandoc’s Markdown extension examples are quite extensive.
- Auto-enabled GFM extensions:
  - alerts: GitHub-style callouts (info, tip, warning) via > [!TYPE] blocks.
  - autolink_bare_uris: Turns bare URLs into links, without needing <...>.
  - emoji: Parses :smile:-style codes into Unicode emoji characters.
  - footnotes: Enables footnote syntax with [^id] and definitions at the bottom.
  - gfm_auto_identifiers: Uses GitHub’s heading-ID algorithm: spaces → dashes, lowercase, removes punctuation.
  - pipe_tables: Enables table.
  - raw_html: Raw HTML is unchanged.
  - strikeout: Enables strikethrough with ~~text~~.
  - task_lists: Parses - [ ] and - [x] items as checkboxes.
  - yaml_metadata_block: YAML front matter for document metadata, e.g. <title>
- GFM extensions worth enabling:
  - ascii_identifiers: Strips accents/non-Latin letters in automatically generated IDs.
  - bracketed_spans: [Warning]{.alert} becomes <span class="alert">
  - definition_lists: Term\n: Definition text becomes a definition list
  - fenced_divs: ::: {.note} block creates a <div class="note">...</div>
  - implicit_figures: Standalone images become <figure> with <figcaption>.
  - implicit_header_references: [Section] is treated as [Section][#section]
  - raw_attribute: <b>bold</b>{=html} is inserted as HTML
  - smart: Converts straight quotes to curly, -- to en-dash, --- to em-dash, ... to ellipsis.
  - subscript & superscript: E.g. H~2~O and E = mc^2^

Related