This week, I learned:
- Lessons from discussions at IIT Madras:
- Even in recorded video tutorials, asking students a question and pausing to give them time to think can be effective.
- When you put students in front of real clients, engagement increases dramatically.
- Most teaching assistants would like to help diligent students among the bottom half (more than the top decile of students).
- However, there is a fraction of poor performers who do not care, and are best ignored. Their engagement and effort is a good measure of their interest.
- Defining a minimal set of principles that we want to teach helps us measure if we’ve helped the bottom half at least meet those objectives.
- Teaching is hard. Even after explanations, students, even ENGAGED students, tend to make basic mistakes
- ChatGPT does a good job of spotting errors in architectural and structural diagrams. In fact, the whole theme of spotting errors in large diagram is a theme that can have potential use cases. Source: Dan Becker.
- R1 seems good at text-to-CAD. Even better than Sonnet. Source: Dan Becker
- OpenAI advices a few different prompting techniques for reasoning models. OpenAI:
- Avoid examples unless zero-shot prompting fails.
- Avoid chain-of-thought. These models do that internally anyway.
- Short, direct prompts are better than detailed prompts.
- GitHub models is free for anyone to try. The model catalog us extensive and even includes
o3-miniwhich was launched this week (though in limited preview). - The data catalog space is led by proprietary solutions:
- Alation Data Catalog: Market leader; growing steadily in enterprise use
- Collibra Data Catalog: Widely adopted with steady growth
- AWS Glue Data Catalog: Growing rapidly as AWS expands its data services
- Informatica Enterprise Data Catalog: Long established and stable, though facing newer alternatives
- Microsoft Purview Unified Catalog: Experiencing fast growth driven by cloud momentum
- Atlan Data Catalog: Relatively new but gaining fast traction among tech-forward organizations
- OpusClip automatically creates short clips from long videos. I ran it on Programming Minecraft with WebSockets in Python to get this short 30-second clip. 30 minutes. 100% automated.
- Alternatives to Postman:
- Hoppscotch – A web‑based/desktop API client supporting REST, GraphQL, and WebSockets. It’s lightweight, open-source, and self‑hostable.
- HTTPie – A web-based API along with a friendly command-line tool for API interaction.
- Insomnia (or its fork Insomnium) – A popular cross‑platform API client with a minimal interface and plugin ecosystem.
- Bruno – A desktop open-source API client that stores collections as files (ideal for Git versioning).
- Milkman – A desktop open‑source workbench for managing API requests.
- Here is the summary of DuckCon #6 on 31 Jan 2025 in Amsterdam. I copied the transcript from YouTubeTranscript and passed it through Gemini 2.0 Flash Exp with the system prompt: “Summarize this transcript from the DuckDB conference without missing any points. Cover every point mentioned. A lot of spelling errors that sound like DuckDB are likely to be DuckDB”.
- Introduction & Welcome:
- DuckCon #6: This is the 6th DuckDB conference, held in their hometown. The first DuckCon was online due to the pandemic.
- Live Streaming: This is the first time DuckCon is being live-streamed, chosen to accommodate global time zones (especially China and the US).
- Global Reach: The live stream is intended to reach users in areas where in-person DuckCons are unlikely.
- Q&A: Slido (qa.duckdb.org) will be used for Q&A, with upvoting to prioritize questions.
- Sponsors: Thanks to gold sponsor monday.com and silver sponsors Real and Crunchy Data.
- DuckCon Purpose: DuckCon is a place for users to connect, share experiences, and provide feedback to the DuckDB team.
- Inspiration: The team is inspired by the community’s use of DuckDB and how far the project has come.
- Mission Statement: DuckDB aims to make large datasets less intimidating and more accessible, moving away from fear of data to confidence in handling it.
- Motivation: The project was born from seeing people struggle with data that didn’t fit in Excel and the lack of user-friendly tools.
- Industry Trends: Single-node processing capabilities have grown faster than the size of useful datasets.
- Data Singularity: A prediction that most data analysis queries can run on a single node is now a reality.
- Real-World Data Sizes: Analysis of Snowflake and Redshift data shows that 99.9% of datasets are under 300GB.
- Raspberry Pi Benchmark: The industry-standard TPCH benchmark (scale factor 300, ~300GB) can run on a Raspberry Pi using DuckDB.
- Single Node Growth: Single-node processing power is rapidly increasing, allowing for larger datasets to be handled.
- Adoption Numbers:
- 32 Million Extension Installs: 32 million DuckDB extension installs in the last month.
- 1.8 Million Unique Website Visitors: 1.8 million unique visitors per month to the DuckDB website.
- Blue Sky Community: Growing community on Blue Sky, with the hashtag
#dataBS.
- Technical Updates (Mark):
- Extension Ecosystem: Focus on enabling the community to build and share extensions.
- Community Extensions: Making it easier to create and use community-built extensions.
- DuckDB v1.2 (Harlequin Duck): Releasing next week, named after the Harlequin duck.
- CSV Reader Improvements: Significant improvements to the CSV reader.
- Friendlier SQL: Improvements to the SQL experience.
- CLI Autocomplete: Reworked and improved CLI autocomplete.
- Performance Optimizations: Many queries are now faster due to performance work.
- C API for Extensions: Introducing a C API to make building extensions easier.
- Logging Features: Improved logging for production use.
- Lakehouse Focus: The main focus for the year is on lakehouse formats and related features.
- Q&A (Mark & Hanis):
- Doubling Team: If the team doubled, they would focus on client integrations and other projects, not a major architectural change.
- Partitioning: Near-term plans to add support for partitioning, related to lakehouse formats.
- DuckDB WASM: The WASM ecosystem is evolving, with exciting possibilities for in-browser use.
- Financial/Pharmaceutical Industries: DuckDB could replace some SAS workflows due to its cost-effectiveness and capabilities.
- Lakehouse & MotherDuck: Lakehouse work is separate from MotherDuck, though MotherDuck will likely support lakehouse features.
- Contributing to Extensions: Plans to make it easier to contribute to extensions, including support for Rust and Go.
- Airport Extension (Rusty):
- Analogy: The airport extension allows DuckDB to “fly” to remote servers using Apache Arrow Flight.
- Functionality: Supports select, insert, update, and delete operations on remote data sources.
- Motivation: To reduce the burden of writing extensions and enable faster development using existing code.
- Arrow Flight: Uses Arrow Flight for communication, enabling connections to various data sources.
- Demo 1: Delta Lake:
- Attaches to a flight server for Delta Lake access.
- Allows creating schemas, tables, and performing standard SQL operations.
- Uses Python and deltars (Rust implementation of Delta Lake).
- Supports predicate pushdown and C integration with the DuckDB catalog.
- Demo 2: AutoGluon:
- Integrates the AutoGluon AutoML package.
- Predicts Hacker News post votes using a trained model.
- Demonstrates table-returning functions for model fitting and prediction.
- No C++ code required, just Python.
- Demo 3: Geocoding:
- Uses a geocoder service to convert addresses to coordinates and vice versa.
- Demonstrates scalar UDFs for vectorized requests.
- Uses a Python example for a simple uppercase function.
- Features:
- List flights, take flights.
- Catalog integration.
- Select, update, delete.
- Scalar UDFs.
- Table in/out functions.
- Authentication for row/column filtering.
- Availability: Requires DuckDB 1.2, MIT licensed, available on GitHub.
- Q&A (Rusty):
- Most Proud Extension: Airport is the most fun, but the AWS API wrapper also brings joy.
- Extension Resources: The GitHub DuckDB extension template and reading others’ source code are helpful.
- Airport & Other Extensions: Airport is separate and can be used alongside other extensions like spatial or httpfs.
- Graph Support: Graph database support is planned, with examples like Kuzu, Neptune, and Neo4j.
- Licensing: Airport is MIT licensed, compatible with Apache license.
- Scaling Out: Airport can be used to query multiple DuckDB instances on different machines.
- Ibis & Geospatial (Nati):
- Nati Clementi: Senior software engineer at Nvidia, working on open-source projects like Ibis.
- Ibis: Open-source Python library for data wrangling, with a DataFrame API and interfaces to 15+ engines, including DuckDB.
- DuckDB for Geospatial: DuckDB is fast, has a geospatial extension, and supports various geospatial formats.
- Geop Parquet: Becoming a standard for geospatial data, enabling cloud data warehouse interoperability and compression.
- Geo Arrow: A way of representing geospatial vector data in memory for faster processing.
- Ibis Benefits: Allows writing Python instead of SQL, with deferred execution determined by the engine.
- Demo:
- Uses OverTour Maps data in geop parquet format.
- Filters data using bounding boxes.
- Demonstrates geospatial operations like ST_Distance and ST_Transform.
- Plots data using Lumber.
- Shows how to find points of interest near a location (e.g., the Van Gogh Museum).
- Ibis & DuckDB: Ibis uses DuckDB for the parquet reader and lets DuckDB do the heavy lifting.
- Ibis Optimizations: Ibis does type checking but doesn’t do query optimization, leaving that to the engine.
- Ibis in Browser: Ibis works in the browser through DuckDB WASM.
- Q&A (Nati):
- Linear Interpolation: Ibis ML module can help with regression-related tasks.
- Missing Features: No major features are missing in the DuckDB/Ibis geospatial setup, with minimal overhead.
- Parquet Reader: Ibis uses DuckDB’s parquet reader.
- Query Optimization: Ibis does not optimize SQL queries, leaving that to DuckDB.
- Ibis in Browser: Ibis works in the browser through DuckDB WASM.
- Rill & Metrics Layer (Mike):
- Rill: A BI tool optimized for DuckDB, with instant slicing and dicing, BI as code, and a metrics-first philosophy.
- Metrics-First: Design metrics models, and Rill autogenerates dashboards and user experiences.
- Live Demo:
- Downloaded Rill using a curl command.
- Created a new project called “DuckCon 6”.
- Imported a parquet file of GitHub commit data.
- Used AI to generate a metrics model and dashboard.
- Showed the dashboard with trends and filtering.
- Metrics as Building Blocks: Metrics are flexible, fast, and intuitive.
- SQL for Metrics: Metrics should be defined in SQL, not other languages.
- Visual Metrics Editor: Rill has a visual editor for defining metrics using DuckDB SQL.
- Metric Stack:
- Legacy: Data warehouses, traditional BI tools, inconsistent metrics, full table scans.
- DuckDB Powered: Consistent metrics, fast olap queries, SQL everywhere.
- Challenges: Data modeling is hard, metric changes can be expensive, single-node scale has limits.
- AI & Metrics: AI can assist in metrics modeling, optimization, and conversational data exploration.
- Q&A (Mike):
- Complex Metrics: Rill works well with complex metrics involving multiple sources and transformations by joining tables in DuckDB.
- 60 FPS Dashboards: Users can feel the difference with faster dashboards.
- Defining Metrics: Metrics are defined in the Rill UI using SQL expressions.
- Replacing ChatGPT: Considering locally run self-hosted models for privacy.
- Stock Data Analysis (Ryan):
- Two Takeaways: Simple finance data flows with trade data and a tool called Q Studio.
- Ryan Hamilton: 14 years building large data platforms in banks.
- Bank Data: Data from exchanges, market data providers, and internal systems.
- Use Cases: Backtesting, data analysis, and report generation.
- Q Studio: A Java desktop application that connects to 30 databases, including DuckDB.
- Demo:
- Loaded a 6GB CSV file of trade data into DuckDB.
- Showed basic queries, pivoting, and Candlestick charts.
- Demonstrated time-based aggregation and moving averages.
- Showed a basic trading strategy using window functions.
- DuckDB Benefits: Fast, easy to use, great for time-based analysis.
- Q&A (Ryan):
- KDB+ vs. DuckDB: KDB+ is for large data, DuckDB is more approachable with strong Python integration.
- XML Files: Offloading processing to DuckDB, not planning XML integration.
- Lightning Talks:
- Zuk (Jared):
- Search engine research using DuckDB.
- Python-based experiments with SQL.
- Removing document lengths for faster search engines.
- DuckPGQ (Daniel):
- Graph analytics in DuckDB using SQL property graph queries (pgq).
- Visual graph syntax for pattern matching and path finding.
- Outperforms Neo4j on analytical queries.
- Yat (Kristoff):
- Smallest DuckDB SQL orchestrator.
- Runs SQL queries in a folder in the correct order.
- Generates a mermaid diagram for lineage.
- Grafana & DuckDB (Sam):
- Lessons learned from using DuckDB in Grafana.
- Security incident due to shell commands and file access.
- Importance of reading the documentation.
- Cloud Slur (Adam):
- Syncing query engine for bank transaction data.
- Uses LLM to convert human language to SQL.
- Uses DuckDB in the browser, Node.js, and Python.
- Healthcare Data (Tony):
- Data engineering use cases in healthcare.
- Dynamic data masking system using DuckDB and Snowflake.
- Data integration pipeline using DuckDB and Arrow streams.
- Zuk (Jared):
- Closing Remarks:
- Michel Simmons: Author of the DuckDB in Action book, will be signing books.
- Poster Session: A poster session will follow the talks.
- Sponsors: Thanks again to the sponsors.
- Social Event: The conference will now move to the social event.
- Introduction & Welcome:
- ibis is a Python library that works with multiple dataframe backends like DuckDB, Polars, and Pandas.
- With just 3 annotators and 50-100 samples, you can figure out if an LLM can replace human annotators systematically.Arxiv ChatGPT explanation
- Curiosity and agency may be the differentiator in a world of LLMs (not experience, knowledge, or ability), since LLMs will democratize expertise. Jack Clark
- “AI/human combined work can be copyrighted as long as a human is adding, changing or selecting elements. Prompts alone do not usually produce copyrighted work.” - Copyright and Artificial Intelligence, Jan 2025, US Copyright Office via Ethan Mollick
- Human Authorship is Essential: Works created solely by AI are not copyrightable.
- AI can be used as a Tool: Using AI as a tool does not negate copyright protection, as long as the final work reflects sufficient human creativity.
- Prompts Alone are Insufficient: Simply providing prompts to an AI system, even detailed ones, is generally not enough to establish authorship. Prompts are considered instructions or ideas, which are not copyrightable.
- Expressive Inputs: When a human author provides their own expressive content (like a drawing, photo, or text) as input to an AI system, and that content is perceptible in the output, the human author can claim copyright in that portion of the output.
- Modifying and Arranging AI-Generated Content: Humans can claim copyright in the creative selection, coordination, and arrangement of AI-generated material, as well as in creative modifications to AI-generated outputs.
- No Need for New Legislation: The report concludes that existing copyright law is adequate to address the copyrightability of AI-generated works, and no new legislation is needed at this time.
- Case-by-Case Analysis: Copyrightability will be determined on a case-by-case basis, considering the specific facts of each work and the extent of human contribution.