Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicb7ilgznzzsjr3gpvt7v7kyz4zzvlvttgiwe4is3faie3ahmtxri",
    "uri": "at://did:plc:pi6woz4d47bkuws673w2il2r/app.bsky.feed.post/3mhn4prqrb4n2"
  },
  "path": "/t/ann-dataframe-1-0-0-0/13834#post_1",
  "publishedAt": "2026-03-22T06:39:18.000Z",
  "site": "https://discourse.haskell.org",
  "tags": [
    "Find that here",
    "one billion row challenge",
    "here",
    "@daikonradish",
    "@Housing"
  ],
  "textContent": "It’s been roughly two years of work on this and I think things are in a good enough state that it’s worth calling this v1.\n\n## Features\n\n### Typed dataframes\n\nWe got there eventually and I think we got there in a way that still looks nice. There is now a `DataFrame.Typed` API that tracks the entire schema of the dataframe - column names, misapplied operations etc are now compile time failures and you can easily move between exploratory and pipeline work. This is in large part thanks to maxigit and mcoady (Github user names) for their feedback.\n\n\n    $(DT.deriveSchemaFromCsvFile \"Housing\" \"./data/housing.csv\")\n\n    main :: IO ()\n    main = do\n        df <- D.readCsv \"./data/housing.csv\"\n        let df' = either (error . show) id (DT.freezeWithError @Housing df)\n        let df'' =\n                df'\n                    & DT.derive @\"rooms_per_household\" (DT.col @\"total_rooms\" / DT.col @\"households\")\n                    & DT.impute @\"total_bedrooms\" 0\n                    & DT.derive @\"bedrooms_per_household\"\n                        (DT.col @\"total_bedrooms\" / DT.col @\"households\")\n                    & DT.derive @\"population_per_household\"\n                        (DT.col @\"population\" / DT.col @\"households\")\n\n        print df''\n\n\n### Calling dataframe from Python\n\nThere’s an implementation of Apache Arrow’s C Data interface along with an example of how to pass dataframes between polars and haskell.\n\nFind that here\n\n### Getting data from hugging face\n\nYou can explore huggingface datasets. Example:\n\n\n    df <- D.readParquet \"hf://datasets/Rafmiggonpaz/spain_and_japan_economic_data/data/train-00000-of-00001.parquet\"\n\n\n### Larger than memory files\n\nThe Lazy/query-engine-like implementation is now pretty fast. It can compute the one billion row challenge in about 10 minutes on a mac and about 30min on a 12 year old Dell (not OOM).\n\nYou’ll have to generate the data yourself but the code is here.\n\n### Better ergonomics with numeric promotion and null awareness\n\nIntroduced more lenient operators that make happy path computation much easier. E.g:\n\n\n    D.derive \"bmi\" (F.lift2 (\\m h -> (/) <$> m <*> fmap ((^2) . (/100). realToFrac) h) mass height) df\n\n\nIs now instead:\n\n\n    D.derive \"bmi\" (mass ./ (height ./ 100) .^ 2) df\n\n\n## What’s next?\n\nConnectors! BigQuery, Snowflake, s3 buckets etc. Formats! Parquet, Iceberg, DuckDB, a custom dataframe format with full data provenance. Moving from small in memory demos to querying large data lakes is the goal.\n\nAlso, since a new era is upon us, some integration with ai agents to do type-guided data exploration.\n\nA big thank you to everyone who has taken time to try the library and /or give advice - especially @daikonradish who was the main voice in the direction of the architecture. A lot of the most important design decisions were made on community threads. Thank you all.",
  "title": "[ANN] dataframe 1.0.0.0"
}