• Eager Eagle@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    1
    ·
    edit-2
    19 hours ago

    There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

    What bothers me is that they – apparently – crawl instead of just… using the API, like:

    https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2

    I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.

    • clb92@feddit.dk
      link
      fedilink
      arrow-up
      6
      ·
      4 hours ago

      valid reasons for not wanting the whole database e.g. storage constraints

      If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.