• clb92@feddit.dk
    link
    fedilink
    English
    arrow-up
    37
    arrow-down
    1
    ·
    edit-2
    18 hours ago

    Why would anyone crawl Wikipedia when you can freely download the complete databases in one go, likely served on a CDN…

    But sure, crawlers, go ahead and spend a week doing the same thing in a much more expensive, disruptive and error-prone way…

    • Eager Eagle@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      arrow-down
      1
      ·
      edit-2
      17 hours ago

      There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

      What bothers me is that they – apparently – crawl instead of just… using the API, like:

      https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2

      I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.

      • clb92@feddit.dk
        link
        fedilink
        arrow-up
        1
        ·
        2 hours ago

        valid reasons for not wanting the whole database e.g. storage constraints

        If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.

    • Pechente@feddit.org
      link
      fedilink
      arrow-up
      1
      ·
      18 hours ago

      My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.