There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.
What bothers me is that they – apparently – crawl instead of just… using the API, like:
My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.
Why would anyone crawl Wikipedia when you can freely download the complete databases in one go, likely served on a CDN…
But sure, crawlers, go ahead and spend a week doing the same thing in a much more expensive, disruptive and error-prone way…
There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.
What bothers me is that they – apparently – crawl instead of just… using the API, like:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2
I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.
If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.
Vibe coding
My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.
Dumps with complete page edit history can be downloaded too, as far as I can see, so no need to crawl that.
Good pod cast