Top 10 website in the world ( July 2025 )

Arun Shah™@lemmy.world · 19 hours ago

Top 10 website in the world ( July 2025 )

clb92@feddit.dk · edit-2 18 hours ago

Why would anyone crawl Wikipedia when you can freely download the complete databases in one go, likely served on a CDN…

But sure, crawlers, go ahead and spend a week doing the same thing in a much more expensive, disruptive and error-prone way…

Eager Eagle@lemmy.world · edit-2 17 hours ago

There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.

What bothers me is that they – apparently – crawl instead of just… using the API, like:

https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2

I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.

clb92@feddit.dk · 2 hours ago

valid reasons for not wanting the whole database e.g. storage constraints

If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.

limer@lemmy.ml · 18 hours ago

Vibe coding

Pechente@feddit.org · 18 hours ago

My comment was based on a podcast I listened to (Tech won’t save us, I think?). My guess is they also wanna crawl all the edits, discussion etc. which is usually not included in the complete dumps.

clb92@feddit.dk · 2 hours ago

Dumps with complete page edit history can be downloaded too, as far as I can see, so no need to crawl that.

mesa@piefed.social · 11 hours ago

Good pod cast