Arun Shah™@lemmy.world to Technology@lemmy.ml · 21 hours agoTop 10 website in the world ( July 2025 )lemmy.mlexternal-linkmessage-square25fedilinkarrow-up180arrow-down18
arrow-up172arrow-down1external-linkTop 10 website in the world ( July 2025 )lemmy.mlArun Shah™@lemmy.world to Technology@lemmy.ml · 21 hours agomessage-square25fedilink
minus-squareEager Eagle@lemmy.worldlinkfedilinkEnglisharrow-up14arrow-down1·edit-219 hours agoThere are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates. What bothers me is that they – apparently – crawl instead of just… using the API, like: https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2 I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.
minus-squareclb92@feddit.dklinkfedilinkarrow-up6·4 hours ago valid reasons for not wanting the whole database e.g. storage constraints If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.
There are valid reasons for not wanting the whole database e.g. storage constraints, compatibility with ETL pipelines, and incorporating article updates.
What bothers me is that they – apparently – crawl instead of just… using the API, like:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=Lemmy_(social_network)&formatversion=2
I’m guessing they just crawl the whole web and don’t bother to add a special case to turn Wikipedia URLs into their API versions.
If you’re training AI models, surely you have a couple TB to spare. It’s not like Wikipedia takes up petabytes or anything.