Forum:Downloading RW/Backup

There are issues with our current off site backup procedure, and we may need to make some changes. This is a good opportunity then to explore an idea I have been thinking about. I am contemplating setting up RW content dumps similar to what WP does. Likely the way it would work is one file contains all the content of RW but with out the extensive histories (much smaller file), and another file would contain the content+histories. This would allow any user to download a "backup" of RW if they wish, or fork content or do whatever. These dumps would not be an exact backup of the site as things like user names, passwords, and certain meta data would be missing. However, it would allow for someone to have a good start towards "restarting" RW if they want to or need to.

However, I don't want to host these files on the RW servers as it would just eat up our bandwidth. I would therefore like to setup an account on a cloud computing platform that would let me host the files there. Add in a few another 20 GB to the account and we have a solution to our off site backup issues as well.

I estimate the cost at about $150 a year. For that money we would have off site backup, and the ability to distribute content dumps to RW users who want it.

Thoughts? tmtoulouse 21:30, 6 April 2010 (UTC)
 * Sounds good. I'm not so much worried about history, more the stuff we write. I trust the way things are run, but even still it'd be a shame to risk losing all the stuff we have here if something unexpected happened. It'll be the end of the month though before I can chip some cash. Any idea how big the dump would be without the metadata? -- 21:47, 6 April 2010 (UTC)
 * With history and all that stuff it's above 10 gigs, approaching 15 now I think. Without history it's substantially smaller, 150 megs. -- Nx  / talk 21:52, 6 April 2010 (UTC)
 * Cheers. 150MB is doable, even on my connection. -- 22:02, 6 April 2010 (UTC)
 * Could the 150Mb one be uploaded on Megaupload or some other free filesharing site? the interesteds could d/l a copy easily.
 * Thinking about it, send the 15gigs to the FBI directly, When rob's knobs summon it, it will already be there Alain (talk) 22:22, 6 April 2010 (UTC)
 * As I said, I will be providing the dumps on a file hosting site, however to support both dumps a paid service will be needed, and with a bit more money that paid service could be translated into an off site backup for the whole wiki as well as the dump, which would solve our current issue with problem with the off site backup. That's the argument, the cost isn't substantial ($150 for a year) but more than I can afford. Ultimately if this seems like a good idea I will be putting out a call for donations. tmtoulouse 22:25, 6 April 2010 (UTC)
 * What is the current issue with off site backup? -- Nx  / talk 22:28, 6 April 2010 (UTC)
 * I will e-mail you. tmtoulouse 22:32, 6 April 2010 (UTC)
 * Sorry for misreading skimming your call-out. Suggestions from a I-only-do-intrawebs-'caus-I'm-broke kinda guy (Damn, took me 10 years before i paid that 5$ for a metafilter registration):
 * Either split the 15 gigs in a throng of .RARs and play it (legal)hacker style in megaulpload. ( with the supersikkrit .rar with logins and pwds)
 * Or/And Some Advertising on rw? set up a google adds account in the name or the LLC :) Probly discussed already somewhere
 * Ideas from a complete newbie on webhosting/website managing/servers/younameit...
 * Damn i'm probably a more clueless sysop than most on CP! Worst wikia sysop ever! Alain (talk) 22:50, 6 April 2010 (UTC)
 * Dump the backups on S3 and make them public access. 14:08, 9 April 2010 (UTC)

Special:Export
Technically, you can already get a full backup of RationalWiki. Of course there are some problems with it: It's still something worth mentioning for partial backups of few articles. -- Nx  / talk 22:41, 6 April 2010 (UTC)
 * 1) You need to manually type in every article's title (or you can get all articles in a category)
 * 2) It will probably time out after 30 seconds
 * 3) Trent will kill you when he sees the bandwidth bill.
 * That'd be an interesting way of saving and publishing our "best" articles. Ones we consider complete (or at least sufficiently viable for publishing, some of the better essays too, for instance) can be bunged into a word processor, tarted up and published as PDF files as a single collection. 18:37, 12 April 2010 (UTC)

Some thoughts
One thing is I keep seeing is all these different numbers regarding the bandwidth storing the site would require. Seems to me there are, broadly, three numbers that matter. One, the biggest, is the entire site - all diffs, all media, plus all the "surface" versions of articles (with or without a functioning MW install). Then there is the database without any rendered file versions - all the diffs + all the media, plus of course the back end changes - our extensions, but not MW itself. This is the minimum required for a "full" restore. The third, and to me the most important, is the daily bandwidth required to update either of those. After initializing a backup site (the big dump), the daily changes are all that is required, and that number would be far smaller than the bandwidth the server uses.

Now, if I am correct here - and tell me if I'm not - what MW does operates at two levels. One, the backbone of the data, is it maintains a huge database of edits and uploads (and custom backend stuff). A page is "described" completely by a series of diffs, and that is all that must be saved in order to bring the site back to life. The other thing MW does, which a backup does not need, is render the html (or something one layer deeper) for the current version of any given file/article in order to serve it up without having to build it from diffs from scratch. I suspect these renderings are often a bit "behind", as in a few diffs still need to be applied to them to show the current version.

So, my question is, how big is the raw data - everything it would take to mirror the site, but not able to serve it up, and how big are the average daily additions to this data? 23:36, 10 April 2010 (UTC)


 * You're not entirely correct. MediaWiki stores each and every revision of a page completely - diffs are generated on-the-fly. This is why our full dump is approaching 15gigs (this is the minimum needed to restart RW on another server, although things like users and permissions would be lost). And images are relatively small compared to the revisions table, but I don't have an accurate number.
 * The daily backup transfers (or transferred) all 10+ gigs every day. (stupidity removed by senior admin) However, our backups were not done using MediaWiki's backup script - they were MySQL backups, i.e. a raw dump of the entire database (the MediaWiki dump is an XML file, and only contains revisions and pages), including private information. This can be done incrementally, and we should probably set that up to save bandwidth. -- Nx  / talk 23:58, 10 April 2010 (UTC)


 * We didn't transfer all 10+ gigs everyday, I rsynced the backup so it only transferred the changes. Which amounted to a few hundred megs. tmtoulouse 00:00, 11 April 2010 (UTC)


 * Bah, I didn't take that into account. I only looked at the huge files that were produced from the full db dump. -- Nx  / talk 00:03, 11 April 2010 (UTC)
 * No worries, I am looking into new back up scripts as well as soon as we figure out how we want to handle it. tmtoulouse 00:22, 11 April 2010 (UTC)

Just for amusement, how much would it be without the saloon bar and talk:wigo CP? 01:00, 11 April 2010 (UTC)

And for serious, am I correct in thinking that the db simply "grows", nothing ever goes away? And it's just a bunch of files that could be stored on any HD? 01:14, 11 April 2010 (UTC)
 * Might be worth seeing if we can expunge the saloon bar every month or so, or maybe dump archive pages somewhere else. 08:09, 12 April 2010 (UTC)
 * We would lose the edit history then. -- Nx  / talk 10:50, 12 April 2010 (UTC)
 * For the saloon bar do we really care? If you make people aware that anything they post will be expunged after a month then they can be responsible for archiving any data they want to keep. 12:14, 12 April 2010 (UTC)
 * A butterfly flaps its wings in Africa.........there are all kinds of odds and ends of the wiki that expunging edit histories would harm. Edit counts, both of users and site wide, DPL, contribution lists, and I am sure many issues I am not even aware of. So far, there is no reason to do it either. We are not hurting for disk space. tmtoulouse 12:26, 12 April 2010 (UTC)
 * Surely not. I have a blank 40GB HD just lying around.  How come no one is answering my "serious" question?  12:44, 12 April 2010 (UTC)
 * Yes the database just grows, nothing goes away and yes they are just files. tmtoulouse 12:47, 12 April 2010 (UTC)
 * (ec) Sorry about that, I thought I answered it already. Yes, the db grows, each time you save a page, you add that page's text to the db, so e.g. when you save a 400kb page and change just a single letter, you are adding 400kb to the database. Liquidthreads would alleviate this problem (each comment is a separate page with its own history, so there's no such overhead), and I'm hoping we can start using it soon.
 * As for deleting history: I'm sure it can be done without breaking stuff, but it would be that: deleting history. i.e.: permadiffs won't work, users' contribs would be lost, editcounts lost etc. -- Nx  / talk 12:50, 12 April 2010 (UTC)
 * Which is always why its worth an extra giggle every time Ken makes 300 edits to one of his monstrous pages, just to change a few blank lines. tmtoulouse 12:58, 12 April 2010 (UTC)
 * I though the pages were compressed by just saving the changes and stringing them together, but I imagine that'd be more intensive work than saving individual pages and just getting the software to generate the diffs. I'm going to feel really guilty for every edit now...[[image:Francis.gif]] 18:34, 12 April 2010 (UTC)
 * Holy shit! You mean each version of the page is stored as a whole rather than using a reverse delta? That's fucking terrible. How long have source control systems been around? They surely could have swiped some code from CVS / SVN? 20:58, 12 April 2010 (UTC)
 * I'm sure there was a reason to implement it this way other than "we're too retarded to implement reverse deltas". -- Nx  / talk 21:05, 12 April 2010 (UTC)
 * I'd be interested to know the reason. By the way, can we not just zap the saloon bar with this every now and then? Would save a massive amount of dump space. 21:07, 12 April 2010 (UTC)
 * Or this might be beter. 21:08, 12 April 2010 (UTC)
 * Ask the developers of mediawiki then. As for deleting old revisions: like I said, it's not a technical problem, it can be done, but that would mean losing history, contribs, edit counts, permadiffs etc. -- Nx  / talk 21:09, 12 April 2010 (UTC)
 * Actually, I just found this, which means they are now looking of ways of implementing reverse deltas in the same way as subversion, so I guess they were retarded enough to not think about it. Christ, imagine how big Wikipedia is now! Also, as I said before, do we really care anout history, edit counts, diffs etc for the saloon bar? I think the user who would lose the most from it would be me and I don't give a fuck. Who posts diffs to the bar anyway (apart from quick "haha, I loved this comment from you" on people's talk pages, which get forgotten after a few days anyway)? 21:13, 12 April 2010 (UTC)
 * (ec) Not really, take a look at the history of that page. Anyway, liquidthreads is going to solve the saloon bar part of the problem. As for articles, we cannot delete old revisions because of the attribution requirement of our license (except vandalism and maybe edits that have been reverted). -- Nx  / talk 21:26, 12 April 2010 (UTC)
 * Well, recall that Trent just recently discovered the diff that "started" RationalWiki. You never know what is actually in there until you delete it and realise you can't find it. And a few people do track edit counts for reasons of productivity and statistics as well as basic ego. Anyway, I imagine the reason for not just storing diffs is that working back through (potentially) hundreds or thousands of diffs to construct a previous version of a page would be a lot of work for the server and software compared to just pulling up an old version. The pages aren't extremely big, and extra storage space is easier to get hold of and expand compared to processing speed and RAM for the server. 21:24, 12 April 2010 (UTC)

(UI) I'm not suggesting deleting all revisions from all articles, just the bar. This is starting to turn into a discussion with a creationist. "Why not remove the history of the bar?" "because that would mess up the history" "but who cares if it's for the bar?" "Well we'll lose all the history, and articles would be screwed up" (repeat ad-nauseum). Do whatever, I don't care about the space really, I was trying to help out. I'll go back to my beer. 21:37, 12 April 2010 (UTC)


 * The problem is not in the absolute size of the wiki itself, as "storage space" is really cheap, both on the server and on offsite backup options. The "problem" comes into things like bandwidth, file sizes (not because they take up space, but because managing a single large file can be cumbersome). Some of these problems fall away quickly enough, after the initial transfer of the database the nightly bandwidth is small. I am working on a variety of options for handling file sizes, etc. I think the technological energy is better spent figuring out how to efficiently preserve our history rather than cull it. tmtoulouse 21:50, 12 April 2010 (UTC)
 * PS, don't view it as attacking your suggestions, I appreciate people taking an active interest in this level of the meta-management of the project and would never want you to think that your suggestions aren't appreciated. tmtoulouse 21:52, 12 April 2010 (UTC)
 * I might be talking way out of my level of competence but couldn't the whole thing be copied to a terabyte external disc drive? (about £50 here) 21:58, 12 April 2010 (UTC)
 * Local backups are created, but an offsite backup should usually be created for something of this importance. If my house burns down, or everything in my room stolen, or civil war were to break out in Ontario, a local backup doesn't save the data. Also if I were to disappear off the earth, whether being run over by a bus (I harp on this example because this is a non-trivial likelihood, there is this two stage crossing I have to do everyday coming to/from work back home and I get confused sometimes...), or rapture other people can have access to the offsite backup and recreate the wiki in a new location. tmtoulouse 22:03, 12 April 2010 (UTC)
 * OK get two drives, one on premises & the other elsewhere (moved physically to a net linked machine somewhere) - let's face it if (gOD forbid) something did happen to you we'd be rather effed up anyhow.) 22:08, 12 April 2010 (UTC)
 * Backing it up locally and either mailing it or personally taking it to someone may be a viable and fairly sensible option, actually. What's the phrase? "Never underestimate the bandwidth of an 18-wheeler truck speeding down the road"? Although that could only be for a one-off, perhaps future backups could be done to that drive via the internet in smaller chunks if possible. 22:16, 12 April 2010 (UTC)
 * In a way that was what we were using, a volunteer allowing us to dump our data to their computer. But they are no longer able to support that, and here is the great limitation of this method, it is 100 percent reliant on the ability and willingness of one person, and the level of safety/support of the backed up data can't be guaranteed. My argument, as well as others, is that for about $10 a month we can have a professional service handle our offsite storage. The ability to read and right to this storage, as well as its safety and integrity is now reliant on the solvency of Amazon. Also, it is very easy to share access to this information to a few trusted resources so that it can be retrieved in various catastrophic scenarios. tmtoulouse 22:23, 12 April 2010 (UTC)

archive.org
Would archive.org be interested in hosting the dumps? - David Gerard (talk) 08:01, 12 April 2010 (UTC)


 * It also occurs to me that putting up a BitTorrent tracker instead of a download would be highly efficient. Can list it on LegalTorrents and all - David Gerard (talk) 10:08, 12 April 2010 (UTC)

Amazon
I'll go back to advocating Amazon's AWS services, because we use them here a lot for our mirror servers, backup storage, and file synchronisation and they are awesome for the cash. I've found the best way to get a cheap dedicated server is to create an EC2 spot instance with a really high bid amount (e.g. $1), and then you have a perpetual server which costs you about 3c/h ($263/year), and bandwith is only $0.15 / Gb (with first Gb/m free). Not sure how that fits into the budget, but for storage S3 would worth it for the backups. 08:08, 12 April 2010 (UTC)


 * Its a solution I am leaning towards but there are a lot of limitations toe the S3 service. For example, the maximum size of a file is 5 gigabytes, but our database files reach 14 gigabytes. Also file transfers are "all or nothing" with S3 meaning that rysnc doesn't work with out a third party service and additional costs. But I am exploring it as an option. tmtoulouse 11:21, 12 April 2010 (UTC)


 * Maximum file size surely shouldn't be too hard to get around; just use split or tar or something to break the file into manageable chunks. As for the lack of rsync, something like duplicity might be worth looking at: use it to create local incremental diffs and upload those. Or, rsync appears to have an "--only-write-batch" option, so maybe one could rsync between two local dirs and use that to generate suitable incremental diffs. (I've not tried those myself, just thinking aloud..) alt (talk) 11:38, 12 April 2010 (UTC)


 * I didn't even know there was a maximum. We use 7zip on all our database backup files which gets them well within the size limit. There are some free syncing tools out there, but they all have problems. The one we're using now is "Super Flexible File Synchronizer", stupid name, awesome product. Unfortunately it is $60 but if you use it properly it is worth it. 12:12, 12 April 2010 (UTC)


 * The thing you have to realize about third party sync tools is that the S3 all or nothing paradigm means you can not compare partials of files. No matter what you do if a part of a file changes you have to upload the whole file. That is not a problem if the files we are talking about are a few megs. But when the vast bulk of the multi-gigabyte backup is a single file its a problem. And trying to split a 14 gigabyte file into 100 smaller files to get around this problem would just eat up CPU time in a vicious cycle. There is a solution but it involves having to use the EC2 aws service. Again, I am not saying its not an option, its actually my top choice so far, but the issues are fairly complex to work through. tmtoulouse 12:23, 12 April 2010 (UTC)


 * Well the sort of thing I had in mind is that you have a local copy of greatbigdump2010.sql and then a series of increments generated via rsync --only-write-batch or whatever, and then periodically upload increment_12042010.diff etc to the offsite storage place. Obviously that means you've got to have a local mirror of greatbigdump2010.sql and N*(increment_foo.diff); it'd probably be a good idea to periodically start over again with greatbigdump2011.sql. I know it's far from perfect though. Now if only Google provided a bit more space and actually allowed gmailfs to work properly.. alt (talk) 12:53, 12 April 2010 (UTC)
 * I think Cloudberry Backup does block-level (chunked) backups of large files now, but I think it's windows only. Can't tar / gzip split consistently enough to keep unchanged portions the same? 14:47, 12 April 2010 (UTC)
 * Anyway, if you do choose to go to EC2 then let me know if you need a hand, as I've been using it for almost a year now. One tip I would give would be to start your instance using an EBS AMI, because then you can stop & start it rather than just terminating it, and if the worst happens and you haven't taken a snapshot recently then you can can boot it again. 20:56, 12 April 2010 (UTC)
 * I would likely use something that is all ready established such as s3rysnc, which is really pretty cheap. Like I said, and others, there are solutions to various problems, but it is important to weigh options. Also I need to get a better feel for how tar/split or rdiff effect partial checksums. tmtoulouse 21:05, 12 April 2010 (UTC)
 * Is that the Ruby based one? If so then don't bother. We used it for about 6 months and as soon as you get over a few hundred thousand files it's practically useless. 21:15, 12 April 2010 (UTC)
 * No, it is actually a group that has an EC2 system up that you can rsync to. So they have access to your S3, download it, and you rsync the them then they upload it back to your S3. http://www.s3rsync.com/ you just pay for computational time at a fairly cheap rate. tmtoulouse 21:40, 12 April 2010 (UTC)

(undent) Or rather than tar/splitting a huge file, how about switching on binary logging on the database (it gives a very small performance hit, but that's maybe acceptable). That should generate a set of manageable-sized files; set a cron job to flush the logs each night and upload those. (which possibly means you just do a straight copy rather than rsync, though s3rsync does look pretty good) alt (talk) 23:23, 12 April 2010 (UTC)
 * I'm a bit out of touch with MySQL. Can you ship transaction logs? i.e. do a complete dump and store that somewhere and then ship the transaction logs to the same place every night. Every month or so do a complete backup and truncate the logs. 10:15, 13 April 2010 (UTC)
 * That's pretty much what I was suggesting, yeah - I don't know if MySQL's "binary log" is technically exactly the same as a full transaction log, but they contain enough info to do incremental backups/restores. alt (talk) 14:10, 13 April 2010 (UTC)
 * FYI Trent (if you do go with EC2), as far as taking an AMI snapshot goes, I've just discovered that you can't do it from the EC2 control panel. You have to do it on the command line and then upload and register manually. Info here, but they don't tell you that you have to remove the dashes from your account number. HTH. 00:34, 18 April 2010 (UTC)