Author Topic: Can we start a board mirroring project?  (Read 7167 times)

Chaos Ex Machina

  • Lieutenant
  • ***
  • Posts: 79
Can we start a board mirroring project?
« on: September 22, 2012, 03:59:10 AM »
With the Zwillinger news I expect the official boards may go rather quickly.  Please act so the board is not GONE TO THE AMERICANS.

I am ridiculously busy and involved in another project I intend to announce soon, but board parsing should not be a particularly tech intensive task if any of you are willing.

uninventive

  • Guest
Re: Can we start a board mirroring project?
« Reply #1 on: September 22, 2012, 04:02:01 AM »
Yeah, I'm thinking along the same lines, too: the fewer red names present, the earlier they'll just axe the boards to avoid liability (player makes racist/hate speech comments, they linger, lawyers pirouette, peasants get lulz, and NCSoft closes them anyway).

SithRose

  • Plan Z: Lore Lead
  • Elite Boss
  • *****
  • Posts: 1,981
  • The Phoenix is coming.
    • Missing Worlds Media - Plan Z: The Phoenix Project
Re: Can we start a board mirroring project?
« Reply #2 on: September 22, 2012, 05:21:55 AM »
I believe several people are already working on this - the thread may have dropped to the third page or so, but it's here.
Lore Lead for Plan Z: The Phoenix Project
Secretary of Missing Worlds Media, Inc.

Chaos Ex Machina

  • Lieutenant
  • ***
  • Posts: 79
Re: Can we start a board mirroring project?
« Reply #3 on: September 22, 2012, 10:13:59 PM »
I saw a lot of discussion but no commitments.

WanderingAries

  • Elite Boss
  • *****
  • Posts: 321
  • @WanderingAries /\ Mostly on Torch
Re: Can we start a board mirroring project?
« Reply #4 on: September 23, 2012, 05:37:53 AM »
I saw links, discussion, and concepts, but no solid confirmations.
Find me on Homecoming:
https://www.homecomingservers.com

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #5 on: September 23, 2012, 06:02:07 PM »
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.

I have three archives:

The first archive is everything except the forums. So that would include pretty much anything at *.cityofheroes.com that can be found through recursively linking from the main website, as well as content on a few other related domains. So this includes domains such as ftp.coh.com, goingrogue.na.cityofheroes.com, ftp.ncsoft.com. I last took a snapshot around September 10 but I may take a new snapshot again later this week. This archive is more than 20 GB in size, due to media files.

The second archive is just the forums. My initial attempts to archive this met with some difficulties. Because it's a dynamic site with many ways of viewing the same information, you end up with many many copies of the same information. And because it's all "flat" it was making my computer cry to have so many files in a single directory. However, this thread got me investigating again and I discovered that there's an archive friendly version of the forums that strips out links, images, formatting, etc. Now that is great for archiving! So last night I kicked off an archive job and it downloaded successfully, though it stopped at I think 1,000,000 links (because I didn't realize that was a default). I now have it set to download with a much higher threshold. I don't know how long it will take to finish, but I'll try to make sure I get at least one full snapshot. Not sure how big it will end up being, but the partial download was 1 GB.

The third archive is that I downloaded all the videos from their Twitch.TV, Ustream, and YouTube accounts. This is nearly 60GB of data.

I'm also keeping the archives I make backed up on SpiderOak so I'm reasonably confident I won't lose them due to drive failure or anything. However, as I said, I wouldn't want to discourage anyone from making their own archives; more copies is safer. I'm using HTTrack to make my website archives.

Chaos Ex Machina

  • Lieutenant
  • ***
  • Posts: 79
Re: Can we start a board mirroring project?
« Reply #6 on: September 23, 2012, 06:23:30 PM »
There may be a way to specifically archive vbulletins in a way that may convert to a database however I found none.

You could filter URLs to ignore certain types of categories.  For example only capture the indexes, postings, and profiles.

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #7 on: September 23, 2012, 08:04:29 PM »
You could filter URLs to ignore certain types of categories.  For example only capture the indexes, postings, and profiles.

That was actually what I was starting to do yesterday when I started, until I noticed the archive version of the site. The archive version gives us all the post contents and forum indexes, so I figure that's the highest priority to capture at least initially. Maybe once I have a solid archive of that, I may try to expand to the "normal" version of the forums to capture images and formatting, though I have a suspicion that'll blow the size of the archive up quite significantly.

Windy

  • Underling
  • *
  • Posts: 9
Re: Can we start a board mirroring project?
« Reply #8 on: September 24, 2012, 12:52:19 AM »
I'm doing some mirroring/archiving, but I would certainly encourage anyone else who wants to do so as well. I do not have what I downloaded available online anywhere yet though.

Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)

Archiving a forum without server access is not something we know how to do in my household.  If someone wants to give me a few instructions, I'm happy to be another archiver.
@Windy
Freedom Server

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #9 on: September 24, 2012, 01:16:45 AM »
Sekoia, will your archives of the forums be available for those of us who may want to search them for info? (For future articles, quotes, contact info, etc.)

I don't have any specific plans for how to make them available just yet, but when the official sites go offline I'll work with Titan to see if there's some way we can make them publicly available. I'd definitely like to see that happen.

Archiving a forum without server access is not something we know how to do in my household.  If someone wants to give me a few instructions, I'm happy to be another archiver.

Here's the software I used for it: http://www.httrack.com/ I'm currently running an archival job so I can't check to see what settings I used specifically right now, I'll try to share them later though. The software isn't too terribly hard to learn, especially if you read through their documentation.


UPDATE: I managed to finish downloading the archive version of the forums. It totals about 4.5 GB. Note that this also only includes the publicly viewable parts of the forums, not the stuff that requires log-in.
« Last Edit: September 24, 2012, 01:20:24 PM by Sekoia »

Quinch

  • Guest
Re: Can we start a board mirroring project?
« Reply #10 on: October 01, 2012, 07:22:12 AM »
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.

I'm thinking of putting something together that will simply go page by page and download thread by thread, but as I'm still getting the hang of C#, I know I can do it, but not sure if I can do it quickly enough.

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #11 on: October 01, 2012, 01:39:29 PM »
The problem I've found with website scrapers and bulletin boards is that they treat every link as something to be downloaded - even when the link leads to the post in the same thread, leading to a veritable and perpetual explosion.

Yeah, I gave up my first attempts at mirroring the forums because of exactly that. Perhaps with filters it could be made more reasonable, though.

The archive version of the forums only represents each post exactly once, so the issue you cite fortunately wasn't a problem there. Unfortunately, the archive version is far from perfect. It strips off formatting and drops images.

Which means that the Beta testing boards are not gathered (ie the ones you see if you log in on a VIP account)

This is correct.

voodoogirl

  • Guest
Re: Can we start a board mirroring project?
« Reply #12 on: October 01, 2012, 01:53:59 PM »
What about Closed Beta forums? I still have "subscriptions" to the one or two I was in.

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #13 on: October 01, 2012, 01:58:59 PM »
My current archive only contains the publicly-viewable parts of the forums, so that excludes VIP and Beta stuff.

I'm now making an attempt to archive the VIP section (I found out how to have WinHTTrack log in), but unfortunately I don't have access to any Beta forums that might exist so I wouldn't be able to archive them. I'll post again when the mirroring job is complete to note whether it successfully captured the VIP stuff.

voodoogirl

  • Guest
Re: Can we start a board mirroring project?
« Reply #14 on: October 01, 2012, 02:02:20 PM »
I can tell you that http://boards.cityofheroes.com/forumdisplay.php?f=726  leads to the main I-20 Pre-Beta forums, if you can download it

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #15 on: October 01, 2012, 02:08:00 PM »
Nope, tells me I don't have permission to access it (even though I'm logged in). :(

voodoogirl

  • Guest
Re: Can we start a board mirroring project?
« Reply #16 on: October 01, 2012, 02:10:21 PM »
Maybe you can ask a former Dev who still has forum log in privileges...?

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #17 on: October 01, 2012, 02:14:22 PM »
Paying closer attention and to clarify, the VIP forums I have access to are actually for Issue 24 beta (for some reason the fact that they were both VIP and Beta wasn't clicking for me...):
Issue 24: Resurgence [VIP Beta] Forums
 - Issue 24: Resurgence [VIP Beta] Announcements Forum
 - Issue 24: Resurgence [VIP Beta] General Discussion Forum
 - Issue 24: Resurgence [VIP Beta] Feedback Forum
 - Issue 24: Resurgence [VIP Beta] Bug Reports Forum

Are the previous issues (such as I-20) beta forums still actually there? I was under the impression they nuked them after a while.

I honestly don't follow the official forums very much, so sorry if that sounds clueless. :)

voodoogirl

  • Guest
Re: Can we start a board mirroring project?
« Reply #18 on: October 01, 2012, 02:19:22 PM »
Nope, they are there. I can still see all the contents of http://boards.cityofheroes.com/forumdisplay.php?f=734

Maybe parent directories are hidden but not children?

A few months ago I discovered the ascending order of new forums and found an empty forum and posted in it, wreaking some havoc for one night. The thread was excised the next morning and the phantom forum locked.

AFAIK they don't erase the forums - they make them invisible - since they sometimes refer back to the threads.

Sekoia

  • Titan Network Admin
  • Elite Boss
  • *****
  • Posts: 1,848
Re: Can we start a board mirroring project?
« Reply #19 on: October 01, 2012, 02:50:50 PM »
Interesting! Once I confirm that HTTrack is actually working properly for the log-in sections, I'll send someone a PM and inquire about the beta forums then. Doesn't hurt to ask. :)