So far TheFleet logged 52 out of 100 attempted networks.1I began logging another batch of 45 networks on Fleet2. Fleet3 is currently inactive and waiting for assignment.
Before giving Fleet3 its next assignment, I need to pay some technical debt. The programs I use to query the irc networks for their metadata2 are poorly written and undocumented.
Channel-snagger, the program used for grabbing the list of all channels, needs to be rewritten. It currently keeps a cache of the list of channels for individual networks and never updates this cache. This made sense when I first used the program, but now the networks' channel lists are stale. I could simply clear the cache, but the better solution is to make the program more efficient and reliable so caching is not required at all.
Another quirk of the channel-snagger is the makeshift way it handles the ping/pong dance with the servers. Networks require a ping in response to their pongs every N seconds. As a half-hearted attempt to keep the connection alive, I hard coded channel-snagger to send one ping after 30 seconds. Otherwise channel-snagger has no ping/pong logic. So if it takes too much time to receive the channel list from the network, the connection may be dropped for failure to perform the ping/pong routine. This can be fixed by refactoring channel-snagger so that it extends ircbot, which implements a proper ping/pong thread.
The code that queries the network for its maximum number of channels allowed per nick also needs a rewrite. It currently exists as a disorganized clump of scripts. I plan to consolidate them into the channel-snagger.
The above components are ultimately needed for keeping an up to date pool of the next channels to assign for logging. The only other missing element I can think of is a process for removing networks/channels that have already been logged (or have been deemed unloggable) from this pool. Once I have a system I am content with for assigning the next channels to my VMs running TheFleet, I will plan how to analyze the collected data.
> I will plan how to analyze the collected data.
Looking forward to that, really, I'm curious whether you'll find anything worth digging into. I had a quick glance at the data you posted and the amount of noise looks staggeringly high, mostly warez and idle chatter. Not that it's any surprise, mind you, most IRC networks were already starting to look like pigsties back in 2004; not like 1998 IRC was a place of great culture or anything, lol, but at least you didn't have to swim through oceans of shit to find something interesting...
The transparent meaning of the above being of course that you have no idea how to analyse the data, you feel quite overwhelmed by the idea to analyse the data to start with and therefore you'll spend instead whatever amount of time on polishing and futzing with everything and anything else rather than... analyse the damned data.
Sure, "a system I am content with" is such a great way of putting it, nobody ever could accuse you of not wanting to... being content, right?
Just stop all the futzing about and fleets and whatnots. You've got some data already - unless you can look at that data and report on it properly, there is no point (let me make this clear: THERE IS NO POINT) in getting even more such "data", all right? There's no magic that will do that part for you so you can just do the other and feel good about it. I already told you what basics are needed - if it's not clear, ask in chan.
@Spyked
Thanks for taking a look at the data. I hope the instructions were clear and you were able to load the sql file into your db quickly.
@Diana Coman
Okay. There is certainly a lingering "god how the hell do I analyze these 'oceans of shit'" and I justified delaying answering that question by convincing myself the right way to move forward was to reach an undefined level of "content" with improving the tools I use to log so that I can be logging in parallel while I answered the question.