Below is an overview of how TheFleet works on a per network basis. The article also touches on a bug I found in cl-irc, enumerates the hurdles I have to overcome related to orchestrating the fleet of fleetbots, and lists my next steps forward.
Table of Contents:
I. Fleetbot class template
II. Connecting basics
III. How fleetbot handles various irc messages
IV. Reconnecting
V. Notes on logging
VI. A bug in cl-irc
VII. Problems with orchestration
VIII. Next steps
I. Fleetbot class template
Below is the fleetbot class template. Some of the fields are inherited from its superclass, ircbot.
Field Name | Datatype | Description |
---|---|---|
db | list | a list of four strings (db_name, db_user, db_user_password, db_ip_addr) used for connecting to the postgres db. |
sunk-p | boolean | set to true if the bot has disconnected, false otherwise |
resurface-attempts | integer | the number of consecutive failed attempts to connect to a network. gets reset to 0 upon a successful connection. a connection is deemed successful when we receive our join message from the server1 for any channel in our lists of channels. |
connection | cl-irc class | an object representing the connection to the network. ~all cl-irc commands take this object as the first parameter.2 |
channels | list | a list of strings that are the names of channels the bot will attempt to connect to. |
active-channels | list | a list of strings that are the names of channels we are currently connected to. upon being kicked from a channel, we remove the channel from active-channels. upon being disconnected, we remove all channels by setting active-channels to nil. we add the corresponding channel to active-channels when we receive back our join message from the network. |
networkname | string | the name of the irc network the bot connects to |
server | string | the hostname of the network |
port | int | the port number of the network (usually 6667) |
nick | string | the nick we use for the network |
current-nick | string | the nick that we are using for our current connection. this is the same as nick, unless the server told us we could not use nick (most likely because nick is already being used.) in this case we append "-???" to nick where each ? is a random digit 0-9. |
password | string | the password for the nick. since we do not use registered nicks for fleetbot, this field is always nil. |
connection-security | :ssl or :none | the connection security for the network. currently we do not connect to networks via ssl, so this is always set to :none |
run-thread | sb-thread | the thread that is used for communicating with the server |
ping-thread | sb-thread | the thread that is used for sending pings to the server |
lag | int | the difference between the last time we sent the server a ping and the time we received the server's corresponding pong. |
lag-track | list | everytime we send the server a ping we store a reference to the ping in the lag-track. when we receive the corresponding pong from the server we delete the ping from the lag-track. if we have not received a pong in response to a ping for > *max-lag* (default = 60) seconds we determine we have disconnected, and the ping-thread will attempt to reconnect our bot to the network. |
II. Connecting basics
The fleetbot constructor takes as parameters: the server information (networkname, server, and port), a nick, a list of channels, and the db used for logging.
We then connect fleetbot via the overwritten method ircbot-connect-thread
. ircbot-connect-thread
creates a new thread, assigns that thread to the bot's run-thread field, and within the new thread calls ircbot-connect.
ircbot-connect
is surrounded in a handler-case.3
First, ircbot-connect
calls cl-irc's function connect
, which takes a nickname, server, port, and connection-security and returns a connection object. The connection object provides an interface to send and receive messages from the irc server via cl-irc's api. Next, we use cl-irc's api to add a list of hooks to the connection object. A hook is a mapping of an irc message type to a function. When we receive a message via the socket stored in the connection object, cl-irc searches for hooks that are assigned to the message's message type and calls the hook's function with the message passed as the sole parameter.
If at any point an error is thrown, then we mark the bot as "sunk" and attempt to reconnect the bot. If we cannot reconnect after +MAX-RESURFACE-ATTEMPTS+ (default = 5) attempts then we determine the bot has sunk for good and stop trying to reconnect.
III. How fleetbot handles various irc messages
We have the following hooks setup:
IRC Message Type | Function Description |
---|---|
irc-err_nicknameinuse-message | We change our nick by appending a random suffix. (i.e. jenny -> jenny-482) |
irc-kick-message |
The intended behavior is to log that we are kicked and then remove the channel from our active-channels.
Currently upon being kicked from a channel, we have a bug where we immediately try to rejoin all channels in fleetbot's channels field.4 |
irc-notice-message | A notice message is an arbitrary message from the server. Different networks send different notice messages. For our purposes, we parse the message to see if the server is telling us that the nickname we're trying to use is registered. If so, then we pick a new randomize nick, just like we do when we receive irc-err_nicknameinuse-message. Afaik this is only useful for freenode. |
irc-pong-message | When we receive a pong message we calculate our lag with the server and mark that the server has received our ping. |
irc-rpl_welcome-message |
Upon receiving the welcome message from the server, we start the ping-thread and connect our bot to its channels via cl-irc's join method.
|
irc-privmsg-message | A privmsg is any normal user message sent to a target.5 We log privmsg's to the irclog table in our postgres db. |
irc-part-message | A part message gets sent when a user leaves a channel. We log these to our irclog along with privmsg's. |
irc-join-message | A join message gets sent when a user joins a channel. We log joins to our irclog along with part and privmsg's. If the join message is saying that ~we~ joined the channel, then: we add the channel to our list of active channels, we log to fleetlog that we joined a channel, and we set our consecutive-resurface-attempts back to 0. |
After we set all the above hooks, we enter an infinite loop reading messages from the server. If we have a hook for the message type of a message we received, we dispatch to the corresponding function. If we receive a message type that we don't have a hook for, the default-hook from cl-irc gets called, which usually just prints the message to STDOUT. If at anypoint during our infiniteloop the server sends us an EOF or we throw any error, we attempt to reconnect.
IV. Reconnecting
Reconnecting is a bit tricky because there are two different threads that can call reconnect: the bot's run-thread and the bot's ping-thread. The run-thread would be reconnecting because the run-thread received an EOF from the server or hit an error during its operation. The ping-thread would be reconnecting because it hasn't received a pong to one of our pings from the server in *max-lag* seconds.
The reconnect flow is as follows:
First, we call ircbot-disconnect
. ircbot-disconnect
sets sunk-p to true6 and then logs an internal DISCONNECTED message to fleetlog for all the active-channels the bot was connected to. Then we set active-channels to nil. If we have an active connection to the network, we close that connection. Then we set the ircbot's lag-track and ircbot-connection to nil. Then, if we are calling disconnect from the run-thread, we kill the ping-thread. If we are calling disconnect from the ping-thread, we kill the run-thread.7 We then set the ping-thread to nil.
Once we've disconnected cleanly, we attempt to connect. If we are the run-thread we call ircbot-connect
. If we are the ping-thread, we make a new thread with ircbot-connect-thread
, and then call sb-thread:abort-thread
to end the current thread.8 Then we set sunk-p to false.
V. Notes on logging
Our schema has two tables - irclog and fleetlog. We insert into irclog the types of messages one would normally see in their irc client: privmsg, join, part, and kick messages. Fleetlog, on the other hand, keeps a record of our bot's events that we choose to store. For each event we log: a custom message describing the event, the channel (if applicable), the nick (if applicable), the networkname, and the time the event occurred. The custom messages currently logged are:
Message | Description |
---|---|
JOINED | logged for a channel+nick when we receive a join message from the server where we were the ones who joined |
KICKED | logged for a channel+nick when we get kicked from a channel |
DISCONNECTED | logged for every active-channel when we call ircbot-disconnect |
COULD-NOT-RECONNECT | logged for every channel we failed to connect to when we have failed a consecutive series of attempts to reconnect |
ARMADA-ALL-DEAD | logged for a network when all bots (within a process) connected to a network have crashed |
PROCESS-TERMINATED | logged for a network when we end the process running fleetbot (either because all ships have sunk or because the process received a kill signal) |
The schema for the postgres db is pasted below. Noticably missing is the indexing for fleetlog.
set search_path = public;
create table irclog (
id serial primary key,
target text not null,
message text not null,
host text,
source text not null,
"user" text,
networkname text not null,
irc_message_type text not null,
received_at timestamp without time zone not null default (now() at time zone 'utc')
);
create index irclog_received_at on irclog (received_at);
create index irclog_target on irclog (target);
create index irclog_source on irclog (source);
create index irclog_networkname on irclog (networkname);
create table fleetlog (
id serial primary key,
channel text,
message text not null,
nick text,
networkname text not null,
received_at timestamp without time zone not null default (now() at time zone 'utc')
);
Also, when I run fleetbot, I redirect standard out to a log file. The log file contains prints of unhandled irc messages and debugging print statements I put inside fleetbot.
VI. A bug in cl-irc
I gave the cl-irc source a pass and noticed a bug that affects fleetbot. cl-irc has a global variable *unknown-reply-hook* that can be assigned a function. That function is supposed to be called anytime an irc network sends a message of unknown type. However, the code that throws the no-such-reply error in cl-irc is malformatted.
This9
(error "Ignore unknown reply." 'no-such-reply :reply-number reply-number)
should be:
(error 'no-such-reply :reply-number reply-number)
The bug in cl-irc made the no-such-reply error bypass cl-irc's own handler-case10, throwing the error upstream to fleetbot. Fleetbot handles all errors by reconnecting. So upon receiving an unknown reply, fleetbot would reconnect and then usually receive the same unknown reply, causing fleetbot to reconnect again ad infinitum. I will patch cl-irc with the above fix and then set *unknown-reply-hook* to a function that logs unknown replies to fleetlog.
VII. Problems with orchestration
I need to rewrite how I orchestrate the fleet of fleetbots. Here are the constraints I am dealing with:
1. An irc network typically only allows 3 connections per IP.
2. A VM with 1GB of RAM costs me $5 / month.
3. A unix process running sbcl and asdf consumes 30MB of RAM at minimum. Currently a running fleetbot consumes closer to 100 MB.
4. A unix sbcl process has a maximum number of threads that it can have running concurrently.11
5. A unix sbcl process has a maximum number of sockets it can have open.
6. A unix process has a maximum number of filedescriptors it can use. (adjustable with ulimit
)
VIII. Next steps12
1. Publish vpatch addressing ircbot's reconnect bug
2. Fix reconnect on kick bug
3. Increase delay before reconnecting to a network
4. Patch cl-irc + create way to distribute fix to VMs (after I patch I can no longer load cl-irc via quicklisp)
5. Set *unknown-reply-hook* to a function that logs unknown replies to fleetlog
6. Fix fleetbot's db schema
7. Write article planning how to address orchestration of bots
- To join a channel, we send the server a join message. If the join is successful, we receive the same join mesage we sent back from the server. [↩]
- Under the hood, connection uses usocket:socket-connect to create a socket connected to the irc network. Then the socket is passed to usocket:socket-stream to get the network-stream. cl-irc creates an output-stream (used for sending messages to the network) by passing the network-stream to flexi-streams:make-flexi-stream. I haven't explored the usocket nor flexi-streams library at this time. [↩]
- A handler-case is Common Lisp's version of a try/catch block. [↩]
- I discovered this only now and it is the cause of problems I've run into.. The weird part of the bug is that we rejoin all channels when we are kicked from only one channel. This may explain why I saw "ERR_TOOMANYCHANNELS" messages despite limiting the length of channel-list to the max number of channels alllowed per nick on the network. [↩]
- A target is either a nick or a channel. [↩]
- I realized that
ircbot-reconnect
was setting sunk-p to false, ~before~ we callircbot-disconnect
. So all of our bots that reconnected once were being incorrectly marked as sunk. This has been fixed. [↩] - This is handled incorrectly in the current version of ircbot. In ircbot, the ping-thread kills the ping-thread (itself) when trying to reconnect, thus crashing the bot. Since there is a few pieces of republican infrastructure sitting ontop of ircbot, it is a top priority to create a vpatch that fixes this. [↩]
- But I realize there is no reason for this "create new thread and then self destruct". At this point the ping-thread can stop doing its pinging job and just become the run thread. [↩]
- I do not know why the author of cl-irc put the string "Ignore unknown reply." as the first parameter. The first parameter instead should be the condition type. [↩]
- try/catch block [↩]
- 2048 iirc. There is likely a way to increase this number. [↩]
- Updated from my last plan [↩]
[...] previous article detaling TheFleet's pseudocode was poorly worded and contained extraneous implementation details. This revision aims to give a [...]