Threads and Signals in Python
Published: March 26, 2008Tags: python signals threading programming
For the last few days I've been working on a web spider (also known as a web crawler - see Wikipedia), in Python. This is something I've been thinking about doing for a while, simply because it always seemed like it would be a good, fun, educational programming challenge. I've been motivated to actually go ahead and write one just now mainly on account of my burgeoning interest in natural language processing, web searching, the semantic web and the overlap between these three. I have a few ideas for projects that I'd like to try out some day and they all require having a local copy of a small subset of the web to tinker with. A spider seems the natural way to achieve this.
I'm very happy with how the spider is progressing and I'll write about it in some detail closer to when I actually release it (which shouldn't be far off. And, no, I haven't forgotten about feedformatter, there's a new version of that in the works too). The point of this entry is to discuss the interplay of threads and signals in Python, which is something I had to contend with today.
My spider is multi-threaded. The main thread creates an instance of a UrlQueue, which is just a simple subclass of the standard library's Queue object and then spawns a number of worker threads which pull URLs off of this queue, download the sites at those URLs and then parse the HTML looking for links, placing any new URLs found onto the queue to be handled later by the same or a different thread. The whole thing is run from the command line, so I'd really like it if when the user hit Ctrl+C, each of the threads could finish dealing with their current URL and then stop, so that the whole crawl finishes within a few extra seconds.
Those readers with a bit of Unix background will know that what Ctrl+C actually does is send a "signal" (specifically, SIGINT) to the process. You can read up on signals at Wikipedia. Any modern Unix has a signal system call which lets you register a "signal handler", a function which is called upon receipt of a signal. Python gives you access to this system call via the signal module, so you can register a signal handler for SIGINT and make Ctrl+C do whatever you like. The default SIGINT handler, by the way, simply raises a KeyboardInterrupt exception, so if you don't want to use signals you can put your entire program in a try/except structure and get more or less the same effect.
The first problem is that Python's signal module documentation explicitly states that when multiple threads are running, only the main thread (i.e. the thread that was created when your process started) will receive signals. So the signal handler will execute only in one thread and not in all of them. In order to get all threads to stop in response to a signal, you need the main thread's signal handler to communicate the stop message to the other threads. You can do this in plenty of different ways, perhaps the simplest being by having the main thread flip the value of a boolean variable that all threads hold a reference too. This is not a huge problem, and I've done things like this before.
To my surprise today, this approach just wasn't working. I put a print statement in my signal handler and discovered that even the main thread wasn't receiving the SIGINT signal, even though it was definitely supposed to.
This leads to the second problem involved in mixing threads and signals. When you send a signal to a multi-threaded Python program, that signal is put into a queue. The main thread processes signals from that queue and invokes the relevant handlers, but - and here's the catch - it doesn't do this until it has something else to do as well. That is, if your main thread fires off a group of worker threads and then sits there doing nothing while they work then as far as Python's thread scheduler is concerned there is no need to give that main thread any CPU time while the worker threads are actually doing something, so your SIGINTs - and, indeed, any other signals - just pile up in the queue and are never handled. Note that "doing nothing" while the worker threads work include sitting in a blocked state after a call to the join method of a worker thread.
This means that if you want your main thread to be able to catch a Ctrl+C and shut down all the worker threads, you need to make sure your main thread is doing something while the others work. This doesn't have to be anything useful, of course, you can just make a call to sleep in a loop every second or so. The code I am now using looks a bit like this, and seems to work as intended:
# Start threads
threads = []
for i in range(0, num_threads):
thread = WorkerThread()
threads.append(thread)
thread.start()
# Wait for threads to finish
while True:
if not any([thread.isAlive() for thread in threads]):
# All threads have stopped
break
else:
# Some threads are still going
sleep(1)
With this code, if I hit Ctrl+C while the worker threads are working, the SIGINT gets put in the main thread's signal queue. After no more than one second, the sleep call in the infinite loop returns and the main thread has something to do (check if all the threads have stopped yet). It thus gets a slice of CPU time from the thread scheduler and so gets a chance to handle any signals which have built up. If you need a bit more responsiveness, you can sleep for less than a second, but the less you sleep the more CPU time your main thread will chew up evaluating the expression in the if statement. While on that subject, the any in that if statement is a new built-in function that appeared in Python 2.5. An equivalent statement that should work in earlier versions is:
if not True in [thread.isAlive() for thread in threads]:
I hope that this is helpful to somebody at some stage. Also, to give credit where credit is due, this email by James Henstridge on the PyGtk mailing list is where I got the insight to realise how to fix my spider. Thanks, James! It's probably also worth noting that there is a recipe in the ASPN Python Cookbook by Allen Downey which proposes a solution to this problem involving the fork system call - the main thread calls fork to create a child process. The worker threads are spawned in the child process, leaving just one thread in the parent process which can thus also receive a signal. The parent process can catch SIGINT and then kill its child to get the desired effect. I feel that this approach is a bit uglier than sleeping in a loop, but it may have advantages that make it the better choice under certain circumstances.