unresponsive (from error?) #237

ihanson · 2012-03-21T23:52:55Z

aleph.sagemath.org is now unresponsive, and here is the last bit of the stdout log:

InvalidDocument: BSON document too large (21829064 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
ERROR:root:Exception in I/O handler for fd <zmq.core.socket.Socket object at 0xaa9a70>
Traceback (most recent call last):
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/zmq/eventloop/ioloop.py", line 330, in start
    self._handlers[fd](fd, events)
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/zmq/eventloop/zmqstream.py", line 391, in _handle_events
    self._handle_recv()
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/zmq/eventloop/zmqstream.py", line 424, in _handle_recv
    self._run_callback(callback, msg)
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/zmq/eventloop/zmqstream.py", line 365, in _run_callback
    callback(*args, **kwargs)
  File "trusted_db.py", line 95, in <lambda>
    stream.on_recv(lambda msgs:callback(db,key,pipe,fs_auth_dict if isFS else db_auth_dict,rep,msgs,isFS), copy=False)
  File "trusted_db.py", line 159, in callback
    db.add_messages([c[0] for c in content])
  File "/padic/scratch/jason/simple-python-db-compute/db_mongo.py", line 112, in add_messages
    self.database.messages.insert(messages)
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/pymongo/collection.py", line 310, in insert
    continue_on_error, self.__uuid_subtype), safe)
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/pymongo/connection.py", line 807, in _send_message
    (request_id, data) = self.__check_bson_size(message)
  File "/padic/scratch/jason/sage-4.7.1-sage.math.washington.edu-x86_64-Linux/local/lib/python2.6/site-packages/pymongo/connection.py", line 784, in __check_bson_size
    (max_doc_size, self.__max_bson_size))
InvalidDocument: BSON document too large (21829064 bytes) - the connected server supports BSON document sizes up to 16777216 bytes.
Traceback (most recent call last):
  File "device_process.py", line 793, in <module>
    keys=keys, resource_limits=resource_limits)
  File "device_process.py", line 343, in device
    for X in db.get_input_messages(device=device_id, limit=-1):
  File "/padic/scratch/jason/simple-python-db-compute/db_zmq.py", line 35, in f
    output=self.socket.recv_pyobj()
  File "socket.pyx", line 801, in zmq.core.socket.Socket.recv_pyobj (zmq/core/socket.c:7113)
cPickle.UnpicklingError: invalid load key, '{'.

kramer314 · 2012-02-09T03:15:24Z

I'm not sure about the pickling error, but for the first error what version of MongoDB is aleph.sagemath running? A brief google search on the traceback message suggests that pre-1.8 versions have this issue.

jasongrout · 2012-02-09T06:14:52Z

It's running mongodb-linux-x86_64-1.8.1

kramer314 · 2012-02-09T18:03:05Z

Hmm, then it seems like someone's input just generated a massive output message, and the DB complained about it being larger than 16MB (hence the error). Implementing gh-139 could be a good long-term solution, but the quickest solution seems to be to just catch the error when things are added to the database and drop the message. Or, do we want to find a way to let the user know that their output was too large / was rejected?

jasongrout · 2012-02-14T17:50:47Z

This just halted all of the workers again. As a quick fix, let's just drop the computation (or if there's an easy way, insert a message back to the user saying there was an error).

jasongrout · 2012-03-22T05:32:32Z

Nice, and I agree with Alex's comment. And I'm really curious: how are you attaching code to an existing issue? You can do that from the API; are you using the API?

ihanson · 2012-03-22T06:17:24Z

Yes, I’ve been using the Github API.

jasongrout · 2012-03-22T06:25:08Z

Do you happen to have a shell script or python script you could share with us? I know it's easy enough to write my own, but I thought maybe you might have something already written.

ihanson · 2012-03-22T07:17:19Z

Here: https://gist.github.com/2156799

jasongrout · 2012-03-29T02:39:22Z

db_mongo.py

+                      "sequence": m["sequence"]})
+        log("INSERTED: %s"%('\n'.join(str(m) for m in success),))
+       	if len(success) < len(messages):
+       		log("FAILED TO INSERT %d message(s)" % (len(messages) - len(success)))


Oops, there is a tab on this line. That needs to be changed to spaces.

… referee fixes by jasongrout) The problem was that messages were exceeding the mongodb limits for record size. Now we just insert an error in the message stream when that happens

jasongrout · 2012-03-29T04:37:45Z

I guess that's what I get for adding a few small referee commits and merging---now that it didn't contain your tab commit, it appears that I didn't merge your branch. But I made the same tab fix and a few other fixes that relate to printing enormous messages in the logs.

jasongrout mentioned this pull request Mar 15, 2012

zmq invalid key #181

Closed

Properly handle errors when output is too large for MongoDB

3f9a555

Make log output reflect failure to add a message

1065531

jasongrout reviewed Mar 29, 2012
View reviewed changes

Tabs to spaces

f3b4eb5

jasongrout closed this Mar 29, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unresponsive (from error?) #237

unresponsive (from error?) #237

ihanson commented Mar 21, 2012

kramer314 commented Feb 9, 2012

jasongrout commented Feb 9, 2012

kramer314 commented Feb 9, 2012

jasongrout commented Feb 14, 2012

jasongrout commented Mar 22, 2012

ihanson commented Mar 22, 2012

jasongrout commented Mar 22, 2012

ihanson commented Mar 22, 2012

jasongrout Mar 29, 2012

jasongrout commented Mar 29, 2012

unresponsive (from error?) #237

unresponsive (from error?) #237

Conversation

ihanson commented Mar 21, 2012

kramer314 commented Feb 9, 2012

jasongrout commented Feb 9, 2012

kramer314 commented Feb 9, 2012

jasongrout commented Feb 14, 2012

jasongrout commented Mar 22, 2012

ihanson commented Mar 22, 2012

jasongrout commented Mar 22, 2012

ihanson commented Mar 22, 2012

jasongrout Mar 29, 2012

Choose a reason for hiding this comment

jasongrout commented Mar 29, 2012