-
Notifications
You must be signed in to change notification settings - Fork 1k
USB device cannot be re-opened after "stuck USB write" #145
Comments
Well, this bug normally only happens like 1/25 times, but yesterday it disconnected me in a match. Normally my robot literally doesn't care about ESD at all, but if it's a particularly strong zap then the USB comms will be interrupted for usually around 300ms. However yesterday Android decided that a real match would be nice time bite :( I immediately knew it was this bug when it happened on the field, due to the tell-tale sign of the "problem with ExH portal" message appearing 5 seconds after the incident, and a postmortem analysis confirmed I was correct. (I.e. logs indicated "stuck USB write", and after carrying the bot off the field and replugging the OTG adapter, it picked it up right away) I may have to take a look at the Android USB code and see if I can find anything suspicious. |
IIRC a while back we determined that it was the hardware itself that got in a stuck state, and the only recourse was a power cycle. Is your venue particularly susceptible to ESD? If so, you might want to talk to the hosts about giving the ESD mitigation whitepaper a read. https://www.firstinspires.org/sites/default/files/uploads/resource_library/ftc/analysis-esd-mitigation-echin.pdf |
@cmacfarl well that's not exactly the behavior I'm seeing. If I replug the actual OTG adapter it comes back. Replugging the ExH from the OTG adapter does nothing, though. They do not take any mitigation measures as far as I'm aware, so I may mention that to them :) |
Two more more notes:
|
Update: I've been doing some digging. Just for the heck of it, I unbound the Linux kernel driver for the battery management controller (which also happens to control turning on and off the boost converter for 5v out when using USB OTG). This had the result of the phone not supplying any power at all to the Expansion Hub (i.e. no LED instead of flashing orange), but comms were established if the Expansion Hub had 12v power. However, I was still able to reproduce the issue with the driver for the battery management controller unloaded. (Not that I expected it to change anything, but you never know) |
Hi @FROGbots-4634 I have no real idea what you mean when you say: " I unbound the Linux kernel driver for the battery management controller ", but your comment about the phone not powering the Expansion hub had my interest piqued. At the Tech Team's last conf call, I was asking if it was an expected behavior that when I plug my Moto G phones into an unpowered expansion hub for the first time, that it does not connect and light the orange LED? The ZTE's always did this, and I thought the Moto G phones did too More recently with the Moto G phones, I've noticed what you describe, if I plug into a cold expansion hub, nothing happens until I power it's 12V on. Then the app auto launches and the lights go green. THEN if I power down, I get the expected flashing orange. How/Does your discussion about the boost converter impact what I'm seeing. What determines when the App sees the expansion hub and auto launches? Could it be something as trivial as the the type of OTG cable i'm using. I'm just following a thread down a rabbit hole, and wondering if changed OS behavior is making our disconnects more brutal. |
@gearsincorg The Linux kernel allows drivers to be manually bound and unbound from a device without having to remove the whole driver with The battery management controller used in my Nexus 5 RC phone is the Texas Instruments BQ24192 and is controlled via an internal I2C bus. For reference, here is the driver for the BQ24192 in my Nexus 5. I simply unbound the I2C driver for the battery management controller by doing:
Now, regarding your comments of:
So, as I previously mentioned, the driver for the battery management controller decides whether or not to enable the boost converter. With that driver unbound, the boost converter will never be enabled. Thus, with the driver unbound, if I remove the 12v power after comms were established, the ExH goes dark again. (Just for clarity's sake: if I leave the driver bound, I get the expected flashing orange when connecting to a cold ExH) It is possible that the behavior you are seeing is caused by the driver for battery management controller in your OS. However, it is also possible that it could be the type of OTG cable your are using, as apparently either connecting the ID pin of the USB plug directly to GND, or to GND through 124 kΩ, 68 kΩ, or 36.5 kΩ resistors causes different behavior. |
I've been hardcore pursuing this issue. Just thought I'd drop a note to say I captured the kernel log when this happens:
|
In this case is the Expansion Hub the "Host" ? |
Negative. The "host" is the "DesignWare USB3 DRD Controller" in the phone. |
I compiled a custom version of the Linux kernel for my phone with a workaround (though maybe "hack" would be a better term) for this issue. I edited if (!debugfs_create_file("frog_reset", S_IRUGO | S_IWUSR,
dwc3_debugfs_root, mdwc, &frog_reset_write_fops))
goto error; I added this file-ops listener to it: const struct file_operations frog_reset_write_fops = {
.write = frog_reset_write,
}; So, when the debugfs file is written to, this function is invoked: static ssize_t frog_reset_write(struct file *file, const char __user *ubuf,
size_t count, loff_t *ppos)
{
char buf[8];
memset(buf, 0x00, sizeof(buf));
if (copy_from_user(&buf, ubuf, min_t(size_t, sizeof(buf) - 1, count)))
return -EFAULT;
if (!strncmp(buf, "zombie", 6))
{
frog_toggle_host_mode();
printk("FTC: FROG RESET WRITE\n");
}
return count;
} As you can see, if write "zombie" to the debugfs file, then That is a function I added to a different file, void frog_toggle_host_mode(void)
{
printk("FTC: dwc3_otc.c: FROG_TOGGLE_HOST_MODE\n");
dwc3_otg_start_host(context, 0);
dwc3_otg_start_host(context, 1);
} Essentially, it simply forces the controller into peripheral mode, then back to host mode. (Note, I had to add I can trigger the reset simply by doing:
Here's a video showing this in action. First, I demonstrate that the RC is in the bad state where it doesn't pick up the ExH even after multiple tries unplugging and replugging the USB cable from the OTG adapter. Then, I leave the ExH plugged in and run the aforementioned shell command from a remote shell on my computer, and the phone picks it up right away (notice the LED change from flashing blue to green). So, now I just need to setup my OpMode with a special button combo on my gamepad which will cause the RC to locally invoke that shell command. Then I can essentially have a "zombie apocalypse" trigger to recover comms in a real match, should I ever need to use it. (Of course, now that I've done all the work to implement this, watch how I'll never need to use it sigh). |
Freaking cool.. So, the $60K question... |
Sadly, this is not easily portable to other phones. Recompiling a modified kernel for each phone would be required. I'm not sure whether the CH is affected by this issue or not - that would depend on the USB controller is uses as well as how the ports are physically wired to said controller. Additionally, this is very much a "hack" (for one thing, it crashes the kernel if you run it without an OTG adapter connected). However, this is a good starting point for hunting down the bug. According to the kernel log I posted earlier, it appears that what is actually happening when the bad state occurs is that the USB controller is turning off host mode while the OTG adapter is still connected. That's why replugging the OTG adapter fixes the issue but replugging the ExH from the adapter does not: when the adapter is reconnected, host mode is activated again. Or so goes my theory. I can see about trying to upload a diff of the changed files so that you could recompile the kernel for your phones, if you're interested. |
Thanks, but I don't think I'm ready to go down that rabbit hole :) What I find interesting is that the phone is reporting:
So, I wonder how it determines this. Is it the 2 data line levels going to a default state. Or could it be the 5th OTG pin level dropping. |
Well, if we look at //detect pin would not work if we don't configure and request USB_ID pin (gpio 77)
ret = request_threaded_irq(anx7808->cab_irq, NULL, anx7808_cbl_det_isr,
IRQF_TRIGGER_RISING
| IRQF_TRIGGER_FALLING,
"anx7808_cabel_det", anx7808); Here's the ISR itself: static irqreturn_t anx7808_cbl_det_isr(int irq, void *data)
{
struct anx7808_data *anx7808 = (struct anx7808_data *)data;
int status;
SP_DEV_DBG("%s : gpio_get_value(anx7808->pdata->gpio_cbl_det) = %d\n", __func__,gpio_get_value(anx7808->pdata->gpio_cbl_det));
if (gpio_get_value(anx7808->pdata->gpio_cbl_det)) {
wake_lock(&anx7808->slimport_lock);
SP_DEV_DBG("%s : detect cable insertion\n", __func__);
queue_delayed_work(anx7808->workqueue, &anx7808->work, 0);
msm_otg_id_pin_irq_enabled(false);
} else {
SP_DEV_DBG("%s : detect cable removal\n", __func__);
status = cancel_delayed_work_sync(&anx7808->work);
if(status == 0)
flush_workqueue(anx7808 ->workqueue);
wake_unlock(&anx7808->slimport_lock);
msm_otg_id_pin_irq_enabled(true);
}
return IRQ_HANDLED;
} NB: Slimport is a technology similar to MHL, i.e. for HDMI out over USB. (The Nexus 5 has an ANX7808 Slimport chip). |
I was under the impression that 5.4 had some sort of fix for this: "Fixes system responsiveness issue when an Expansion Hub is disconnected" Indeed, when we get the "stuck USB write" error, we are now getting a different orange error message saying that the expansion hub is detached, and the Driver Station remains responsive. But we still aren't able to restart the robot without pulling the OTG cable. Is anyone aware of what the behavior is supposed to be? Or what we can do to reset the robot without pulling the OTG? |
@rischconsulting Negative.
This IS NOT a bug in the FTC SDK. It is either hardware bug in the USB transceiver, or a bug in the kernel USB transceiver driver. There is nothing you can do to recover from this without unplugging and replugging the OTG cable, unless you feel like recompiling the Linux kernel for your phone to add a hook in the debugfs that invokes (very hacky) reset of the USB transceiver like I did. |
We just might - we're pulling our hair out. I guess we need to figure out the cause of the error. It is interesting, though, that the error reported has now changed, which means that somebody did something on this. |
The cause of the error is a completely random chance when the USB device disconnects. It can be reproduced without ESD, by simply continually unplugging and replugging the OTG adapter. Eventually, on one of the unplugs, you will hit a stuck USB write.
The error changes because the SDK now detects unresponsive hubs as opposed to just outright missing hubs. Since everything disappears permanently off the USB bus, the SDK cannot communicate with the Hub and marks it unresponsive. The changes that were made in v5.4 have absolutely nothing to do with the stuck USB write condition specifically. You can review the diff for yourself if you don't believe me :) |
I totally believe you - I already read through the diffs and found nothing, which is why I asked! The randomness is killer - sometimes we can run for hours with no error, and sometimes we get one every minute. If it's android based, I wonder whether there's a version of Android that doesn't have the problem. Our e4's are on 8.0 |
I don't think it's Android version specific. I think it's caused by either certain USB transceivers or the drivers for said transceivers. What I did was I bound a button on my gamepad to invoke the hook in my kernel to reset the USB transceiver: if(rstToggle.shouldToggle(gamepad2.right_stick_button))
{
new RunShellCommand().runAsRoot("echo zombie > /sys/kernel/debug/msm_dwc3/frog_reset");
} Then if I ever hit the issue in a match, I can be back up and running again in <2s by simply pressing a button on the gamepad. |
At the end of the day, this is likely going to need a fix at the driver
level. But... I suspect that the best chance for this is going to be for
something like what you have done.
========
Kirk Wood
…On Thu, Feb 6, 2020 at 10:17 AM FROGbots-4634 ***@***.***> wrote:
I don't think it's Android version specific. I think it's caused by either
certain USB transceivers or the drivers for said transceivers. What I did
was I bound a button on my gamepad to invoke the hook in my kernel to reset
the USB transceiver:
if(rstToggle.shouldToggle(gamepad2.right_stick_button))
{
new RunShellCommand().runAsRoot("echo zombie > /sys/kernel/debug/msm_dwc3/frog_reset");
}
Then if I ever hit the issue in a match, I can be back up and running
again in <2s by simply pressing a button on the gamepad.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#145?email_source=notifications&email_token=ACLP2R736FDC4GKSYHK5WJ3RBQZY3A5CNFSM4JO35K22YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7ZZNA#issuecomment-582982836>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLP2R2YSMEUZFEKG43UUADRBQZY3ANCNFSM4JO35K2Q>
.
|
If it's a hardware bug, then a fix at the driver level would probably not be possible. I don't really know whether it's a driver bug or a hardware bug.... For my patch though, I could probably make it much less hacky (i.e. not able to crash the kernel if run without USB devices connected) by not just outright ignoring the massive state machine in the driver, lol. On the other hand, I need it to serve one purpose, and it does that nicely. Of course, this is a non-issue with Control Hubs (well unless you connect the second Hub over USB instead of 485) because in the CH the embedded Lynx PCB is connected directly to the Android SBC over UART. Nit for posterity: When connecting to a Hub over USB, you're actually connecting to an FTDI USB to UART bridge... the Control Hub just removes the middleman FTDI. |
During a scrimmage yesterday, static on the field was very very bad (I was literally hearing arcs when driving under the skybridge), and during one match I got a stuck USB write and gave my "zombie apocalypse" button its first test when it really mattered - and it brought it back :) Of course I'd prefer for this never to happen in the first place, but it's comforting to know that static isn't going to knock me out for the remainder of a match. |
I'd like to add my own observations over the last several events I've FTA'd (most notably from this weekend) with regards to this issue. I agree that this issue is not necessarily caused by ESD, by anecdote that at some events where this happens no static is noticed during normal field operation (touching robots/field without any shocks all weekend). I started to pull logs any time this issue arises and actually started to comb through them to understand what is happening. Prior to this weekend, it was always a
This weekend I noticed a different pattern with a different configuration. The typical configuration is Phone -> OTG adapter -> Exp hub. In the following case, it's Phone -> OTG adapter -> USB hub -> Exp hub and camera. What the logs show in this case is that the phone's USB doesn't get stuck, but rather doesn't get acks back. In the typical configuration, a physical unplug/replug of the OTG adapter from the phone resets the connection, but in this case the connection between the USB hub and the Exp hub has to be "replugged".
and so on, until something like the app crashes, you physically restart the app, or replug the Exp hub, or such. Now, for some speculation. The first lead is the observation in differing behavior between the hardware configurations (having a USB hub middleman). The phone gets stuck on USB when tied in directly to the Exp hub, versus having messages dropped with the USB hub middleman. The commonality in connection being reset is the Exp hub, since in the normal configuration it's the phone/Exp hub link, and in the USB hub configuration its the Exp hub/USB hub link. The roads are leading to problems with the Exp hub, in one or more ways. The first speculation is that the Exp hub's FTDI firmware/hardware is freezing up (going into a bad state) in a certain circumstance. The other speculation is that the Exp hub is freezing up its host's FTDI driver/firmware/hardware. This is a wild, but somewhat educated, speculation. I don't know much about how the USB protocol, hardware, or lowest level driver work. |
Noting that (to add color to FrowntownPitt's speculation):
|
Wait. So you're saying that with the middleman configuration, replugging the OTG adapter from the phone does not fix the issue, but replugging the ExH from the USB hub does fix it? Interesting. I ran with the middleman configuration last year, and I'm pretty sure I got stuck USB writes that could be fixed by replugging the OTG adapter, but I wasn't keeping very close tabs on it then so I can't say that with certainty.
I don't think it's the first one, because Lynx firmware v1.8.2 it will pull the reset line on the FTDI if a KA-timeout occurs. Besides, if it was that, a replug of the connection to the OTG adapter would fix it. Of your two I would think the second is more likely. But here's the thing to remember - I can reproduce this without ESD at all simply by unplugging the OTG adapter while the connection is live. Nearly always that simply produces an "ExH portal missing" error like you'd expect, but occasionally you'll get the stuck USB write. I'm left scratching my head trying to figure out how the FTDI could freeze the host if it's been unplugged.... I think @gearsincorg may be on to something when he said:
I posted the ISR that produced that message earlier. It looks like it was simply listening for the rising and falling edges of the USB_ID pin. So this seems to implicate that electrically something very funny is going on, with the state of the USB_ID pin bouncing perhaps and confusing the host driver? I think that, at the end of the day, the only way to get to the bottom of this is to snoop the USB bus with both a USB packet sniffer and an o-scope while reproducing the problem in a controlled environment without ESD. |
Regarding point 3, when that happened, how could the fault be cleared? Replugging OTG adapter? Or replugging ExH from USB hub? |
OTG to phone. We couldn't unplug ExH from USB hub if we wanted to (not at the ExH, anyway) |
It's just so odd that although ESD can cause a temporary USB disconnection which could induce the issue, ESD itself is not the cause. If it really was the USB_ID pin state bouncing, wouldn't ESD cause it to bounce? On a related note, got some nice arcs while testing auto last night. The effect of the specific arc in the photo below was a USB disconnection, causing my auto to abort itself due to getting all 0 data from sensors, and USB reconnection in quick succession. But no stuck USB write. |
Yes. It required a replug on the expansion hub link (from the USB hub, since the teams were using the retention mount on the expansion hub. In these particular configurations from this weekend, I will note they used the REV USB hubs. I do not know if they were powered separately or not. I'm going off of only a few scenarios from the limited exposure I have to robots that have problems. As an aside, there was a single instance where I noticed something interesting, from an event a couple of weeks ago. One of the teams (9872, (In)Formal Logic) dropped in semifinals, and in my typical postmortem I decided to do something a little differently. The wire configuration was OTG adapter (micro to A female), then A-to-mini. I replugged the A adapter side (not the micro from the phone port), and this did not reset the connection. It was only after I replugged the micro port from the phone that it worked. I was never able to reproduce this situation, across 3 more events, as the replug from the A side did end up resetting the connection properly.
I had this speculation with the scenario from 9872, that it has something to do with the OTG signal that is affected by either ESD or rough conditions of robot play. What I speculated was this OTG signal was bounced, and it took a full replug to properly reset the connection. After not being able to reproduce it, I figured that path was no longer the way to go - but it may be a lead after all. I've never had a problem with my little demo robot (4 motors, all encoders, using IMU but no other sensor), and I've never had the chance to attempt reproducing the failures on a more complicated configuration. In the cases that happen at events, they tend to be recurring teams, but it is not consistent across matches. That picture is heckin' cool. I may play around with intentionally zapping individual connections to see what happens.
ESD, or rough play. I never counted ESD out, but it isn't necessarily the cause. After all, "this only happens at competitions"... Rough conditions may cause individual leads (or all the contacts of a plug) to become momentarily disconnected, which would affect how the logic signals are transferred and interpreted. Again, speculation. |
Does this mean that you've seen stuck USB writes where the fault can be cleared without replugging OTG the adapter on the micro-B side? If so that completely goes against all of my findings. |
In the cases where the fault was cleared by unplugging the A side, I did not look at the logs. It may have not been the "stuck USB write" issue, or there may be different circumstances where this fault occurs. I'll keep a look out for it next time. |
I've been experiencing this same issue during our last few competitions as well as occasionally in practice. My findings are similar to what's been discussed here in that there's always the stuck USB write error thrown and the only solution is physically unplugging and replugging the OTG to phone connector. We've begun taking more precautions in terms of preventing ESD or cable wobble, but still find that there will be the occasional unlucky short loss in USB comms that happens to be during a write and so gets the controller stuck in this bad state. Has any more progress been made on this issue? I use Moto G4's and so the solution of editing and recompiling a modified kernel seems to be more difficult not to mention I don't have much experience with linux source to really know what I'm doing. Is this really the only "hacky" solution that can be done or is there possibly something that's less extreme than having to directly modify phone firmware in order to force the controller back in Host mode? (maybe something like figuring out how to turn power off/on to the USB port programmatically?) It's just frustrating that a robot can go dead in a match with a problem that should have a software solution but is so low level that a general solution is hard to roll out and teams just have to accept being dead in a match despite their connections still being plugged in fine. |
Believe me, I tried many things to be able to reset the USB host controller without a kernel driver modification. I was able to reset the client portion of the USB stack without doing so (e.g. resulting in the phone disappearing from the computer it was connected to and then re-appearing), but it didn't work to recover from the stuck USB write.
This is the rationale for the Control Hub. For starters, the connection to the internal Lynx PCB does not go over USB at all, but rather through a direct UART link. This means not only better reliability, but also reduced latency when issuing hardware commands. Furthermore, since it's a single device for which REV / the Tech Team has complete control software-wise, they could easily release a fix for a low-level issue and have it be available for all teams. That being said, there's still things I don't like about the Control Hub. No power source besides main 12v battery, no screen which makes debugging hard, you're forced to use a webcam, etc. |
One thing I will say is that just because I couldn't find a way to reset the USB host controller on my specific phone without a kernel modification does not mean that holds true for all phones. Unlock the bootloader on that bad boy, flash TWRP recovery, install LineageOS and the root ZIP, and start poking around :) |
When an Expansion Hub is disconnected, there are, of course, several possible states that the comms link could be in at that momment:
One of these conditions (probably the second) can cause this error to occur:
If that error occurs, the USB device cannot be re-opened by any normal means. None of the following will cause it to be re-detected:
The ONLY way to recover from this state is to physically unplug and replug the actual OTG adapter. This led me to conclude it might possibly be an Android bug. To further add evidence to that case, I tried using a 3rd party USB device enumerator app while the phone was in the bad state. It too failed to pick up the FTDI. Upon unplugging and replugging the actual OTG adapter, both apps were able to successfully find the device.
I dug into the SDK to find what code produces this error, and I found this comment on line 164 of MonitoredUsbDeviceConnection.java:
Thus it seems then that someone on the Tech Team has seen this before, or at least something similar (notice the "we've rarely/never been able to successfully re-open after that").
I have a feeling this is an Android bug, (and thus not much can really be done about it, except maybe for the Control Hub, because that OS could be patched easily) but I think it should still be looked into.
The text was updated successfully, but these errors were encountered: