This file lives in eoe/lib/libtserialio/doc/tserialio.html.
I am not at SGI any more. By the time you read this, there may be a new tserialio expert. Check the rlog for the files below. If not, best bet for reaching me is cpirazzi@cs.princeton.edu. But please read everything here before you try and get a hold of me!
Read this document before attempting to modify tserialio. It will teach you what you need to know to avoid introducing bugs.
This document was written on 12/19/97. At this point there is a compiling tserialio driver in the kudzu tree, however it contains many serious problems and will not work in any useful sense.
Below, I will explain the proper steps to port tserialio from bonsai (6.3) to kudzu (6.5).
Unlike the standard STREAMS or other interface, the user does not open tserialio's device node (/dev/tty*) directly. Instead, the user links with the user-mode library libtserialio.c, and uses only its documented API entry points.
Tserialio does not have or need any flow control of any kind (not hardware/sofware, not XON/XOFF).
Example lower layers include:
tserialio is unlike any other driver in irix/kern. In fact, it is more like the audio and MIDI driver. typically in a driver, we do all we can to optimize bandwidth (throughput), sometimes at the cost of latency. tserialio needs exactly the opposite. tserialio is used for applications like deck control, which involve absolutely miniscule amounts of data (38400 baud, and the serial line is often not at all saturated), but involve extremely tight accuracy constraints. tserialio is 100% completely useless unless it can deliver the stated accuracy guarantees, 100% of the time. a failure means that video deck control overwrites customer data, which in the video industry means you send it back (no, it's not like Microsoft). tserialio uses a guaranteed low latency event in the kernel to run every millisecond so that it can deliver the very high accuracy timestamping and scheduling described in the tserialio man page.
If it were possible to run every millisecod, guaranteed, 100% of the time from a user-mode thread, we wouldn't need the tserialio driver, and the tserialio user-mode API would be more of a convenience than any kind of new functionality for the developer.
However, as of 12/19/97, the company has decided not to allocate the engineering resources necessary to guarantee that a user-mode thread can run every millisecond on O2 and Octane for IRIX 6.5. So O2 and Octane need the tserialio driver.
The event which the tserialio driver uses is called the millisecond profiler tick. There is a nasty, kludgy hook from irix/kern/ml called tsio_timercallback_ptr which tserialio hooks into. This tick comes at splprof, which is higher than splhi. This interrupt is not threaded, nor could it be threaded and deliver the required latencies on O2/Octane. splprof is so high that almost none of the standard kernel facilities which you're used to (semaphores, memory allocation, address space manipulation (memory mapping and unmapping), sleeping) work. About the only kernel facility which the interrupt can do is to schedule a lower-priority timeout routine (known as a timepoke), from which tserialio can use other kernel facilities such as pollwakeup().
In tserialio, as described in the comment you should have read above, the interrupt routine reads (for serial TX) or writes (for serial RX) an area of main memory containing a ringbuffer. That ringbuffer is mapped into the address space of the user process using the tserialio API. So once a TSport is set up, there is actually very little tserialio needs to do in terms of using kernel facilities.
On Onyx2/Origin, the company has promised millisecond schedulability of a user-mode process. So on those platforms, don't bother porting the tserialio driver. Instead, you should write a very simple (should be at most a few hundred lines of code) "tserialio veneer" DSO in user-mode. This veneer will have the exact same API as the driver-based tserialio.so, but it will spawn off a sproc/pthread thread which runs every millisecond (which you can do by making it pri 255 and SCHED_RR or SCHED_FIFO (see sched_setscheduler(2))) and polls the serial device using any serial interface (presumably userualio or cserialio). The veneer will use some selectable object (semaphores, pipes, ...) to implement the tserialio API's selectable file descriptor. Semaphores would be the obvious choice, but due to the bizarre semantics of us selectable semaphores, you may find that you need to find another mechanism. You may find pipes to be of use. For communication between the veneer thread and the API thread, the veneer can use the same mapped ringbuffer trick as the current API and driver do. The only difference is that the whole shebang will reside in the address space of one share group (sproc) or process (pthread).
As other platforms take on the millisecond schedulability guarantees, you can discard their tserialio driver and go with the veneer.
Update: 1998 October 27
Chris Pirazzi and I had an email exchange about latency tolerance in the tserialio driver. Below is some additional information to help clarify the scheduling needs to the tserialio driver.
The tserialio driver requires a one millisecond scheduling period with one millisecond scheduling latency. So, at the extreme, the tserialio driver can tollerate no more than 2 ms betweeen its execution. This scheduling requirement also applies to the midi and audio drivers.
Chris agreed to a more strict definition:
The tserialio timeout must be scheduled to run every 1000 us with no skew in scheduling frequency, and the maximum holdoff, i.e., the time from when a timer expires until the affected thread runs, that the tserialio timeout can tollerate is 1000 us.Based on testing with real video decks, the above definition is overly strict, but the remainder of the statement is correct.
As of IRIX 6.5.2, the OCTANE platform has guaranteed one millisecond bounded response time, in contrast to Chris's comment on 1997 December 19. The tserialio driver runs as a periodic xthread on the IP27 (Onyx2, Origin200, Origin2000) and IP30 (OCTANE). Other platforms may be added over time by defining TSERIALIO_TIMEOUT_IS_THREADED in irix/kern/kcommondefs for each additional platform. Note: One millisecond bounded response for a platform is a prerequesite to proper execution of a threaded tserialio driver, so ensure proper bounded response before defining TSERIALIO_TIMEOUT_IS_THREADED.
--Brian Forney (bforney@sgi.com)
MIDI == 31250 sym/sec / (1start+8data+1stop)=10 sym/byt
== 3125... bytes/second
deck == 38400 sym/sec / (1start+8data+1par+1stop)=11 sym/byt
== 3491... byt/sec
As described above, latency and accuracy are what is important to the
tserialio driver.
tserialio is used for video deck control. Video deck control needs highly accurate scheduling and timestamping of serial bytes.
Another extremely popular application is video deck emulation. Its requirements are stricter. Video deck emulation requires low latency user-thread scheduling in addition to the high accuracy of video deck control.
Please read the following blurb about these two applications. This blurb was written with someone in mind who is stronger with OS concepts than they are with video concepts: click here for millideck.
There is some other information about video deck control, including sample code using tserialio, in the Lurker's Guide to Video at http://collabmed.engr.sgi.com/projects/lurker/.
Anyway, back when it seemed likely that they will sell lots of systems, we did a very carefully crafted and limited hack to the tserialio IRIX 6.3 driver which would allow them to have a small part of their driver run from tserialio's millisecond callback in the kernel. Mountaingate was extremely limited as to what they could do from this callback.
To see the contract and the precise hack, see the file mgate1.9 (or whatever the highest-numbered mgate* is right now) in eoe/lib/libtserialio/mountaingate.
As you can see, we explicitly said we were not on the hook for IRIX 6.5 support. So you can choose to omit the mountaingate hacks if you want when porting to IRIX 6.5. As you can see by diffing the "with" and "without" tserialio.c in eoe/lib/libtserialio/mountaingate/tserialio_driver_hack, they are not that complex.
There is one very lightweight support issue with Mountaingate. Because we provide them with a custom tserialio.o, we need to make sure that any patches they install do not conflict with the tserialio.o we provided. Part of our contract with Mountaingate is that we will qualify or disqualify any patches they want to install. Examples of patches which Mountaingate should not be allowed to install:
Qualifying a patch usually consists of looking at its idb file (/hosts/dist/test/patches/patchXXXX/dist/*.idb) to see what it includes, and if it includes something suspicious in the kernel, going to its source files (find /hosts/jake/proj/patches/patchXXXX -type file -print) and seeing what changed.
Mountaingate asks for a set of patches to be qualified once every few months.
After surveying the changes for a week I am absolutely, 100% sure that it is worth starting over from scratch with the bonsai source. This is definitely the fastest way to get kudzu tserialio working.
Some of the items below are such that it takes me longer to explain them than it would for me to do them before I leave. But this would be a short-sighted strategy. If tserialio is to survive in the long term, or even if it is to ever get ported to NT, you need to understand the intricacies of the driver, and so I will spend my time explaining it and not putzing with it.
So here's the process you need to do unless otherwise specified, all changes are to the driver. Do these things in this order:
#if IP22 || IP32 extern int fastclock; /* from ml/timer.c */ #endif
and then add this in start_tsio_ticks():
#if defined(MP)
startaudioclock();
#else
if (!fastclock) {
enable_fastclock();
}
#endif
The bug is that the fastclock must be started, and the bonsai and kudzu code weren't starting it. This bug often was masked because any use of audio (including keyboard beep on many platforms) would also start the clock.
The Mountaingate source code in eoe/lib/libtserialio/mountaingate/tserialio_driver_hack also fixes 460909 but in a bonsai-only way. do not copy the 460909 fix from the Mountaingate code.
The key part is this:
The generation value must be taken either while a state lock is held
that will hold off any call to pollwakeup(D3) on the pollhead, or it
must be taken before the check is made for any pending events. The
Because of the lock-free way tserialio is designed, it is not possible or
desirable to hold off a call to pollwakeup() (which in the case of
tserialio comes from the timepoke). Instead, we need to take the
generation number before we check the ringbuffers:
#include <sys/poll.h> /* ADD THIS */
int
tsiopoll(dev_t dev,
register short events,
int anyyet,
short *reventsp,
struct pollhead **phpp,
unsigned int *genp) /* loadable module entry point */
{
...
if (fillunits != TSIO_FILLUNITS_BYTES &&
fillunits != TSIO_FILLUNITS_UST)
{
cmn_err(CE_WARN, "tsio poll: port 0x%x fillunits corrupted\n",
upper);
UNLOCK_TSIO();
return EINVAL;
}
/* ADD THIS */
/*
* Snapshot pollhead generation number before checking event state
* since we don't hold a lock while doing the check.
*/
*genp = POLLGEN(&urb->pq);
/* UP TO HERE */
if (urb->direction == TSIO_TOMIPS) /* RX, input */
{
int n_filled;
int tail = urb->TXheadRXtail; /* tail for RX */
...
...
}
dmedia/devices/midi/kern/nsdriver/nsmidi.c is an example
which gets it right.There is no need to add any code later in the function as casey did.
p_rdiff -g -r 1.4 -r 1.5 tserialio.c
to see the change (but don't gdiff kudzu against your working copy; there will be too many differences). The old code used the unexported dotimeout() because for some reason itimeout() couldn't be called from splprof. This is now possible in kudzu.
p_rdiff -g -r 1.6 -r 1.7 tserialio.c
but don't gdiff kudzu against your working copy; there will be too many differences. The issue here is that there are now more drivers using this 1ms hook. The midi hook used to be unused in bonsai. Now tsio needs its own hook.
This performance reality of our SGI compilers is relevant for the tserialio driver as well. keep it in mind when optimizing the driver.
port->sio_callup = &tsio_callup; ... port->sio_lock->spinlock = SIO_LOCK_SPL7; /* we need spl7 spinlock, sigh */Also, the lock macros are now called SIO_LOCK_*() in kudzu.
/hw/node/xtalk/15/pci/4/tty/1
/hw/node/xtalk/15/pci/4/tty/1/4d /hw/node/xtalk/15/pci/4/tty/1/4f /hw/node/xtalk/15/pci/4/tty/1/c /hw/node/xtalk/15/pci/4/tty/1/d /hw/node/xtalk/15/pci/4/tty/1/f /hw/node/xtalk/15/pci/4/tty/1/m /hw/node/xtalk/15/pci/4/tty/1/midi
In some cases (4d, 4f, d, f, m) one upper layer (serialio) wants to have several different /hw nodes per port.
sio_make_hwgraph_nodes()'s job is to populate /hw/ttys, a system-global directory of ttys, with one node for each physical serial port and upper layer combination. It does this by simply concatenating the port number, retrieved with---you guessed it---device_controller_num_get() with the letters identifying the upper layer. So the nodes from the above example would generate:
/hw/ttys/tty4d1 /hw/ttys/tty4f1 /hw/ttys/ttyc1 /hw/ttys/ttyd1 /hw/ttys/ttyf1 /hw/ttys/ttym1 /hw/ttys/ttymidi1
so, by the time the system is booted, /hw/ttys contains every tty on the system.
Note that the kudzu rev 1.7 tserialio does not use anything approaching the right method; you should instead mimic what is done by the other upper layers (cserialio, serialio, nsmidi: pointers are above).
The correct procedure is:
# Standard on-cpu tty's
# Create symlinks to any tty files that exist in the hwgraph, as well as any
# input directories. Also provide direct symlinks /dev/keybd /dev/mouse
# /dev/dials and /dev/tablet for backward compatibility.
AXE # For O2, create timestamped serial io driver, /dev/ttyts*
AXE # (an alternate serialio upper layer).
ttys: anything
AXE @if hinv -c processor | grep -s IP32 > /dev/null; then \
AXE rm -rf ttyts* ; \
AXE ln -s /hw/tsio/ttyts1 ttyts1 ; \
AXE ln -s /hw/tsio/ttyts2 ttyts2 ; \
AXE fi
@for tty_file in /hw/ttys/* ; do \
bname="`basename $$tty_file`" ; \
....
In bonsai, there was no hardware graph, and so the driver maintained its own array of "per-physical-port" data structues (porttab) and "per-driver-open" data structures (urbtab). The fact that these were statically allocated arrays was lame. It meant you always had to allocate the worst-case amount of urbs (which is eight times the number of physical ports on the machine). For O2 this is a not a problem, but for higher-end machines it may be a problem.
In kudzu, the hardware graph automatically gives us a convenient place to store a "per-physical-port" pointer and a "per-driver-open" pointer. We identified these places just above when we discussed how to create the hardware graph nodes. There are no static array limitations, because all of these data structures are dynamically initialized.
It would be a very good idea if you could do away with urbtab[] and porttab[]. They are not necessary. Note that the rev 1.7 kudzu tserialio still has porttab[] and urbtab[], while also using the hardware graph pointers. This makes the code very confusing to read and more bug-prone.
Brief side note: even if you choose to keep porttab[] and urbtab[], you should make them kmem_zalloc'ed to a fixed size by the driver init instead of statically allocated. This is to work around the extreme lameness of the R10000 O2 kernel, where K0/K1 space under 8MB is at an extreme premium due to the R10000 speculative store workaround. Ask your friendly neighborhood OS person for more on that.
If you got rid of porttab[] and urbtab[], you would need:
The obvious choice in this case is a singly linked list structure.
There would be one global pointer that pointed to a linked list of open ports. Each element of that linked list would be a "per-physical-point" structure, allocated during hardware graph setup time. The "per-physical-point" structure would be the equivalent of the tsio_upper structure in bonsai tserialio. When tsioopen() is called, it retrieves the "per-physical-point" structure, and if that port is not already open, it adds the port to the linked list. You could embed the "next" pointer for the linked list in the "per-physical-point" structure itself. tsioclose() would remove the port from the global linked list of open ports when the last urb on that port is closed.
Each "per-physical-point" structure would contain a pointer to a linked list of open urbs on that port. Following the same model, each element of that urb list would be a "per-driver-open" structure, allocated at hwgraph_vertex_clone() time. That is, the "per-driver-open" structure would be the equivalent of the "urb" structure in bonsai tserialio. tsioopen() does the clone so it enqueues the "per-driver-open" structure on the linked list of the appropriate "per-physical-point" structure. tsioclose() would then remove the "per-driver-open" structure from the list.
While all this is going on, the interrupt and timepoke routines will be walking the lists to do their job.
This scheme has the advantages that the interrupt and timepoke routines waste no time on unopen ports or urbs, that there are no static limits on ports or urbs, and that the number of data structures is kept to the same minimal level as bonsai: 2.
The tricky part, of course, is the coordination of these linked list data structures between the interrupt/timepoke and toplevel. In order for tserialio to work with 1ms guarantees, you cannot use any locking scheme which could hold off the profiler tick for anywhere near that long. You must also use a locking scheme where the interrupt and timepoke are guaranteed to walk over each open port/urb on each call. In theory, you could use spl7 spinlocks, but for reasons described in the long comment at the beginning of tserialio.c, this is overkill. Ideally, you would choose a spinlock-free coordination method. For the bonsai driver with its porttab[] and urbtab[] arrays, Mike Thompson helped me cook up the intrflags scheme. There is almost definitely a similar scheme that will work for linked lists. It could be extremely fun to find such a scheme. Contact Mike Thompson for invaluable help: he eats this stuff for breakfast. There may be a pre-cooked answer to this problem with code ready to steal.
If it will take you too long to cook up a spinlock-free coordination scheme, alternatives are:
* if (compare_and_swap_int(&urbtab[i].intrflags, * (urbtab[i].intrflags & URB_INUSE_BY_TIMEPOKE) * | URB_ACTIVE, * (urbtab[i].intrflags & URB_INUSE_BY_TIMEPOKE) * | URB_ACTIVE | URB_INUSE_BY_INTR))Here's how it's actually used in the (interrupt) code:
int other = (urb->intrflags & URB_INUSE_BY_TIMEPOKE);
if (compare_and_swap_int(&urb->intrflags,
other | URB_ACTIVE,
other | URB_ACTIVE | URB_INUSE_BY_INTR))
Note the difference: the race is that the former code (which is still
in the comments) could grab intrflags twice (based on compiler whim).
The unlikely but possible failure scenario is:
Once you understand this you should update the comment so it reflects the code.
Incidentally, I have some optimizations for the tick processing that you might want to incorporate. I was having performance problems with 7 audio streams and 7 tsio streams (sometimes the tick processing would take an inordinately long time, and there's a bug in the profiling tick -- on 6.4, at least -- that if the tick processing takes just less than 1ms, it misses the next interrupt and goes away until the counter wraps -- 57 minutes). We got the audio down to avg approx. 100us, max approx. 450us, but although the tsio avg was 113us, at least once a second it was more like 600us. With the optimizations, the avg is the same, but the max is more like 220us with an occasional 400us. There are two optimizations. One is to tighten up the rxbytes -> urb's loop a little by removing the byte-by-byte copy. The timestamp is only stored for the first byte copied. The user lib handles replicating that stamp to the rest of the bytes. I suspect that most of the time is lost in the write loop. The optimization is to send the write bytes to the lower layer in one downcall (or two if the data wraps around the urb) rather than byte-by-byte.
I have made absolutely no attempt to optimize tserialio, beyond assuring that the worst-case execution time in bonsai (where there are at most 2 ports and 16 urbs) did not blow the millisecond guarantees. I focused on simplicity and correctness instead. So you should really be able to optimize the heck out of tsio_tick()!