blob: 407d29df82002f136b3946e2924882326784e31e [file] [log] [blame] [view]
(C) Copyright 2018 The Fuchsia Authors. All rights reserved.
Use of this source code is governed by a BSD-style license that can be
found in the LICENSE file.
# Ethernet Devices
This document is part of the [Driver Development Kit tutorial]( documentation.
## Overview
This chapter looks into the details of ethernet drivers, using the Intel driver code
for specific examples.
In order to handle ethernet devices, two distinct parts are involved.
A "top half" driver handles the generic ethernet protocol, and is located in
`//src/connectivity/ethernet/drivers/ethernet/ethernet.c` (yes, three "ethernets" in a row),
and one or more "bottom half" drivers handle the actual devices, located one
directory higher in `//src/connectivity/ethernet/drivers/`**_devicename_**`/`.
Multiple Zircon IPC protocols are used for communication between modules.
> We'll just use the term "protocol" to refer to these.
> Granted, we *are* discussing an Ethernet driver, but since we won't be
> discussing any of the on-wire communications protocols supported by the driver,
> this usage shouldn't result in any confusion.
> @@@ I hope.
The top half provides a protocol interface that conforms to `ZX_PROTOCOL_ETHERNET_IMPL`.
The bottom half provides a protocol interface that conforms to whatever the
hardware is connected to (for example, this might be `ZX_PROTOCOL_PCI`, for
PCI-based ethernet cards, or `ZX_PROTOCOL_USB` for USB-based ethernet devices,
and so on).
We'll focus on the PCI version here.
The bottom half drivers all expose a `ZX_PROTOCOL_ETHERNET_IMPL` binding, which is how
the top half finds the bottom halves.
Effectively, the bottom half ethernet driver is responsible for managing the hardware
associated with the ethernet device, and presenting a consistent abstraction of that
hardware for use by the top half.
The top half manages the ethernet interface to the system.
![Figure: Relationship amongst layers in ethernet driver stack](ethernet-000-cropped.png)
> @@@ this diagram: helpful? too busy? font too small?
# Intel PCI-based ethernet
The Intel ethernet driver can be found in `//src/connectivity/ethernet/drivers/intel-ethernet`,
and consists of the following files:
<dd>The device driver part of the code; handles interface to protocols.
<dd>The Intel specific part of the code; knows about the hardware registers on the card.
<dd>Contains the manifest constants for all of the control registers.
<dd>Common definitions (such as the device context block)
This driver not only handles the `ethmac` protocol, but also:
* finds its device on the PCI bus,
* attaches to legacy or Message Signaled Interrupts (**MSI**),
* maps I/O memory, and
* creates a background IRQ handling thread.
## Binding
The file `ethernet.c` contains the binding information, implemented by the standard
binding macros introduced in the [Simple Drivers]( chapter:
ZIRCON_DRIVER_BEGIN(intel_ethernet, intel_ethernet_driver_ops, "zircon", "0.1", 11)
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x100E), // Qemu
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x15A3), // Broadwell
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x1570), // Skylake
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x1533), // I210 standalone
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x1539), // I211-AT
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x156f), // I219-LM (Dawson Canyon NUC)
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x15b7), // Skull Canyon NUC
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x15b8), // I219-V
BI_MATCH_IF(EQ, BIND_PCI_DID, 0x15d8), // Kaby Lake NUC
This ends up binding to ethernet cards that are identified by vendor ID `0x8086` (Intel),
and have any of the listed device IDs (the `BIND_PCI_DID` lines indicate the allowed
hexadecimal device IDs).
It also requires the `ZX_PROTOCOL_PCI` protocol.
Note the sense of the logic here &mdash; the vendor ID is tested with a
"`BI_ABORT_IF(NE`" construct (meaning, "**ABORT IF** the values are **N**ot **E**qual"),
whereas the device IDs are tested with "`BI_MATCH_IF(EQ`" constructs (meaning "**MATCH
IF** the values are **EQ**ual").
Intuitively, you might think that the vendor ID could be tested with a "`BI_MATCH_IF(EQ`"
as well, (looking for vendor `0x8086`), but this would have two major problems.
First, evaluation stops as soon as a condition is true, so that means that **any** device
that had the Intel vendor ID would be considered a "match."
Second, even if the device wasn't an Intel vendor ID, it would open the possibility
of allowing matches to other vendors' devices that had the same device ID as listed.
> The individual tests are evaluated in sequence.
> The first one that's true terminates evaluation, and performs
> the given action (i.e., `ABORT` or `MATCH`).
## More about binding
From the command line, `dm drivers` will display this information.
Here's the relevant portion for the Intel ethernet driver:
$ dm drivers
Name : intel_ethernet
Driver : /boot/driver/
Flags : 0x00000000
Binding : 11 instructions (88 bytes)
[1/11]: if (Protocol != 0x70504349) return no-match;
[2/11]: if (PCI.VID != 0x00008086) return no-match;
[3/11]: if (PCI.DID == 0x0000100e) return match;
[4/11]: if (PCI.DID == 0x000015a3) return match;
[5/11]: if (PCI.DID == 0x00001570) return match;
[6/11]: if (PCI.DID == 0x00001533) return match;
[7/11]: if (PCI.DID == 0x00001539) return match;
[8/11]: if (PCI.DID == 0x0000156f) return match;
[9/11]: if (PCI.DID == 0x000015b7) return match;
[10/11]: if (PCI.DID == 0x000015b8) return match;
[11/11]: if (PCI.DID == 0x000015d8) return match;
The `Name` field indicates the name of the driver, given as the first argument to the
The `Driver` field indicates the location of the shared object that contains the driver code.
> The `Flags` field is not used @@@ correct?
The last section, the binding instructions, corresponds with the `BI_ABORT_IF` and `BI_MATCH_IF`
macro directives.
Note that the first binding instruction compares the field `Protocol` against the hexadecimal
number `0x70504349` &mdash; that "number" is simply the ASCII encoding of the string "`pPCI`",
indicating the PCI protocol (you can see all of the encodings in
From the `ZIRCON_DRIVER_BEGIN` macro, the `intel_ethernet_driver_ops`
structure contains the driver operations, in this case just the binding function
Let's turn our attention to the binding function itself.
## PCI interface
The first part of the binding function deals with the PCI interface.
The Intel ethernet driver is a PCI bus peripheral.
As such, it needs to first query the PCI configuration registers in order to discover
where the BIOS (or other startup program) has located the device in memory
address space, and what interrupt it was assigned.
Second, it needs to initialize the device for use (such as mapping the configuration
registers and attaching to the device's interrupt).
We broadly discussed this in the [Hardware Interfacing]( chapter.
As usual, the binding function allocates and initializes a context block:
static zx_status_t eth_bind(void* ctx, zx_device_t* dev) {
ethernet_device_t* edev;
if ((edev = calloc(1, sizeof(ethernet_device_t))) == NULL) {
mtx_init(&edev->lock, mtx_plain);
mtx_init(&edev->eth.send_lock, mtx_plain);
This allocates a zeroed ethernet context block (`ethernet_device_t`).
Then we initialize two mutexes (one for locking the device itself (`edev->lock`), and one
for locking the ethernet send buffers (`edev->eth.send_lock`)).
We'll examine the context block in more detail below.
### PCI protocol operations
The next step fetches the PCI protocol operations pointer (or fails if it can't):
if (device_get_protocol(dev, ZX_PROTOCOL_PCI, &edev->pci)) {
printf("no pci protocol\n");
goto fail;
This populates `edev->pci` (of type `pci_protocol_t`) with pointers to functions that
provide PCI protocol services.
Of the many functions available, we use the following subset (listed in order of
use in the binding function):
Function | Description
`get_bti` | Used to get the Bus Transaction Initiator (**[BTI](/docs/reference/kernel_objects/**) for the device
`query_irq_mode` | Returns the number of the specific type of IRQ available (MSI or legacy)
`set_irq_mode` | Requests the specified IRQ mode to be used for the device
`map_interrupt` | Creates an IRQ handle associated with the device's interrupt
`map_bar` | Returns a pointer to the Base Address Register (**BAR**) of the PCI device
`enable_bus_master` | Enables / disables bus mastering for the device
> Note that the function names given in the table above are the member names within
> the `pci_protocol_t` structure; throughout the code we'll use the **pci_...()** accessor
> functions to call the protocol ops.
### Fetch the BTI
The first PCI function we call is
zx_status_t status = pci_get_bti(&edev->pci, 0, &edev->btih);
if (status != ZX_OK) {
goto fail;
A [BTI](/docs/reference/kernel_objects/
is used to represent the bus mastering / DMA capability of a device.
It can be used for granting memory access to a device.
The [BTI](/docs/reference/kernel_objects/
handle is stored in `edev->btih` and is used later to initialize transfer buffers.
The [Hardware Interfacing]( chapter talks more about this, in the DMA section.
### Discover and map interrupts
The interrupt is discovered and mapped next:
// Query whether we have MSI or Legacy interrupts.
uint32_t irq_cnt = 0;
if ((pci_query_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_MSI, &irq_cnt) == ZX_OK) &&
(pci_set_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_MSI, 1) == ZX_OK)) {
printf("eth: using MSI mode\n");
} else if ((pci_query_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_LEGACY, &irq_cnt) == ZX_OK) &&
(pci_set_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_LEGACY, 1) == ZX_OK)) {
printf("eth: using legacy irq mode\n");
} else {
printf("eth: failed to configure irqs\n");
goto fail;
zx_status_t r = pci_map_interrupt(&edev->pci, 0, &edev->irqh);
if (r != ZX_OK) {
printf("eth: failed to map irq\n");
goto fail;
The **pci_query_irq_mode()**
function determines if the device supports any `MSI` or `LEGACY`
style interrupts, and returns the count (in `irq_cnt`).
We're expecting one interrupt, so we ignore the count and examine just the return status.
If the return status indicates one or more interrupts of that type exist, we set the device to
use that mode.
The **pci_map_interrupt()**
function is then used to bind the hardware interrupt to a handle, stored in `edev->irqh`.
We'll see this handle later, when we look at the interrupt service thread.
### Map PCI BAR
Next up, we map the PCI BAR:
// map iomem
uint64_t sz;
zx_handle_t h;
void* io;
r = pci_map_bar(&edev->pci, 0u, ZX_CACHE_POLICY_UNCACHED_DEVICE, &io, &sz, &h);
if (r != ZX_OK) {
printf("eth: cannot map io %d\n", h);
goto fail;
edev->eth.iobase = (uintptr_t)io;
edev->ioh = h;
if ((r = pci_enable_bus_master(&edev->pci, true)) < 0) {
printf("eth: cannot enable bus master %d\n", r);
goto fail;
The call to **pci_map_bar()** creates a handle to the first BAR
(the `0u` as the second argument
specifies the BAR ID number), which we store into the context block's `ioh` member.
(We also capture the virtual address into `edev->eth.iobase`.)
### Ethernet setup and configuration
At this point, we have access to enough of the device that we can go and set it up:
if (eth_enable_phy(&edev->eth) != ZX_OK) {
goto fail;
if (eth_reset_hw(&edev->eth)) {
goto fail;
The implementation of **eth_enable_phy()** and **eth_reset_hw()**
is in the `ie.c` file.
### DMA buffer setup and hardware configuration
With the device configured, we can now set up the DMA buffers.
Here we see the [BTI](/docs/reference/kernel_objects/
handle, `edev->btih`, that we set up above, as the 2nd argument to
r = io_buffer_init(&edev->buffer, edev->btih, ETH_ALLOC, IO_BUFFER_RW | IO_BUFFER_CONTIG);
if (r < 0) {
printf("eth: cannot alloc io-buffer %d\n", r);
goto fail;
eth_setup_buffers(&edev->eth, io_buffer_virt(&edev->buffer), io_buffer_phys(&edev->buffer));
The **io_buffer_init()**
function zeroes the buffer, and creates a [VMO](/docs/reference/kernel_objects/
handle to the [BTI](/docs/reference/kernel_objects/
The **eth_setup_buffers()** and **eth_init_hw()** functions are defined in the `ie.c` module.
### Final driver binding
The next part binds the device name ("`intel-ethernet`"), context block (`edev`,
allocated above), device operations (`device_ops`, which supports suspend, resume, and release),
and the additional optional protocol ops for ethernet (identified as `ZX_PROTOCOL_ETHERNET_IMPL`
and contained in `ethernet_impl_ops`):
device_add_args_t args = {
.name = "intel-ethernet",
.ctx = edev,
.ops = &device_ops,
.proto_ops = &ethernet_impl_ops,
if (device_add(dev, &args, &edev->zxdev)) {
goto fail;
### Interrupt thread creation
Finally, the background Interrupt Handling Thread (**IHT**), **irq_thread()** is created:
thrd_create_with_name(&edev->thread, irq_thread, edev, "eth-irq-thread");
printf("eth: intel-ethernet online\n");
return ZX_OK;
As discussed in the [Hardware Interfacing]( chapter,
the IHT handles asynchronous hardware events.
We'll look at the thread itself below.
### Failure handling
In case of failure, the `fail` label is the target of various `goto`s within the code, and is
responsible for cleanup of allocated resources as well as returning a failure code to the caller:
if (edev->btih) {
if (edev->ioh) {
pci_enable_bus_master(&edev->pci, false);
That concludes the discussion of the binding function.
## The context structure
At this point, we can circle back and take a look at the context structure:
typedef struct ethernet_device {
ethdev_t eth;
mtx_t lock;
eth_state state;
zx_device_t* zxdev;
pci_protocol_t pci;
zx_handle_t ioh;
zx_handle_t irqh;
thrd_t thread;
zx_handle_t btih;
io_buffer_t buffer;
bool online;
// callback interface to attached ethernet layer
ethernet_ifc_t* ifc;
void* cookie;
} ethernet_device_t;
It holds all of the context for the ethernet devices.
> @@@ How much discussion do we want of the context block members?
## Ethernet protocol operations
Recall from the discussion around the binding function
that we bound an `ethernet_impl_protocol_ops_t` structure called
`ethernet_impl_ops` to the driver.
This structure provides the following "bottom-half" ethernet driver protocol operations
for the Intel driver:
static ethernet_impl_protocol_ops_t ethernet_impl_ops = {
.query = eth_query,
.stop = eth_stop,
.start = eth_start,
.queue_tx = eth_queue_tx,
.set_param = eth_set_param,
// .get_bti not supported
We examine each in turn below.
### Ethernet protocol: **query()**
The **query()** function takes three parameters:
a context block, an options specifier, and a pointer to
an `ethernet_info_t` where the information should be stored.
> Note that at the present time, there are no options defined; therefore, the driver
> should return `ZX_ERR_INVALID_ARGS` in case of a non-zero value.
The `ethernet_info_t` structure is defined as follows (reserved fields omitted for clarity):
typedef struct ethernet_info {
uint32_t features;
uint32_t mtu;
uint8_t mac[ETH_MAC_SIZE];
} ethernet_info_t;
The `mtu` field contains the Maximum Transmission Unit (**MTU**) size that the driver
can support.
A common value is `1500`.
The `mac` field contains `ETH_MAC_SIZE` (6 bytes) worth of Media Access Control (**MAC**)
address in big-endian order (that is, for a MAC of `01:23:45:67:89:ab`, the value of
`mac[0]` is `0x01`).
Finally, the `features` field contains a bitmap of available features:
Feature | Meaning
`ETHERNET_FEATURE_WLAN` | Device is a wireless network device
`ETHERNET_FEATURE_SYNTH` | Device is a synthetic network device
`ETHERNET_FEATURE_DMA` | Driver will be doing DMA to/from the VMO
The Intel driver's **eth_query()** is representative:
static zx_status_t eth_query(void* ctx, uint32_t options, ethernet_info_t* info) {
ethernet_device_t* edev = ctx;
if (options) {
memset(info, 0, sizeof(*info));
info->mtu = ETH_MTU;
memcpy(info->mac, edev->eth.mac, sizeof(edev->eth.mac));
return ZX_OK;
In that it returns `ZX_ERR_INVALID_ARGS` in case the `options` parameter is non zero,
and otherwise fills the `mtu` and `mac` members.
### Ethernet protocol: **queue_tx()**
The **queue_tx()** function is responsible for taking the `ethernet_netbuf_t` network
buffer and transmitting it.
static zx_status_t eth_queue_tx(void* ctx, uint32_t options, ethernet_netbuf_t* netbuf) {
ethernet_device_t* edev = ctx;
if (edev->state != ETH_RUNNING) {
return eth_tx(&edev->eth, netbuf->data, netbuf->len);
The real work for the Intel ethernet driver is done in `ie.c`:
status_t eth_tx(ethdev_t* eth, const void* data, size_t len) {
if (len > ETH_TXBUF_DSIZE) {
printf("intel-eth: unsupported packet length %zu\n", len);
zx_status_t status = ZX_OK;
// obtain buffer, copy into it, setup descriptor
framebuf_t *frame = list_remove_head_type(&eth->free_frames, framebuf_t, node);
if (frame == NULL) {
goto out;
uint32_t n = eth->tx_wr_ptr;
memcpy(frame->data, data, len);
// Pad out short packets.
if (len < 60) {
memset(frame->data + len, 0, 60 - len);
len = 60;
eth->txd[n].addr = frame->phys;
eth->txd[n].info = IE_TXD_LEN(len) | IE_TXD_EOP | IE_TXD_IFCS | IE_TXD_RS;
list_add_tail(&eth->busy_frames, &frame->node);
// inform hw of buffer availability
n = (n + 1) & (ETH_TXBUF_COUNT - 1);
eth->tx_wr_ptr = n;
writel(n, IE_TDT);
return status;
This function performs buffer management and talks to the hardware.
It first locks the mutex, and then finds an available buffer.
This is done by calling **reap_tx_buffers()** to find available buffers,
and then calling the macro **list_remove_head_type()** to try and fetch
a buffer from the head of the list.
If no buffer is available, an error status (`ZX_ERR_NO_RESOURCES`) is set
and the function returns.
Otherwise, the frame data is copied (short frames, less than 60 bytes, are padded
with zeros).
The hardware is kicked via the macro **writel()**, which writes to the
`IE_TDT` register telling it which buffer is available to be written to the ethernet.
At this point, the frame is queued at the chip level, and will be sent shortly.
(The timing depends on if there are other frames queued before this one.)
### Ethernet protocol: **set_param()**
Sets a parameter based on the passed `param` argument and `value` argument.
The Intel driver supports enabling or disabling promiscuous mode, and nothing else:
static zx_status_t eth_set_param(void *ctx, uint32_t param, int32_t value, void* data) {
ethernet_device_t* edev = ctx;
zx_status_t status = ZX_OK;
switch (param) {
if ((bool)value) {
} else {
status = ZX_OK;
return status;
The following parameters are available:
Parameter | Meaning (additional data)
`ETHERNET_SETPARAM_PROMISC` | Controls promiscuous mode (bool)
`ETHERNET_SETPARAM_MULTICAST_PROMISC` | Controls multicast promiscuous mode (bool)
`ETHERNET_SETPARAM_MULTICAST_FILTER` | Sets multicast filtering addresses (count + array)
`ETHERNET_SETPARAM_DUMP_REGS` | Used for debug, dumps the registers (no additional data)
For multicast filtering, the `value` argument indicates the count of MAC addresses sequentially
presented via the `data` argument. For example, if `value` was `2`, then `data`
would point to two back-to-back MAC addresses (2 x 6 = 12 bytes total).
Note that if a parameter is not supported, the value `ZX_ERR_NOT_SUPPORTED` is returned.
### Ethernet protocol: **start()** and **stop()**
The two functions, **eth_start()** and **eth_stop()** are used to start and stop
the ethernet device:
static void eth_stop(void* ctx) {
ethernet_device_t* edev = ctx;
edev->ifc = NULL;
static zx_status_t eth_start(void* ctx, ethernet_ifc_t* ifc, void* cookie) {
ethernet_device_t* edev = ctx;
zx_status_t status = ZX_OK;
if (edev->ifc) {
status = ZX_ERR_BAD_STATE;
} else {
edev->ifc = ifc;
edev->cookie = cookie;
edev->ifc->status(edev->cookie, edev->online ? ETHERNET_STATUS_ONLINE : 0);
return status;
The Intel ethernet driver code shown above is typical; the `ifc` member of the context
block is used as both an indication of status (`NULL` if stopped) and, when running,
it points to a valid interface block.
### Ethernet protocol: **get_bti()**
The Intel ethernet driver doesn't support the optional **get_bti()** callout.
This callout is used to return a handle to the [BTI](/docs/reference/kernel_objects/
In case the device doesn't support it, it can either leave it out of the `ethernet_impl_protocol_ops_t`
structure (like the Intel ethernet driver does), or it can return `ZX_HANDLE_INVALID`.
If supported, the handle is returned from the function.
Note that the ownership of the handle is *not* transferred; the ethernet driver still
owns the handle.
In particular, the caller must not close the handle.
## Receiving data
The IHT thread created by the binding function waits for data from the ethernet hardware.
When data arrives, it calls **eth_handle_irq()** to process the data.
The portion of the thread in `ethernet.c` is as follows:
static int irq_thread(void* arg) {
ethernet_device_t* edev = arg;
for (;;) {
zx_status_t r;
r = zx_interrupt_wait(edev->irqh, NULL);
if (r != ZX_OK) {
printf("eth: irq wait failed? %d\n", r);
unsigned irq = eth_handle_irq(&edev->eth);
if (irq & ETH_IRQ_RX) {
void* data;
size_t len;
while (eth_rx(&edev->eth, &data, &len) == ZX_OK) {
if (edev->ifc && (edev->state == ETH_RUNNING)) {
edev->ifc->recv(edev->cookie, data, len, 0);
if (irq & ETH_IRQ_LSC) {
bool was_online = edev->online;
bool online = eth_status_online(&edev->eth);
zxlogf(DEBUG, "intel-eth: ETH_IRQ_LSC fired: %d->%d", was_online, online);
if (online != was_online) {
edev->online = online;
if (edev->ifc) {
edev->ifc->status(edev->cookie, online ? ETHERNET_STATUS_ONLINE : 0);
return 0;
The thread waits on an interrupt, and, when one occurs, calls **eth_handle_irq()**
to read the interrupt reason register (which also clears the interrupt
indication on the card).
Based on the value read from **eth_handle_irq()**,
there are two major flows in the thread:
1. the bit `ETH_IRQ_RX` is present &mdash; this indicates data has been
received by the card,
2. the bit `ETH_IRQ_LSC` is present &mdash; this indicates a Line Status
Change (LSC) event has been detected by the card.
If data has been received, the following functions are called:
* **eth_rx()** &mdash; obtains a pointer to the receive buffer containing the data
* **eth_rx_ack()** &mdash; acknowledges receipt of the packet by writing to registers on the card
Note that further processing is done by the ethernet device protocol (available via `edev->ifc`):
* **edev->ifc->recv()** &mdash; processes the received data
* **edev->ifc->status()** &mdash; processes the status change
In the case of a line status change, **eth_status_online()** is called to handle the event.
status_t eth_rx(ethdev_t* eth, void** data, size_t* len) {
uint32_t n = eth->rx_rd_ptr;
uint64_t info = eth->rxd[n].info;
if (!(info & IE_RXD_DONE)) {
// copy out packet
zx_status_t r = IE_RXD_LEN(info);
*data = eth->rxb + ETH_RXBUF_SIZE * n;
*len = r;
return ZX_OK;