Нет-Work: Darwin Networking
Networking has forever redefined computing. With the advent of the Internet, a system without network connectivity finds little use, as more applications rely on remote servers to perform some or all of their functions. This is especially important with the move to the Cloud, which in some aspects goes back full circle to the "dumb terminals" of the mainframe age.
Like other operating systems, Darwin places a considerable emphasis on its networking stack. Originally inherited from the BSD layer, Apple has continuously refined and extended it with support for new protocols and more features improving functionality, efficiency, and speed. This chapter discusses Darwin's networking features, from a user mode perspective.
The chapter begins with a review of socket families, specifically the ones idiosyncratic to Darwin. These are the PF_NDRV
sockets, which enable (to a certain extent) raw packet manipulation, and the PF_SYSTEM
sockets for user/kernel mode communication. The latter is especially important, since it contains quite a few proprietary, undocumented but powerful interfaces.
We next briefly explain MultiPath TCP, an emerging Internet standard which Apple was quick to adopt. This required the addition of new system calls in Darwin 13, as well as some sysctl
MIBs. MIBs are also the focus of the next two sections, which deal with network configuration from user mode, and the gathering of statistics. Statistics, however, are where Darwin excels through undocumented APIs. The PF_SYSTEM
control sockets introduced earlier are especially useful to provide live network statistics which other OSes can only struggle with.
Following that, we turn our attention to firewalling and packet filtering. Darwin provides not one, but two built-in network-layer firewalls - BSD's ipfw(8)
(which has been deprecated as of around Darwin 15) and pf(8)
. MacOS further provides an application layer firewall as well, through
Last, but not least we turn a spotlight towards two entirely undocumented but powerful APIs: The first is that of Network Extension Control Policies, which enable QoS, flow control and further through policy objects and a proprietary file descriptor. The second is the mysterious Skywalk, and its nexus and channel objects. This is an entire subsystem which is not only undocumented, but intentionally left out of XNU's public sources. The pages of this book provide the only public documentation of this important mechanism to date.
Darwin Extensions of the BSD Socket APIs
Since its inception in the 1980's, the BSD socket model has proven time and again its superb design and extensibility to new protocols. The set of system calls used in manipulating sockets is (for the most part) entirely implementation agonstic. The only times where protocol specific functions are required, they may be set through the sockaddr_..
variant used when bind(2)
ing the socket, or [get/set]sockopt(2)
, if family options are supported.
The header file in [AF/PF]_*
constants), but in practice only a subset of them are supported in Darwin. These are shown in Table 16-1:
# | Protocol Family | Transport |
---|---|---|
1 | PF_[LOCAL/UNIX] | UNIX domain sockets |
2 | PF_INET | IPv4 |
14 | PF_ROUTE | Internal Routing Protocol |
27 | PF_NDRV | Network Drivers: Raw device access |
29 | PF_KEY | IPSec Key Management (RFC2367) |
30 | PF_INET6 | IPv6 (And IPv4 mapped) |
32 | PF_SYSTEM | System/kernel local communication (Proprietary) |
Most of these are standard, and should be well known to the reader from other UN*X variants. PF_NDRV
and PF_SYSTEM
, however, are Darwin proprietary, and deserve special discussion.
PF_NDRV
The PF_NDRV
protocol family is a somewhat misnamed one - the documentation describes it as used by "Network Drivers", though drivers are generally kernel mode beasts. A better name would have been "PF_RAW", as the family allows raw access to network interfaces, or perhaps (in keeping with Linux) "PF_PACKET". Raw interface access is quite similar to AF_INET[6]
's SOCK_RAW
, or using the IP_HDRINCL
setsockopt(2)
syscall. Unlike either, however, PF_NDRV
allows control over all layers - down to the link layer header.
The PF_NDRV
sockets are created as usual, but require a different socket address family - struct sockaddr_ndrv
. As a sockaddr_*
compatible structure, its first fields are the byte-size snd_len
and snd_family
(set to sizeof(sockaddr_ndrv) and PF_NDRV
, respectively. The only other field in the structure is the snd_name
character array (of IFNAMSIZ
bytes), which holds the underlying interface the socket is to be bound to.
Though seldom used, PF_NDRV
allows a user-mode client to register its own EtherType, so that the kernel will dispatch packets to it. In that sense, it allows for "user mode drivers" to register their custom protocol implementations, using a setsockopt(2)
with the SOL_NDRVPROTO
level and NDRV_SETDMXSPEC
option name. This, however, will only work if there is no a priori registered protocol (otherwise returning EADDRINUSE
), so is therefore not useful for general packet sniffing.
A much more useful aspect of PF_NDRV
is to create custom packets. A socket is created the same way, but specifying SOCK_RAW
for the socket type. Following the binding to an interface, packets can be fabricated by directly writing to a buffer, and sending it on the bound interface using sendto(2)
.
PF_NDRV
to implement a custom network protocol
The following program can be used to demonstrate the capabilities of PF_NDRV
for both custom protocol packet reception and sending. Because a lot of code is common, the receiving functionality is #ifdef
LISTENER, and otherwise this will send the packets. You can try this program with any ethertype (specified as a decimal argument), so long as it doesn't collide with an already existing one (e.g. IP's 0x0800 or IPv6's 0x86dd).
PF_NDRV
client/listener programPF_SYSTEM
The PF_SYSTEM
protocol family is a proprietary Darwin mechanism which provides communication between kernel mode providers and user mode requesters. PF_SYSTEM
sockets are always SOCK_RAW
, with two protocols implemented: SYSPROTO_EVENT
(1) and SYSPROTO_CONTROL
(2).
SYSPROTO_EVENT
The SYSPROTO_EVENT
protocol is used by the kernel to multicast events to interested parties. In that sense, it is very similar to Linux's AF_NETLINK
. No binding is required for the socket - but an event filter must be set using a SIOCSKEVFILT
ioctl(2)
request. The ioctl(2)
takes a struct kev_request
, defined in ioctl(2)
codes) to consist of three uint32_t
- for the vendor_code
, kev_class
and kev_subclass
. Zero values may be specified as wildcards (....ANY
) for any of the three values. The only vendor supported out of box is KEV_VENDOR_APPLE
, and Table 16-3 shows the classes which exist in Darwin 18:
KEV_..._CLASS | KEV_..._SUBCLASS | Event types | ||
---|---|---|---|---|
1 | NETWORK | 1 | INET | IPv4 (codes in |
2 | DL | Data Link subclass (codes in | ||
3 | NETPOLICY | Network policy subclass | ||
4 | SOCKET | Sockets | ||
5 | ATALK | AppleTalk (no longer used) | ||
6 | INET6 | IPv6 (codes in | ||
7 | ND6 | IPv6 Neighbor Discovery Protocol | ||
8 | NECP | NECP subclasss | ||
9 | NETAGENT | Net-Agent subclass | ||
10 | LOG | Log subclass | ||
11 | NETEVENT | Generic Net events subclass | ||
12 | MPTCP | Global MPTCP events subclass | ||
2 | IOKIT | ? | ? | IOKit drivers |
3 | SYSTEM | 2 | CTL | Control notifications |
3 | MEMORYSTATUS | Jetsam/memorystatus subclass | ||
4 | APPLESHARE | AppleShare events (no longer used) | ||
5 | FIREWALL | 1 | IPFW | ipfw - IPv4 firewalling |
2 | IP6FW | ipfw - IPv6 firewalling | ||
6 | IEEE80211 | 1 | ? | Wireless Ethernet (IO8211Family drivers) |
Following the setting of the ioctl(2)
, events can be read from the socket as a stream of kern_event_msg
structures. As further explained in total_size
, specifying a vendor_code
, kev_class
and kev_subclass
(which are guaranteed to match the filter), as well as a monotonically increasing id
, an event_code
, and any number of event_data
words (up to the total_size
specified). Additional ioctl(2)
codes are SIOCGKEVID
(to get current event ID), SIOCGKEVFILT
(get the filter set on the socket) and SIOCGKEVVENDOR
(looking up the provider code for a string provider name). Apple's built-in mechanisms naturally use the APPLE
class, although vendor class 1000 has also been seen on MacOS (for the socketfilterfw
, discussed later).
SYSPROTO_EVENT
listener
The programming model of SYSPROTO_EVENT
is so simple an event listener can be coded in but a few lines:
SYSPROTO_EVENT
listenerCompiling the above program and running it will block, and occasionally spit out event notifications. Most commonly on MacOS are those of IE80211 (1/6/1), which emits messages on WiFi scans and state changes. Toggling the WiFi interface will also generate NETWORK/DL messages (1/1/2) as the interface reconfigures, and NETWORK/INET6 (1/1/6) as it gets a dynamic IP address.
Being a Darwin proprietary mechanism, the event notifications are used by Apple's own daemons. Using procexp(j)
you can see which sockets are used which daemons - including the above program ('kev'), when it runs:
SYSPROTO_EVENT
socket usage with procexp(j)
SYSPROTO_CONTROL
The second protocol of the PF_SYSTEM
family is SYSPROTO_CONTROL
. This merely provides a control channel from user space onto a given provider, which may be a kernel subsystem or some kernel extension, calling on the ctl_register
KPI. Such SYSPROTO_CONTROL
sockets are associated with control names, which Apple maintains in a reverse DNS notation. Apple keeps adding more and more providers in between Darwin versions and in new kernel extensions - needless to say all undocumented. Using netstat(1)
, you can see both which providers are registered (under "Registered kernel control modules"), and which are actively in use (through "Active kernel control sockets"), although to see which processes are actually holding control sockets one needs to use lsof(1)
or procexp(j)
. Table 16-6 shows the providers found in Darwin 18.
com.apple. Control Name | Provides |
---|---|
network.statistics | Live socket statistics and notifications |
content-filter | XNU-2782: User space packet data filtering. Used by cfilutil |
fileutil.kext.state[less/ful].ctl | MacOS 14: AppleFileUtil.kext |
flow-divert | XNU-2422: MPTCP flow diversions |
mcx.kernctl.alr | MacOS mcxalr.kext: Managed Client eXtensions control |
net.ipsec_control | XNU-2422: User-mode IPSEC controls |
net.necp_control | XNU-2782: Network Extension Control Policies |
net.netagent | XNU-3248: Network Agents (discussed later) |
net.rvi_control | RemoteVirtualInterface.kext: control socket |
net.utun_control | User mode tunneling (VPNs) |
netsrc | Network/route policies and statistics |
network.advisory | XNU-3248: Report SYMPTOMS_ADVISORY_[CELL/WIFI]_[BAD/OK] to kernel |
network.tcp_ccdebug | XNU-2782: Collect flow control algorithm debug data |
nke.sockwall | MacOS: The Application Layer Firewall (ALF.kext), discussed later |
nke.webcontentfilter | webcontentfilter.kext: "HolyInquisition" socket filtering via user-mode proxy |
packet-mangler | XNU-2782: Tracks flows and handles TCP options (Used by pktmnglr ) |
uart.[sk].* | AppleOnBoardSerial.kext and other UART devices (MacOS: BLTH , MALS , SOC , iOS: oscar , gas-gauge , wlan-debug and iap ) |
userspace_ethernet | IOUserEthernet.kext : User mode tunneling (Layer II Ethernet) |
Once a socket is created, a CTLIOCGINFO
ioctl(2)
must be issued with a struct ctl_info
argument, whose ctl_name
field is initialized with the requested control name. If the ioctl(2)
is successful, the socket may then be connect(2)
ed through a struct sockaddr_ctl
, initialized with the ctl_id
returned from the previous ioctl(2)
.
A connected socket, however, is far as the SYSPROTO_CONTROL
goes. From that point on, every socket behaves differently, depending on the underlying provider. The general flow usually entails send(2)
ing and recv(2)
ing, and in some cases using [get/set]sockopt(2)
. The system calls result in kernel mode callbacks (registered by the implementing party, usually a kernel extension) being invoked. The kernel-mode implementation is discussed in Volume II.
Control sockets are commonly used for administering other networking facilities, so a few examples of their usage will be discussed in this chapter.
SYSPROTO_CONTROL
User mode tunneling, a feature commonly tapped by VPN applications, is a great example of a SYSPROTO_CONTROL
socket. Such applications, rather than installing some kernel filtering mechanism, instead draw on a kernel facility to request the creation of a new interface, which appears to other processes as another link-layer, complete with its own IPv4 or IPv6 address. When such processes bind to the interface, the IP-layer packets are redirected to the utun controller, which can then do with them as his own, commonly encapsulating them in an additional IP layer (with or without encryption), and sending them elsewhere. The process works both ways, in that the utun controller can also inject packets onto the tunneling interface, which the kernel will then route to their bound sockets, as it would have with any other interface.
The compact code in Listing 16-7 sets up com.apple.net.utun_control
:
Once the descriptor is set up, a trivial read(2)
loop implementation is left as an exercise for the avid reader. When completed, an interface will appear (In the example we useutun1
, since utun0
is occasionally used by the identityserviced
). After configuring an IP address on the interface, generating any traffic (e.g. with ping(8)
), will send those IP packets to the process:
In the first terminal, set up tunnel: |
In another terminal, once tunnel is up: |
Proprietary socket system calls
In addition to the _nocancel
extensions found in XNU for other I/O related calls, Apple has extended the BSD socket API with several proprietary system calls - which are, as usual, undocumented.
pid_shutdown_sockets
(#436)
The pid_shutdown_sockets
system call enables the caller to force all presently open sockets to be forcefully shutdown. This is only used on *OS, wherein the only caller of this system call appears to be
socket_delegate
(#450)
The socket_delegate
system call works like socket(2)
- only it receives an additional (fourth) argument, specifying the target PID in which the socket is to be created. There appears to be little use for this system call.
[dis]connectx
and [send/recv]msg_x
(#447-8, #480-1)
Apple introduced several non-standard system calls which both extend BSD sockets and provide support for MultiPath TCP (and other protocols) as far back as Darwin 13. The first pair, [dis]connectx
, provide for quick bind(2)/connect(2)
, deferred connection setup (CONNECT_RESUME_ON_READ_WRITE
) as well as for supporting multiple address associations. These only became an official API as of 15, and got a fairly detailed manual page. The second pair, [send/recv]msg_x
(supporting array based [send/recv]msg
for protocols handling simultaneous multiple datagrams), is still not officially provided to this day. The prototypes and documentation in PRIVATE
, so as not to appear in
Apple expects developers to use the (very) high level abstraction of NSURLSession
objects, setting the MultipathServiceType
property of NSURLSessionConfiguration
as documented in a developer article[1]. For pure C programmers, the system calls remain the most effective mechanism to tap into MPTCP's powerful functionality, and other address family specific yet non-standard features.
peeloff
(#449)
peeloff
was a short lived system call meant to extract an association from a socket. It was added in XNU-2422 but apparently replaced with a null implementation (returning 0) in XNU-4570.
Interfaces
As with other UN*X systems, Darwin provides user mode with network access through the notion of interfaces. These are devices which, unlike the standard character or block devices, have no filesystem presence and can only be accessed through sockets, and controlled through ioctl(2)
on the bound sockets. The command line ifconfig(1)
utility comes in very handy to view devices, with -l
for a short list or -a
for full information. Trying this on any Darwin system details the interfaces, which follow the naming convetions shown in Table 16-10:
Link | Interface | Provided by | Used for |
---|---|---|---|
Loopback | lo0 | XNU | The loopback ("localhost") interface |
gif# | Generic IPv[4/6]-in-IPv[4/6] (RFC2893) tunneling | ||
stf# | 6-to-4 (RFC3056) tunneling | ||
utun# | User mode tunneling | ||
ipsec# | IPSec tunneling | ||
Ethernet | en# | IONetworkingFamily | Ethernet (wired, wireless, and over other media) |
awdl0 | IOgPTPPlugin.kext | Apple Wireless Device Link | |
p2p# | AppleBCMWLANCore | Wi-Fi peer to peer | |
ppp# | PPP.kext | Point-to-Point Protocol ( | |
bridge# | XNU | MacOS: Interface bridging | |
fw# | IOFireWireIP | MacOS:IP over FireWire | |
rvi# | RemoteVirtualInterface | MacOS: captures packets from attached *OS devices | |
ap# |
| Access Point (personal hotspot) | |
Cell | pdp_ip# | AppleBasebandPCI[ICE/MAV]PDP | iOS/WatchOS: Cellular connection (if applicable) |
USB | XHC# | AppleUSBHostPacketFilter | MacOS: USB Packet capture |
Capture | [pk/ip]tap# | XNU | Packet or IP Layer capture from multiple interfaces |
The en#
interfaces are the ones most commonly used. Not only does the local wired Ethernet (on Mac Minis and iMacs, through ......) and wireless interfaces (used by the Airport Broadcom or other NIC kext) appear as en#
, but so does Bluetooth, the Mac ↔ BridgeOS interface as well (usually as en5
) started by the com.apple.driver.usb.cdc.ncm
driver, as well as the tethering interface of the iPhone when Mobile Hotspot is activated over USB. The system maintains a property list of all ethernet interfaces and their mappings to the GUI visible strings ("UserDefinedName") in networksetup -listallhardwareports
).
Interface Configuration
As with other UN*X systems, the ifconfig(8)
utility can be used to obtain information on interfaces and perform various operations, such as plumb, add/remove IPv(4/6) addresses or aliases, and bond. Darwin also extends the interface object internally with numerous proprietary ioctl(2)
codes, all marked PRIVATE
or BSD_KERNEL_PRIVATE
so as to not be visible in user mode. Whereas SIOCSIF*
codes, XNU's ioctl(2)
codes are shown in Table 16-11 (next page). The codes are shown in their macro form, which also defines their third (void *) argument. Note, that some of the structures are also undocumented, and are found elsewhere in the XNU sources. The Book's companion XXR can come handy to help you locate the structures and copy them to a user mode header.
ioctl(2) code | Value (_IOWR() ) | Description |
---|---|---|
SIOCGIFCONF[32/64] | ('i', 36, struct ifconf[32/64]) | get ifnet list |
SIOCGIFMEDIA[32/64] | ('i', 56, struct ifmediareq[32/64]) | get net media |
SIOCGIFGETRTREFCNT | ('i', 137, struct ifreq) | get interface route refcnt |
SIOCGIFLINKQUALITYMETRIC | ('i', 138, struct ifreq) | get LQM |
SIOCGIFEFLAGS | ('i', 142, struct ifreq) | get extended ifnet flags |
SIOC[S/G]IFDESC | ('i', 143/144, struct if_descreq) | Set/Get interface description |
SIOC[S/G]IFLINKPARAMS | ('i', 145/146, struct if_linkparamsreq) | Set/Get output TBR rate/percent |
SIOCGIFQUEUESTATS | ('i', 147, struct if_qstatsreq) | Get interface queue statistics |
SIOC[S/G]IFTHROTTLE | ('i', 148/149, struct if_throttlereq) | Set/Get throttling for interface |
SIOCGASSOCIDS[/32/64] | ('s', 150, struct so_aidreq[/32/64]) | get associds |
SIOCGCONNIDS[/32/64] | ('s', 151, struct so_cidreq[/32/64]) | get connids |
SIOCGCONNINFO[/32/64] | ('s', 152, struct so_cinforeq[/32/64]) | get conninfo |
SIOC[S/G]CONNORDER | ('s', 153/154, struct so_cordreq) | set conn order |
SIO[C/G]SIFLOG | ('i', 155/156, struct ifreq) | Get/Set Interface log level |
SIOCGIFDELEGATE | ('i', 157, struct ifreq) | Get delegated interface index |
SIOCGIFLLADDR | ('i', 158, struct ifreq) | get link level addr |
SIOCGIFTYPE | ('i', 159, struct ifreq) | get interface type |
SIOC[G/S]IFEXPENSIVE | ('i', 160/161, struct ifreq) | get/mark interface expensive flag |
SIO[C/S]GIF2KCL | ('i', 162/163, struct ifreq) | interface prefers 2 KB clusters |
SIOCGSTARTDELAY | ('i', 164, struct ifreq) | Add artificial delay |
SIOCAIFAGENTID | ('i', 165, struct if_agentidreq) | Add netagent id |
SIOCDIFAGENTID | ('i', 166, struct if_agentidreq) | Delete netagent id |
SIOCGIFAGENTIDS[/32/64] | ('i', 167, struct if_agentidsreq[/32/64]) | Get netagent ids |
SIOCGIFAGENTDATA[/32/64] | ('i', 168, struct netagent_req[/32/64]) | Get netagent data |
SIOC[S/G]IFINTERFACESTATE | ('i', 169/170, struct ifreq) | set/get interface state |
SIOC[S/G]IFPROBECONNECTIVITY | ('i', 171/172, struct ifreq) | Start/Stop or check connectivity probes |
SIOCGIFFUNCTIONALTYPE | ('i', 173, struct ifreq) | get interface functional type |
SIOC[S/G]IFNETSIGNATURE | ('i', 174/175, struct if_nsreq) | Set/Get network signature |
SIOC[G/S]ECNMODE | ('i', 176/177, struct ifreq) | Explicit Congestion Notification mode (IFRTYPE_ECN_[[EN/DIS]ABLE/DEFAULT] ) |
SIOCSIFORDER | ('i', 178, struct if_order) | Set interface ordering |
SIO[C/G]SQOSMARKINGMODE[/ENABLED] | ('i', 180-183, struct ifreq) | Get/set QoS marking mode |
SIOCSIFTIMESTAMP[EN/DIS]ABLE | ('i', 184-185, struct ifreq) | Enable/Disable interface timestamp |
SIOCGIFTIMESTAMPENABLED | ('i', 186, struct ifreq) | Get interface timestamp enabled status |
SIOCSIFDISABLEOUTPUT | ('i', 187, struct ifreq) | Disable output (DEVELOPMENT/DEBUG ) |
SIOCGIFAGENTLIST[/32/64] | ('i', 190, struct netagentlist_req[/32/64]) | Get netagent dump |
SIOC[S/G]IFLOWINTERNET | ('i', 191/192, struct ifreq) | Set/Get low internet download/upload |
SIOC[G/S]IFNAT64PREFIX | ('i', 193/194, struct if_nat64req) | Get/set Interface NAT64 prefixes |
SIOCGIFNEXUS | ('i', 195, struct if_nexusreq) | get nexus details |
SIOCGIFPROTOLIST[/32/64] | ('i', 196, struct if_protolistreq[/32/64]) | get list of attached protocols |
SIOC[G/S]IFLOWPOWER | ('i', 199/200, struct ifreq) | Low Power Mode |
SIOCGIFCLAT46ADDR | ('i', 201, struct if_clat46req) | Get CLAT (IPv4-in-IPv6) addresses |
Case Study: rvi
The Remote Virtual Interface was introduced in iOS 5.0 and documented by Apple in QA1176[2]. The feature allows the iOS interfaces to appear on a Mac host, so that packet tracing tools (notably, tcpdump(1)
) can be used through the host.
RVI requires the cooperation of several components, both on the host and the device, all working together as shown in Figure 16-12 (next page). The binaries of the remote virtual interface package all belong to the same (unnamed and closed source) project, and are of the few in MacOS which still target 10.7 (i.e. use LC_UNIXTHREAD
), and are apparently unmaintained since then.
-l/L
, then start (-s/S
) or stop (-x/X
) the remote virtual interface on the devices by their specified UUID. It does so by linking with RemotePacketCapture.framework
. The latter exports APIs which hide the IPC connection to
The
rvictl
)'s demand for com.apple.rpmuxd
by launchd
. The daemon handles the local end of the packet capture operations, as well as provides a notification mechanism for interested clients over MIG subsystem 117731 with five messages:
# | Routine Name |
---|---|
117731 | rpmuxd_start_packet_capture |
117732 | rpmuxd_stop_packet_capture |
117733 | rpmuxd_get_current_devices |
117734 | rpmuxd_register_notification_port |
117735 | rpmuxd_deregister_notification_port |
The daemon controls the kextstat(1)
by its CFBundleIdentifier
of com.apple.nke.rvi
). The kext isn't normally loaded, so posix_spawn(1)
ing kextload
). The kext sets up a PF_SYSTEM
control socket with the name com.apple.net.rvi_control
, which the daemon connect(2)
s to, and through which it requests the kext to create the rvi#
interface. At the same time, it handles the connection to the iDevice, by calling AMDeviceSecureStartService()
to request the launch of com.apple.pcapd
on it.
On the iDevice, when lockdownd
receives the request to start com.apple.pcapd
, it consults the __TEXT.__services
plist embedded in the Mach-O, and resolves the name to
pcapd
is a small daemon which uses libpcap.A.dylib
, calling pcap_setup_pktap_interface()
and then using the Berkeley Packet Filter (explained later in this chapter) to capture all packets. These packets are relayed over the lockdown connection to the host's rpmuxd
, which then injects them (through the control socket) to the kext, which in turn pushes them through the rvi#
interface it has created.
The end result of this is that the rvi#
interface is entirely indistinguishable from other ethernet interfaces for packet capture tools, and so tcpdump(1)
can be run on the host (with the -i rvi#
switch) to obtain the packets which were actually captured on the iDevice. The connection, however, is read-only, so packets cannot be sent through the rvi#
interface back to the device.
Networking Configuration
The network stack exposes a plethora of configuration settings (and statistics, described later) via sysctl
MIBs. These are all conveniently in the net
namespace. As with other MIBs, they are mostly undocumented save for a description in the kernel's __DATA.__sysctl_set
(through the SYSCTL_OID
's description
field), which can be displayed with joker -S
.
IPv4 configuration
The MIBs for controlling IPv4 and IPv6 are broken into two separate namespaces net.inet.ip
and net.inet6.ip6
. Owing to the similarities of both protocols, however, some MIBs are found in both namespaces, as shown in Table 16-14:
net.inet[6].ip[6] MIB | Default | Purpose |
---|---|---|
mcast.loop | 0x00000001 | Loopback multicast datagrams by default |
rtexpire | 0x0000013b | Default expiration time on dynamically learned routes |
rtminexpire | 0x0000000a | Minimum time to hold onto dynamically learned routes |
rtmaxcache | 0x00000080 | Upper limit on dynamically learned routes |
forwarding | 0000000000 | Enable IP forwarding between interfaces |
redirect | 0x00000001 | Enable sending IP redirects |
maxfragpackets | 1536 | Maximum number of fragment reassembly queue entries |
maxfragsperpacket | 128 | Maximum number of fragments allowed per packet |
adj_clear_hwcksum | 0000000000 | Invalidate hwcksum info when adjusting length |
adj_partial_sum | 0x00000001 | Perform partial sum adjustment of trailing bytes at IP layer |
mcast.maxgrpsrc | 0x000200 | Max source filters per group |
mcast.maxsocksrc | 0x000080 | Max source filters per socket |
mcast.loop | 0x0000080 | Multicast on loopback interface |
Additional, IPv4 specific parameters, found in net.inet.ip
, are shown in Table 16-15.
net.inet.ip MIB | Default | Purpose |
---|---|---|
portrange.low[first/last] | 600-1023 | Low reserved port range |
portrange.[hi][first/last] | 49152-65536 | High reserved port range |
ttl | 0x00000040 | Default TTL value on outgoing packets |
[accept_]sourceroute | 0000000000 | Enable [accepting/forwarding] source routed IP packets |
gifttl | 0x0000001e | Time-to-Live (max hop count) on GIF (IP-in-IP) interfaces |
subnets_are_local | 0000000000 | Subnets of local interfaces also considered local |
random_id_statistics | 0000000000 | Enable IP ID statistics |
sendsourcequench | 0000000000 | Enable the transmission of source quench packets |
check_interface | 0000000000 | Verify packet arrives on correct interface |
rx_chaining | 0x00000001 | Do receive side ip address based chaining |
rx_chainsz | 0x00000006 | IP receive side max chaining |
linklocal.in.allowbadttl | 0x00000001 | Allow incoming link local packets with TTL < 255 |
random_id | 0x00000001 | Randomize IP packets IDs |
maxchainsent | 0x00000016 | use dlil_output_list |
select_srcif_debug | 0000000000 | Debug (dmesg) source address selection |
output_perf | 0000000000 | Do time measurement |
rfc6864 | 0x00000001 | Updated Specification of the IPv4 ID Field |
IPv6 configration
IPv6 has plenty of other specific parameters which affect its behavior, particularly for its sub-protocols, like Neighbor Discovery (ND).
net.inet6 sysctl MIB | Default | Purpose |
---|---|---|
ip6.hlim | 64 | IPv6 hop limit |
ip6.accept_rtadv | 1 | Accept ICMPv6 Router Advertisements |
ip6.keepfaith | 0 | Unused. Apparently IPv6 has grown disillusioned. |
ip6.log_interval | 5 | Throttle kernel log output to once in log_interval |
ip6.hdrnestlimit | 15 | IP header nesting limit |
ip6.dad_count | 1 | Duplicate Address Detection count (read only) |
ip6.auto_flowlabel | 1 | Assign IPv6 flow labels (in header) |
ip6.defmcasthlim | 1 | Default multicast hop limit |
ip6.gifhlim | 0 | Hop limit on GIF (IP-in-IP) interfaces |
ip6.use_deprecated | 1 | Continue use of deprecated temporary address |
ip6.rr_prune | 5 | Router renumbering prefix |
ip6.v6only | 0 | If 0, enable IPv6 mapped addresses. Else, native IPv6 only |
ip6.use_tempaddr | 1 | RFC3041 temporary interface addresses |
ip6.temppltime | 86400 | Temporary address preferred lifetime (sec) |
ip6.tempvltime | 604800 | Temporary address maximum lifetime (sec) |
ip6.auto_linklocal | 1 | Automatically use link local (fe80::) addresses |
ip6.prefer_tempaddr | 1 | Prefer the temporary address over the assigned one |
ip6.use_defaultzone | 0 | Embed default scope ID |
ip6.maxfrags | 3072 | Maximum number of IPv6 fragments allowed |
ip6.mcast_pmtu | 0 | Enable Multicast Path MTU discovery |
ip6.neighborgcthresh | 1024 | Neighbor cache garbage collection threshold |
ip6.maxifprefixes | 16 | Maximum interface prefixes adopted from router advertisements |
ip6.maxifdefrouters | 16 | Maximum default routers adopted from router advertisements |
ip6.maxdynroutes | 1024 | Maximum number of dynamic (via redirect) routes allowed |
ip6.input_perf_bins | 0 | bins for chaining performance data histogram |
ip6.select_srcif_debug | 0 | Debug (log) selection process of source interface |
ip6.select_srcaddr_debug | 0 | Debug (log) selection process of source address |
ip6.select_src_expensive_secondary_if | 0 | Allow source address selection to use interfaces w/high metric |
ip6.select_src_strong_end | 1 | limit source address selection to outgoing interface |
ip6.only_allow_rfc4193_prefixes | 0 | Use RFC4193 as baseline for network prefixes |
ip6.maxchainsent | 1 | use dlil_output_list |
ip6.dad_enhanced | 1 | Adds a random nonce to NS messages for DAD. |
IPSec (6) Configuration
IPSec is deeply integrated into IPv6, and therefore more likely to be used with it than in the IPv4 case. Many of the ipsec6
MIB values also apply to IPv4 (i.e. exist in net.inet.ipsec
as well), and values which apply to both are under net.ipsec
(not shown below).
net.inet6.ipsec6 MIB | Default | Purpose |
---|---|---|
def_policy | 1 | Default Policy |
esp_trans_deflev | 1 | Encapsulating Security Payload in Transport Mode |
esp_net_deflev | 1 | Encapsulating Security Payload in Network Mode |
ah_trans_deflev | 1 | Authentication Header in Transport Mode |
ah_net_deflev | 1 | Authentication Header in Network mode |
ecn | 0 | Toggle Explicit Congestion Notifications |
debug | 0 | Toggle logging and Debugging |
esp_randpad | -1 | Pad Encapsulating Security Payload with random bytes |
ICMPv6 Configuration
ICMPv6 (RFC4443) is also tightly knit into IPv6, and includes the sub protocols of Neighbor Discovery (ND) and SEcure Neighbor Discovery (SEND). Another sub protocol, Multicast Listener Discovery (MLD) has subtleties between version 1 (RFC2710) and version 2 (RFC3810), both of which are supported by the Darwin network stack.
net.inet6 MIB | Default | Purpose |
---|---|---|
icmp6.rediraccept | 1 | Accept and process redirects |
icmp6.redirtimeout | 600 | Expire ICMP redirected route entries after n seconds |
icmp6.rappslimit | 10 | Router Advertisement Packets per second limit |
icmp6.errppslimit | 500 | packet-per-second error limit |
icmp6.nodeinfo | 3 | enable/disable NI response |
icmp6.nd6_prune | 1 | Walk list every n seconds |
icmp6.nd6_prune_lazy | 5 | Lazily walk list every n seconds |
icmp6.nd6_delay | 5 | Delay first probe in seconds |
icmp6.nd6_[u/m]maxtries | 3 | Maximum [unicast/multicast] ND query attempts |
icmp6.nd6_useloopback | 1 | Allow ND6 to operate on loopback interface |
icmp6.nd6_debug | 0 | Output ND debug messages to kernel log |
icmp6.nd6_accept_6to4 | 1 | Accept neighbors from 6-to-4 links |
icmp6.nd6_optimistic_dad | 63 | Assume Duplicate Address Detection won't ever collide |
icmp6.nd6_onlink_ns_rfc4861 | 0 | Accept 'on-link' nd6 NS in compliance with RFC 4861 |
icmp6.nd6_llreach_base | 30 | default ND6 link-layer reachability max lifetime (in seconds) |
icmp6.nd6_maxsolstgt | 8 | maximum number of outstanding solicited targets per prefix |
icmp6.nd6_maxproxiedsol | 4 | maximum number of outstanding solicitations per target |
send.opmode | 1 | Configured SEND operating mode |
Multicast Listener Discovery | ||
mld.gsrdelay | 10 | Rate limit for IGMPv3 Group-and-Source queries in seconds |
mld.v1enable | 1 | Support MLDv1 (RFC2710) |
mld.v2enable | 1 | Support MLDv2 (RFC3810) |
mld.use_allow | 1 | Use ALLOW/BLOCK for RFC 4604 SSM joins/leaves |
mld.debug | 0 | Output MLD debug messages to kernel log |
TCP configuration
Darwin's TCP implementation has a huge number of settings, which toggle support for various RFCs and best practices. They are all under net.inet.tcp
, and apply the same way for both IPv4 and IPv6, save for [v6]mssdflt
. Table 16-19 lists them all:
net.inet.tcp MIB | Default | Purpose |
---|---|---|
[v6]mssdflt | [0x400]/0x200 | Default TCP Maximum Segment Size |
keepidle | 0x006ddd00 | Keepalive timeout for idle connections |
keepintvl | 0x000124f8 | Keepalive interval |
sendspace | 0x00020000 | Maximum outgoing TCP datagram size |
recvspace | 0x00020000 | Maximum incoming TCP datagram size |
randomize_ports | 0000000000 | Randomize TCP source ports |
log_in_vain | 0000000000 | Log all incoming TCP packets |
blackhole | 0000000000 | Do not send RST when dropping refused connections |
keepinit | 0x000124f8 | TCP connect idle keep alive time |
disable_tcp_heuristics | 0000000000 | Set to 1, to disable all TCP heuristics (TFO, ECN, MPTCP) |
delayed_ack | 0x00000003 | Delay ACK to try and piggyback it onto a data packet |
tcp_lq_overflow | 0x00000001 | Listen Queue Overflow |
recvbg | 0000000000 | Receive background |
drop_synfin | 0x00000001 | Drop TCP packets with SYN+FIN set |
slowlink_wsize | 0x00002000 | Maximum advertised window size for slowlink |
rfc1644 | 0000000000 | T/TCP support |
rfc3390 | 0x00000001 | Increased Initial Window |
rfc3465 | 0x00000001 | Congestion Control with Appropriate Byte Counting (ABC) |
rfc3465_lim2 | 0x00000001 | Appropriate bytes counting w/ L=2*SMSS |
doautorcvbuf | 0x00000001 | Enable automatic socket buffer tuning |
autorcvbufmax | 0x00100000 | Maximum recieve socket buffer size |
disable_access_to_stats | 0x00000001 | Disable access to tcpstat |
rcvsspktcnt | 0x00000200 | packets to be seen before receiver stretches acks |
rexmt_thresh | 0x00000003 | Duplicate ACK Threshold for Fast Retransmit |
slowstart_flightsize | 0x00000001 | Slow start flight size |
local_slowstart_flightsize | 0x00000008 | Slow start flight size (local networks) |
tso | 0x00000001 | TCP Segmentation offload |
ecn_initiate_out | 0x00000002 | Initiate ECN for outbound |
ecn_negotiate_in | 0x00000002 | Initiate ECN for inbound |
ecn_setup_percentage | 0x00000064 | Max ECN setup percentage |
ecn_timeout | 0x0000003c | Initial minutes to wait before re-trying ECN |
packetchain | 0x00000032 | Enable TCP output packet chaining |
socket_unlocked_on_output | 0x00000001 | Unlock TCP when sending packets down to IP |
recv_allowed_iaj | 0x00000005 | Allowed inter-packet arrival jiter |
min_iaj_win | 0x00000010 | Minimum recv win based on inter-packet arrival jitter |
acc_iaj_react_limit | 0x000000c8 | Accumulated IAJ when receiver starts to react |
doautosndbuf | 0x00000001 | Enable send socket buffer auto-tuning |
autosndbufinc | 0x00002000 | Increment in send bufffer size |
autosndbufmax | 0x00100000 | Maximum send buffer size |
ack_prioritize | 0x00000001 | Prioritize pure ACKs |
rtt_recvbg | 0x00000001 | Use RTT for bg recv algorithm |
net.inet.tcp MIB | Default | Purpose |
---|---|---|
recv_throttle_minwin | 0x00004000 | Minimum recv win for throttling |
enable_tlp | 0x00000001 | Enable Tail loss probe |
sack | 0x00000001 | TCP Selective ACK |
sack_maxholes | 0x00000080 | Maximum # of TCP SACK holes allowed per connection |
sack_globalmaxholes | 0x00010000 | Global maximum TCP SACK holes (across all connections) |
fastopen | 0x00000003 | Enable TCP Fast Open (rfc7413) |
fastopen_backlog | 0x0000000a | Backlog queue for half-open TCP Fast Open connections |
fastopen_key | TCP Fast Open key | |
backoff_maximum | 0x00010000 | Maximum time for which we won't try TCP Fast Open |
clear_tfocache | 0000000000 | Toggle to clear the TFO destination based heuristic cache |
now_init | 0x2d88b850 | Initial tcp now value |
microuptime_init | 0x000daa2d | Initial tcp uptime value in micro seconds |
minmss | 0x000000d8 | Minimum TCP Maximum Segment Size |
do_tcpdrain | 0000000000 | Enable tcp_drain routine for extra help when low on mbufs |
icmp_may_rst | 0x00000001 | ICMP unreachable may abort connections in SYN_SENT |
rtt_min | 0x00000064 | Minimum Round Trip Time value allowed |
rexmt_slop | 0x000000c8 | Slop added to retransmit timeout |
win_scale_factor | 0x00000003 | Sliding window scaling factor |
tcbhashsize | 0x00001000 | Size of TCP control-block hashtable |
keepcnt | 0x00000008 | number of times to repeat keepalive |
msl | 0x00003a98 | Maximum segment lifetime |
max_persist_timeout | 0000000000 | Maximum persistence timeout for ZWP |
always_keepalive | 0000000000 | Assume SO_KEEPALIVE on all TCP connections |
timer_fastmode_idlemax | 0x0000000a | Maximum idle generations in fast mode |
broken_peer_syn_rexmit_thres | 0x0000000a | # rexmitted SYNs to disable RFC1323 on local connections |
path_mtu_discovery | 0x00000001 | Enable Path MTU Discovery |
pmtud_blackhole_detection | 0x00000001 | Path MTU Discovery Black Hole Detection |
pmtud_blackhole_mss | 0x000004b0 | Path MTU Discovery Black Hole Detection lowered MSS |
cc_debug | 0000000000 | Enable debug data collection |
use_newreno | 0000000000 | Use TCP NewReno algorithm by default |
cubic_tcp_friendliness | 0000000000 | Enable TCP friendliness |
cubic_fast_convergence | 0000000000 | Enable fast convergence |
cubic_use_minrtt | 0000000000 | use a min of 5 sec rtt |
lro | 0000000000 | Used to coalesce TCP packets |
lro_startcnt | 0x00000004 | Segments for starting LRO computed as power of 2 |
lrodbg | 0000000000 | Used to debug SW LRO |
lro_sz | 0x00000008 | Maximum coalescing size |
lro_time | 0x0000000a | Maximum coalescing time |
bg_target_qdelay | 0x00000064 | Target queueing delay |
bg_allowed_increase | 0x00000008 | Modifier for calculation of max allowed congestion window |
bg_tether_shift | 0x00000001 | Tether shift for max allowed congestion window |
bg_ss_fltsz | 0x00000002 | Initial congestion window for background transport |
MPTCP configuration
net.inet.mptcp MIB | Default | Purpose |
---|---|---|
enable | 1 | Global on/off switch |
mptcp_cap_retr | 2 | Number of MP Capable SYN retries |
dss_csum | 0 | Enable DSS checksum |
fail | 1 | Failover threshold |
keepalive | 840 | Keepalive (sec) |
rtthist_thresh | 600 | Rtt threshold |
userto | 1 | Disable RTO for subflow selection |
probeto | 1000 | Disable probing by setting to 0 |
dbg_area | 31 | MPTCP debug area |
dbg_level | 1 | MPTCP debug level |
allow_aggregate | 0 | Allow the Multipath aggregation mode |
alternate_port | 0 | Darwin 18: Set alternate port for MPTCP connections |
rto | 3 | MPTCP restransmission Timeout |
rto_thresh | 1500 | RTO threshold |
.tw | 60 | MPTCP Timewait period |
UDP configuration
UDP is a simple and stateless protocol, and therefore offers very few configuration options.
net.inet.udp MIB | Default | Purpose |
---|---|---|
checksum | 1 | Enable UDP checksumming |
maxdgram | 9216 | Maximum outgoing UDP datagram size |
recvspace | 196724 | Maximum incoming UDP datagram size |
log_in_vain | 0 | Log all incoming packets |
blackhole | 0 | Do not send port unreachables for refused connects |
randomize_ports | 1 | Randomize port numbers |
ICMP configuration
ICMP behavior is similarly governed by various sysctl
MIBs, which are mostly set by default to ignore known problematic protocol vulnerabilities, such as spoofed ICMP redirection and broadcast echo requests ("ping storms").
net.inet.icmp MIB | Default | Purpose |
---|---|---|
maskrepl | 0000000000 | Reply to Address Mask requests |
icmplim | 0x000000fa | ICMP limit |
timestamp | 0000000000 | Respond to ICMP timestamp requests |
drop_redirect | 0x00000001 | Ignore ICMP redirect messages |
log_redirect | 0000000000 | Log ICMP redirect messages (to dmesg(1) ) |
bmcastecho | 0x00000001 | Broadcast/Multicast ICMP echo requests |
Networking Statistics
It's important for any system to keep detailed statistics on detailed usage, and to provide them to the administrator in the clearest ways possible. Darwin systems contain quite a few statistics mechanisms of varying detail level and purpose. The chief tool is, of course, the aptly named netstat(1)
, which presents statistics about interfaces (-i/-I
), routes (-r
), multicast groups (-g
), memory consumption (-m
), general protocol statistics (-s
), and - naturally - active sockets (with no other arguments or with -a
). As useful as it is, though, netstat(1)
is little more than a parser of the raw statistics, which it obtains from sysctl
MIBS.
sysctl
MIBs
Along with configuration settings, the network stack exports a plethora of statistics through sysctl
MIBs. Per-family statistics are exported through net.family.socktype.stats
, with the family being local
, inet
, link
and systm
, and the socktype
being (respectively) stream/dgram
, tcp/udp/igmp/icmp/ipset
, generic/ether/bridge
, and kevt/kctl
. These end up in the much more readable form of the netstat -s
output.
The live connection statistics are maintained in net.family.socktype.pcblist*
, with the family
and socktype
being almost the same as with the stats
: There are no link
pcbs (as there are no connections at the link layer level), and the inet
socktype
s are only tcp/udp/raw/mptcp
. The three pcblist*
variants are pcblist
, pcblist64
and pcblist_n
, offering different concatenated structures for the statistics, all defined in various locations throughout the kernel headers. Listing 16-23 shows a break down of a TCP PCB strcture:
A good example of parsing the PCBs can be found in the open sources of netstat(1)
(in the network_cmds
project).
com.apple.network.statistics
Using the pcblist*
MIBs has major drawbacks. Not only is collecting the statistics a lengthy operation, but the statistics themselves are just a snapshot - and with the dynamic nature of network connections, likely to be stale within minutes, if not far less. Another problem is that it is very difficult to associate the connections to their respective owners.
The PF_SYSTEM/SYSPROTO_CONTROL
socket of com.apple.network.statistics
provides a far better mechanism - one which not only provides a constant stream of network statistics through the control socket, but also provides a way to match connections to the originating process.
Once a control socket is set up, command messages (in the 1xxx range) may be sent to the kernel, which will be replied to with messages from the kernel (in the 1xxxx range). One command may generate quite a few replies - as is common when adding sources. Commands are defined (along with all of the interface) in PRIVATE
, so it does not make it into user mode.
NSTAT_MSG_TYPE
s, from Each source addition creates an associated descriptor, which may be queried by using the ...GET_SRC_DESC
message. Descriptors are nstat_[tcp/udp/route]_descriptor
structures. Apple continuously modifies these data structures, breaking the direct API between XNU versions and making it really difficult to work directly through the control socket.
The nettop(1)
utility (part of the closed source NetworkStatistics
package), provides an example of com.apple.network.statistics
capabilities. The utility is a "live" (but crude) netstat(1)
, though it doesn't use the low level sockets directly, instead opting for the higher level wrappers NStatManager*
APIs of the private CF*
object aware interface, which serves as an adapter layer and thus decouples from the low level socket structures.
An NStatManagerManager
is instantiated with a call to kCFAllocator
, options and a callback block. Sources can be added with any of the NStatManagerAddAll[TCP/UDP][/With[Filter/Options]]
, or (for route sources) NStatManagerAddAllRoutes[WithFilter]
. Adding routes triggers the callback block, which gets the NStatSource
as an argument. The source objects can be manipulated through blocks with NStatSourceSet[Counts/Events/Description/Removed]Block
, which are called with their respective objects as arguments.
The description object is a particularly detailed CFDictionary
, providing the properties (resolving enums to human readable form where necessary) from the nstat_[tcp/udp/route]_descriptor
, combining them with nstat_counts
in a convenient CFDictionary
form, as shown in Table 16-25:
Property | Descriptor field | |
---|---|---|
epid | epid | |
processID | pid | |
uniqueProcessID | eupid | |
processName | pname | |
euuid | euuid | |
startAbsoluteTime | start_timestamp | |
durationAbsoluteTime | timestamp - start_timestamp | |
interface | ifindex | |
[local/remote]Address (CFDATA) | [local/remote].[v4/v6] | |
provider | N/A (descriptor type) | |
[rx/tx]Bytes | nstat_[rx/tx]bytes | |
[rx/tx][/Cellular/WiFi/Wired]Bytes | nstat_[cell/wifi/wired]_[rx/tx]bytes | |
trafficClass | traffic_class | |
uuid | uuid | |
receiveBuffer[Size/Used] | rcvbuf | |
TCP sources | ||
rx[Duplicate/OutOfOrder]Bytes | nstat_rx[duplicate/outoforder]bytes | |
congestionAlgorithm | cc_algo | |
rtt[Average/Minimum/Variation] | nstat_[min/avg/var]_rtt | |
connect[Attempts/Successes] | nstat_connect[attempt/successes] | |
TCPState | state | |
txRetransmittedBytes | nstat_txretransmit | |
txUnacked | txunacked |
|
TCP[Congestion]Window | tx[c]window | |
trafficManagementFlags | traffic_mgt_flags | |
The lsock(j)
companion tool matches and exceeds the functionality of nettop(1)
- and is available in open source. Note, that because it uses the APIs directly, it might very well be outdated by the time you try it: The example had to be updated multiple times in the past to catch up with the changing structures, and there is no guarantee Darwin 18 won't break its function.
Another hurdle is an entitlement - com.apple.private.network.statistics
- which may be required for using the com.apple.network.statistics
control socket. "May", because at the moment this requirement can be toggled (by the root
user) using the net.statistics_privcheck
sysctl
MIB. This value is already set to '1' on *OS variants, but still '0' (for the moment) on MacOS. In *OS this is isn't much of an issue, since running arbitrary code implies a jailbreak, root
access and arbitrary entitlements. Should the MacOS sysctl
be set to '1' and possibly locked, however, administrators will need to disable SIP or only use Apple's "approved" (but painful) nettop(1)
.
The following experiment shows a quick and very dirty program to mimic nettop(1)
's usage of the
The main client for nettop(1)
, which displays a live netstat(1)
like output. Unfortunately, the tool is closed source, crude and hard to work over its curses interface, and unavailable for *OS variants. Fortunately, it's fairly straightforward to disassemble, and build a functional (albeit more limited) clone, shown in the following Listing.
nettop(1)
cloneAs barebones as this listing is, it will nonetheless compile cleanly for both MacOS and the *OS variants. Note, that in the *OS case Apple has removed the private framework ".tbd" files, which are required for linkage. Those are easy enough to recreate using jtool2
's --tbd
option. You can find the listing online on the book's companion website[3].
/var/networkd/netusage.sqlite
All Darwin flavors offer aggregate statistics at the process level, summing up bandwidth usage for every process on the system by its binary name. The database used is networkd
is actually filled by
As the database name implies, it is a SQLite3 file, which makes it very easy to inspect - assuming root
or _networkd
(uid 24) credentials. Binaries are given unique identifiers in the ZPROCESS
table, which remains across multiple times they may be executed. The unique id (Z_PK
) is then used to track the binary across other tables, the most useful of which is ZLIVEUSAGE
, which keeps the aggregate statistics. Using sqlite3
on the database would show something similar to Output 16-27:
netusage.sqlite
databaseFirewalling
Network connectivity extends the system's reach to the four corners of the Internet, but also vice versa. A firewall has thus become an integral part of any system's defense, and MacOS has not one, but several. This section discovers those mechanisms which are accessible through user mode - The Application Layer Firewall, ipfw
(briefly, as it is deprecated), and pf
. Kernel-accessible mechanisms (socket, IP and interface filters) are left for Volume II.
MacOS: The Application Layer Firewall
The Application Layer Firewall, commonly referred to by the fuzzy nickname
The Application Layer Firewall is comprised of a kernel extension (CFBundleIdentifier
of com.apple.nke.applicationfirewall
), and several binaries, all in socketfilterfw(8)
, which manages the kext. This daemon loads its defaults from the com.apple.alf
Mach service. When it needs UI interaction, it calls on CFUserNotificationCreate
(q.v. Chapter 5) to create a pop-up dialog with the resources from Firewall(8)
for user authorizations through com.apple.alf.useragent
, with a protocol consisting of a single MIG message (#9999)*.
When the firewall settings are modified through CFNotificationCenter
(q.v. Chapter 5), with a name of "com.apple.alf". The notification's object
field designates it at "firewalloptions", "app[added/removed]", "[app/service]statechanged", etc. The userinfo
field contains the firwall set request, as an XML propery list. Listing 16-29 shows two messages (in SimPLISTic format):
socketfilterfw(8)
appfwloggerd
, was previously used to listen on an event socket for messages from ipfw
.The userinfo
data, translating the XML property list into the socket filtering rules it needs to apply. The main rules are in in payloads of appadded
and/or appstatechanged
, under the alias
key, which is a (again) a base64 encoded plist (cfdata), whose contents are a binary details with details about the application for which a rule is added:
alias
entry, twice base-64 decodedRules are enforced by sflt_*
KPIs, as explained in Volume II). The sockfilterfw
daemon communicates with the kernel extension over a com.apple.nke.sockwall
PF_SYSTEM/SYSPROTO_CONTROL
. The protocol is a simple TLV (type-length-value), with the types shown in Table 16-31.
# | Command | Purpose |
---|---|---|
0 | result | Inserts a new rule for a process |
1 | proc_rules | Inserts a new rule for a process |
3 | ask | Kext requests a user prompt |
5 | dumpinfo | Useful to dump the kext process list into dmesg |
6 | verify | Kext requests process rule verification |
7 | setpath | Add a process path |
8 | updaterules | Called when rulebase changes |
9 | releasepcachedpath | Kext requests invalidation of a PID (by path) from cache |
10 | unloadkext | Unload the kernel extension, if possible |
11 | addapptolist | Add an application |
12 | changelogmode | Change kext logging mode |
13 | changetrustmode | Change trust mode |
14 | askmsgrelease | Dismiss pending ask |
15 | changelogopt | Change logging options |
16 | changeapptrustmode | Change app trust mode |
Some of the message types are no longer implemented in Darwin 18. socketfilterfw
also contains a few references to ipfw
sysctls, which are no longer implemented (as explained next) as well. The daemon may be configured to log excessively by changing its LaunchDaemon property list's Program
string to a ProgramArguments
array and adding -d
and/or -l
. The daemon normally relays PF_SYSTEM/SYSPROTO_EVENT
messages it receives from the kernel extension for the APPLE:NETWORK:LOG
provider, and another unnamed provider at 1000:5:11. The ALF kext also registers the net.alf
MIB namespace, with a loglevel
bitmask, perm
ission check, defaultaction
and (read-only) mqcount
.
ipfw
(Deprecated)
Darwin has used BSD's ipfw
mechanism for many years - until it was removed in Darwin 16. The code for implementing the mechanism - in IPFW2
and other #define
s which are no longer enabled. The user mode controller, ipfw(8)
has been removed. Some discussion of this facility can be found in the BSD respective manual pages, as well as the first edition of this work (at which time it was still deemed relevant in Darwin).
pf
The pf
facility, another relic of BSD but still in wide use, provides an alternative network layer firewalling mechanism. The facility appears in user mode as two character devices - /dev/pf
character device can be used to create and apply firewalling rulesets, using ioctl(2)
codes. This functionality is not unlike Linux's netfilter
(a.k.a iptables
). The
The pf
facility makes use of a configuration file, pf.conf(5)
manual page. An additional file, com.apple
anchors for AirDrop and ALF (from a load anchor
statement in
System administrators wishing to configure pf
often use the pfctl(8)
command line. The tool is well documented in its man page, which is left for the interested reader to peruse. Comprehensive documentation for the set of ioctl(2)
codes can be found in pf(4)
, but this manual page has somehow been removed from Darwin releases. The Open BSD man page[5] thus serves in its place, although there are some differences in the set of codes. Table 16-32 (next page) shows a summary of the ioctl(2)
codes defined in Darwin, though some are not actively supported.
PacketFilter.framework
Using the ioctl(2)
codes directly on PF*
functions, and PFUser
/PFManager
high level calls. First, a call to PFUserCreate
starts a session. Then, PFUserBeginRules
declares a rule set, in which rules can be manipulated using PFUser[Add/Insert/Delete]Rule
. The set can be committed using a call to PFUserCommitRules
. Similar APIs are PFManager[Get/Copy/Delete]Rules
.
Rule transactions are submitted to the PFManager
object, which uses PFXPC
abstractions to communicate with pfd((8)
through the com.apple.pfd
service. The daemon (running as root) translates the XPC messages into the corresponding ioctl(2)
codes, and returns any replies in XPC formatted dictionaries. The protocol can be reversed easily by using XPoCe
on a running instance of pfd(8)
.
DIOC ioctl(2) code | Argument | Purpose |
---|---|---|
DIOC[START/STOP] | _IO ('D', 1/2) | Start/stop the packet filter facility |
DIOCADDRULE | _IOWR('D', 4, struct pfioc_rule) | Add a pfioc_rule to (inactive) ruleset |
DIOCGETSTARTERS | _IOWR('D', 5, struct pfioc_tokens) | Get starter tokens |
DIOCGETRULE[S] | _IOWR('D', 6/7, struct pfioc_rule) | Obtain a ticket + num rules, or specific rule |
DIOCSTARTREF | _IOR ('D', 8, u_int64_t) | Increment ref count, get token |
DIOCSTOPREF | _IOWR('D', 9, struct pfioc_remove_token) | Decrement ref count with provided token |
DIOCCLRSTATES | _IOWR('D', 18, struct pfioc_state_kill) | Clear packet filter state table |
DIOCGETSTATE | _IOWR('D', 19, struct pfioc_state) | Retrieve specific state entry |
DIOCSETSTATUSIF | _IOWR('D', 20, struct pfioc_if) | Toggle statistics on interface |
DIOCGETSTATUS | _IOWR('D', 21, struct pf_status) | Get pf_status counters and data |
DIOCCLRSTATUS | _IO ('D', 22) | Clear all pf_status counters |
DIOCNATLOOK | _IOWR('D', 23, struct pfioc_natlook) | Look up a NAT state table entry |
DIOCSETDEBUG | _IOWR('D', 24, u_int32_t) | Toggle debug |
DIOCGETSTATES | _IOWR('D', 25, struct pfioc_states) | Retrieve all state entries |
DIOC[CHANGE/INSERT/DELETE]RULE | _IOWR('D', 26/27/28, struct pfioc_rule) | Various rule manipulation actions |
DIOC[SET/GET]TIMEOUT | _IOWR('D', 29/30, struct pfioc_tm) | Set/get state timeouts |
DIOCADDSTATE | _IOWR('D', 37, struct pfioc_state) | Add a state entry |
DIOCCLRRULECTRS | _IO ('D', 38) | Clear rule counters |
DIOC[GET/SET]LIMIT | _IOWR('D', 39/40, struct pfioc_limit) | Set the hard limits on the memory pools |
DIOCKILLSTATES | _IOWR('D', 41, struct pfioc_state_kill) | Remove matching entries from the state table |
DIOC[START/STOP]ALTQ | _IO ('D', 42/43) | Requires ALTQ support, which Darwin does not provide |
DIOC[ADD/GET]ALTQ[/S] | _IOWR('D', 45/47, struct pfioc_altq) | |
DIOC[GET/CHANGE]ALTQ | _IOWR('D', 48/49, struct pfioc_altq) | |
DIOCGETQSTATS | _IOWR('D', 50, struct pfioc_qstats) | Get queue statistics |
DIOC[BEGIN/GET]ADDRS | _IOWR('D', 51/53, struct pfioc_pooladdr) | |
DIOC[ADD/GET/CHANGE]ADDR | _IOWR('D', 52/54/55, struct pfioc_pooladdr) | |
DIOCGETRULESETS | _IOWR('D', 58, struct pfioc_ruleset) | Get number of rulesets (anchors) |
DIOCGETRULESET | _IOWR('D', 59, struct pfioc_ruleset) | Get anchor by number |
DIOCR[CLR/ADD/DEL]TABLES | _IOWR('D', 60/61/62, struct pfioc_table) | Clear/add/delete tables |
DIOCRGETTABLES | _IOWR('D', 63, struct pfioc_table) | Get table list |
DIOCR[GET/CLR/RST]TSTATS | _IOWR('D', 64/65/73, struct pfioc_table) | Test if the given addresses match a table |
DIOCR[CLR/ADD/DEL]ADDRS | _IOWR('D', 66-68, struct pfioc_table) | Clear/Add/Delete addresses in table |
DIOCR[SET/GET]ADDRS | _IOWR('D', 69/70, struct pfioc_table) | Get/set addresses in table |
DIOCR[GET/CLR]ASTATS | _IOWR('D', 71/72, struct pfioc_table) | Get/Clear address statistics |
DIOCRSETTFLAGS | _IOWR('D', 74, struct pfioc_table) | Change const/persist flags of table |
DIOCRINADEFINE | _IOWR('D', 77, struct pfioc_table) | Defines a table in the inactive set |
DIOCOSFPFLUSH | _IO('D', 78) | Flush the passive OS fingerprint table. |
DIOCOSFP[ADD/GET] | _IOWR('D', 79/80, struct pf_osfp_ioctl) | Add/retrieve passive OS fingerprint entry |
DIOCX[BEGIN/COMMIT/ROLLBACK] | _IOWR('D', 81/82/83, struct pfioc_trans) | Clear/commit/undo inactive rulesets |
DIOCGETSRCNODES | _IOWR('D', 84, struct pfioc_src_nodes) | Get source nodes |
DIOCCLRSRCNODES | _IO('D', 85) | Clear list of source nodes |
DIOCSETHOSTID | _IOWR('D', 86, u_int32_t) | Set host ID (for pfsync(4) ) |
DIOCIGETIFACES | _IOWR('D', 87, struct pfioc_iface) | Get list of interfaces |
DIOC[SET/CLR]IFFLAG | _IOWR('D', 89/90, struct pfioc_iface) | Set/clear user flags |
DIOCKILLSRCNODES | _IOWR('D', 91, struct pfioc_src_node_kill) | Explicitly remove source tracking nodes |
DIOCGIFSPEED | _IOWR('D', 92, struct pf_ifspeed) | Get interface speed |
Packet Capture
There are times when user mode needs packet capture capabilities on a given interface. The most common example of that is when using a sniffer (or "network analyzer") such as tcpdump(1)
and its ilk. Being user mode tools, they must make use of some kernel facility to enable such features as promiscuous mode (in which the interface accepts all frames, not just broadcast/multicast and its own unicast), and getting packets normally destined for other applications. Apple's proprietary PF_NDRV
is inadequate for general packet capture (as it can only intercept unregistered ethertype protocols), and BSD's PF
firewalls, but doesn't actually relay filtered packets. so another mechanism is required.
BPF
Darwin follows the BSD model in implementing the Berkeley Packet Filter, commonly referred to as BPF. BPF is the brainchild of McCanne and Jacobson (of PPP compression and traceroute(1)
fame), who presented the mechanism in a UseNIX 1993 paper[6]. BPF was quite revolutionary, as it provided a full language, with which dynamic filter programs could be created in user space, and loaded directly into the kernel subsystem. It has since become a standard adopted by quite a few operating systems and the ubiquitous tcpdump(1)
, Ethereal and many other tools (
BPF appears in user mode as a number of character devices, open(2)
ed, attached to an underlying interface using a BIOCSETIF
ioctl(2)
, configured with a few other ioctl(2)
codes, and then loaded with a BPF "program" through a BIOCSETF
ioctl(2)
. Once the filter program is installed, the device's file descriptor lends itself to read(2)
operations, which will provide any packets matching the filter loaded onto it. This also marks the corresponding device node as in use, which means that there is a hard limit in the system of up to however many devices are configured. The general flow of a BPF client is shown in Listing 16-33, next page. The full list of ioctl(2)
s can be found in DLT_
constants for Data Link types, though the only ones of actual use are DLT_EN10MB
(used for all modern Ethernet, not just 10MB), and DLT_USB_DARWIN
, which is an Apple extension for the XHC*
interfaces provided by IOUSBHostFamily.kext
's AppleUSBHostPacketFilter.kext
PlugIn.
The idea behind BPF is as simple as it is elegant: Consider an automaton with a single register, which may be directed to load a value from any offset in an input frame (i.e. including the layer II header), and perform a logical test on its value. The automaton would branch on the results of that test, and the process would continue until a decision could be made as to whether to accept or reject the packet in question. The accepted packets appear on the input device, and the rejected packets are merely rejected by the filter - that is, they do not get captured, but they are not firewalled (as they would be by the PF facility, which was described earlier).
BPF Programs
Listing 16-33 can be used for just about any generic sniffer/packet analyzer, but notice it's missing the actual BPF filter, which needs to be installed for the BPF mechanism to actually sift out frames. The BPF filter program needs to be specified as an array of BPF automaton struct bpf_insn
instructions. The instruction structure consists (not in this order) of a 16-bit code
, a uint32
constant k
(used as an argument to the code
), and two unsigned 8-bit offsets, jt
and jf
, which represent a jump offset to branch to in case the code
is a logical BPF_J*
test. Most BPF filters usually consist of a mix of BPF_LD
statements (to read data from various offsets in an incoming frame) and BPF_JMP
, to perform logical tests and branch accordingly. Note, however, that there are quite a few other opcodes - including destructive ones (e.g. BPF_ST[X]
, which will alter scratch memory, allowing the filter to maintain state.
Rather than initializing the structure for every single instruction, two macros are commonly used. BPF_STMT
takes the code
and k
values, for instructions which aren't logical tests. BPF_JUMP
is used for tests, whose codes are of the BPF_JMP
class, with whatever BPF_J*
variant. This makes the BPF "assembly" (barely) manageable for human readers.
As an example, consider Listing 16-34, which presents a sample BPF filter program in installFilter
. The listing demonstrates how to traverse an IPv4 packet: In the beginning of the program, the automaton's read stream is at the first byte of the frame - i.e. the Ethernet header. Since the IPv4 EtherType is always at offset 14 (past 6 bytes of destination MAC Address and 6 more of source), the value (16-bits) is loaded as a halfword with BPF_LD + BPF_H
. It is then compared to the ETHERTYPE_IP
(0x0800). If there is no match, the processing jumps 10 instructions forward, to the rejection (= return 0). If it is an IPv4 packet, processing continues (jumping 0 instructions forward, which means the next, since the program counter always points to the next instruction). As tests continue, the rejection offset grows closer and closer still - two instructions later, it is 8, two more make it 6, etc. If the flow makes it to assert that the frame is an IPv4 unfragmented, TCP packet with either the source or the destination port matching the one requested, the filter returns 0, and the frame makes it back to Listing 16-34's file descriptor, where it can be read and processed in user space.
BPF_ST[X]
instructions, which allow storing (= writing memory), this could be conducive to full kernel compromise. Additionally, the code around the filters (i.e. the ioctl(2)
implementations) has been buggy in the past - as recently as Darwin 16 for BIOCSBLEN
(CVE-2017-2482) .
Pseudo-Interfaces
There are times when frames or packets need to be captures simultaneously from multiple interfaces. One way of doing so is to run multiple BPF filters at the same time (over several iptap
or pktap
.
The *tap
interfaces are pseudo-interfaces, and normally do not appear when interfaces are listed with ifconfig
. They are created on-demand when a packet capture program (notably, tcpdump(1)
is used with pktap
or iptap
as the name of the interface, followed by a comma-delimited list of actual interfaces. The difference between the two *tap
interfaces is the encapsulation exposed - pktap
provides the full packet, whereas iptap
provides the network layer (IPv6 or IPv4) and upwards. Both interfaces can be used with BPF, and appear with a DLT_PKTAP
(also DLT_USER2
, with a value of 149).
The tap interfaces are created programmatically using an SIOCIFCREATE
ioctl(2)
, and marked to be removed when the creating process exits (Apple's libpcap
project's SIOCSDRVSPEC
ioctl(2)
code. The pktapctl
utility (not provided in Darwin releases) shows an example of getting and setting filters.
Using DLT_PKTAP
also provides a significant benefit in allowing more metadata to be included for every packet captured. XNU's DLT_*
of the the packet, as well as the owning process pid and command name:
DLT_PKTAP
header, from XNU-4570's Darwin's tcpdump
implementation contains a non-standard -k
switch, which will parse some of that metadata (specifically, pth_ifname
, pth_[e]comm
, pth_[e]pid
and pth_svc
) to show the details of the the process (or processes, if both are on the same host) to whose session each packet belongs.
Quality of Service
We've already discussed process and thread level Quality of Service, and with such formidable capabilities it's easy to forget the Quality of Service concept was originally "born" at the network layer. Due to Net Neutrality and other considerations, QoS isn't deployed on the global Internet, but it is nonetheless applicable on internal networks, up to the egress router and sometimes beyond.
QoS recognizes two modes - Integrated Services, and Differentiated Services. The former mode is handled by RSVP (the reservation protocol), and is not supported by XNU - which is not required, since the implementation can reside in user mode. The latter mode (DiffSrv) requires packet-level labeling, and is fully supported. The IPv4 "type of service" byte (the second byte of the header, right after the version/header-length "45") has been repurposed by RFC2474 and RFC3168 to provide a six-bit "Differentiated Services Code Point" (DSCP) and two bits of Explicit Congestion Notification (ECN). XNU supports later revisions of Diffsrv, including RFC2597 (Assured Forwarding Per-Hop-Behavior) and RFC5865 (Capacity-Admitted Traffic).
Darwin 17 adds a new (and, as usual, undocumented) system call - net_qos_guideline
(#525). The System call provides a net_qos_param
structure specifying a bandwidth requirement (upload or download) and the structure's (fixed) length. It returns a hint to user mode specifying whether this requirement would be subject to the default QoS policy, or should be marked as a background (BK) service type, which will prefer delay based flow algorithms.
Network Link Conditioning
XCode's "Additional Tools" disk image contains in between its many fabulous "Hardware Tools" the "Network Link Conditioner" Preference Pane. This plug-in to
The preference pane is merely a front-end: The actual work is performed by nlcd(8)
, which communicates with the GUI by means of MIG subsystem 40268. But it turns out that nlcd
, too, doesn't want to get its hands dirty, and instead sends XPC messages to pfd(8)
with the help of the private PacketFilter.framework
. Although we've discussed pfd
in the context of the PF facility earlier, this time the daemon interfaces with another kernel facility, called dummynet(4)
, which is responsible for the dirty work.
The dummynet mechanism, a facility to provide traffic shaping, bandwidth management and delay emulation, was devised by Luigi Rizzo in 1997, and extended in 2010[7]. It was brought into BSD and its ipfw
mechanism, and migrated to Darwin. Although ipfw
is defunct in modern systems, dummynet is still fully operational. Its implementation is mostly contained in XNU's #ifdef DUMMYNET
blocks, meaning that XNU can be build without it, though that is seldom the case.
Dummynet works by defining flows, and funneling them into one or more "pipes", which emulate links with given bandwidth/delay/loss parameters. Pipes are managed with the help of "queues", which implement Worst-case Fair Weighted Fair Queueing (WF2Q+) and Random Early Detection (RED). The pipes are entirely virtual, and packets are passed through them before or after they flow through the physical interface, which is how the connection parameters can be enforced.
Pipes can be configured by creating a raw socket, and then issuing setsockopt(2)
calls. Four options are defined: IP_DUMMYNET_CONFIGURE
(60) creates of modifies a dummynet pipe. The pipe may be removed with IP_DUMMYNET_DEL
(61). The list of pipes can be retrieved with IP_DUMMYNET_GET
(64), and pipes can be flushed with IP_DUMMYNET_FLUSH
(62). The command line dnctl(8)
offers a far easier way to configure, by providing an extensive command line with a well documented manual page, complete with examples. This manual page also documents the sysctl(8)
MIBs.
Network Extension Control Policies (Darwin 14+)
A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in
The original interface provided for NECP is through a PF_SYSTEM/SYSPROTO_CONTROL
socket. Using com.apple.net.necp_control
as the control name, a socket can be created, and then read to and written from through a specialized packet protocol:
In addition to the root
privileges needed to open the control socket, some actions are deemed privileged, and require the PRIV_NET_PRIVILEGED_NECP_[POLICIES/MATCH]
privileges. These are are tied to com.apple.private.necp.[policies/match]
entitlements, and are presently granted only to a select few daemons, as you can view on the book's entitlement database.
The actual policies which may be defined are ridiculously rich and complex. Using a set of NECP_POLICY_CONDITION_*
constants allow matching a policy to a particular DNS domain, local or remote address, specific IP protocol, PID, UID, entitlement-holder, interface, and more. Policies can also be ordered, so as to prioritize their application. Once applied, a policy result can be as simple as NECP_POLICY_RESULT_[PASS/DROP]
, but can also be any of several other NECP_POLICY_RESULT_*
constants, to divert. filter or tunnel the flow, change a route rule, trigger or use a particular netagent (discussed later).
NECP descriptors and clients
Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because necp_open()
(#501) as part of its initialization (specifically, from nw_endpoint_handler_start
). This creates a necp client, a file descriptor of type NPOLICY
, which is readily visible in the output of lsof(1)
or procexp ..fds
. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)
), and only supports select(2)
, or use in a kqueue
. The necp_client_action
system call (#502) can be used to specify client actions, as shown in Listing 16-37:
necp_open()
is just one of several undocumented system calls, which Apple has added over time as the facility evolves. The system calls are also unexported to user mode, but Listing 16-38 reconstructs the missing header file:
Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY
file descriptor*). These descriptors are created with necp_session_open
(#522), and support just the close(2)
operation (which deletes the associated session). NECP session descriptors are meant to be handled with the proprietary necp_session_action()
system call (#523). Using NECP_SESSION_ACTION_*
constants passed through the action
parameter, which map to the NECP_PACKET_TYPE_POLICY*
codes of the control socket, the various actions can be performed, subject to the privilege check.
The public NEPolicySession
objective-C object.
DTYPE_NETPOLICY
). The potential type confusion was exploited by CVE-2018-4425, before being fixed by Apple in MacOS 14.1.Network Agents (Darwin 15+)
Darwin 15 introduces a novel networking concept, of network agents. These are user-mode clients to which network flow or other event handling is relayed via triggers. Those agents can then handle the triggers and act upon them, for example making network policy decisions.
Network agents create a PF_SYSTEM/SYSPROTO_CONTROL
socket with the com.apple.net.netagent
control name. The control is created in a manner identical to Listing 16-7 setting sc_unit
to 0 and changing the control name, of course. Once the control socket is connect(2)
ed, agents may send and receive messages formatted with a netagent_message_header
, defined in
netagent_message_header
and types (from XNU 4570's A new system call in XNU-3248 is netagent_trigger
system call (#490), which enables selective wake up of a registered netagent by the caller. The system call takes the agent_uuid
, which should match the one the target agent registered with, and the agent_uuidlen
(which is fixed at sizeof(uuid_t)
, i.e. 16). If the target agent allows triggers (registered with NETAGENT_FLAG_USER_ACTIVATED
) and is not already active, a NETAGENT_MESSAGE_TYPE_TRIGGER
(#5) will be sent to it.
A process may create and register more than one agent (with different UUIDs), and agents may be assigned to different domains (e.g. "WirelessRadioManager", "NetworkExtension") or types (e.g. VPN, Persistent, DNSAgent..). Darwin's configd
does so (with several DNSAgent
s), as do CommCenter
, networkserviceproxy
, and iOS's nesessionmanager
. Other daemons are fine with one agent, e.g. identityserviced
, wifid
and apsd
. Using procexp all fds
and filtering for Control Sockets (in a manner similar to Output 16-5) will show all the agents. The sysctl
MIBs of net.netagent.[active/registered]_count
track the number of agents, and net.netagent.debug
may be adjusted to produce verbose logging.
An open source example of creating an agent and handling notifications may be found in DNSAgent
and the ProxyAgent
. The following experiment demonstrates displaying agent details using specialized ioctl(2)
codes.
ioctl(2)
codes
The netagent facility provides ioctl(2)
codes which can be used to enumerate existing agents (i.e. processes with com.apple.network.agent
control sockets). The ioctl(2)
codes are SIOCGIFAGENT[LIST/DATA]64
, which operate similarly: On first pass, their respective data size arguments must be 0, and in turn they will be filled with the required data size. The caller is expected to allocate a sufficiently large buffer, and then call again. The call pattern is shown in Listing 16-40:
SIOCGIFAGENT[LIST/DATA]64
ioctl(2)
sNeither codes nor structures (nor flags, in netagentFlagsToText
, above) are provided to user space headers, but it is a simple matter to copy them (from ioctl(2)
codes require NECP entitlements (for system privilege 10004, a.k.a PRIV_NET_PRIVILEGED_NECP_POLICIES
). This means they're easier to use on Jailbroken *OS (where code signing is faked and any entitlement can be bestowed) rather than on MacOS, (even with SIP disabled, since self-signed code is disallowed). Output 16-41 shows the Output of a completed program on iOS (with UUIDs truncated since they're random anyway):
SkyWalk
The SkyWalk subsystem is an entirely undocumented networking subsystem in XNU. It provides the interconnection between other networking subsystems, such as bluetooth and user-mode tunnels. Although built-in to XNU, its source remains closed, with only error and debug strings indicating it is implemented in #ifdef
block. A third implementation exists (bridge) but its source code is redacted. Skywalk's memory subsystem is also laregely self-managed: There are about three dozen skywalk related kernel zones, and the subsystem has its own arena based allocator (similar in concept to the Nanov2 allocator) with caching, which is used for in-kernel, non-blocking packet allocation and other uses.
Nexuses & Channels
Skywalk makes use of two special object types. A nexus is an endpoint, identified by a UUID, through which data packets can flow, prior to actually getting to an underlying network interface. Nexuses may be created in kernel or user mode, and when used in the latter appear as file descriptors (of DTYPE_NEXUS
).
Nexuses are created through the use of Nexus Providers. There are currently four known provider types:
- User pipes: are pipes whose provider is in userspace, created directly through the
nexus_create
system call (#506) orlibnetwork.dylib 'snw_nexus_create
, which internally callsos_nexus_controller_create
andos_nexus_controller_register_provider
. Examples areidentityservicesd
'sIDSChannelClientNexus[OS]
andbluetoothd
'scom.apple.bluetooth.scalablePipe
. - Kernel pipes: are pipes whose provider is in the kernel, usually some kernel extension. An example of that is
IOSkywalkBSDClient
, callingkern_nexus_controller_create
. - Network interfaces: provided by interfaces, such as
com.apple.netif.utun*
, orcom.apple.netif.ipsec*
. - Flow Switches: to direct network flows. Can be of subtype bridge (layer II) or multi-stack (layer III). Here,too, examples are
com.apple.multistack.utun*
, orcom.apple.multistack.ipsec*
.
Registering a Nexus is a privileged operation. A set of sandboxed-enforced entitlements - com.apple.private.skywalk.register-[flow-switch/net-if/user-pipe]
- protects registrtion for each of the corresponding types. Nexuses have one or more channels to provide data flows. Each channel commonly has two rings, one for transmission (tx) and one for reception (rx), each with 128 slots.
Nexuses can interoperate with network agents. The NETAGENT_MESSAGE_TYPE_ [REQUEST/ASSIGN/CLOSE]_NEXUS
messages (from Listing 16-39) allow the interoperation, by letting a network agent control nexus creation on demand. You can see both nexuses and network agents in action when using VPN applications: Setting up a VPN connection commonly creates both a net-if
(usually, com.apple.netif.utun2
) and a multistack flow-switch (com.apple.multistack.utun1
) provider.
The ifconfig(8)
utility (as of network-cmds
520+, provided in the *OS binpack) can display network agent and nexus details. Output 16-42 demonstrates the nexus enabled (user-mode tunneling) interfaces when a VPN connection is active:
ifconfig(8)
to view interface netagent and nexus detailsSystem calls and APIs
As with the other skywalk components, the system calls used to handle nexus and channels are purposely left out of XNU's public sources - including even #if
blocks in it are still present. Fortunately, the names of the system calls can be gleaned from the user mode header
Handling nexuses
Rather than using the system calls directly, libsystem_kernel.dylib
provides higher level _os_nexus
and _os_channel
objects. This API holds provides metadata about the underlying file descriptors (for example, the guard value needed to guarded_close_np
a channel, through _os_channel_destroy
). An even higher level API can be found in nw_nexus
and nw_channel
objects (with OS_
prefixed Objective-C objects.
Three objects - os_nexus
, os_nexus_attr
and os_nexus_controller
manage nexuses, and four more - os_channel
, os_channel_slot
, .._attr
and .._packet
are used for channels. This way, a nexus can be created directly, through a call to ___nexus_open
, but the preferred way is to use os_nexus_controller_create
, which also ensures the descriptor is guarded. Once created, a Nexus can be registered directly with a system call (__nexus_register
) by using its file descriptor, or through the higher level os_nexus_controller_register_provider
. Other calls offered by os_nexus_*
APIs are os_nexus_[dis]connect
, os_nexus_if[attach/detach]
, and os_nexus_ns[un]bind
, all of which wrap the __nexus_set_opt
system call. Unsurprisingly, the os_nexus_*
APIs aren't anywhere near documented as well, but Listing 16-44 reconstructs the missing header file:
os_nexus_*
APIsskywalkctl(8)
A vital piece of the SkyWalk puzzle is the skywalkctl(8)
utility, apparently a debugging tool left in sysctl(8)
interface, described next).
sysctl MIBs
The SkyWalk subsystem outputs its statistics through several MIBs, of which kern.skywalk.nexus_provider_list
and kern.skywalk.nexus_channel_list
are the most interesting, as they provide detailed information about Nexus providers and channels (as nexus_provider_info_t
and nexus_channel_entry_t
structures). Accessing these MIBs requires the com.apple.private.skywalk.observe-all
entitlement, enforced by a mac_priv_check_hook
(from com.apple.private.skywalk.observe-stats
(the undocumented 12011). There are additional privileges (12000-12003), all nexus related but undocumented (and, for lack of source, nameless), all of which depend on the skywalk entitlements. The *OS binutils' sysctl(8)
is properly entitled, as is the aforementioned skywalkctl(1)
, which can decipher the opaque MIBs into a human readable form.
Obtaining information about a particular channel file descriptor can be achieved through proc_pidfdinfo
with the undocumented PROC_PIDFDCHANNELINFO
(10). This returns a channel_fdinfo
containing the channel type, UUID, port and flags.
Review Questions
PF_NDRV
packet capture capabilities and those of BPF?utun##
facility) for VPN?References
- "Improving Network Reliability Using Multipath TCP" - https://developer.apple.com/documentation/foundation/nsurlsessionconfiguration/ improving_network_reliability_using_multipath_tcp
- Apple Developer - QA1776 - https://developer.apple.com/library/archive/qa/qa1176/_index.html
- NewOSXBook.com - "NetBottom.c" - http://newosxbook.com/src.jl?tree=listings&file=netbottom.c
- Apple - HT201642 (The Application Level Firewall) -
https://support.apple.com/en-us/HT201642 - Open BSD Manual Pages -
pf(4)
- https://man.openbsd.org/pf.4 - McCanne & Van Jacobson - "The BSD Packet Filter" - https://www.usenix.org/legacy/publications/library/proceedings/sd93/mccanne.pdf
- Luigi Rizzo - "Dummynet, Revisited" - https://www.researchgate.net/publication/220194992_Dummynet_Revisited