16

Нет-Work: Darwin Networking

Networking has forever redefined computing. With the advent of the Internet, a system without network connectivity finds little use, as more applications rely on remote servers to perform some or all of their functions. This is especially important with the move to the Cloud, which in some aspects goes back full circle to the "dumb terminals" of the mainframe age.

Like other operating systems, Darwin places a considerable emphasis on its networking stack. Originally inherited from the BSD layer, Apple has continuously refined and extended it with support for new protocols and more features improving functionality, efficiency, and speed. This chapter discusses Darwin's networking features, from a user mode perspective.

The chapter begins with a review of socket families, specifically the ones idiosyncratic to Darwin. These are the PF_NDRV sockets, which enable (to a certain extent) raw packet manipulation, and the PF_SYSTEM sockets for user/kernel mode communication. The latter is especially important, since it contains quite a few proprietary, undocumented but powerful interfaces.

We next briefly explain MultiPath TCP, an emerging Internet standard which Apple was quick to adopt. This required the addition of new system calls in Darwin 13, as well as some sysctl MIBs. MIBs are also the focus of the next two sections, which deal with network configuration from user mode, and the gathering of statistics. Statistics, however, are where Darwin excels through undocumented APIs. The PF_SYSTEM control sockets introduced earlier are especially useful to provide live network statistics which other OSes can only struggle with.

Following that, we turn our attention to firewalling and packet filtering. Darwin provides not one, but two built-in network-layer firewalls - BSD's ipfw(8) (which has been deprecated as of around Darwin 15) and pf(8). MacOS further provides an application layer firewall as well, through ALF.kext. For packet filtering, another legacy of BSD - the Berkeley Packet Filter (BPF) - is used, and though it is best described elsewhere, it is also briefly explained here.

Last, but not least we turn a spotlight towards two entirely undocumented but powerful APIs: The first is that of Network Extension Control Policies, which enable QoS, flow control and further through policy objects and a proprietary file descriptor. The second is the mysterious Skywalk, and its nexus and channel objects. This is an entire subsystem which is not only undocumented, but intentionally left out of XNU's public sources. The pages of this book provide the only public documentation of this important mechanism to date.

This is the complete 16th chapter from *OS Internals, Volume I (in its v1.2 update) It's free, but please respect the copyright and immense amounts of research devoted to creating it. If any of this is useful, please cite using the original link. You might also want to consider getting the book, or Checking out Tg's training
 

Darwin Extensions of the BSD Socket APIs

Since its inception in the 1980's, the BSD socket model has proven time and again its superb design and extensibility to new protocols. The set of system calls used in manipulating sockets is (for the most part) entirely implementation agonstic. The only times where protocol specific functions are required, they may be set through the sockaddr_.. variant used when bind(2)ing the socket, or [get/set]sockopt(2), if family options are supported.

The header file in <sys/socket.h> lists over three dozen address families (as [AF/PF]_* constants), but in practice only a subset of them are supported in Darwin. These are shown in Table 16-1:

Table 16-1: The Protocol families supported on Darwin
#Protocol FamilyTransport
1PF_[LOCAL/UNIX]UNIX domain sockets
2PF_INETIPv4
14PF_ROUTEInternal Routing Protocol
27PF_NDRVNetwork Drivers: Raw device access
29PF_KEYIPSec Key Management (RFC2367)
30PF_INET6IPv6 (And IPv4 mapped)
32PF_SYSTEMSystem/kernel local communication (Proprietary)

Most of these are standard, and should be well known to the reader from other UN*X variants. PF_NDRV and PF_SYSTEM, however, are Darwin proprietary, and deserve special discussion.

PF_NDRV

The PF_NDRV protocol family is a somewhat misnamed one - the documentation describes it as used by "Network Drivers", though drivers are generally kernel mode beasts. A better name would have been "PF_RAW", as the family allows raw access to network interfaces, or perhaps (in keeping with Linux) "PF_PACKET". Raw interface access is quite similar to AF_INET[6]'s SOCK_RAW, or using the IP_HDRINCL setsockopt(2) syscall. Unlike either, however, PF_NDRV allows control over all layers - down to the link layer header.

The PF_NDRV sockets are created as usual, but require a different socket address family - struct sockaddr_ndrv. As a sockaddr_* compatible structure, its first fields are the byte-size snd_len and snd_family (set to sizeof(sockaddr_ndrv) and PF_NDRV, respectively. The only other field in the structure is the snd_name character array (of IFNAMSIZ bytes), which holds the underlying interface the socket is to be bound to.

Though seldom used, PF_NDRV allows a user-mode client to register its own EtherType, so that the kernel will dispatch packets to it. In that sense, it allows for "user mode drivers" to register their custom protocol implementations, using a setsockopt(2) with the SOL_NDRVPROTO level and NDRV_SETDMXSPEC option name. This, however, will only work if there is no a priori registered protocol (otherwise returning EADDRINUSE), so is therefore not useful for general packet sniffing.

A much more useful aspect of PF_NDRV is to create custom packets. A socket is created the same way, but specifying SOCK_RAW for the socket type. Following the binding to an interface, packets can be fabricated by directly writing to a buffer, and sending it on the bound interface using sendto(2).

 
Experiment: Using PF_NDRV to implement a custom network protocol

The following program can be used to demonstrate the capabilities of PF_NDRV for both custom protocol packet reception and sending. Because a lot of code is common, the receiving functionality is #ifdef LISTENER, and otherwise this will send the packets. You can try this program with any ethertype (specified as a decimal argument), so long as it doesn't collide with an already existing one (e.g. IP's 0x0800 or IPv6's 0x86dd).

Listing 16-2: A sample PF_NDRV client/listener program
#include <sys/socket.h>
#include <net/if.h>
#include <net/ndrv.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <net/ethernet.h>

int main (int argc, char **argv) {

   if (geteuid()) { fprintf(stderr,"No root, no service\n"); exit(1); }
   int s = socket(PF_NDRV,SOCK_RAW,0);
   if (s < 0) { perror ("socket"); exit(2); }

   uint16_t   etherType = ntohs(atoi(argv[1]));
   struct sockaddr_ndrv    sa_ndrv;

   strlcpy((char *)sa_ndrv.snd_name, "en0", sizeof (sa_ndrv.snd_name));
   sa_ndrv.snd_family = PF_NDRV;
   sa_ndrv.snd_len = sizeof (sa_ndrv);
   
   rc = bind(s, (struct sockaddr *) &sa_ndrv, sizeof(sa_ndrv));
   
   if (rc < 0) { perror ("bind"); exit (3);}

   char packetBuffer[2048];

#ifdef LISTENER
   struct ndrv_protocol_desc desc;
   struct ndrv_demux_desc demux_desc[1];
   memset(&desc, '\0', sizeof(desc));
   memset(&demux_desc, '\0', sizeof(demux_desc));

   /* Request kernel for demuxing of one chosen ethertype */
   desc.version = NDRV_PROTOCOL_DESC_VERS;
   desc.protocol_family = atoi(argv[1]);
   desc.demux_count = 1;
   desc.demux_list = (struct ndrv_demux_desc*)&demux_desc;
   demux_desc[0].type = NDRV_DEMUXTYPE_ETHERTYPE;
   demux_desc[0].length = sizeof(unsigned short);
   demux_desc[0].data.ether_type = ntohs(atoi(argv[1]));

   if (setsockopt(s, 
        SOL_NDRVPROTO, 
        NDRV_SETDMXSPEC, 
	 (caddr_t)&desc, sizeof(desc))) {
      perror("setsockopt"); exit(4);
   }
   /* Socket will now receive chosen ethertype packets */
   while ((rc = recv (s, packetBuffer, 2048, 0) ) > 0 ) {
   	printf("Got packet\n"); // remember, this is a PoC..
   }
#else
   memset(packetBuffer, '\xff', 12);
   memcpy(packetBuffer + 12, &etherType, 2);
   strcpy(packetBuffer,"NDRV is fun!");
   rc = sendto (s, packetBuffer, 20, 0, 
		(struct sockaddr *)&sa_ndrv, sizeof(sa_ndrv));
   if (rc < 0) { perror("sendto"); }
#endif
} 
 

PF_SYSTEM

The PF_SYSTEM protocol family is a proprietary Darwin mechanism which provides communication between kernel mode providers and user mode requesters. PF_SYSTEM sockets are always SOCK_RAW, with two protocols implemented: SYSPROTO_EVENT (1) and SYSPROTO_CONTROL (2).

SYSPROTO_EVENT

The SYSPROTO_EVENT protocol is used by the kernel to multicast events to interested parties. In that sense, it is very similar to Linux's AF_NETLINK. No binding is required for the socket - but an event filter must be set using a SIOCSKEVFILT ioctl(2) request. The ioctl(2) takes a struct kev_request, defined in sys/kern_event.h (along with the ioctl(2) codes) to consist of three uint32_t - for the vendor_code, kev_class and kev_subclass. Zero values may be specified as wildcards (....ANY) for any of the three values. The only vendor supported out of box is KEV_VENDOR_APPLE, and Table 16-3 shows the classes which exist in Darwin 18:

Table 16-3: SYSPROTO_EVENT KEV_... classes in Darwin 18
KEV_..._CLASSKEV_..._SUBCLASSEvent types
1NETWORK1INETIPv4 (codes in <net/net_kev.h>)
2DL Data Link subclass (codes in <net/net_kev.h>)
3NETPOLICY Network policy subclass
4SOCKET Sockets
5ATALKAppleTalk (no longer used)
6INET6 IPv6 (codes in <net/net_kev.h>)
7ND6 IPv6 Neighbor Discovery Protocol
8NECP NECP subclasss
9NETAGENT Net-Agent subclass
10LOG Log subclass
11NETEVENT Generic Net events subclass
12MPTCP Global MPTCP events subclass
2IOKIT??IOKit drivers
3SYSTEM2CTLControl notifications
3MEMORYSTATUS Jetsam/memorystatus subclass
4APPLESHAREAppleShare events (no longer used)
5FIREWALL1IPFWipfw - IPv4 firewalling
2IP6FWipfw - IPv6 firewalling
6IEEE802111?Wireless Ethernet (IO8211Family drivers)

Following the setting of the ioctl(2), events can be read from the socket as a stream of kern_event_msg structures. As further explained in <sys/kern_event.h>, each event structure is of variable total_size, specifying a vendor_code, kev_class and kev_subclass (which are guaranteed to match the filter), as well as a monotonically increasing id, an event_code, and any number of event_data words (up to the total_size specified). Additional ioctl(2) codes are SIOCGKEVID (to get current event ID), SIOCGKEVFILT (get the filter set on the socket) and SIOCGKEVVENDOR (looking up the provider code for a string provider name). Apple's built-in mechanisms naturally use the APPLE class, although vendor class 1000 has also been seen on MacOS (for the socketfilterfw, discussed later).

 
Experiment: Creating a SYSPROTO_EVENT listener

The programming model of SYSPROTO_EVENT is so simple an event listener can be coded in but a few lines:

Listing 16-4: Sample code for a SYSPROTO_EVENT listener
#include <sys/socket.h>
#include <sys/kern_event.h>
#include <sys/ioctl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main (int argc, char **argv)
{
  struct kev_request req;
  int s = socket(PF_SYSTEM, SOCK_RAW, SYSPROTO_EVENT);

  req.vendor_code = KEV_VENDOR_APPLE;
  req.kev_class = KEV_ANY_CLASS;
  req.kev_subclass = KEV_ANY_SUBCLASS;

  if (ioctl(s, SIOCSKEVFILT, &req)){ perror("Unable to set filter\n"); exit(1);}
  char buf[1024];

  while (1) {
     int rc;
     struct kern_event_msg *kev;

     // can use if (ioctl(s, SIOCGKEVID, &id)) to get next ID
     // or simply read and block until an event occurs..
     rc = read (s, buf, 1024);
     kev = (struct kern_event_msg *)buf;
     printf ("%d: (%d bytes). Vendor/Class/Subclass: %d/%d/%d Code: %d\n",
              kev->id, kev->total_size, kev->vendor_code, 
	      kev->vendor_code, kev->kev_class, kev->kev_subclass, 
	      kev->event_code);
  } // end while
  return 0;
} // end main

Compiling the above program and running it will block, and occasionally spit out event notifications. Most commonly on MacOS are those of IE80211 (1/6/1), which emits messages on WiFi scans and state changes. Toggling the WiFi interface will also generate NETWORK/DL messages (1/1/2) as the interface reconfigures, and NETWORK/INET6 (1/1/6) as it gets a dynamic IP address.

Being a Darwin proprietary mechanism, the event notifications are used by Apple's own daemons. Using procexp(j) you can see which sockets are used which daemons - including the above program ('kev'), when it runs:

Output 16-5: Viewing SYSPROTO_EVENT socket usage with procexp(j)
root@Chimera (~)# sudo procexp all fds | grep Event:
kev	         12172 FD  3u  socket system Event:   APPLE:ANY:ANY
socketfilterfw    1723 FD  4u  socket system Event:   1000:5:11
socketfilterfw    1723 FD  7u  socket system Event:   APPLE:NETWORK:LOG
sharingd           289 FD  5u  socket system Event:   APPLE:IEEE80211:1
UserEventAgent     244 FD  4u  socket system Event:   APPLE:IEEE80211:1
airportd           143 FD  7u  socket system Event:   APPLE:NETWORK:DL
airportd           143 FD 22u  socket system Event:   APPLE:IEEE80211:1
CommCenter         248 FD  6u  socket system Event:   APPLE:NETWORK:NETEVENT
symptomsd          177 FD 15u  socket system Event:   APPLE:NETWORK:INET
symptomsd          177 FD 16u  socket system Event:   APPLE:NETWORK:ND6
AirPlayXPCHelpe     98 FD  3u  socket system Event:   APPLE:NETWORK:DL
AirPlayXPCHelpe     98 FD  6u  socket system Event:   APPLE:IEEE80211:1
UserEventAgent      43 FD  5u  socket system Event:   APPLE:SYSTEM:MEMORYSTATUS
bluetoothd          95 FD  4u  socket system Event:   APPLE:IEEE80211:1
locationd           84 FD 11u  socket system Event:   APPLE:IEEE80211:1
configd             54 FD  4u  socket system Event:   APPLE:NETWORK:ANY
configd             54 FD 19u  socket system Event:   APPLE:IEEE80211:1
configd             54 FD 21u  socket system Event:   APPLE:IEEE80211:1
 

SYSPROTO_CONTROL

The second protocol of the PF_SYSTEM family is SYSPROTO_CONTROL. This merely provides a control channel from user space onto a given provider, which may be a kernel subsystem or some kernel extension, calling on the ctl_register KPI. Such SYSPROTO_CONTROL sockets are associated with control names, which Apple maintains in a reverse DNS notation. Apple keeps adding more and more providers in between Darwin versions and in new kernel extensions - needless to say all undocumented. Using netstat(1), you can see both which providers are registered (under "Registered kernel control modules"), and which are actively in use (through "Active kernel control sockets"), although to see which processes are actually holding control sockets one needs to use lsof(1) or procexp(j). Table 16-6 shows the providers found in Darwin 18.

Table 16-6: The known SYSPROTO_CONTROL IDs in Darwin 18
com.apple. Control NameProvides
network.statisticsLive socket statistics and notifications
content-filterXNU-2782: User space packet data filtering. Used by network-cmds cfilutil
fileutil.kext.state[less/ful].ctlMacOS 14: AppleFileUtil.kext
flow-divertXNU-2422: MPTCP flow diversions
mcx.kernctl.alrMacOS mcxalr.kext: Managed Client eXtensions control
net.ipsec_controlXNU-2422: User-mode IPSEC controls
net.necp_controlXNU-2782: Network Extension Control Policies
net.netagentXNU-3248: Network Agents (discussed later)
net.rvi_controlRemoteVirtualInterface.kext: control socket
net.utun_controlUser mode tunneling (VPNs)
netsrcNetwork/route policies and statistics
network.advisoryXNU-3248: Report SYMPTOMS_ADVISORY_[CELL/WIFI]_[BAD/OK] to kernel
network.tcp_ccdebugXNU-2782: Collect flow control algorithm debug data
nke.sockwallMacOS: The Application Layer Firewall (ALF.kext), discussed later
nke.webcontentfilterwebcontentfilter.kext: "HolyInquisition" socket filtering via user-mode proxy
packet-manglerXNU-2782: Tracks flows and handles TCP options (Used by network-cmds pktmnglr)
uart.[sk].*AppleOnBoardSerial.kext and other UART devices
(MacOS: BLTH, MALS, SOC, iOS: oscar, gas-gauge, wlan-debug and iap)
userspace_ethernetIOUserEthernet.kext: User mode tunneling (Layer II Ethernet)

Once a socket is created, a CTLIOCGINFO ioctl(2) must be issued with a struct ctl_info argument, whose ctl_name field is initialized with the requested control name. If the ioctl(2) is successful, the socket may then be connect(2)ed through a struct sockaddr_ctl, initialized with the ctl_id returned from the previous ioctl(2).

A connected socket, however, is far as the SYSPROTO_CONTROL goes. From that point on, every socket behaves differently, depending on the underlying provider. The general flow usually entails send(2)ing and recv(2)ing, and in some cases using [get/set]sockopt(2). The system calls result in kernel mode callbacks (registered by the implementing party, usually a kernel extension) being invoked. The kernel-mode implementation is discussed in Volume II.

Control sockets are commonly used for administering other networking facilities, so a few examples of their usage will be discussed in this chapter.

 
Experiment: User mode tunneling with SYSPROTO_CONTROL

User mode tunneling, a feature commonly tapped by VPN applications, is a great example of a SYSPROTO_CONTROL socket. Such applications, rather than installing some kernel filtering mechanism, instead draw on a kernel facility to request the creation of a new interface, which appears to other processes as another link-layer, complete with its own IPv4 or IPv6 address. When such processes bind to the interface, the IP-layer packets are redirected to the utun controller, which can then do with them as his own, commonly encapsulating them in an additional IP layer (with or without encryption), and sending them elsewhere. The process works both ways, in that the utun controller can also inject packets onto the tunneling interface, which the kernel will then route to their bound sockets, as it would have with any other interface.

The compact code in Listing 16-7 sets up com.apple.net.utun_control:

Listing 16-7: The code to set up a user mode tunneling interface
#include <sys/kern_control.h>
#include <net/if_utun.h>       // for UTUN_CONTROL_NAME

int tun(void)
{
  struct sockaddr_ctl sc;
  struct ctl_info ctlInfo;

  memset(&ctlInfo, 0, sizeof(ctlInfo));
  strlcpy(ctlInfo.ctl_name, UTUN_CONTROL_NAME, sizeof(ctlInfo.ctl_name));

  int fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL);
  if (fd == -1) { /* perror .. */ return -1;}
  if (ioctl(fd, CTLIOCGINFO, &ctlInfo) < 0) { /* perror.. */ return -1; }

  sc.sc_id = ctlInfo.ctl_id; sc.sc_len = sizeof(sc);
  sc.sc_family = AF_SYSTEM;  sc.ss_sysaddr = AF_SYS_CONTROL;
  sc.sc_unit = 2; /* To create utun1, just in case utun0 is in use */

  /* utun%d device will be created, where "%d" is unit number -1 */
  if (connect(fd, (struct sockaddr *)&sc, sizeof(sc)) == -1) {
           perror ("connect(AF_SYSCONTROL)"); close(fd); return -1; }

  return fd;
}

Once the descriptor is set up, a trivial read(2) loop implementation is left as an exercise for the avid reader. When completed, an interface will appear (In the example we useutun1, since utun0 is occasionally used by the IDS.framework's identityserviced). After configuring an IP address on the interface, generating any traffic (e.g. with ping(8)), will send those IP packets to the process:

Output 16-8 (a/b): The output of a sample program built with the previous listing

In the first terminal, set up tunnel:

morpheus@chimera(~)# ifconfig utun1
ifconfig: interface utun1 does not exist
morpheus@chimera(~)# /tmp/tun
#
# IP Packets come out here (in this example, 
#   20 bytes (IP Header)
# +  8 bytes (ICMP Header)
# + 56 byte payload = 76 bytes, below
#
45 00 00 54 97 90 00 00 # 45 - IPv4, 20 bytes
40 01 db 0c 01 02 03 04 # 01 - ICMP, src: 1.2.3.4 
01 02 03 05 08 00 a4 0e # dst: 1.2.3.5, 08 - Echo
87 30 00 00 5b f4 ba 97 
00 07 cb 2a 08 09 0a 0b 
0c 0d 0e 0f 10 11 12 13 
14 15 16 17 18 19 1a 1b 
1c 1d 1e 1f 20 21 22 23 
24 25 26 27 28 29 2a 2b
2c 2d 2e 2f 30 31 32 33 
34 35 36 37 

In another terminal, once tunnel is up:

#
# Interface exists - initially no address
#
$ ifconfig utun1
utun1: flags=8051<UP,POINTOPOINT,.. > mtu 1500
#
# Configure with some address
#
$ sudo ifconfig utun1 1.2.3.4 1.2.3.5
#
# Make sure configuration worked:
#
$ ifconfig utun1
utun1: flags=8051 ..  mtu 1500
 inet 1.2.3.4 --> 1.2.3.5 netmask 0xff000000 
#
# Generate traffic
#
$ ping 1.2.3.5
PING 1.2.3.5 (1.2.3.5): 56 data bytes
 

Proprietary socket system calls

In addition to the _nocancel extensions found in XNU for other I/O related calls, Apple has extended the BSD socket API with several proprietary system calls - which are, as usual, undocumented.

pid_shutdown_sockets (#436)

The pid_shutdown_sockets system call enables the caller to force all presently open sockets to be forcefully shutdown. This is only used on *OS, wherein the only caller of this system call appears to be /usr/libexec/assertiond.

socket_delegate (#450)

The socket_delegate system call works like socket(2) - only it receives an additional (fourth) argument, specifying the target PID in which the socket is to be created. There appears to be little use for this system call.

[dis]connectx and [send/recv]msg_x (#447-8, #480-1)

Apple introduced several non-standard system calls which both extend BSD sockets and provide support for MultiPath TCP (and other protocols) as far back as Darwin 13. The first pair, [dis]connectx, provide for quick bind(2)/connect(2), deferred connection setup (CONNECT_RESUME_ON_READ_WRITE) as well as for supporting multiple address associations. These only became an official API as of 15, and got a fairly detailed manual page. The second pair, [send/recv]msg_x (supporting array based [send/recv]msg for protocols handling simultaneous multiple datagrams), is still not officially provided to this day. The prototypes and documentation in bsd/sys/socket.h are PRIVATE, so as not to appear in <sys/socket.h>.

Listing 16-9: The extended socket system calls (from <sys/socket.h>)
__API_AVAILABLE(macosx(10.11), ios(9.0), tvos(9.0), watchos(2.0))       // #447
int connectx(int, const sa_endpoints_t *, sae_associd_t, unsigned int,
             const struct iovec *, unsigned int, size_t *, sae_connid_t *);

__API_AVAILABLE(macosx(10.11), ios(9.0), tvos(9.0), watchos(2.0))       // #448
int disconnectx(int, sae_associd_t, sae_connid_t);

__API_STILL_NOT_AVAILABLE_FOR_SOME_REASON(macosx(10.14), ios(12.0), etc)
ssize_t recvmsg_x(int s, struct msghdr_x *msgp, u_int cnt, int flags);  // #480
ssize_t sendmsg_x(int s, struct msghdr_x *msgp, u_int cnt, int flags);  // #481

Apple expects developers to use the (very) high level abstraction of NSURLSession objects, setting the MultipathServiceType property of NSURLSessionConfiguration as documented in a developer article[1]. For pure C programmers, the system calls remain the most effective mechanism to tap into MPTCP's powerful functionality, and other address family specific yet non-standard features.

peeloff (#449)

peeloff was a short lived system call meant to extract an association from a socket. It was added in XNU-2422 but apparently replaced with a null implementation (returning 0) in XNU-4570.

 

Interfaces

As with other UN*X systems, Darwin provides user mode with network access through the notion of interfaces. These are devices which, unlike the standard character or block devices, have no filesystem presence and can only be accessed through sockets, and controlled through ioctl(2) on the bound sockets. The command line ifconfig(1) utility comes in very handy to view devices, with -l for a short list or -a for full information. Trying this on any Darwin system details the interfaces, which follow the naming convetions shown in Table 16-10:

Table 16-10: The interfaces found on Darwin systems
LinkInterfaceProvided byUsed for
Loopbacklo0XNUThe loopback ("localhost") interface
gif#Generic IPv[4/6]-in-IPv[4/6] (RFC2893) tunneling
stf#6-to-4 (RFC3056) tunneling
utun#User mode tunneling
ipsec#IPSec tunneling
Etherneten#IONetworkingFamilyEthernet (wired, wireless, and over other media)
awdl0IOgPTPPlugin.kextApple Wireless Device Link
p2p#AppleBCMWLANCoreWi-Fi peer to peer
ppp#PPP.kextPoint-to-Point Protocol (/usr/sbin/pppd)
bridge#XNUMacOS: Interface bridging
fw#IOFireWireIPMacOS:IP over FireWire
rvi#RemoteVirtualInterfaceMacOS: captures packets from attached *OS devices
ap#Access Point (personal hotspot)
Cellpdp_ip#AppleBasebandPCI[ICE/MAV]PDPiOS/WatchOS: Cellular connection (if applicable)
USBXHC#AppleUSBHostPacketFilterMacOS: USB Packet capture
Capture[pk/ip]tap#XNUPacket or IP Layer capture from multiple interfaces

The en# interfaces are the ones most commonly used. Not only does the local wired Ethernet (on Mac Minis and iMacs, through ......) and wireless interfaces (used by the Airport Broadcom or other NIC kext) appear as en#, but so does Bluetooth, the Mac ↔ BridgeOS interface as well (usually as en5) started by the com.apple.driver.usb.cdc.ncm driver, as well as the tethering interface of the iPhone when Mobile Hotspot is activated over USB. The system maintains a property list of all ethernet interfaces and their mappings to the GUI visible strings ("UserDefinedName") in /L*/Pref*ces/SystemConfiguration/NetworkInterfaces.plist (which can be displayed with networksetup -listallhardwareports).

Interface Configuration

As with other UN*X systems, the ifconfig(8) utility can be used to obtain information on interfaces and perform various operations, such as plumb, add/remove IPv(4/6) addresses or aliases, and bond. Darwin also extends the interface object internally with numerous proprietary ioctl(2) codes, all marked PRIVATE or BSD_KERNEL_PRIVATE so as to not be visible in user mode. Whereas <sys/sockio.h> defines about 19 SIOCSIF* codes, XNU's bsd/sys/sockio.h nearly doubles that number with additional codes to handle numerous proprietary functions. The undocumented ioctl(2) codes are shown in Table 16-11 (next page). The codes are shown in their macro form, which also defines their third (void *) argument. Note, that some of the structures are also undocumented, and are found elsewhere in the XNU sources. The Book's companion XXR can come handy to help you locate the structures and copy them to a user mode header.

 
Table 16-11: Unexported SIOC* codes (from XNU-4903's bsd/sys/sockio.h)
ioctl(2) codeValue (_IOWR())Description
SIOCGIFCONF[32/64] ('i', 36, struct ifconf[32/64]) get ifnet list
SIOCGIFMEDIA[32/64] ('i', 56, struct ifmediareq[32/64]) get net media
SIOCGIFGETRTREFCNT ('i', 137, struct ifreq) get interface route refcnt
SIOCGIFLINKQUALITYMETRIC ('i', 138, struct ifreq) get LQM
SIOCGIFEFLAGS ('i', 142, struct ifreq) get extended ifnet flags
SIOC[S/G]IFDESC ('i', 143/144, struct if_descreq)Set/Get interface description
SIOC[S/G]IFLINKPARAMS ('i', 145/146, struct if_linkparamsreq)Set/Get output TBR rate/percent
SIOCGIFQUEUESTATS ('i', 147, struct if_qstatsreq)Get interface queue statistics
SIOC[S/G]IFTHROTTLE ('i', 148/149, struct if_throttlereq)Set/Get throttling for interface
SIOCGASSOCIDS[/32/64] ('s', 150, struct so_aidreq[/32/64]) get associds
SIOCGCONNIDS[/32/64] ('s', 151, struct so_cidreq[/32/64]) get connids
SIOCGCONNINFO[/32/64] ('s', 152, struct so_cinforeq[/32/64]) get conninfo
SIOC[S/G]CONNORDER ('s', 153/154, struct so_cordreq) set conn order
SIO[C/G]SIFLOG ('i', 155/156, struct ifreq)Get/Set Interface log level
SIOCGIFDELEGATE ('i', 157, struct ifreq)Get delegated interface index
SIOCGIFLLADDR ('i', 158, struct ifreq) get link level addr
SIOCGIFTYPE ('i', 159, struct ifreq) get interface type
SIOC[G/S]IFEXPENSIVE ('i', 160/161, struct ifreq) get/mark interface expensive flag
SIO[C/S]GIF2KCL ('i', 162/163, struct ifreq) interface prefers 2 KB clusters
SIOCGSTARTDELAY ('i', 164, struct ifreq)Add artificial delay
SIOCAIFAGENTID ('i', 165, struct if_agentidreq) Add netagent id
SIOCDIFAGENTID ('i', 166, struct if_agentidreq) Delete netagent id
SIOCGIFAGENTIDS[/32/64] ('i', 167, struct if_agentidsreq[/32/64]) Get netagent ids
SIOCGIFAGENTDATA[/32/64] ('i', 168, struct netagent_req[/32/64]) Get netagent data
SIOC[S/G]IFINTERFACESTATE ('i', 169/170, struct ifreq) set/get interface state
SIOC[S/G]IFPROBECONNECTIVITY ('i', 171/172, struct ifreq) Start/Stop or check connectivity probes
SIOCGIFFUNCTIONALTYPE ('i', 173, struct ifreq) get interface functional type
SIOC[S/G]IFNETSIGNATURE ('i', 174/175, struct if_nsreq)Set/Get network signature
SIOC[G/S]ECNMODE ('i', 176/177, struct ifreq)Explicit Congestion Notification mode (IFRTYPE_ECN_[[EN/DIS]ABLE/DEFAULT])
SIOCSIFORDER ('i', 178, struct if_order)Set interface ordering
SIO[C/G]SQOSMARKINGMODE[/ENABLED] ('i', 180-183, struct ifreq)Get/set QoS marking mode
SIOCSIFTIMESTAMP[EN/DIS]ABLE ('i', 184-185, struct ifreq)Enable/Disable interface timestamp
SIOCGIFTIMESTAMPENABLED ('i', 186, struct ifreq)Get interface timestamp enabled status
SIOCSIFDISABLEOUTPUT ('i', 187, struct ifreq)Disable output (DEVELOPMENT/DEBUG)
SIOCGIFAGENTLIST[/32/64] ('i', 190, struct netagentlist_req[/32/64]) Get netagent dump
SIOC[S/G]IFLOWINTERNET ('i', 191/192, struct ifreq)Set/Get low internet download/upload
SIOC[G/S]IFNAT64PREFIX ('i', 193/194, struct if_nat64req)Get/set Interface NAT64 prefixes
SIOCGIFNEXUS ('i', 195, struct if_nexusreq) get nexus details
SIOCGIFPROTOLIST[/32/64] ('i', 196, struct if_protolistreq[/32/64]) get list of attached protocols
SIOC[G/S]IFLOWPOWER ('i', 199/200, struct ifreq) Low Power Mode
SIOCGIFCLAT46ADDR ('i', 201, struct if_clat46req)Get CLAT (IPv4-in-IPv6) addresses
 

Case Study: rvi

The Remote Virtual Interface was introduced in iOS 5.0 and documented by Apple in QA1176[2]. The feature allows the iOS interfaces to appear on a Mac host, so that packet tracing tools (notably, tcpdump(1)) can be used through the host.

RVI requires the cooperation of several components, both on the host and the device, all working together as shown in Figure 16-12 (next page). The binaries of the remote virtual interface package all belong to the same (unnamed and closed source) project, and are of the few in MacOS which still target 10.7 (i.e. use LC_UNIXTHREAD), and are apparently unmaintained since then.

/usr/bin/rvictl is a simple command line utilty which can list active device UUIDs (-l/L, then start (-s/S) or stop (-x/X) the remote virtual interface on the devices by their specified UUID. It does so by linking with MobileDevice.framework (for the list functionality), and with the private RemotePacketCapture.framework. The latter exports APIs which hide the IPC connection to /usr/libexec/rpmuxd.

The /usr/libexec/rpmuxd is started on (rvictl)'s demand for com.apple.rpmuxd by launchd. The daemon handles the local end of the packet capture operations, as well as provides a notification mechanism for interested clients over MIG subsystem 117731 with five messages:

Table 16-12: /usr/libexec/rpmuxd's MIG subsystem:
#Routine Name
117731rpmuxd_start_packet_capture
117732rpmuxd_stop_packet_capture
117733rpmuxd_get_current_devices
117734rpmuxd_register_notification_port
117735rpmuxd_deregister_notification_port

The daemon controls the RemoteVirtualInterface.kext (visible in kextstat(1) by its CFBundleIdentifier of com.apple.nke.rvi). The kext isn't normally loaded, so rpmuxd can load it (by posix_spawn(1)ing kextload). The kext sets up a PF_SYSTEM control socket with the name com.apple.net.rvi_control, which the daemon connect(2)s to, and through which it requests the kext to create the rvi# interface. At the same time, it handles the connection to the iDevice, by calling AMDeviceSecureStartService() to request the launch of com.apple.pcapd on it.

On the iDevice, when lockdownd receives the request to start com.apple.pcapd, it consults the __TEXT.__services plist embedded in the Mach-O, and resolves the name to /usr/libexec/pcapd. The pcapd is a small daemon which uses libpcap.A.dylib, calling pcap_setup_pktap_interface() and then using the Berkeley Packet Filter (explained later in this chapter) to capture all packets. These packets are relayed over the lockdown connection to the host's rpmuxd, which then injects them (through the control socket) to the kext, which in turn pushes them through the rvi# interface it has created.

The end result of this is that the rvi# interface is entirely indistinguishable from other ethernet interfaces for packet capture tools, and so tcpdump(1) can be run on the host (with the -i rvi# switch) to obtain the packets which were actually captured on the iDevice. The connection, however, is read-only, so packets cannot be sent through the rvi# interface back to the device.

 
Figure 16-13: The architecture of the Remote Virtual Interface facility (addresses from MacOS 14 binaries)
 

Networking Configuration

The network stack exposes a plethora of configuration settings (and statistics, described later) via sysctl MIBs. These are all conveniently in the net namespace. As with other MIBs, they are mostly undocumented save for a description in the kernel's __DATA.__sysctl_set (through the SYSCTL_OID's description field), which can be displayed with joker -S.

IPv4 configuration

The MIBs for controlling IPv4 and IPv6 are broken into two separate namespaces net.inet.ip and net.inet6.ip6. Owing to the similarities of both protocols, however, some MIBs are found in both namespaces, as shown in Table 16-14:

Table 16-14: Configuration MIBs common to IPv4 and IPv6
net.inet[6].ip[6] MIBDefaultPurpose
mcast.loop0x00000001Loopback multicast datagrams by default
rtexpire0x0000013bDefault expiration time on dynamically learned routes
rtminexpire0x0000000aMinimum time to hold onto dynamically learned routes
rtmaxcache0x00000080Upper limit on dynamically learned routes
forwarding0000000000Enable IP forwarding between interfaces
redirect0x00000001Enable sending IP redirects
maxfragpackets1536Maximum number of fragment reassembly queue entries
maxfragsperpacket128Maximum number of fragments allowed per packet
adj_clear_hwcksum0000000000Invalidate hwcksum info when adjusting length
adj_partial_sum0x00000001Perform partial sum adjustment of trailing bytes at IP layer
mcast.maxgrpsrc0x000200Max source filters per group
mcast.maxsocksrc0x000080Max source filters per socket
mcast.loop0x0000080Multicast on loopback interface

Additional, IPv4 specific parameters, found in net.inet.ip, are shown in Table 16-15.

Table 16-15: Configuration MIBs specific to IPv4
net.inet.ip MIBDefaultPurpose
portrange.low[first/last]600-1023Low reserved port range
portrange.[hi][first/last]49152-65536High reserved port range
ttl0x00000040Default TTL value on outgoing packets
[accept_]sourceroute0000000000Enable [accepting/forwarding] source routed IP packets
gifttl0x0000001eTime-to-Live (max hop count) on GIF (IP-in-IP) interfaces
subnets_are_local0000000000Subnets of local interfaces also considered local
random_id_statistics0000000000Enable IP ID statistics
sendsourcequench0000000000Enable the transmission of source quench packets
check_interface0000000000Verify packet arrives on correct interface
rx_chaining0x00000001Do receive side ip address based chaining
rx_chainsz0x00000006IP receive side max chaining
linklocal.in.allowbadttl0x00000001Allow incoming link local packets with TTL < 255
random_id0x00000001Randomize IP packets IDs
maxchainsent0x00000016use dlil_output_list
select_srcif_debug0000000000Debug (dmesg) source address selection
output_perf0000000000Do time measurement
rfc68640x00000001Updated Specification of the IPv4 ID Field
 

IPv6 configration

IPv6 has plenty of other specific parameters which affect its behavior, particularly for its sub-protocols, like Neighbor Discovery (ND).

Table 16-16: The IPv6 related sysctl control MIBs
net.inet6 sysctl MIBDefaultPurpose
ip6.hlim64IPv6 hop limit
ip6.accept_rtadv1Accept ICMPv6 Router Advertisements
ip6.keepfaith0Unused. Apparently IPv6 has grown disillusioned.
ip6.log_interval5Throttle kernel log output to once in log_interval
ip6.hdrnestlimit15IP header nesting limit
ip6.dad_count1Duplicate Address Detection count (read only)
ip6.auto_flowlabel1Assign IPv6 flow labels (in header)
ip6.defmcasthlim1Default multicast hop limit
ip6.gifhlim0Hop limit on GIF (IP-in-IP) interfaces
ip6.use_deprecated1Continue use of deprecated temporary address
ip6.rr_prune5Router renumbering prefix
ip6.v6only0If 0, enable IPv6 mapped addresses. Else, native IPv6 only
ip6.use_tempaddr1RFC3041 temporary interface addresses
ip6.temppltime86400Temporary address preferred lifetime (sec)
ip6.tempvltime604800Temporary address maximum lifetime (sec)
ip6.auto_linklocal1Automatically use link local (fe80::) addresses
ip6.prefer_tempaddr1Prefer the temporary address over the assigned one
ip6.use_defaultzone0Embed default scope ID
ip6.maxfrags3072Maximum number of IPv6 fragments allowed
ip6.mcast_pmtu0Enable Multicast Path MTU discovery
ip6.neighborgcthresh1024Neighbor cache garbage collection threshold
ip6.maxifprefixes16Maximum interface prefixes adopted from router advertisements
ip6.maxifdefrouters16Maximum default routers adopted from router advertisements
ip6.maxdynroutes1024Maximum number of dynamic (via redirect) routes allowed
ip6.input_perf_bins0bins for chaining performance data histogram
ip6.select_srcif_debug0Debug (log) selection process of source interface
ip6.select_srcaddr_debug0Debug (log) selection process of source address
ip6.select_src_expensive_secondary_if0Allow source address selection to use interfaces w/high metric
ip6.select_src_strong_end1limit source address selection to outgoing interface
ip6.only_allow_rfc4193_prefixes0Use RFC4193 as baseline for network prefixes
ip6.maxchainsent1use dlil_output_list
ip6.dad_enhanced1Adds a random nonce to NS messages for DAD.
 

IPSec (6) Configuration

IPSec is deeply integrated into IPv6, and therefore more likely to be used with it than in the IPv4 case. Many of the ipsec6 MIB values also apply to IPv4 (i.e. exist in net.inet.ipsec as well), and values which apply to both are under net.ipsec (not shown below).

Table 16-17: The net.inet6.ipsec6 related sysctl control MIBs
net.inet6.ipsec6 MIBDefaultPurpose
def_policy1Default Policy
esp_trans_deflev1Encapsulating Security Payload in Transport Mode
esp_net_deflev1Encapsulating Security Payload in Network Mode
ah_trans_deflev1Authentication Header in Transport Mode
ah_net_deflev1Authentication Header in Network mode
ecn0Toggle Explicit Congestion Notifications
debug0Toggle logging and Debugging
esp_randpad-1Pad Encapsulating Security Payload with random bytes

ICMPv6 Configuration

ICMPv6 (RFC4443) is also tightly knit into IPv6, and includes the sub protocols of Neighbor Discovery (ND) and SEcure Neighbor Discovery (SEND). Another sub protocol, Multicast Listener Discovery (MLD) has subtleties between version 1 (RFC2710) and version 2 (RFC3810), both of which are supported by the Darwin network stack.

Table 16-18: The net.inet6 ICMPv6, ND and SEND related sysctl control MIBs
net.inet6 MIBDefaultPurpose
icmp6.rediraccept1Accept and process redirects
icmp6.redirtimeout600Expire ICMP redirected route entries after n seconds
icmp6.rappslimit10Router Advertisement Packets per second limit
icmp6.errppslimit500packet-per-second error limit
icmp6.nodeinfo3enable/disable NI response
icmp6.nd6_prune1Walk list every n seconds
icmp6.nd6_prune_lazy5Lazily walk list every n seconds
icmp6.nd6_delay5Delay first probe in seconds
icmp6.nd6_[u/m]maxtries3Maximum [unicast/multicast] ND query attempts
icmp6.nd6_useloopback1Allow ND6 to operate on loopback interface
icmp6.nd6_debug0Output ND debug messages to kernel log
icmp6.nd6_accept_6to41Accept neighbors from 6-to-4 links
icmp6.nd6_optimistic_dad63Assume Duplicate Address Detection won't ever collide
icmp6.nd6_onlink_ns_rfc48610 Accept 'on-link' nd6 NS in compliance with RFC 4861
icmp6.nd6_llreach_base30 default ND6 link-layer reachability max lifetime (in seconds)
icmp6.nd6_maxsolstgt8maximum number of outstanding solicited targets per prefix
icmp6.nd6_maxproxiedsol4 maximum number of outstanding solicitations per target
send.opmode 1Configured SEND operating mode
Multicast Listener Discovery
mld.gsrdelay10Rate limit for IGMPv3 Group-and-Source queries in seconds
mld.v1enable1Support MLDv1 (RFC2710)
mld.v2enable1Support MLDv2 (RFC3810)
mld.use_allow1Use ALLOW/BLOCK for RFC 4604 SSM joins/leaves
mld.debug0Output MLD debug messages to kernel log
 

TCP configuration

Darwin's TCP implementation has a huge number of settings, which toggle support for various RFCs and best practices. They are all under net.inet.tcp, and apply the same way for both IPv4 and IPv6, save for [v6]mssdflt. Table 16-19 lists them all:

Table 16-19: The TCP related sysctl control MIBs
net.inet.tcp MIBDefaultPurpose
[v6]mssdflt[0x400]/0x200 Default TCP Maximum Segment Size
keepidle0x006ddd00Keepalive timeout for idle connections
keepintvl0x000124f8Keepalive interval
sendspace0x00020000Maximum outgoing TCP datagram size
recvspace0x00020000Maximum incoming TCP datagram size
randomize_ports0000000000Randomize TCP source ports
log_in_vain0000000000Log all incoming TCP packets
blackhole0000000000Do not send RST when dropping refused connections
keepinit0x000124f8TCP connect idle keep alive time
disable_tcp_heuristics0000000000Set to 1, to disable all TCP heuristics (TFO, ECN, MPTCP)
delayed_ack0x00000003Delay ACK to try and piggyback it onto a data packet
tcp_lq_overflow0x00000001Listen Queue Overflow
recvbg0000000000Receive background
drop_synfin0x00000001Drop TCP packets with SYN+FIN set
slowlink_wsize0x00002000Maximum advertised window size for slowlink
rfc16440000000000T/TCP support
rfc33900x00000001Increased Initial Window
rfc34650x00000001Congestion Control with Appropriate Byte Counting (ABC)
rfc3465_lim20x00000001Appropriate bytes counting w/ L=2*SMSS
doautorcvbuf0x00000001Enable automatic socket buffer tuning
autorcvbufmax0x00100000Maximum recieve socket buffer size
disable_access_to_stats0x00000001Disable access to tcpstat
rcvsspktcnt0x00000200packets to be seen before receiver stretches acks
rexmt_thresh0x00000003Duplicate ACK Threshold for Fast Retransmit
slowstart_flightsize0x00000001Slow start flight size
local_slowstart_flightsize0x00000008Slow start flight size (local networks)
tso0x00000001TCP Segmentation offload
ecn_initiate_out0x00000002Initiate ECN for outbound
ecn_negotiate_in0x00000002Initiate ECN for inbound
ecn_setup_percentage0x00000064Max ECN setup percentage
ecn_timeout0x0000003cInitial minutes to wait before re-trying ECN
packetchain0x00000032Enable TCP output packet chaining
socket_unlocked_on_output0x00000001 Unlock TCP when sending packets down to IP
recv_allowed_iaj0x00000005Allowed inter-packet arrival jiter
min_iaj_win0x00000010Minimum recv win based on inter-packet arrival jitter
acc_iaj_react_limit0x000000c8 Accumulated IAJ when receiver starts to react
doautosndbuf0x00000001Enable send socket buffer auto-tuning
autosndbufinc0x00002000Increment in send bufffer size
autosndbufmax0x00100000Maximum send buffer size
ack_prioritize0x00000001Prioritize pure ACKs
rtt_recvbg0x00000001Use RTT for bg recv algorithm
 
Table 16-19 (cont.): The TCP related sysctl control MIBs
net.inet.tcp MIBDefaultPurpose
recv_throttle_minwin0x00004000Minimum recv win for throttling
enable_tlp0x00000001Enable Tail loss probe
sack0x00000001TCP Selective ACK
sack_maxholes0x00000080Maximum # of TCP SACK holes allowed per connection
sack_globalmaxholes0x00010000Global maximum TCP SACK holes (across all connections)
fastopen0x00000003Enable TCP Fast Open (rfc7413)
fastopen_backlog0x0000000aBacklog queue for half-open TCP Fast Open connections
fastopen_keyTCP Fast Open key
backoff_maximum0x00010000Maximum time for which we won't try TCP Fast Open
clear_tfocache0000000000Toggle to clear the TFO destination based heuristic cache
now_init0x2d88b850Initial tcp now value
microuptime_init0x000daa2dInitial tcp uptime value in micro seconds
minmss0x000000d8Minimum TCP Maximum Segment Size
do_tcpdrain0000000000Enable tcp_drain routine for extra help when low on mbufs
icmp_may_rst0x00000001ICMP unreachable may abort connections in SYN_SENT
rtt_min0x00000064Minimum Round Trip Time value allowed
rexmt_slop0x000000c8Slop added to retransmit timeout
win_scale_factor0x00000003Sliding window scaling factor
tcbhashsize0x00001000Size of TCP control-block hashtable
keepcnt0x00000008number of times to repeat keepalive
msl0x00003a98Maximum segment lifetime
max_persist_timeout0000000000Maximum persistence timeout for ZWP
always_keepalive0000000000Assume SO_KEEPALIVE on all TCP connections
timer_fastmode_idlemax0x0000000aMaximum idle generations in fast mode
broken_peer_syn_rexmit_thres0x0000000a# rexmitted SYNs to disable RFC1323 on local connections
path_mtu_discovery0x00000001Enable Path MTU Discovery
pmtud_blackhole_detection0x00000001Path MTU Discovery Black Hole Detection
pmtud_blackhole_mss0x000004b0Path MTU Discovery Black Hole Detection lowered MSS
cc_debug0000000000Enable debug data collection
use_newreno0000000000Use TCP NewReno algorithm by default
cubic_tcp_friendliness0000000000Enable TCP friendliness
cubic_fast_convergence0000000000Enable fast convergence
cubic_use_minrtt0000000000use a min of 5 sec rtt
lro0000000000Used to coalesce TCP packets
lro_startcnt0x00000004Segments for starting LRO computed as power of 2
lrodbg0000000000Used to debug SW LRO
lro_sz0x00000008Maximum coalescing size
lro_time0x0000000aMaximum coalescing time
bg_target_qdelay0x00000064Target queueing delay
bg_allowed_increase0x00000008Modifier for calculation of max allowed congestion window
bg_tether_shift0x00000001Tether shift for max allowed congestion window
bg_ss_fltsz0x00000002Initial congestion window for background transport
 

MPTCP configuration

Table 16-20: The sysctl(8) MIBS of the net.inet.mptcp namespace
net.inet.mptcp MIBDefaultPurpose
enable1 Global on/off switch
mptcp_cap_retr2Number of MP Capable SYN retries
dss_csum0Enable DSS checksum
fail1Failover threshold
keepalive840Keepalive (sec)
rtthist_thresh600 Rtt threshold
userto1Disable RTO for subflow selection
probeto1000Disable probing by setting to 0
dbg_area31MPTCP debug area
dbg_level1MPTCP debug level
allow_aggregate0Allow the Multipath aggregation mode
alternate_port0Darwin 18: Set alternate port for MPTCP connections
rto3MPTCP restransmission Timeout
rto_thresh1500RTO threshold
.tw60MPTCP Timewait period

UDP configuration

UDP is a simple and stateless protocol, and therefore offers very few configuration options.

Table 16-21: UDP configuration parameters
net.inet.udp MIBDefaultPurpose
checksum1Enable UDP checksumming
maxdgram9216Maximum outgoing UDP datagram size
recvspace 196724Maximum incoming UDP datagram size
log_in_vain 0Log all incoming packets
blackhole 0Do not send port unreachables for refused connects
randomize_ports 1Randomize port numbers

ICMP configuration

ICMP behavior is similarly governed by various sysctl MIBs, which are mostly set by default to ignore known problematic protocol vulnerabilities, such as spoofed ICMP redirection and broadcast echo requests ("ping storms").

Table 16-22: ICMP(v4) configuration parameters
net.inet.icmp MIBDefaultPurpose
maskrepl 0000000000Reply to Address Mask requests
icmplim 0x000000faICMP limit
timestamp 0000000000Respond to ICMP timestamp requests
drop_redirect 0x00000001Ignore ICMP redirect messages
log_redirect 0000000000Log ICMP redirect messages (to dmesg(1))
bmcastecho 0x00000001Broadcast/Multicast ICMP echo requests
 

Networking Statistics

It's important for any system to keep detailed statistics on detailed usage, and to provide them to the administrator in the clearest ways possible. Darwin systems contain quite a few statistics mechanisms of varying detail level and purpose. The chief tool is, of course, the aptly named netstat(1), which presents statistics about interfaces (-i/-I), routes (-r), multicast groups (-g), memory consumption (-m), general protocol statistics (-s), and - naturally - active sockets (with no other arguments or with -a). As useful as it is, though, netstat(1) is little more than a parser of the raw statistics, which it obtains from sysctl MIBS.

sysctl MIBs

Along with configuration settings, the network stack exports a plethora of statistics through sysctl MIBs. Per-family statistics are exported through net.family.socktype.stats, with the family being local, inet, link and systm, and the socktype being (respectively) stream/dgram, tcp/udp/igmp/icmp/ipset, generic/ether/bridge, and kevt/kctl. These end up in the much more readable form of the netstat -s output.

The live connection statistics are maintained in net.family.socktype.pcblist*, with the family and socktype being almost the same as with the stats: There are no link pcbs (as there are no connections at the link layer level), and the inet socktypes are only tcp/udp/raw/mptcp. The three pcblist* variants are pcblist, pcblist64 and pcblist_n, offering different concatenated structures for the statistics, all defined in various locations throughout the kernel headers. Listing 16-23 shows a break down of a TCP PCB strcture:

Listing 16-23: Parsing a kernel provided TCP PCB
// bsd/netinet/in_pcb.h
struct  xinpgen { ... }    // xig_ fields

// bsd/sys/socketvar.h
struct xsocket_n { ... }   // xso_ and xso_ fields

// bsd/netinet/in_pcb.h
#define   XSO_INPCB       0x010  // structure is an xinpcb_n 
struct  xinpcb_n { ... } // xi_ and inp_ fields 

#
# Raw data from sysctl -A -X: 
#
net.inet.tcp.pcblist_n: Format:S,xtcpcb_n Length:22544 Dump:

  xig_len  | xig_count | xig_gen         |  xig_sogen       |  xgn_len | xgen_kind
0x18000000   26000000  |a17d1200 00000000| da01ce00 00000000| 68000000 | 10000000  

 		         443     59305
       xi_inpp      |inp_fport|inp_lport| inp_ppcb (permuted)|    inp_gencnt        
  d7feb971 bfe44d8e |   01bb  |  e7a9   |  8f01ba71 bfe44d8e | 9b7d1200 00000000  

                 IN_IPV4   inp_ip_p(rotocol)                   104.244.42.2
inp_flags inp_flow   ↓ ttl ↓                                  inp46_foreign
40088000  00000000  01|40|00|00   00000000  00000000  00000000  68f42a02 

			192.168.0.108
	                  inp46_local
00000000 00000000 00000000 c0a8006c.... 

#
# netstat  output
#
tcp4       0      0  192.168.0.108.59305    104.244.42.2.443      ESTABLISHED

A good example of parsing the PCBs can be found in the open sources of netstat(1) (in the network_cmds project).

 

com.apple.network.statistics

Using the pcblist* MIBs has major drawbacks. Not only is collecting the statistics a lengthy operation, but the statistics themselves are just a snapshot - and with the dynamic nature of network connections, likely to be stale within minutes, if not far less. Another problem is that it is very difficult to associate the connections to their respective owners.

The PF_SYSTEM/SYSPROTO_CONTROL socket of com.apple.network.statistics provides a far better mechanism - one which not only provides a constant stream of network statistics through the control socket, but also provides a way to match connections to the originating process.

Once a control socket is set up, command messages (in the 1xxx range) may be sent to the kernel, which will be replied to with messages from the kernel (in the 1xxxx range). One command may generate quite a few replies - as is common when adding sources. Commands are defined (along with all of the interface) in bsd/net/ntstat.h. The header is marked as PRIVATE, so it does not make it into user mode.

Listing 16-24: The NSTAT_MSG_TYPEs, from bsd/net/ntstat.h
#pragma mark -- Network Statistics User Client --
#define NET_STAT_CONTROL_NAME   "com.apple.network.statistics"
enum {
        // generic response messages
        NSTAT_MSG_TYPE_SUCCESS = 0
        ,NSTAT_MSG_TYPE_ERROR = 1

        ,NSTAT_MSG_TYPE_ADD_SRC = 1001
        ,NSTAT_MSG_TYPE_ADD_ALL_SRCS = 1002
        ,NSTAT_MSG_TYPE_REM_SRC = 1003
        ,NSTAT_MSG_TYPE_QUERY_SRC = 1004
        ,NSTAT_MSG_TYPE_GET_SRC_DESC = 1005
        ,NSTAT_MSG_TYPE_SET_FILTER = 1006          // 2422
        ,NSTAT_MSG_TYPE_GET_UPDATE = 1007          // 3248
        ,NSTAT_MSG_TYPE_SUBSCRIBE_SYSINFO = 1008   // 3248

        // Responses/Notfications
        ,NSTAT_MSG_TYPE_SRC_ADDED = 10001
        ,NSTAT_MSG_TYPE_SRC_REMOVED = 10002
        ,NSTAT_MSG_TYPE_SRC_DESC = 10003
        ,NSTAT_MSG_TYPE_SRC_COUNTS = 10004
        ,NSTAT_MSG_TYPE_SYSINFO_COUNTS = 10005
        ,NSTAT_MSG_TYPE_SRC_UPDATE = 10006       };

Each source addition creates an associated descriptor, which may be queried by using the ...GET_SRC_DESC message. Descriptors are nstat_[tcp/udp/route]_descriptor structures. Apple continuously modifies these data structures, breaking the direct API between XNU versions and making it really difficult to work directly through the control socket.

The nettop(1) utility (part of the closed source NetworkStatistics package), provides an example of com.apple.network.statistics capabilities. The utility is a "live" (but crude) netstat(1), though it doesn't use the low level sockets directly, instead opting for the higher level wrappers NStatManager* APIs of the private NetworkStatistics framework. This is not without benefits, since the API is a block driven, CF* object aware interface, which serves as an adapter layer and thus decouples from the low level socket structures.

An NStatManagerManager is instantiated with a call to NStatManagerCreate, with a kCFAllocator, options and a callback block. Sources can be added with any of the NStatManagerAddAll[TCP/UDP][/With[Filter/Options]], or (for route sources) NStatManagerAddAllRoutes[WithFilter]. Adding routes triggers the callback block, which gets the NStatSource as an argument. The source objects can be manipulated through blocks with NStatSourceSet[Counts/Events/Description/Removed]Block, which are called with their respective objects as arguments.

 

The description object is a particularly detailed CFDictionary, providing the properties (resolving enums to human readable form where necessary) from the nstat_[tcp/udp/route]_descriptor, combining them with nstat_counts in a convenient CFDictionary form, as shown in Table 16-25:

Table 16-25: The keys of the NStat descriptor object
PropertyDescriptor field
epidepid
processIDpid
uniqueProcessIDeupid
processNamepname
euuideuuid
startAbsoluteTimestart_timestamp
durationAbsoluteTimetimestamp - start_timestamp
interfaceifindex
[local/remote]Address (CFDATA)[local/remote].[v4/v6]
providerN/A (descriptor type)
[rx/tx]Bytesnstat_[rx/tx]bytes
[rx/tx][/Cellular/WiFi/Wired]Bytesnstat_[cell/wifi/wired]_[rx/tx]bytes
trafficClasstraffic_class
uuiduuid
receiveBuffer[Size/Used]rcvbuf[size/used]
TCP sources
rx[Duplicate/OutOfOrder]Bytesnstat_rx[duplicate/outoforder]bytes
congestionAlgorithmcc_algo
rtt[Average/Minimum/Variation]nstat_[min/avg/var]_rtt
connect[Attempts/Successes]nstat_connect[attempt/successes]
TCPStatestate
txRetransmittedBytesnstat_txretransmit
txUnackedtxunacked
TCP[Congestion]Windowtx[c]window
trafficManagementFlagstraffic_mgt_flags

The lsock(j) companion tool matches and exceeds the functionality of nettop(1) - and is available in open source. Note, that because it uses the APIs directly, it might very well be outdated by the time you try it: The example had to be updated multiple times in the past to catch up with the changing structures, and there is no guarantee Darwin 18 won't break its function.

Another hurdle is an entitlement - com.apple.private.network.statistics - which may be required for using the com.apple.network.statistics control socket. "May", because at the moment this requirement can be toggled (by the root user) using the net.statistics_privcheck sysctl MIB. This value is already set to '1' on *OS variants, but still '0' (for the moment) on MacOS. In *OS this is isn't much of an issue, since running arbitrary code implies a jailbreak, root access and arbitrary entitlements. Should the MacOS sysctl be set to '1' and possibly locked, however, administrators will need to disable SIP or only use Apple's "approved" (but painful) nettop(1).

The following experiment shows a quick and very dirty program to mimic nettop(1)'s usage of the NetworkStatistics.framework.

 
Experiment: Exploring the private NetworkStatistics.framework APIs

The main client for NetworkStatistics.framework is Darwin's nettop(1), which displays a live netstat(1) like output. Unfortunately, the tool is closed source, crude and hard to work over its curses interface, and unavailable for *OS variants. Fortunately, it's fairly straightforward to disassemble, and build a functional (albeit more limited) clone, shown in the following Listing.

Listing 16-26: A simple nettop(1) clone
#include <dispatch/dispatch.h>
#include <CoreFoundation/CoreFoundation.h>

// gcc-arm64 netbottom.c -o /tmp/netbottom 
//           -framework CoreFoundation -framework NetworkStatistics

// The missing NetworkStatistics.h...
typedef void    *NStatManagerRef;
typedef void    *NStatSourceRef;

extern CFStringRef kNStatSrcKeyProvider;

NStatManagerRef NStatManagerCreate (const struct __CFAllocator *,
                             dispatch_queue_t,
                             void (^)(NStatManagerRef));

int NStatManagerSetInterfaceTraceFD(NStatManagerRef, int fd);
int NStatManagerSetFlags(NStatManagerRef, int Flags);
int NStatManagerAddAllTCPWithFilter(NStatManagerRef, int, int);
int NStatManagerAddAllUDPWithFilter(NStatManagerRef, int, int);
void *NStatSourceQueryDescription(NStatSourceRef);
CFStringRef NStatSourceCopyProperty (NStatSourceRef, CFStringRef);
void NStatSourceSetDescriptionBlock (NStatSourceRef,  void (^)(void *));

void (^description_callback_block) (void *) = ^(CFDictionaryRef Desc) {
   // Simple example - just dump the Description dictionary to stderr
  CFShow(Desc);
};

void (^callback_block) (void *, void *)  = ^(NStatSourceRef arg){

  // Arg is NWS[TCP/UDP]Source. We can tell which by property:
  const CFStringRef prop  = NStatSourceCopyProperty (arg, kNStatSrcKeyProvider);

  NStatSourceSetDescriptionBlock (arg, description_callback_block);
  void *desc = NStatSourceQueryDescription(arg); // Continued in callback
};

int main (int argc, char **argv) {

   NStatManagerRef      nm = NStatManagerCreate (kCFAllocatorDefault,
                                  &_dispatch_main_q,
                                  callback_block);

   int rc = NStatManagerSetFlags(nm, 0);
  
   // A trace file will show the raw nstat messages
   int fd = open ("/tmp/netbottom.trace", O_RDWR| O_CREAT | O_TRUNC);
   rc = NStatManagerSetInterfaceTraceFD(nm, fd);
   
   rc = NStatManagerAddAllTCPWithFilter (nm, 0 , 0);
   rc = NStatManagerAddAllUDPWithFilter (nm, 0 , 0);

   dispatch_main();
}

As barebones as this listing is, it will nonetheless compile cleanly for both MacOS and the *OS variants. Note, that in the *OS case Apple has removed the private framework ".tbd" files, which are required for linkage. Those are easy enough to recreate using jtool2's --tbd option. You can find the listing online on the book's companion website[3].

 

/var/networkd/netusage.sqlite

All Darwin flavors offer aggregate statistics at the process level, summing up bandwidth usage for every process on the system by its binary name. The database used is /var/networkd/netusage.sqlite, though the role of networkd is actually filled by /usr/libexec/symptomsd.

As the database name implies, it is a SQLite3 file, which makes it very easy to inspect - assuming root or _networkd (uid 24) credentials. Binaries are given unique identifiers in the ZPROCESS table, which remains across multiple times they may be executed. The unique id (Z_PK) is then used to track the binary across other tables, the most useful of which is ZLIVEUSAGE, which keeps the aggregate statistics. Using sqlite3 on the database would show something similar to Output 16-27:

Output 16-27: The netusage.sqlite database
root@Chimera(~)# sqlite3 /var/networkd/netusage.sqlite
sqlite> .headers yes
sqlite> select * from ZPROCESS;
Z_PK|Z_ENT|Z_OPT|ZFIRSTTIMESTAMP |ZTIMESTAMP      |ZBUNDLENAME             |ZPROCNAME
1   |    9| 1399|502587590.334238|565020796.563935|com.apple.configd       |com.apple.configd
2   |    9|  218|502587591.149515|563217197.06509 |com.apple.captiveagent  |com.apple.captiveagent
3   |    9|    7|502587591.625211|560016270.022554|com.apple.SetupAssistant|com.apple.SetupAssistant
..
#
# Retrieve the unique identifier of a particular process name
#
sqlite> select Z_PK from ZPROCESS where ZPROCNAME='com.apple.Safari';
Z_PK
39
#
# Retrieve the statistics for said process by joining tables
#
sqlite> select ZLIVEUSAGE.*  from ZLIVEUSAGE JOIN ZPROCESS 
           ON ZPROCESS.Z_PK = ZLIVEUSAGE.Z_PK WHERE ZPROCESS.ZPROCNAME ='com.apple.Safari';

Z_PK|Z_ENT|Z_OPT|ZKIND|ZMETADATA|ZTAG|ZHASPROCESS|Z8_HASPROCESS|ZALLFLOWS|ZBILLCYCLEEND|ZJUMBOFLOWS|
39  |6    | 1722 |   0|        0|   0|         39|            9|1737372.0|             |      240.0|

ZTIMESTAMP      |ZWIFIIN       |ZWIFIOUT    |ZWIREDIN|ZWIREDOUT|ZWWANIN    |ZWWANOUT  |ZXIN|ZXOUT
502587891.997509|261417369768.0|6955273130.0|     0.0|      0.0|402017694.0|17913084.0| 0.0|  0.0
...
 

Firewalling

Network connectivity extends the system's reach to the four corners of the Internet, but also vice versa. A firewall has thus become an integral part of any system's defense, and MacOS has not one, but several. This section discovers those mechanisms which are accessible through user mode - The Application Layer Firewall, ipfw (briefly, as it is deprecated), and pf. Kernel-accessible mechanisms (socket, IP and interface filters) are left for Volume II.

Figure 16-28: The MacOS Firewall settings pane

MacOS: The Application Layer Firewall

The Application Layer Firewall, commonly referred to by the fuzzy nickname ALF, is a MacOS proprietary mechanism introduced as far back as MacOS 10.5.1 to provide Application-aware firewall capabilities. Apple provides a brief article[4] to explain its usage, which is quite straightforward. A nifty feature not found elsewhere is integration with Darwin's built-in code signing mechanism, which enables the identification of trusted (built-in and/or downloaded signed) software to receive incoming connections. Another option is to activate "stealth mode", which blocks replies to ICMP messages, making the system unresponsive to scanning.

The Application Layer Firewall is comprised of a kernel extension (ALF.kext, which identifies by its CFBundleIdentifier of com.apple.nke.applicationfirewall), and several binaries, all in /usr/libexec/ApplicationFirewall. The extension is loaded by default across versions of MacOS, even if the Firewall is turned off. The user mode binaries - the main one is socketfilterfw(8), which manages the kext. This daemon loads its defaults from the com.apple.alf.plist in the directory, and claims the com.apple.alf Mach service. When it needs UI interaction, it calls on CFUserNotificationCreate (q.v. Chapter 5) to create a pop-up dialog with the resources from ApplicationFirewall.bundle (in CoreServices). The daemon may also start Firewall(8) for user authorizations through com.apple.alf.useragent, with a protocol consisting of a single MIG message (#9999)*.

When the firewall settings are modified through System Preferences.app, the preference pane posts CFNotificationCenter (q.v. Chapter 5), with a name of "com.apple.alf". The notification's object field designates it at "firewalloptions", "app[added/removed]", "[app/service]statechanged", etc. The userinfo field contains the firwall set request, as an XML propery list. Listing 16-29 shows two messages (in SimPLISTic format):

Figure 16-29: Tokens generated when modifying ALF behavior from the GUI, or using socketfilterfw(8)
# Generated when the GUI turns off the firewall
# Payload is plist with 'globalstate' integer, values 0, 1 or 2
 object: firewalloptions
 userinfo: [DATA, 234] .....<key>globalstate</key>\n\t<integer>0</integer>...
 name: com.apple.alf
 token: 1000001
 method: post_token
 version:1
# Enabling stealth mode
# Stealth mode: Payload is plist with 'allowdownloadsignedenabled', 
# 'allowsignedenabled', and 'stealthenabled' integer (boolean) keys
 object: stealthmodechanged
 userinfo: [DATA, 351] ..<key>allowdownloadsignedenabled</key>....
   ...<key>stealthenabled</key>\n\t<integer>1</integer>\n</dict>\n</plist>"
 name: com.apple.alf
 token: 1000001
 method: post_token
 version: 1
* - A lesser daemon, appfwloggerd, was previously used to listen on an event socket for messages from ipfw.
 

The /usr/libexec/ApplicationFirewall/socketfilterfw accepts the notifications, and proceeds to acts on the contents of the userinfo data, translating the XML property list into the socket filtering rules it needs to apply. The main rules are in in payloads of appadded and/or appstatechanged, under the alias key, which is a (again) a base64 encoded plist (cfdata), whose contents are a binary details with details about the application for which a rule is added:

Listing 16-30: The contents of an alias entry, twice base-64 decoded
00000000  00 00 00 00 01 2e 00 02  00 01 0c 4d 61 63 69 6e  |...........Macin|
00000010  74 6f 73 68 20 48 44 00  00 00 00 00 00 00 00 00  |tosh HD.........|
00000020  00 00 00 00 00 00 00 00  00 00 42 44 00 01 ff ff  |..........BD....|
00000030  ff ff 0d 41 70 70 20 53  74 6f 72 65 2e 61 70 70  |...App Store.app|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000070  00 00 ff ff ff ff 00 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 ff ff ff ff 00 00  0a 20 63 75 00 00 00 00  |......... cu....|
00000090  00 00 00 00 00 00 00 00  00 0c 41 70 70 6c 69 63  |..........Applic|
000000a0  61 74 69 6f 6e 73 00 02  00 1d 2f 3a 41 70 70 6c  |ations..../:Appl|
000000b0  69 63 61 74 69 6f 6e 73  3a 41 70 70 20 53 74 6f  |ications:App Sto|
000000c0  72 65 2e 61 70 70 2f 00  00 0e 00 1c 00 0d 00 41  |re.app/........A|
000000d0  00 70 00 70 00 20 00 53  00 74 00 6f 00 72 00 65  |.p.p. .S.t.o.r.e|
000000e0  00 2e 00 61 00 70 00 70  00 0f 00 1a 00 0c 00 4d  |...a.p.p.......M|
000000f0  00 61 00 63 00 69 00 6e  00 74 00 6f 00 73 00 68  |.a.c.i.n.t.o.s.h|
00000100  00 20 00 48 00 44 00 12  00 1a 41 70 70 6c 69 63  |. .H.D....Applic|
00000110  61 74 69 6f 6e 73 2f 41  70 70 20 53 74 6f 72 65  |ations/App Store|
00000120  2e 61 70 70 00 13 00 01  2f 00 ff ff 00 00        |.app..../.....  |

Rules are enforced by ALF.kext through applying kernel socket filters (sflt_* KPIs, as explained in Volume II). The sockfilterfw daemon communicates with the kernel extension over a com.apple.nke.sockwall PF_SYSTEM/SYSPROTO_CONTROL. The protocol is a simple TLV (type-length-value), with the types shown in Table 16-31.

Table 16-31: The com.apple.nke.sockwall protocol command types
#CommandPurpose
0resultInserts a new rule for a process
1proc_rulesInserts a new rule for a process
3askKext requests a user prompt
5dumpinfoUseful to dump the kext process list into dmesg
6verifyKext requests process rule verification
7setpathAdd a process path
8updaterulesCalled when rulebase changes
9releasepcachedpathKext requests invalidation of a PID (by path) from cache
10unloadkextUnload the kernel extension, if possible
11addapptolistAdd an application
12changelogmodeChange kext logging mode
13changetrustmodeChange trust mode
14askmsgreleaseDismiss pending ask
15changelogoptChange logging options
16changeapptrustmodeChange app trust mode

Some of the message types are no longer implemented in Darwin 18. socketfilterfw also contains a few references to ipfw sysctls, which are no longer implemented (as explained next) as well. The daemon may be configured to log excessively by changing its LaunchDaemon property list's Program string to a ProgramArguments array and adding -d and/or -l. The daemon normally relays PF_SYSTEM/SYSPROTO_EVENT messages it receives from the kernel extension for the APPLE:NETWORK:LOG provider, and another unnamed provider at 1000:5:11. The ALF kext also registers the net.alf MIB namespace, with a loglevel bitmask, permission check, defaultaction and (read-only) mqcount.

 

ipfw (Deprecated)

Darwin has used BSD's ipfw mechanism for many years - until it was removed in Darwin 16. The code for implementing the mechanism - in bsd/netinet/ip_fw2[_compat].[ch] and bsd/netinet6/ip6_fw.[ch] - is very much intact, but it is contingent on IPFW2 and other #defines which are no longer enabled. The user mode controller, ipfw(8) has been removed. Some discussion of this facility can be found in the BSD respective manual pages, as well as the first edition of this work (at which time it was still deemed relevant in Darwin).

pf

The pf facility, another relic of BSD but still in wide use, provides an alternative network layer firewalling mechanism. The facility appears in user mode as two character devices - /dev/pf and /dev/pfm. The /dev/pf character device can be used to create and apply firewalling rulesets, using ioctl(2) codes. This functionality is not unlike Linux's netfilter (a.k.a iptables). The /dev/pfm device, used only in Darwin, serves a similar function.

The pf facility makes use of a configuration file, /etc/pf.conf, which is also well documented in the pf.conf(5) manual page. An additional file, /etc/pf.os, is used as an operating system finger print database. There is also an /etc/pf.anchors directory, which is used to load the com.apple anchors for AirDrop and ALF (from a load anchor statement in /etc/pf.conf).

System administrators wishing to configure pf often use the pfctl(8) command line. The tool is well documented in its man page, which is left for the interested reader to peruse. Comprehensive documentation for the set of ioctl(2) codes can be found in pf(4), but this manual page has somehow been removed from Darwin releases. The Open BSD man page[5] thus serves in its place, although there are some differences in the set of codes. Table 16-32 (next page) shows a summary of the ioctl(2) codes defined in Darwin, though some are not actively supported.

PacketFilter.framework

Using the ioctl(2) codes directly on /dev/pf is not only cumbersome, but requires root privileges. An alternative is provided by the private PacketFilter.framework, which offers a richer API of exported PF* functions, and PFUser/PFManager high level calls. First, a call to PFUserCreate starts a session. Then, PFUserBeginRules declares a rule set, in which rules can be manipulated using PFUser[Add/Insert/Delete]Rule. The set can be committed using a call to PFUserCommitRules. Similar APIs are PFManager[Get/Copy/Delete]Rules.

Rule transactions are submitted to the PFManager object, which uses PFXPC abstractions to communicate with pfd((8) through the com.apple.pfd service. The daemon (running as root) translates the XPC messages into the corresponding ioctl(2) codes, and returns any replies in XPC formatted dictionaries. The protocol can be reversed easily by using XPoCe on a running instance of pfd(8).

 
Table 16-32: The set of /dev/pf ioctl(2) codes
DIOC ioctl(2) codeArgumentPurpose
DIOC[START/STOP]_IO ('D', 1/2)Start/stop the packet filter facility
DIOCADDRULE _IOWR('D', 4, struct pfioc_rule)Add a pfioc_rule to (inactive) ruleset
DIOCGETSTARTERS _IOWR('D', 5, struct pfioc_tokens)Get starter tokens
DIOCGETRULE[S] _IOWR('D', 6/7, struct pfioc_rule)Obtain a ticket + num rules, or specific rule
DIOCSTARTREF _IOR ('D', 8, u_int64_t)Increment ref count, get token
DIOCSTOPREF _IOWR('D', 9, struct pfioc_remove_token)Decrement ref count with provided token
DIOCCLRSTATES _IOWR('D', 18, struct pfioc_state_kill)Clear packet filter state table
DIOCGETSTATE _IOWR('D', 19, struct pfioc_state)Retrieve specific state entry
DIOCSETSTATUSIF _IOWR('D', 20, struct pfioc_if)Toggle statistics on interface
DIOCGETSTATUS _IOWR('D', 21, struct pf_status)Get pf_status counters and data
DIOCCLRSTATUS _IO ('D', 22)Clear all pf_status counters
DIOCNATLOOK _IOWR('D', 23, struct pfioc_natlook)Look up a NAT state table entry
DIOCSETDEBUG _IOWR('D', 24, u_int32_t)Toggle debug
DIOCGETSTATES _IOWR('D', 25, struct pfioc_states)Retrieve all state entries
DIOC[CHANGE/INSERT/DELETE]RULE _IOWR('D', 26/27/28, struct pfioc_rule)Various rule manipulation actions
DIOC[SET/GET]TIMEOUT _IOWR('D', 29/30, struct pfioc_tm)Set/get state timeouts
DIOCADDSTATE _IOWR('D', 37, struct pfioc_state)Add a state entry
DIOCCLRRULECTRS _IO ('D', 38)Clear rule counters
DIOC[GET/SET]LIMIT_IOWR('D', 39/40, struct pfioc_limit)Set the hard limits on the memory pools
DIOCKILLSTATES _IOWR('D', 41, struct pfioc_state_kill)Remove matching entries from the state table
DIOC[START/STOP]ALTQ _IO ('D', 42/43)Requires ALTQ support,
which Darwin does not provide
DIOC[ADD/GET]ALTQ[/S] _IOWR('D', 45/47, struct pfioc_altq)
DIOC[GET/CHANGE]ALTQ _IOWR('D', 48/49, struct pfioc_altq)
DIOCGETQSTATS _IOWR('D', 50, struct pfioc_qstats)Get queue statistics
DIOC[BEGIN/GET]ADDRS _IOWR('D', 51/53, struct pfioc_pooladdr)
DIOC[ADD/GET/CHANGE]ADDR _IOWR('D', 52/54/55, struct pfioc_pooladdr)
DIOCGETRULESETS _IOWR('D', 58, struct pfioc_ruleset)Get number of rulesets (anchors)
DIOCGETRULESET _IOWR('D', 59, struct pfioc_ruleset)Get anchor by number
DIOCR[CLR/ADD/DEL]TABLES _IOWR('D', 60/61/62, struct pfioc_table)Clear/add/delete tables
DIOCRGETTABLES _IOWR('D', 63, struct pfioc_table)Get table list
DIOCR[GET/CLR/RST]TSTATS _IOWR('D', 64/65/73, struct pfioc_table)Test if the given addresses match a table
DIOCR[CLR/ADD/DEL]ADDRS _IOWR('D', 66-68, struct pfioc_table)Clear/Add/Delete addresses in table
DIOCR[SET/GET]ADDRS _IOWR('D', 69/70, struct pfioc_table)Get/set addresses in table
DIOCR[GET/CLR]ASTATS _IOWR('D', 71/72, struct pfioc_table)Get/Clear address statistics
DIOCRSETTFLAGS _IOWR('D', 74, struct pfioc_table)Change const/persist flags of table
DIOCRINADEFINE _IOWR('D', 77, struct pfioc_table)Defines a table in the inactive set
DIOCOSFPFLUSH _IO('D', 78)Flush the passive OS fingerprint table.
DIOCOSFP[ADD/GET] _IOWR('D', 79/80, struct pf_osfp_ioctl)Add/retrieve passive OS fingerprint entry
DIOCX[BEGIN/COMMIT/ROLLBACK] _IOWR('D', 81/82/83, struct pfioc_trans)Clear/commit/undo inactive rulesets
DIOCGETSRCNODES _IOWR('D', 84, struct pfioc_src_nodes)Get source nodes
DIOCCLRSRCNODES _IO('D', 85)Clear list of source nodes
DIOCSETHOSTID _IOWR('D', 86, u_int32_t)Set host ID (for pfsync(4))
DIOCIGETIFACES _IOWR('D', 87, struct pfioc_iface)Get list of interfaces
DIOC[SET/CLR]IFFLAG _IOWR('D', 89/90, struct pfioc_iface)Set/clear user flags
DIOCKILLSRCNODES _IOWR('D', 91, struct pfioc_src_node_kill)Explicitly remove source tracking nodes
DIOCGIFSPEED _IOWR('D', 92, struct pf_ifspeed)Get interface speed
 

Packet Capture

There are times when user mode needs packet capture capabilities on a given interface. The most common example of that is when using a sniffer (or "network analyzer") such as tcpdump(1) and its ilk. Being user mode tools, they must make use of some kernel facility to enable such features as promiscuous mode (in which the interface accepts all frames, not just broadcast/multicast and its own unicast), and getting packets normally destined for other applications. Apple's proprietary PF_NDRV is inadequate for general packet capture (as it can only intercept unregistered ethertype protocols), and BSD's PF firewalls, but doesn't actually relay filtered packets. so another mechanism is required.

BPF

Darwin follows the BSD model in implementing the Berkeley Packet Filter, commonly referred to as BPF. BPF is the brainchild of McCanne and Jacobson (of PPP compression and traceroute(1) fame), who presented the mechanism in a UseNIX 1993 paper[6]. BPF was quite revolutionary, as it provided a full language, with which dynamic filter programs could be created in user space, and loaded directly into the kernel subsystem. It has since become a standard adopted by quite a few operating systems and the ubiquitous libpcap which powers tcpdump(1), Ethereal and many other tools (/usr/libexec/airportd is a recurring client). The BPF mechanism has also been ported to non-BSD based systems (notably, Linux and Android) and the code presented in this section is actually fully portable to those operating systems. The language of BPF has even been extended well past packets - and supports Linux's SECCOMP-BPF model for system call filtering, which is an instrumental part of Android's security.

BPF appears in user mode as a number of character devices, /dev/bpf##, with numbers usually in the 0 to 5 range. Any of these devices (unless already in use) may be open(2)ed, attached to an underlying interface using a BIOCSETIF ioctl(2), configured with a few other ioctl(2) codes, and then loaded with a BPF "program" through a BIOCSETF ioctl(2). Once the filter program is installed, the device's file descriptor lends itself to read(2) operations, which will provide any packets matching the filter loaded onto it. This also marks the corresponding device node as in use, which means that there is a hard limit in the system of up to however many devices are configured. The general flow of a BPF client is shown in Listing 16-33, next page. The full list of ioctl(2)s can be found in <net/bpf.h>, along with a staggering list of DLT_ constants for Data Link types, though the only ones of actual use are DLT_EN10MB (used for all modern Ethernet, not just 10MB), and DLT_USB_DARWIN, which is an Apple extension for the XHC* interfaces provided by IOUSBHostFamily.kext's AppleUSBHostPacketFilter.kext PlugIn.

The idea behind BPF is as simple as it is elegant: Consider an automaton with a single register, which may be directed to load a value from any offset in an input frame (i.e. including the layer II header), and perform a logical test on its value. The automaton would branch on the results of that test, and the process would continue until a decision could be made as to whether to accept or reject the packet in question. The accepted packets appear on the input device, and the rejected packets are merely rejected by the filter - that is, they do not get captured, but they are not firewalled (as they would be by the PF facility, which was described earlier).

BPF Programs

Listing 16-33 can be used for just about any generic sniffer/packet analyzer, but notice it's missing the actual BPF filter, which needs to be installed for the BPF mechanism to actually sift out frames. The BPF filter program needs to be specified as an array of BPF automaton struct bpf_insn instructions. The instruction structure consists (not in this order) of a 16-bit code, a uint32 constant k (used as an argument to the code), and two unsigned 8-bit offsets, jt and jf, which represent a jump offset to branch to in case the code is a logical BPF_J* test. Most BPF filters usually consist of a mix of BPF_LD statements (to read data from various offsets in an incoming frame) and BPF_JMP, to perform logical tests and branch accordingly. Note, however, that there are quite a few other opcodes - including destructive ones (e.g. BPF_ST[X], which will alter scratch memory, allowing the filter to maintain state.

 
Listing 16-33: The general framework of a BPF filter client
int main(int argc, char *argv[])
{
    int fd = 0;
    char *iface = NULL;
    int port, rc = 0 , enable =1;

    if (argc < 2 || argc > 3) { return 1; }
    iface = strdup(argc < 3 ? "en0" : argv[2]);
    port  = atoi(argv[1]);

    fd = open ("/dev/bpf1", O_RDWR);
    if (fd < 0) { /* device in use - could try another */ }

    struct ifreq ifr;

    /* Associate the bpf device with an interface */
    (void)strlcpy(ifr.ifr_name, iface, sizeof(ifr.ifr_name)-1);

    if(ioctl(fd, BIOCSETIF, &ifr) < 0) return 3;

    /* Monitor outgoing packets from interface as well */
    if(ioctl(fd, BIOCSSEESENT, &enable) < 0) return 4;

    /* Return immediately when a packet received */
    if(ioctl(fd, BIOCIMMEDIATE, &enable) < 0) return 5;

    /* Ensure we are dumping the datalink we expect */
    if(ioctl(fd, BIOCGDLT, &dlt) < 0) return 6;
    if (dlt !=  DLT_EN10MB) return 7;

    /* Prepare program -- see next listing */
    installFilter (fd, IPPROTO_TCP, port); 
	
    /* Get receive buffer length */
    if(ioctl(fd, BIOCGBLEN, &blen) < 0) return 8;
    char *buf = alloca(blen);

    while ((rc = read(fd, buf, blen)) > 0) {

	   fprintf(stderr, "Got frame (%d bytes)!\n");
	   /* Do something with frame.. e.g. overlay ip_hdr, tcp_hdr.. */
		...
	}
}

Rather than initializing the structure for every single instruction, two macros are commonly used. BPF_STMT takes the code and k values, for instructions which aren't logical tests. BPF_JUMP is used for tests, whose codes are of the BPF_JMP class, with whatever BPF_J* variant. This makes the BPF "assembly" (barely) manageable for human readers.

As an example, consider Listing 16-34, which presents a sample BPF filter program in installFilter. The listing demonstrates how to traverse an IPv4 packet: In the beginning of the program, the automaton's read stream is at the first byte of the frame - i.e. the Ethernet header. Since the IPv4 EtherType is always at offset 14 (past 6 bytes of destination MAC Address and 6 more of source), the value (16-bits) is loaded as a halfword with BPF_LD + BPF_H. It is then compared to the ETHERTYPE_IP (0x0800). If there is no match, the processing jumps 10 instructions forward, to the rejection (= return 0). If it is an IPv4 packet, processing continues (jumping 0 instructions forward, which means the next, since the program counter always points to the next instruction). As tests continue, the rejection offset grows closer and closer still - two instructions later, it is 8, two more make it 6, etc. If the flow makes it to assert that the frame is an IPv4 unfragmented, TCP packet with either the source or the destination port matching the one requested, the filter returns 0, and the frame makes it back to Listing 16-34's file descriptor, where it can be read and processed in user space.

 
Listing 16-34: A sample BPF program
int installFilter(int   fd, 
         unsigned char  Protocol, 
             unsigned short Port)
{
    struct bpf_program bpfProgram = {0};

    /* Dump IPv4 packets matching Protocol and (for IPv4) Port only */

    /* @param: fd - Open /dev/bpfX handle.               */
    
    const int IPHeaderOffset = 6 + 6 + 2; /* 14 */
    
    /* Assuming Ethernet (DLT_EN10MB) frames, We have: 
     *  
     * Ethernet header = 14 = 6 (dest) + 6 (src) + 2 (ethertype)
     * Ethertype is 8-bits (BFP_P) at offset 12
     * IP header len is at offset 14 of frame (lower 4 bytes). 
     * We use BPF_MSH to isolate field and multiply by 4
     * IP fragment data is 16-bits (BFP_H) at offset  6 of IP header, 20 from frame
     * IP protocol field is 8-bts (BFP_B) at offset 9 of IP header, 23 from frame 
     * TCP source port is right after IP header (HLEN*4 bytes from IP header)
     * TCP destination port is two bytes later
     *
     * Note Port offset assumes that this Protocol == IPPROTO_TCP!
     * If it isn't, adapting this to UDP port is left as an exercise to the reader,
     * as is extending this to support IPv6, as well..
     */

 struct bpf_insn insns[] = {

 /* Uncomment this line to accept all packets (skip all checks) */
 // BPF_STMT(BPF_RET + BPF_K, (u_int)-1),                   // Return -1 (packet accepted)

 BPF_STMT(BPF_LD  + BPF_H   + BPF_ABS, 6+6),             // Load ethertype 16-bits from 12 (6+6)
 BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, ETHERTYPE_IP, 0, 10), // Test Ethertype or jump(10) to reject
 BPF_STMT(BPF_LD  + BPF_B   + BPF_ABS, 23),              // Load protocol (= IP Header + 9 bytes) 
 BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K  , Protocol, 0, 8),  // Test Protocol or jump(8) to reject 
 BPF_STMT(BPF_LD  + BPF_H   + BPF_ABS, IPHeaderOffset+6),// Load fragment offset field 
 BPF_JUMP(BPF_JMP + BPF_JSET+ BPF_K  , 0x1fff, 6, 0),    // Reject (jump 6) if more fragments
 BPF_STMT(BPF_LDX + BPF_B   + BPF_MSH, IPHeaderOffset),  // Load IP Header Len (x4), into BPF_IND
 BPF_STMT(BPF_LD  + BPF_H   + BPF_IND, IPHeaderOffset),  // Skip hdrlen bytes, load TCP src
 BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K  , Port, 2, 0),      // Test src port, jump to "port" if true

 /* If we're still here, we know it's an IPv4, unfragmented, TCP packet, but source port
  * doesn't match - maybe destination port does? 
  */

 BPF_STMT(BPF_LD  + BPF_H   + BPF_IND, IPHeaderOffset+2), // Skip two more bytes, to load TCP dest
/* port */
 BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K  , Port, 0, 1),       // If port matches, ok. Else reject
/* ok: */
 BPF_STMT(BPF_RET + BPF_K, (u_int)-1),                    // Return -1 (packet accepted)
/* reject: */
 BPF_STMT(BPF_RET + BPF_K, 0)                             // Return 0  (packet rejected)
    };


BPF's sheer power may be too powerful at times! BPF has been the source of quite a few vulnerabilities, thanks to the automaton implementation living in kernel space. This allows potential integer overflows to be used outside the packet scope to read arbitrary kernel memory. Coupled with BPF_ST[X] instructions, which allow storing (= writing memory), this could be conducive to full kernel compromise. Additionally, the code around the filters (i.e. the ioctl(2) implementations) has been buggy in the past - as recently as Darwin 16 for BIOCSBLEN (CVE-2017-2482) .
 

Pseudo-Interfaces

There are times when frames or packets need to be captures simultaneously from multiple interfaces. One way of doing so is to run multiple BPF filters at the same time (over several /dev/bpf# devices). Doing so, however, not only risks depleting the available BPF devices, but also makes it difficult to correctly sync the capture streams. Another, is to use one of the pseudo interfaces supported by XNU, iptap or pktap.

The *tap interfaces are pseudo-interfaces, and normally do not appear when interfaces are listed with ifconfig. They are created on-demand when a packet capture program (notably, tcpdump(1) is used with pktap or iptap as the name of the interface, followed by a comma-delimited list of actual interfaces. The difference between the two *tap interfaces is the encapsulation exposed - pktap provides the full packet, whereas iptap provides the network layer (IPv6 or IPv4) and upwards. Both interfaces can be used with BPF, and appear with a DLT_PKTAP (also DLT_USER2, with a value of 149).

The tap interfaces are created programmatically using an SIOCIFCREATE ioctl(2), and marked to be removed when the creating process exits (Apple's libpcap project's libcap/libpcap-darwin.c provides a clear example of doing so). Taps also allow their own in-kernel filtering rules (by interface name or type), which are independent of BPF. These can be set with the SIOCSDRVSPEC ioctl(2) code. The network-cmds project pktapctl utility (not provided in Darwin releases) shows an example of getting and setting filters.

Using DLT_PKTAP also provides a significant benefit in allowing more metadata to be included for every packet captured. XNU's bsd/net/pktap.h defines the header which is artifically prepended for every packet returned by the interface. As shown in Listing 16-35, this provides plentiful (and useful) information, including the origin interface and actual DLT_* of the the packet, as well as the owning process pid and command name:

Listing 16-35: The DLT_PKTAP header, from XNU-4570's bsd/net/pktap.h
/*
 * Header for DLT_PKTAP
 *
 * In theory, there could be several types of blocks in a chain 
 *  before the actual packet
 */
struct pktap_header {
    uint32_t    pth_length;                    /* length of this header */
    uint32_t    pth_type_next;                 /* type of data following */
    uint32_t    pth_dlt;                       /* DLT of packet */
    char        pth_ifname[PKTAP_IFXNAMESIZE]; /* interface name */
    uint32_t    pth_flags;                     /* flags */
    uint32_t    pth_protocol_family;
    uint32_t    pth_frame_pre_length;
    uint32_t    pth_frame_post_length;
    pid_t       pth_pid;                       /* process ID */
    char        pth_comm[MAXCOMLEN+1];         /* process name */
    uint32_t    pth_svc;                       /* service class */
    uint16_t    pth_iftype;
    uint16_t    pth_ifunit;
    pid_t       pth_epid;                      /* effective process ID */
    char        pth_ecomm[MAXCOMLEN+1];        /* effective command name */
    uint32_t    pth_flowid;
    uint32_t    pth_ipproto;
    struct timeval32    pth_tstamp;
    uuid_t      pth_uuid;
    uuid_t      pth_euuid;  
};

Darwin's tcpdump implementation contains a non-standard -k switch, which will parse some of that metadata (specifically, pth_ifname, pth_[e]comm, pth_[e]pid and pth_svc) to show the details of the the process (or processes, if both are on the same host) to whose session each packet belongs.

 

Quality of Service

We've already discussed process and thread level Quality of Service, and with such formidable capabilities it's easy to forget the Quality of Service concept was originally "born" at the network layer. Due to Net Neutrality and other considerations, QoS isn't deployed on the global Internet, but it is nonetheless applicable on internal networks, up to the egress router and sometimes beyond.

QoS recognizes two modes - Integrated Services, and Differentiated Services. The former mode is handled by RSVP (the reservation protocol), and is not supported by XNU - which is not required, since the implementation can reside in user mode. The latter mode (DiffSrv) requires packet-level labeling, and is fully supported. The IPv4 "type of service" byte (the second byte of the header, right after the version/header-length "45") has been repurposed by RFC2474 and RFC3168 to provide a six-bit "Differentiated Services Code Point" (DSCP) and two bits of Explicit Congestion Notification (ECN). XNU supports later revisions of Diffsrv, including RFC2597 (Assured Forwarding Per-Hop-Behavior) and RFC5865 (Capacity-Admitted Traffic).

Darwin 17 adds a new (and, as usual, undocumented) system call - net_qos_guideline (#525). The System call provides a net_qos_param structure specifying a bandwidth requirement (upload or download) and the structure's (fixed) length. It returns a hint to user mode specifying whether this requirement would be subject to the default QoS policy, or should be marked as a background (BK) service type, which will prefer delay based flow algorithms.

Network Link Conditioning

XCode's "Additional Tools" disk image contains in between its many fabulous "Hardware Tools" the "Network Link Conditioner" Preference Pane. This plug-in to System Preferences.app provides a simple but effective GUI to Network Link Conditioning, which is the art of imposing artificial delay and packet loss based on whimsical parameters. This is commonly used to simulate low to miserable bandwidth conditions, and test their effects on applications.

The preference pane is merely a front-end: The actual work is performed by nlcd(8), which communicates with the GUI by means of MIG subsystem 40268. But it turns out that nlcd, too, doesn't want to get its hands dirty, and instead sends XPC messages to pfd(8) with the help of the private PacketFilter.framework. Although we've discussed pfd in the context of the PF facility earlier, this time the daemon interfaces with another kernel facility, called dummynet(4), which is responsible for the dirty work.

The dummynet mechanism, a facility to provide traffic shaping, bandwidth management and delay emulation, was devised by Luigi Rizzo in 1997, and extended in 2010[7]. It was brought into BSD and its ipfw mechanism, and migrated to Darwin. Although ipfw is defunct in modern systems, dummynet is still fully operational. Its implementation is mostly contained in XNU's bsd/netinet/ip_dummynet.[ch], with several others modifications throughout the stack - mostly in the IPv4/6 input and output paths. All these are in #ifdef DUMMYNET blocks, meaning that XNU can be build without it, though that is seldom the case.

Dummynet works by defining flows, and funneling them into one or more "pipes", which emulate links with given bandwidth/delay/loss parameters. Pipes are managed with the help of "queues", which implement Worst-case Fair Weighted Fair Queueing (WF2Q+) and Random Early Detection (RED). The pipes are entirely virtual, and packets are passed through them before or after they flow through the physical interface, which is how the connection parameters can be enforced.

Pipes can be configured by creating a raw socket, and then issuing setsockopt(2) calls. Four options are defined: IP_DUMMYNET_CONFIGURE (60) creates of modifies a dummynet pipe. The pipe may be removed with IP_DUMMYNET_DEL (61). The list of pipes can be retrieved with IP_DUMMYNET_GET (64), and pipes can be flushed with IP_DUMMYNET_FLUSH (62). The command line dnctl(8) offers a far easier way to configure, by providing an extensive command line with a well documented manual page, complete with examples. This manual page also documents the sysctl(8) MIBs.

 

Network Extension Control Policies (Darwin 14+)

A major addition to Darwin's network stack are Network Extension Control Policies (NECPs), added to Darwin 14. NECPs are described in bsd/net/necp.c as "..high-level policy sessions, which are ingested into low-level kernel policies that control and tag traffic at the application, socket, and IP layers". In other words, NECPs enable user mode programs to control the kernel's network routing and scheduling decisions. Naturally, these also allow QoS.

The original interface provided for NECP is through a PF_SYSTEM/SYSPROTO_CONTROL socket. Using com.apple.net.necp_control as the control name, a socket can be created, and then read to and written from through a specialized packet protocol:

Listing 16-36: The NECP control socket interface (as of Darwin 14)
struct necp_packet_header {
    u_int8_t            packet_type;
    u_int8_t            flags;
    u_int32_t           message_id;
};

/*
 * Control message commands
 */

#define NECP_PACKET_TYPE_POLICY_ADD                    1
#define NECP_PACKET_TYPE_POLICY_GET                    2
#define NECP_PACKET_TYPE_POLICY_DELETE                 3
#define NECP_PACKET_TYPE_POLICY_APPLY_ALL              4
#define NECP_PACKET_TYPE_POLICY_LIST_ALL               5
#define NECP_PACKET_TYPE_POLICY_DELETE_ALL             6
#define NECP_PACKET_TYPE_SET_SESSION_PRIORITY          7
// Lock session so that only the originator can perform actions.
#define NECP_PACKET_TYPE_LOCK_SESSION_TO_PROC          8
#define NECP_PACKET_TYPE_REGISTER_SERVICE              9
#define NECP_PACKET_TYPE_UNREGISTER_SERVICE            10
#define NECP_PACKET_TYPE_POLICY_DUMP_ALL               11

In addition to the root privileges needed to open the control socket, some actions are deemed privileged, and require the PRIV_NET_PRIVILEGED_NECP_[POLICIES/MATCH] privileges. These are are tied to com.apple.private.necp.[policies/match] entitlements, and are presently granted only to a select few daemons, as you can view on the book's entitlement database.

The actual policies which may be defined are ridiculously rich and complex. Using a set of NECP_POLICY_CONDITION_* constants allow matching a policy to a particular DNS domain, local or remote address, specific IP protocol, PID, UID, entitlement-holder, interface, and more. Policies can also be ordered, so as to prioritize their application. Once applied, a policy result can be as simple as NECP_POLICY_RESULT_[PASS/DROP], but can also be any of several other NECP_POLICY_RESULT_* constants, to divert. filter or tunnel the flow, change a route rule, trigger or use a particular netagent (discussed later).

NECP descriptors and clients

Starting with Darwin 16, just about every network-enabled process in the system uses NECPs, oftentimes without the developer even knowing what they are. This is because libnetwork.dylib calls necp_open() (#501) as part of its initialization (specifically, from nw_endpoint_handler_start). This creates a necp client, a file descriptor of type NPOLICY, which is readily visible in the output of lsof(1) or procexp ..fds. The descriptor does not offer the traditional operations (read(2)/write(2)/ioctl(2)), and only supports select(2), or use in a kqueue. The necp_client_action system call (#502) can be used to specify client actions, as shown in Listing 16-37:

 
Listing 16-37: The NECP client interface (from XNU-4570's bsd/net/necp.h)
// Following are all #define NECP_CLIENT_ACTION_... (omitted for brevity)
.._ADD                1 // Register a new client. Input: parameters in buffer; Output: client_id
.._REMOVE             2 // Unregister a client. Input: client_id, optional struct ifnet_stats_per_flow
.._COPY_PARAMETERS    3 // Copy client parameters. Input: client_id; Output: parameters in buffer
.._COPY_RESULT        4 // Copy client result. Input: client_id; Output: result in buffer
.._COPY_LIST          5 // Copy all client IDs. Output: struct necp_client_list in buffer
.._REQUEST_NEXUS_INSTANCE    6 // Request a nexus instance from a nexus provider, optional struct necp_stats_bufreq
.._AGENT              7 // Interact with agent. Input: client_id, agent parameters
.._COPY_AGENT         8 // Copy agent content. Input: agent UUID; Output: struct netagent
.._COPY_INTERFACE     9 // Copy interface details. Input: ifindex cast to UUID; Output: struct necp_interface_details
.._SET_STATISTICS     10 // Deprecated
.._COPY_ROUTE_STATISTICS   11 // Get route statistics. Input: client_id; Output: struct necp_stat_counts
.._AGENT_USE          12 // Return the use count and increment the use count. Input/Output: struct necp_agent_use_parameters
.._MAP_SYSCTLS        13 // Get the read-only sysctls memory location. Output: mach_vm_address_t
.._UPDATE_CACHE       14 // Update heuristics and cache
.._CLIENT_UPDATE 15 // Fetch an updated client for push-mode observer. Output: Client id, struct necp_client_observer_update in buffer
.._COPY_UPDATED_RESULT 16 // Copy client result only if changed. Input: client_id; Output: result in buffer
necp_open() is just one of several undocumented system calls, which Apple has added over time as the facility evolves. The system calls are also unexported to user mode, but Listing 16-38 reconstructs the missing header file:

Listing 16-38: The NECP related system calls (as of Darwin 14)
int necp_match_policy(uint8_t *parameters, size_t parameters_size, 
                    struct necp_aggregate_result *returned_result); // #460

// Darwin 16
int necp_open(int flags);  // 501

int necp_client_action(int necp_fd, 
		uint32_t action, 
		uuid_t client_id, 
		size_t client_id_len, 
		uint8_t *buffer, 
		size_t buffer_size); // 502

// Darwin 17
/**
  * requires PRIV_NET_PRIVILEGED_NECP_POLICIES 
  */
int necp_session_open(__unused int flags);  // 522

int necp_session_action(int necp_fd, 
		uint32_t action, 
		uint8_t *in_buffer, 
		size_t in_buffer_length, 
		uint8_t *out_buffer, 
		size_t out_buffer_length);  // 523

Darwin 17 extends the idea of NECP client descriptors, and adds the NECP session (also an NPOLICY file descriptor*). These descriptors are created with necp_session_open (#522), and support just the close(2) operation (which deletes the associated session). NECP session descriptors are meant to be handled with the proprietary necp_session_action() system call (#523). Using NECP_SESSION_ACTION_* constants passed through the action parameter, which map to the NECP_PACKET_TYPE_POLICY* codes of the control socket, the various actions can be performed, subject to the privilege check.

The public NetworkExtension.framework is a user of NECP sessions, which it abstracts using the undocumented NEPolicySession objective-C object.


* - It's worth mentioning that both NECP file descriptor types are marked in kernel as the same type (DTYPE_NETPOLICY). The potential type confusion was exploited by CVE-2018-4425, before being fixed by Apple in MacOS 14.1.
 

Network Agents (Darwin 15+)

Darwin 15 introduces a novel networking concept, of network agents. These are user-mode clients to which network flow or other event handling is relayed via triggers. Those agents can then handle the triggers and act upon them, for example making network policy decisions.

Network agents create a PF_SYSTEM/SYSPROTO_CONTROL socket with the com.apple.net.netagent control name. The control is created in a manner identical to Listing 16-7 setting sc_unit to 0 and changing the control name, of course. Once the control socket is connect(2)ed, agents may send and receive messages formatted with a netagent_message_header, defined in bsd/net/network_agent.h along with one of several codes, as shown in Listing 16-39. Note this header is not exported to user mode, as Apple keeps the API private:

Listing 16-39: The netagent_message_header and types (from XNU 4570's bsd/net/network_agent.h)
struct netagent_message_header {
        u_int8_t                message_type;
        u_int8_t                message_flags;
        u_int32_t               message_id;
        u_int32_t               message_error;
        u_int32_t               message_payload_length;
};

#define NETAGENT_MESSAGE_TYPE_REGISTER         1   Pass netagent to set, no return value
#define NETAGENT_MESSAGE_TYPE_UNREGISTER       2   No value, no return value
#define NETAGENT_MESSAGE_TYPE_UPDATE           3   Pass netagent to update, no return value
#define NETAGENT_MESSAGE_TYPE_GET              4   No value, return netagent
#define NETAGENT_MESSAGE_TYPE_TRIGGER          5   Kernel init, no reply expected
#define NETAGENT_MESSAGE_TYPE_ASSERT           6   Deprecated
#define NETAGENT_MESSAGE_TYPE_UNASSERT         7   Deprecated
#define NETAGENT_MESSAGE_TYPE_TRIGGER_ASSERT   8   Kernel init, no reply expected
#define NETAGENT_MESSAGE_TYPE_TRIGGER_UNASSERT 9   Kernel init, no reply expected
// Added in XNU-3789 to support Nexus
#define NETAGENT_MESSAGE_TYPE_REQUEST_NEXUS    10  Kernel init, struct netagent_client_message
#define NETAGENT_MESSAGE_TYPE_ASSIGN_NEXUS     11  Pass struct netagent_assign_nexus_message
#define NETAGENT_MESSAGE_TYPE_CLOSE_NEXUS      12  Kernel init, struct netagent_client_message
#define NETAGENT_MESSAGE_TYPE_CLIENT_TRIGGER   13  Kernel init, struct netagent_client_message
#define NETAGENT_MESSAGE_TYPE_CLIENT_ASSERT    14  Kernel init, struct netagent_client_message
#define NETAGENT_MESSAGE_TYPE_CLIENT_UNASSERT  15  Kernel init, struct netagent_client_message

A new system call in XNU-3248 is netagent_trigger system call (#490), which enables selective wake up of a registered netagent by the caller. The system call takes the agent_uuid, which should match the one the target agent registered with, and the agent_uuidlen (which is fixed at sizeof(uuid_t), i.e. 16). If the target agent allows triggers (registered with NETAGENT_FLAG_USER_ACTIVATED) and is not already active, a NETAGENT_MESSAGE_TYPE_TRIGGER (#5) will be sent to it.

A process may create and register more than one agent (with different UUIDs), and agents may be assigned to different domains (e.g. "WirelessRadioManager", "NetworkExtension") or types (e.g. VPN, Persistent, DNSAgent..). Darwin's configd does so (with several DNSAgents), as do CommCenter, networkserviceproxy, and iOS's nesessionmanager . Other daemons are fine with one agent, e.g. identityserviced, wifid and apsd. Using procexp all fds and filtering for Control Sockets (in a manner similar to Output 16-5) will show all the agents. The sysctl MIBs of net.netagent.[active/registered]_count track the number of agents, and net.netagent.debug may be adjusted to produce verbose logging.

An open source example of creating an agent and handling notifications may be found in configd's open sources - specifically, the files in Plugins/IPMonitor show the creation of both the DNSAgent and the ProxyAgent. The following experiment demonstrates displaying agent details using specialized ioctl(2) codes.

 
Experiment: Displaying netagents using specialized ioctl(2) codes

The netagent facility provides ioctl(2) codes which can be used to enumerate existing agents (i.e. processes with com.apple.network.agent control sockets). The ioctl(2) codes are SIOCGIFAGENT[LIST/DATA]64, which operate similarly: On first pass, their respective data size arguments must be 0, and in turn they will be filled with the required data size. The caller is expected to allocate a sufficiently large buffer, and then call again. The call pattern is shown in Listing 16-40:

Listing 16-40: Displaying netagents with the SIOCGIFAGENT[LIST/DATA]64 ioctl(2)s
   int s = socket (AF_INET, SOCK_STREAM,0);
   struct netagentlist_req64 nalr64 ;
   nalr64.data_size = 0; // first pass

   int rc = ioctl (s, SIOCGIFAGENTLIST64, &nalr64);

   if (rc < 0) { /* could fail because of entitlements.. */ }

   // nalr64.data_size will be set by previous call
   nalr64.data = malloc(nalr64.data_size);

   rc = ioctl (s, SIOCGIFAGENTLIST64, &nalr64);
   if (rc < 0){ perror ("ioctl"); return (rc);}

   int i = 0;  char  uuid[64];

   for  (i =0 ; i < nalr64.data_size; i+= 16) {
     uuid_unparse(nalr64.data + i, uuid);

     // Get data for this UUID (pass 1)
     struct netagent_req64 nadrq;
     memcpy(nadrq.netagent_uuid, nalr64.data+i, 16);
     nadrq.netagent_data_size = 0;
     rc = ioctl (s, SIOCGIFAGENTDATA64, &datareq);
     if (rc  < 0 ) { perror("SIOCGIFAGENTDATA64"); /* ... */ }

     // Get data for this UUID (pass 2)
     nadrq.netagent_data = malloc (nadrq.netagent_data_size);
     rc = ioctl (s, SIOCGIFAGENTDATA64, &datareq);
     if (rc  < 0 ) { perror("SIOCGIFAGENTDATA64"); /* ... */ }
     printf ("%s: %s (%s/%s) %s\n", uuid, 
      nadrq.netagent_domain, nadrq.netagent_type, nadrq.netagent_desc,
      netagentFlagsToText(nadrq.netagent_flags));
     // print agent-specific data, if nadrq.netagent_data_size > 0 ..

Neither codes nor structures (nor flags, in netagentFlagsToText, above) are provided to user space headers, but it is a simple matter to copy them (from bsd/sys/sockio.h and bsd/net/network_agent.h). Note, that the ioctl(2) codes require NECP entitlements (for system privilege 10004, a.k.a PRIV_NET_PRIVILEGED_NECP_POLICIES). This means they're easier to use on Jailbroken *OS (where code signing is faked and any entitlement can be bestowed) rather than on MacOS, (even with SIP disabled, since self-signed code is disallowed). Output 16-41 shows the Output of a completed program on iOS (with UUIDs truncated since they're random anyway):

Output 16-41: Output from previous listing, on iOS
..EE9: ids501, (clientchannel/IDSNexusAgent ids501 : clientchannel) reg,active,networkprov,nexusprov
..120: Skywalk (FlowSwitch/MultiStack)) reg,active,nexusprov
..BD7: SystemConfig (DNSAgent/DNSAgent(m)-b.e.f.ip6.arpa) reg,active,user activated,
..064: SystemConfig (DNSAgent/DNSAgent(m)-a.e.f.ip6.arpa) reg,active,user activated,
..33F: SystemConfig (DNSAgent/DNSAgent(m)-9.e.f.ip6.arpa) reg,active,user activated,
..FAF: SystemConfig (DNSAgent/DNSAgent(m)-8.e.f.ip6.arpa) reg,active,user activated,
..32E: SystemConfig (DNSAgent/DNSAgent(m)-254.169.in-addr.arpa) reg,active,user activated,
..D11: SystemConfig (DNSAgent/DNSAgent(m)-local) reg,active,user activated,
..F53: NetworkExtension (PathController/PathController: (null)) reg,active,
..9F2: NetworkExtension (PathController/PathController: (null)) reg,active,
..071: NetworkExtension (PathController/PathController: (null)) reg,active,
..A7D: Cellular (Internet/CommCenter: Internet) reg,voluntary,
..2DB: WiFiManager (CallInProgress/WiFi) reg,kernel activated,user activated,voluntary,specific use
 

SkyWalk

The SkyWalk subsystem is an entirely undocumented networking subsystem in XNU. It provides the interconnection between other networking subsystems, such as bluetooth and user-mode tunnels. Although built-in to XNU, its source remains closed, with only error and debug strings indicating it is implemented in bsd/skywalk, and a couple of in-kernel client side implementations (namely, UTun and IPSec), which were not wiped clean by the preprocessor because of a different #ifdef block. A third implementation exists (bridge) but its source code is redacted. Skywalk's memory subsystem is also laregely self-managed: There are about three dozen skywalk related kernel zones, and the subsystem has its own arena based allocator (similar in concept to the Nanov2 allocator) with caching, which is used for in-kernel, non-blocking packet allocation and other uses.

Please note, that SkyWalk is intentionally redacted out of XNU's sources by Apple, and is still rarely used. Reversing the object structures and APIs paints an incomplete and quite possibly inaccurate picture of its possible use, whether internal to Apple or in some future release of Darwin. The author's understanding and explanation of SkyWalk may therefore differ from Apple's design - but even a partial view of this subsystem is better than none.

Nexuses & Channels

Skywalk makes use of two special object types. A nexus is an endpoint, identified by a UUID, through which data packets can flow, prior to actually getting to an underlying network interface. Nexuses may be created in kernel or user mode, and when used in the latter appear as file descriptors (of DTYPE_NEXUS).

Nexuses are created through the use of Nexus Providers. There are currently four known provider types:

Registering a Nexus is a privileged operation. A set of sandboxed-enforced entitlements - com.apple.private.skywalk.register-[flow-switch/net-if/user-pipe] - protects registrtion for each of the corresponding types. Nexuses have one or more channels to provide data flows. Each channel commonly has two rings, one for transmission (tx) and one for reception (rx), each with 128 slots.

Nexuses can interoperate with network agents. The NETAGENT_MESSAGE_TYPE_ [REQUEST/ASSIGN/CLOSE]_NEXUS messages (from Listing 16-39) allow the interoperation, by letting a network agent control nexus creation on demand. You can see both nexuses and network agents in action when using VPN applications: Setting up a VPN connection commonly creates both a net-if (usually, com.apple.netif.utun2) and a multistack flow-switch (com.apple.multistack.utun1) provider.

 

The ifconfig(8) utility (as of network-cmds 520+, provided in the *OS binpack) can display network agent and nexus details. Output 16-42 demonstrates the nexus enabled (user-mode tunneling) interfaces when a VPN connection is active:

Output 16-42: Using ifconfig(8) to view interface netagent and nexus details
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000 rtref 0 index 7
	eflags=5002080<TXSTART,NOAUTOIPV6LL,ECN_ENABLE,CHANNEL_DRV>
	options=6403<RXCSUM,TXCSUM,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM>
	inet6 fe80::f686:bb0:335f:2ffc%utun0 prefixlen 64 scopeid 0x7 
        netif: BB5FC293-96DF-4130-80B6-0D45560B199B
	multistack: 52B79C5E-C085-4DE4-8E68-F11609C2B6D1
	nd6 options=201<PERFORMNUD,DAD>
	agent domain:ids501 type:clientchannel flags:0xc3 desc:"IDSNexusAgent ids501 : clientchannel"
	state availability: 0 (true)
	scheduler: FQ_CODEL 
	qosmarking enabled: no mode: none
utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1400 rtref 8 index 8
	eflags=5002080<TXSTART,NOAUTOIPV6LL,ECN_ENABLE,CHANNEL_DRV>
	options=6403<RXCSUM,TXCSUM,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM>
	inet 10.47.19.145 --> 10.47.19.145 netmask 0xffffff00 
	netif: B5207F98-A1D5-485B-A197-162EC1F1AFC9
	multistack: 57F3E47D-8A75-4B9A-889A-20D102587FD3
	agent domain:NetworkExtension type:VPN flags:0x3 desc:"VPN: Free VPN"
	agent domain:Persistent type:Persistent flags:0x3 desc:"Persistent interface guidance"
	state availability: 0 (true)
	scheduler: FQ_CODEL 
	effective interface: en0
	qosmarking enabled: no mode: none

System calls and APIs

As with the other skywalk components, the system calls used to handle nexus and channels are purposely left out of XNU's public sources - including even syscalls.master (thus, not even a prototype), which is interesting since other #if blocks in it are still present. Fortunately, the names of the system calls can be gleaned from the user mode header <sys/syscall.h>

Listing 16-43: The undocumented headers for nexus and channel calls
/**
  *  Nexus calls
  */
int __nexus_open(void);		// 503
int __nexus_register(int NexusFD,..., int Flags) ;     // 504
int __nexus_deregister(int NexusFD, ., int Flags);    // 505
int __nexus_create(....,int Flags);    // 506
int __nexus_destroy (int nexusFD, void *, int Flags);	// 507
int __nexus_get_opt(int NexusFD, int Type, void *OptBuf, size_t *Size); // 508
int __nexus_set_opt(int NexusFD, int Type, void *OptBuf, size_t Size);  // 509

/*
 * Channel calls:
 */
int __channel_open  (int NexusFD, Flags);   // 510
int __channel_get_info (int ChannelFD, void *info, int Flags);  // 511
int __channel_sync (....); // 512
int __channel_get_opt(int ChannelFD, int Type, void *OptBuf, size_t *Size); // 513
int __channel_set_opt(int ChannelFD, int Type, void *OptBuf, size_t Size);  // 514

Handling nexuses

Rather than using the system calls directly, libsystem_kernel.dylib provides higher level _os_nexus and _os_channel objects. This API holds provides metadata about the underlying file descriptors (for example, the guard value needed to guarded_close_np a channel, through _os_channel_destroy). An even higher level API can be found in libsystem_network, with its nw_nexus and nw_channel objects (with OS_ prefixed Objective-C objects.

 

Three objects - os_nexus, os_nexus_attr and os_nexus_controller manage nexuses, and four more - os_channel, os_channel_slot, .._attr and .._packet are used for channels. This way, a nexus can be created directly, through a call to ___nexus_open, but the preferred way is to use os_nexus_controller_create, which also ensures the descriptor is guarded. Once created, a Nexus can be registered directly with a system call (__nexus_register) by using its file descriptor, or through the higher level os_nexus_controller_register_provider. Other calls offered by os_nexus_* APIs are os_nexus_[dis]connect, os_nexus_if[attach/detach], and os_nexus_ns[un]bind, all of which wrap the __nexus_set_opt system call. Unsurprisingly, the os_nexus_* APIs aren't anywhere near documented as well, but Listing 16-44 reconstructs the missing header file:

Listing 16-44: The header file for os_nexus_* APIs
typedef struct os_nexus_controller	*os_nexus_t;
typedef enum { tx_rings  = 0, rx_rings, tx_slots, rx_slots, slot_buf_size, 
  slot_meta_size, anonymous, mhints,  pipes, extensions = 9 } nexus_attr_t;
      
os_nexus_t os_nexus_controller_create(void *attrs);
int os_nexus_controller_get_fd(os_nexus_t);
int os_nexus_controller_register_provider
    (os_nexus_t, char *name, int type, void *, out uuid_t provUUID);

int os_nexus_controller_alloc_provider_instance
     (os_nexus_t, in uuid_t provUUID, out uuid_t provInstance);

int os_nexus_controller_free_provider_instance(os_nexus_t, in uuid_t provInstance);

int os_nexus_attr_set(os_nexus_t, nexus_attr_t, int);

skywalkctl(8)

A vital piece of the SkyWalk puzzle is the skywalkctl(8) utility, apparently a debugging tool left in /usr/sbin, but nonetheless even with a partial manual page. The utility is actively maintained by Apple, as can be seen by the increasing number of subcommands in offers in Darwin 18. Of particular interest is the "tree" command, which provides a JSON output of all providers (by reading information from the SkyWalk sysctl(8) interface, described next).

sysctl MIBs

The SkyWalk subsystem outputs its statistics through several MIBs, of which kern.skywalk.nexus_provider_list and kern.skywalk.nexus_channel_list are the most interesting, as they provide detailed information about Nexus providers and channels (as nexus_provider_info_t and nexus_channel_entry_t structures). Accessing these MIBs requires the com.apple.private.skywalk.observe-all entitlement, enforced by a mac_priv_check_hook (from Sandbox.kext) for the undocumented 12010 privilege. Even the other, more basic Nexus statistics have an entitlement associated with them, com.apple.private.skywalk.observe-stats (the undocumented 12011). There are additional privileges (12000-12003), all nexus related but undocumented (and, for lack of source, nameless), all of which depend on the skywalk entitlements. The *OS binutils' sysctl(8) is properly entitled, as is the aforementioned skywalkctl(1), which can decipher the opaque MIBs into a human readable form.

Obtaining information about a particular channel file descriptor can be achieved through proc_pidfdinfo with the undocumented PROC_PIDFDCHANNELINFO (10). This returns a channel_fdinfo containing the channel type, UUID, port and flags.

 

Review Questions

  • What is the difference in operation between PF_NDRV packet capture capabilities and those of BPF?
  • What is the advantage of using User Mode Tunneling (the utun## facility) for VPN?
  • What is a specific advantage of using MPTCP, in particular for bandwidth intensive applications like FaceTime?
  • By peeking into Apple's NetworkExtension.framework, how do NECP and Nexuses provide underlying support for the public exposed classes?
  • What is the benefit of using a DNS agent?
  • What is a good reason to protect skywalk's considerable amount of code with entitlements, even for non-sensitive operations such as statistics?
  • How do network agents and nexuses interoperate?
  • What do utun and ipsec both have in common, which merits them using a nexus?
  • References

    1. "Improving Network Reliability Using Multipath TCP" - https://developer.apple.com/documentation/foundation/nsurlsessionconfiguration/ improving_network_reliability_using_multipath_tcp
    2. Apple Developer - QA1776 - https://developer.apple.com/library/archive/qa/qa1176/_index.html
    3. NewOSXBook.com - "NetBottom.c" - http://newosxbook.com/src.jl?tree=listings&file=netbottom.c
    4. Apple - HT201642 (The Application Level Firewall) -
      https://support.apple.com/en-us/HT201642
    5. Open BSD Manual Pages - pf(4) - https://man.openbsd.org/pf.4
    6. McCanne & Van Jacobson - "The BSD Packet Filter" - https://www.usenix.org/legacy/publications/library/proceedings/sd93/mccanne.pdf
    7. Luigi Rizzo - "Dummynet, Revisited" - https://www.researchgate.net/publication/220194992_Dummynet_Revisited
    This was the complete 16th chapter from *OS Internals, Volume I (in its v1.2 update) It's free, but please respect the copyright and immense amounts of research devoted to creating it. If any of this is useful, please cite using the original link. You might also want to consider getting the book, or Checking out Tg's training