P4in5G: Flexible Data Plane Pipelines for 5G
ELTE Eötvös Loránd University, Budapest, Hungary
Network Function Virtualization (NFV) is one of the enabling technologies of 5G to increase flexibility of communication networks, deploying network functions as software on commodity servers. The development of high-performance network function software still requires deep target-specific knowledge, increasing the development cost and time. To describe packet processing pipelines in a protocol independent way, a domain specific language called P4 has recently emerged. For different targets, including both hardware and software, P4 compilers can be used to generate the target-specific executable program. Through the high-level abstraction of P4 the code complexity, the implementation time and costs can both be reduced.
P4in5G combines the advantages of 5G-NFV and P4 by offering P4-programmable VNFs based on the P4 compiler and software data plane solution called T4P4S (using DPDK backend). The proposed P4-enhanced VNF has been validated through use cases described in P4 language and performance measurements have been carried out with various settings in 5TONIC.
P4VNF is based on our P4-compiler and software data plane called T4P4S. In this experiment we first created a VNF (OSM R4) for the 5TONIC testbed. The VM image of the VNF is delivered with preinstalled tools: 1) T4P4S compiler and execution framework, 2) InfluxDB-based In-band Network Telemetry  database instance, 3) control plane programs for the use cases, 4) P4 implementation of the use cases, 5) helper scripts for easy usage of the VNF, and 6) network performance measurement tools like iperf and pktgen. The VNF defines three internal network interfaces with different purposes: vnf-mgmt for delivering management traffic, vnf-ul and vnf-dl for uplink and downlink (use case-specific test) traffic. Figure 1 illustrates the high-level structure of the VNF with its connections.
Thus, the traffic received by PPDR-ONE encapsulated in MQTT, CoAP and HTTP protocols respectively, creating a mix of heterogeneous streams, which is further used for the execution of ICARUS experiment. So, ICARUS experiment considered these IoT multi-protocol flows that passed through the INFOLYSiS vDPI, which was instantiated as a VNF in a Virtual Server of PPDR-ONE testbed. Based on the classification of the flows to the different protocols, then appropriate SDN rules applied in order each classified IoT traffic (e.g. the CoAP traffic, the MQTT traffic and the HTTP traffic) to be forwarded to the suitable mapping function (which will be also instantiated as VNF at the server with virtualization capabilities), and each one of them were mapped each IoT-protocol specific data flow to a generic/interoperable data protocol flow (such as UDP). Thus, the traffic of the three different protocols were mapped to a generic data protocol, making all IoT data interoperable under the UDP IoT protocol.
At the end and upon ICARUS successful IoT forward and mapping actions, the produced IoT sensor data flows were interoperable based on the generic UDP protocol, ready for further use by third-party applications through the API of the INFOLYSiS interoperable vGW
Measurements in 5TONIC
To analyze the performance of P4VNF we investigated two experimental scenarios: 1) P4VNF testing focusing on the packet forwarding performance, end-to-end and packet processing delays. In this scenario, we considered P4 programs covering use cases: Port forwarding, L2 forwarding, L3 routing, NAT, Load balancing and simplified 5G UPF. 2) P4VNF chaining where two instances of a simple port forwarding network function were used, constituting a service chain. This scenario focused on the achievable throughput and end-to-end delays.
Scenario 1: Evaluation of a P4VNF node
Figure 2 depicts the scenario consisting of 3 nodes. Node #1 and #3 are the source and sink of the test traffic, while node #2 runs the P4-based data plane. The test traffic was generated by the iperf tool. Note that we also tried to use DPDK’s pktgen traffic generator, but after sending approximately 10k packets it suddenly stopped sending packets. The end-to-end delay (actually RTT) was measured by the ping tool. It was executed in parallel with the iperf tools. The packet processing statistics were provided by the software data plane through its INT feature.
In addition to complex use cases, a port forwarding (PortFWD) program was also examined as a baseline. It simply forwards every packet from port 0 to 1, and vice versa.
In the L2 forwarding use case we had 2 exact match tables for source and destination MAC addresses. We considered two cases shown as L2 and L2big in the figures and tables: 1) L2: both tables are filled with 2 entries, 2) L2big: 1000 entries are inserted into both tables.
L3 routing is a more complex use case. It consists of an exact match source mac table, an LPM table for IP/SUBNET and a next hop exact match table. Similarly to the L2 use case, in L3 scenario all the tables contain 2 entries, while in L3big 1000 entries are inserted into the LPM table.
First, experiments with these basic use cases were carried out. They were primarily focusing on the achievable throughput, and the observed end-to-end and packet processing delays. The first two metrics are also affected by virtualized network between the VNF instances, while the latter one results in the performance limits of DPDK applications in virtualized environment (without DPDK-related optimization and fine tuning).
The generated T4P4S data planes used the VIRTIO interfaces of the virtual machine of the P4VNF. Unfortunately, VIRTIO limits the number of RX queues per port, and thus multicore packet processing was not possible. All the figures and tables below demonstrate the single (virtual) CPU core performance of P4VNF.
As it can be seen in Figure 3 the maximum bandwidth, we can utilize is not affected by the P4 program itself. With packets of 64 Bytes throughput of 35Mbps was the maximum throughput we observed, however if we increased the packet size to 1448 Bytes, the throughput also increased linearly to approx. 800Mbps, showing that the system is PPS (packets per seconds) limited. Unfortunately, this seems to be caused by a limitation of underlying infrastructure, both bitrates correspond to 70 KPPS (thousand packets per seconds) packet rate. This was maximum rate we observed in the 5TONIC virtualized environment (P4Chain NS/experiment). Note that we also tested the throughput with other DPDK applications and got the same results. It is worth noting that we tested the pure virtio performance in our local testbed (two QEMU/KVM VMs running on two hosts with simple virtio interfaces) and similar KPPS values were observed. We also checked the literature and found that this is a limitation of QEMU/KVM environment with simple virtio settings. For DPDK applications, this bottleneck can be solved by using vhost on the host machine, reducing the unnecessary memory operations at the boundary of virtualization layers.
The RTT values measured by the ping tool between node #1 and #3 were also varied in a wide range. Note that the RTT values were measured in parallel with the iperf measurements to get realistic results close to the real-world performance. Figures 3 shows the average RTT of 3 independent execution. The average RTT was similar in the different use cases and showed a slightly increasing trend with the increased packet sizes used in the test traffic. Though the packet processing delay reported by the T4P4S data plane was stable, covering the range of 98us-187us, the end-to-end RTT measurements showed a high deviation around the mean in some cases. Note that in case of VMs that were not used for a while, RTT values above 1 second were also captured, caused by overhead of weaking up the VM. This phenomenon is not considered in the measurements presented in this section. The difference in the average RTTs of the different use cases usually stayed between 0.3 and 6 ms. One can also observe that the simple port forward (PortFWD) application used as baseline also shows the same performance, indicating that the bottleneck is located outside of the VNF (e.g., virtio NICs).
Figure 3: Packet processing performance of basic use cases with various packet sizes. Left: average throughput. Right: average RTT.
We also analyzed how the different table sizes affect the packet processing performance. L2 scenario consists of 2 table entries while L2big has 1000 entries in each table. Figure 4 shows that the difference in both bandwidth and delay is insignificant between the two cases up to packet size of 512 bytes. However, for maximum sized packets (1448 bytes) a slightly smaller bandwidth and almost 2ms less delay can be observed when L2 tables are crowded. In all the cases, the throughput values reflect 69-70KPPS forwarding rate. The delay was measured by the ping tool.
Figure 4: The performance of L2 forwarding use case with different table sizes
L3 and L3big use cases are very similar to the previous L2 and L2big, but instead of exact match tables we have more exact match tables and a single LPM table in the P4 program. The measurement results however do not show high differences to the previous scenarios. The throughput corresponds to approx. 70 KPPS rate. The deviation in the RTT measurements for packet sizes of 1448 also shows some anomalies. We did not see any increase in the packet processing delay of T4P4S, indicating that the delay was not caused by the table lookups.
Figure 5: The performance of L3 routing use case with various table sizes
Table 1: More complex use cases, measurements with packet size of 1448 Bytes, single vCPU
One can see in Table 1 that the observed packet processing delay is similar in all the use cases with one exception: 5G UPF contains more computational tasks including encapsulation/decapsulation and it is also reflected in the measured delay statistics. Note that we also tested other DPDK applications and observed similar performances to the baseline and other use cases, clearly indicating that the performance bottleneck is caused by the underlying virtualization layers. Though the achievable performance seems predictable, but it is far from the single core performance of a bare metal setup (where the portFWD baseline throughput with single CPU core is 14 MPPS). The throughput limitations are mostly caused by the VIRTIO interfaces, being in accordance with the literature.
Scenario 2: Chaining two P4VNFs
In this scenario, we executed two port forwarding (PortFWD) examples in node #2 and node #3 while the traffic was generated by node #1 where two Docker containers were running. DL and UL interfaces have been assigned to the two containers. The generated traffic followed the path denoted by the red dashed line in Figure 6.
This scenario was only evaluated with the port forward example to see the achievable baseline performance of chaining multiple P4VNFs. Similarly to the previous cases we monitored the end-to-end throughput and RTT, and the packet processing delay of the two data plane programs.
Table 2: Chaining two PortFWD instances, measurements with packet size of 1448 Bytes, single vCPU
One can observe in Table 2 that the achieved throughput is the same as in the previous experimental scenario, only the observed end-to-end RTT has been increased because of the longer path.
Though our experiment shows that the P4 can be used as language to describe NFs (esp. in networking/telecommunication domain), the good configuration of the NFV infrastructure plays a crucial role in the performance. DPDK can only achieve good performance if specific settings are applied to handle issues of Virtio vNICs (VHost on the host machine), SR-IOV support, dedicated huge-pages, caching and CPU scheduling. The virtualized environment and the VIRTIO interfaces only enable to measure the single vCPU performance of the generated data plane programs. The measurements showed that approx. 800 Mbps throughput can be achieved (corresponding to approx. 70 KPPS with packet sizes of 1448 bytes) with moderate end-to-end delays and small packet processing delays. 70 KPPS seems to be the limitation of pure VIRTIO NICs which has also been reported in other studies. This finding suggest that the key performance bottleneck is located outside the VNF and the measurement numbers may be improved by fine tuning of the infrastructure (e.g. based on DPDK’s performance tuning guidelines).