Uncovering Configuration and Behavior Drift

When debugging network issues, it is important to understand how the network is different today compared to yesterday or to the desired golden state. A text diff of device configs is one way to do this, but it tends to be too noisy. It will show differences that you may not care about (e.g., changes in whitespace or timestamps), and it is hard to control what is reported. More importantly, text diffs also do not tell you about the impact of change on network behavior, such as if new traffic will be permitted or if some BGP edges will go down.

Batfish parses and builds a vendor-neutral model of device configs and behavior. This model enables you to learn how two snapshots of the network differ exactly along the aspects you care about. The behavior modeling of Batfish also lets you understand the full impact of these changes. This notebook illustrates this capability.

We focus on the following differences across three categories.

  1. Configuration settings

    1. Node-level properties

    2. Interface-level properties

    3. Properties of BGP peers

  2. Structures and references

    1. Structures defined in device configs

    2. Undefined references

  3. Network behavior

    1. BGP adjacencies

    2. ACL lines with treat flows differently

These are examples of different types of changes that you can analyze using Batfish. You may be interested in a different aspects of your network, and you should be able to adapt the code below to suit your needs.

Text diff will help with the configuration settings category at best. The other two categories require understanding the structure of the config and the network behavior it induces. To illustrate this point, the text diff of example configs that we use in this notebook is below.

[1]:
# Use recursive diff, followed by some pretty printing hacks
!diff -ur networks/drift/reference networks/drift/snapshot | sed -e 's;diff.*snapshot/\(configs.*cfg\);^-----------\1---------;g' | tr '^' '\n' | grep -v networks/drift

-----------configs/as1border1.cfg---------
@@ -21,7 +21,7 @@
 !
 !
 no ip domain lookup
-ip domain name lab.local
+ip domain name lab.localp
 no ipv6 cef
 !
 !

-----------configs/as1border2.cfg---------
@@ -11,7 +11,7 @@
 !
 !
 ntp server 18.18.18.18
-ntp server 23.23.23.23
+ntp server 18.18.18.19
 !
 !
 no aaa new-model

-----------configs/as2border2.cfg---------
@@ -59,13 +59,14 @@
  duplex auto
 !
 interface GigabitEthernet0/0
- ip address 10.23.21.2 255.255.255.0
- ip access-group OUTSIDE_TO_INSIDE in
- ip access-group INSIDE_TO_AS3 out
- media-type gbic
- speed 1000
- duplex full
- negotiation auto
+ shutdown
+! ip address 10.23.21.2 255.255.255.0
+! ip access-group OUTSIDE_TO_INSIDE in
+! ip access-group INSIDE_TO_AS3 out
+! media-type gbic
+! speed 1000
+! duplex full
+! negotiation auto
 !
 interface GigabitEthernet1/0
  ip address 2.12.22.1 255.255.255.0

-----------configs/as2core1.cfg---------
@@ -60,6 +60,7 @@
  duplex auto
 !
 interface GigabitEthernet0/0
+ description "To as2border1 GigabitEthernet1/0"
  ip address 2.12.11.2 255.255.255.0
  media-type gbic
  speed 1000
@@ -67,6 +68,7 @@
  negotiation auto
 !
 interface GigabitEthernet1/0
+ description "To as2border2 GigabitEthernet2/0"
  ip address 2.12.21.2 255.255.255.0
  negotiation auto
 !

-----------configs/as2dept1.cfg---------
@@ -84,6 +84,7 @@
  neighbor as2 remote-as 2
  neighbor 2.34.101.3 peer-group as2
  neighbor 2.34.201.3 peer-group as2
+ neighbor 2.34.209.3 peer-group as2
  !
  address-family ipv4
   bgp dampening
@@ -96,7 +97,6 @@
   neighbor as2 route-map dept_to_as2 out
   neighbor 2.34.101.3 activate
   neighbor 2.34.201.3 activate
-  maximum-paths eibgp 5
  exit-address-family
 !
 ip forward-protocol nd

-----------configs/as2dist1.cfg---------
@@ -82,13 +82,13 @@
  bgp log-neighbor-changes
  neighbor as2 peer-group
  neighbor as2 remote-as 2
- neighbor dept peer-group
- neighbor dept remote-as 65001
+ neighbor dept2 peer-group
+ neighbor dept2 remote-as 65001
  neighbor 2.1.2.1 peer-group as2
  neighbor 2.1.2.1 update-source Loopback0
  neighbor 2.1.2.2 peer-group as2
  neighbor 2.1.2.2 update-source Loopback0
- neighbor 2.34.101.4 peer-group dept
+ neighbor 2.34.101.4 peer-group dept2
  !
  address-family ipv4
   bgp dampening
@@ -113,6 +113,7 @@
 no ip http server
 no ip http secure-server
 !
+access-list 102 permit tcp host 2.128.0.0 host 255.255.0.0
 access-list 102 permit ip host 2.128.0.0 host 255.255.0.0
 access-list 105 permit ip host 1.0.1.0 host 255.255.255.0
 access-list 105 permit ip host 1.0.2.0 host 255.255.255.0
@@ -128,6 +129,9 @@
  match community dept_community
  set local-preference 350
 !
+route-map dept_to_as2dist permit 200
+ match community dept_community_new
+ set local-preference 350
 !
 !
 control-plane

-----------configs/as2dist2.cfg---------
@@ -118,6 +118,7 @@
 access-list 105 permit ip host 1.0.2.0 host 255.255.255.0
 access-list 105 permit ip host 3.0.1.0 host 255.255.255.0
 access-list 105 permit ip host 3.0.2.0 host 255.255.255.0
+access-list 105 permit ip host 3.0.3.0 host 255.255.255.0
 !
 route-map as2dist_to_dept permit 100
  match ip address 105

-----------configs/as3border1.cfg---------
@@ -120,6 +120,10 @@
 !
 ip prefix-list default_list seq 5 permit 0.0.0.0/0
 !
+ip prefix-list bogons seq 5 permit 10.0.0.0/8
+ip prefix-list bogons seq 10 permit 172.16.0.0/16
+ip prefix-list bogons seq 15 permit 192.168.0.0/16
+!
 ip prefix-list inbound_route_filter seq 5 deny 3.0.0.0/8 le 32
 ip prefix-list inbound_route_filter seq 10 permit 0.0.0.0/0 le 32
 access-list 101 permit ip host 1.0.1.0 host 255.255.255.0

-----------configs/as3core1.cfg---------
@@ -51,9 +51,6 @@
 !
 !
 !
-interface Loopback0
- ip address 3.10.1.1 255.255.255.255
-!
 interface Ethernet0/0
  no ip address
  shutdown
@@ -77,6 +74,9 @@
 interface GigabitEthernet3/0
  ip address 90.90.90.2 255.255.255.0
  negotiation auto
+!
+interface Loopback0
+ ip address 3.10.1.1 255.255.255.255
 !
 router ospf 1
  network 3.0.0.0 0.255.255.255 area 1

As we can see, it is difficult to grasp the nature and impact of the change from this output, not to mention that it is impossible to build automation on top of it (e.g., to alert on certain types of differences). We show next how Batfish offers a meaningful view of these differences and their impact on network behavior.

[2]:
# Import packages, helpers, and load questions
%run startup.py
from drift_helper import diff_frames, diff_properties
bf = Session(host="localhost")

# Initialize both the snapshot and the reference that we want to use
NETWORK_NAME = "my_network"
SNAPSHOT_PATH = "networks/drift/snapshot"
REFERENCE_PATH = "networks/drift/reference"

bf.set_network(NETWORK_NAME)
bf.init_snapshot(SNAPSHOT_PATH, name="snapshot", overwrite=True)
bf.init_snapshot(REFERENCE_PATH, name="reference", overwrite=True)
[2]:
'reference'

1. Configuration settings

Let first uncover differences in configuration settings, starting with node-level properties.

1A. Node-level properties

We focus on three example properties: 1) NTP servers, 2) Domain name, and 3) VRFs that exist on the device. The complete list of node properties extracted by Batfish is here.

We will compute the property differences between across snapshots using Batfish questions. Batfish makes its models available via a set of questions. When questions are run in differential mode, it outputs how the answer differ across two snapshots.

[3]:
# Properties of interest
NODE_PROPERTIES = ["NTP_Servers" , "Domain_Name", "VRFs"]

# Compute the difference across two snapshots and return a Pandas DataFrame
node_diff = bf.q.nodeProperties(
                properties=",".join(NODE_PROPERTIES)
            ).answer(
                snapshot="snapshot",
                reference_snapshot="reference"
            ).frame()

# Print the DataFrame
show(node_diff.head())
  Node KeyPresence Snapshot_Domain_Name Reference_Domain_Name Snapshot_NTP_Servers Reference_NTP_Servers Snapshot_VRFs Reference_VRFs
0 as1border1 In both lab.localp lab.local default default
1 as1border2 In both lab.local lab.local 18.18.18.19

18.18.18.18
23.23.23.23

18.18.18.18
default default

The output above shows all property differences for all nodes. There is a row per node. We see that on as1border1 the domain name has changed, and on as1border2 the set of NTP servers has changes. There is no other difference for any other node for the chosen properties.

This structured output can be transformed and fed into any type of automation, e.g., to alert you when an important property has changed. We can also generate readable drift reports using the helper function we defined above.

[4]:
# Print readable messages on the differences
diff_properties(node_diff, "Node", ["Node"], NODE_PROPERTIES)

Differences for Node=as1border1
    Domain_Name: lab.local -> lab.localp

Differences for Node=as1border2
    NTP_Servers: ['23.23.23.23', '18.18.18.18'] -> ['18.18.18.19', '18.18.18.18']

1B. Interface-level properties

We next check if any interface-level properties have changed. We again focus on three example settings: 1) whether the interface is active, 2) description, and 3) primary IP address. The complete list of interface settings extracted by Batfish are here.

[5]:
# Properties of interest
INTERFACE_PROPERTIES = ['Active', 'Description', 'Primary_Address']

# Compute the difference across two snapshots and return a Pandas DataFrame
interface_diff = bf.q.interfaceProperties(
                    properties=",".join(INTERFACE_PROPERTIES)
                ).answer(
                    snapshot="snapshot",
                    reference_snapshot="reference"
                ).frame()

# Print a readable version of the differences
diff_properties(interface_diff, "Interface", ["Interface"], INTERFACE_PROPERTIES)

Differences for Interface=as2border2[GigabitEthernet0/0]
    Active: True -> False
    Primary_Address: 10.23.21.2/24 -> None

Differences for Interface=as2core1[GigabitEthernet0/0]
    Description: None -> "To as2border1 GigabitEthernet1/0"

Differences for Interface=as2core1[GigabitEthernet1/0]
    Description: None -> "To as2border2 GigabitEthernet2/0"

We see that the interface GigabitEthernet0/0 on as2border2 has been shutdown and its address assignment has been eliminated. We also see that the description has been added for two interfaces on as2core1.

1C. BGP peer properties

We next check properties of BGP peers, focusing on four example properties: 1) description, 2) peer group, 3) Import policies applied to the peer, and 4) Export policies applied to the peer. The complete list of BGP peers properties is here.

[6]:
# Properties of interest
BGP_PEER_PROPERTIES = ['Remote_AS', 'Description', 'Peer_Group', 'Import_Policy', 'Export_Policy']

# Compute the difference across two snapshots and return a Pandas DataFrame
bgp_peer_diff = bf.q.bgpPeerConfiguration(
                    properties=",".join(BGP_PEER_PROPERTIES)
                ).answer(
                    snapshot="snapshot",
                    reference_snapshot="reference"
                ).frame()

#Print readable messages on the differences
diff_properties(bgp_peer_diff, "BgpPeer", ["Node", "VRF", "Local_Interface", "Remote_IP"], BGP_PEER_PROPERTIES)

BgpPeers only in snapshot
    Node=as2dept1, VRF=default, Local_Interface=None, Remote_IP=2.34.209.3

Differences for Node=as2dist1, VRF=default, Local_Interface=None, Remote_IP=2.34.101.4
    Peer_Group: dept -> dept2
    Import_Policy: ['dept_to_as2dist'] -> []
    Export_Policy: ['as2dist_to_dept'] -> []

The output shows that a new peer has been defined on as2dept1 with remote IP address 2.34.209.3; and the peer group has changed for an an existing peer on as2dist1, which then also led to its import and export policies changing. This correlated change in import/export policies are invisible in the text diff.

2. Structures and references

Batfish models include all structures defined in device configs (e.g., ACLs, prefix-lists) and how they are referenced in other parts of the config. You can use these models to learn if structures have been defined or deleted, which represents a major change in the configuration.

2A. Structures defined in configs

The definedStructures question is the basis for learning about structures defined in the config.

[7]:
# Extract defined structures from both snapshots as a Pandas DataFrame
snapshot_structures = bf.q.definedStructures().answer(snapshot="snapshot").frame()
reference_structures = bf.q.definedStructures().answer(snapshot="reference").frame()

# Show me what the information looks like by printing the first few rows
show(snapshot_structures.head())
  Structure_Type Structure_Name Source_Lines
0 extended ipv4 access-list line OUTSIDE_TO_INSIDE: permit ip any any FileLines(filename='configs/as2border1.cfg', lines=[137])
1 bgp peer-group as2 FileLines(filename='configs/as1border1.cfg', lines=[81])
2 extended ipv4 access-list line blocktelnet: deny tcp any any eq telnet FileLines(filename='configs/as2core1.cfg', lines=[124])
3 interface GigabitEthernet1/0 FileLines(filename='configs/as1core1.cfg', lines=[69, 70, 71])
4 extended ipv4 access-list OUTSIDE_TO_INSIDE FileLines(filename='configs/as2border2.cfg', lines=[132, 133, 134])

The output snippet shows how Batfish captures the exact lines in each file where each structure is defined. We can process this information from the two snapshots to produce a report on all differences.

[8]:
# Remove the line numbers but keep the filename. We don't care about where in the file structure are defined.
snapshot_structures_without_lines = snapshot_structures[['Structure_Type', 'Structure_Name']].assign(
    File_Name=snapshot_structures["Source_Lines"].map(lambda x: x.filename))
reference_structures_without_lines = reference_structures[['Structure_Type', 'Structure_Name']].assign(
    File_Name=reference_structures["Source_Lines"].map(lambda x: x.filename))

# Print a readable message on the differences
diff_frames(snapshot_structures_without_lines,
            reference_structures_without_lines,
            "DefinedStructure")

DefinedStructures only in snapshot
     File_Name=configs/as3border1.cfg, Structure_Name=bogons, Structure_Type=ipv4 prefix-list
     File_Name=configs/as2dist1.cfg, Structure_Name=dept2, Structure_Type=bgp peer-group
     File_Name=configs/as2dist1.cfg, Structure_Name=dept_to_as2dist 200, Structure_Type=route-map-clause
     File_Name=configs/as2dist2.cfg, Structure_Name=105: permit ip host 3.0.3.0 host 255.255.255.0, Structure_Type=extended ipv4 access-list line
     File_Name=configs/as2dist1.cfg, Structure_Name=102: permit tcp host 2.128.0.0 host 255.255.0.0, Structure_Type=extended ipv4 access-list line

DefinedStructures only in reference
     File_Name=configs/as2dist1.cfg, Structure_Name=dept, Structure_Type=bgp peer-group

We can easily see in this output that a BGP peer group named dept2 was newly defined on as2dist1 and a prefix-list named bogons was defined on as2border1. We also see that the peer group named dept was removed from as2dist1. The peer group change is related to what we saw earlier with a peer property changing. This view shows that the entire structure has been removed and defined.

2B. Undefined structure references

References to undefined structures are symptoms of configuration errors. Using the undefinedReferences question, Batfish can help you understand if new undefined references have been introduced or old ones have been cleared.

[9]:
# Extract undefined references from both snapshots as a Pandas DataFrame
snapshot_undefined_references=bf.q.undefinedReferences().answer(snapshot="snapshot").frame()
reference_undefined_references= bf.q.undefinedReferences().answer(snapshot="reference").frame()

# Show me all undefined references in the snapshot
show(snapshot_undefined_references)
  File_Name Struct_Type Ref_Name Context Lines
0 configs/as2core2.cfg route-map filter-bogons bgp inbound route-map FileLines(filename='configs/as2core2.cfg', lines=[110])
1 configs/as2dist1.cfg community-list dept_community_new route-map match community-list FileLines(filename='configs/as2dist1.cfg', lines=[133])
2 configs/as2dist1.cfg undeclared bgp peer-group dept bgp peer-group referenced before defined FileLines(filename='configs/as2dist1.cfg', lines=[99, 100, 101])

The output shows that there are three undefined references in the snapshot. Let us find out which ones were newly introduced relative to the reference.

[10]:
# Remove Lines since we don't care about where it was referenced
snapshot_undefined_references_without_lines = snapshot_undefined_references.drop(columns=['Lines'])
reference_undefined_references_without_lines = reference_undefined_references.drop(columns=['Lines'])

# Print a readable message on the differences
diff_frames(snapshot_undefined_references_without_lines,
            reference_undefined_references_without_lines,
            "UndefinedRefeference")

UndefinedRefeferences only in snapshot
     Ref_Name=dept_community_new, File_Name=configs/as2dist1.cfg, Struct_Type=community-list, Context=route-map match community-list
     Ref_Name=dept, File_Name=configs/as2dist1.cfg, Struct_Type=undeclared bgp peer-group, Context=bgp peer-group referenced before defined

We thus see that, of the three undefined references that we saw earlier, two were newly introduced and one exists in both snapshots.

3. Network behavior

We now turn our attention to behavioral differences between network snapshots, starting with changes in BGP adjacencies.

3A. BGP adjacencies

The bgpEdges question of Batfish enables you to learn about all BGP adjacencines in the network, as follows.

[11]:
# Get the edges from both snapshots as Pandas DataFrames
snapshot_bgp_edges = bf.q.bgpEdges().answer(snapshot="snapshot").frame()
reference_bgp_edges = bf.q.bgpEdges().answer(snapshot="reference").frame()

# Show me the schema by printing the first few rows
show(snapshot_bgp_edges.head())
  Node IP Interface AS_Number Remote_Node Remote_IP Remote_Interface Remote_AS_Number
0 as1border2 1.2.2.2 None 1 as1core1 1.10.1.1 None 1
1 as1core1 1.10.1.1 None 1 as1border1 1.1.1.1 None 1
2 as2dist2 2.1.3.2 None 2 as2core2 2.1.2.2 None 2
3 as3border2 3.2.2.2 None 3 as3core1 3.10.1.1 None 3
4 as2dist2 2.34.201.3 None 2 as2dept1 2.34.201.4 None 65001

We see that Batfish knows which BGP edges in the snapshot come up and shows key information about them. We can use the answer to this question to learn which edges exist only in the snapshot or only in the refrence.

[12]:
# Retain only columns we care about for this analysis
snapshot_bgp_edges_nodes = snapshot_bgp_edges[['Node', 'Remote_Node']]
reference_bgp_edges_nodes = reference_bgp_edges[['Node', 'Remote_Node']]

# DataFrames contain one edge per direction; keep only one direction
snapshot_bgp_bidir_edges_nodes = snapshot_bgp_edges_nodes[
                                    snapshot_bgp_edges_nodes['Node'] < snapshot_bgp_edges_nodes['Remote_Node']
                                  ]
reference_bgp_bidir_edges_nodes = reference_bgp_edges_nodes[
                                    reference_bgp_edges_nodes['Node'] < reference_bgp_edges_nodes['Remote_Node']
                                  ]

# Print a readable message on the differences
diff_frames(snapshot_bgp_bidir_edges_nodes,
            reference_bgp_bidir_edges_nodes,
            "BgpEdge")

BgpEdges only in reference
     Node=as2border2, Remote_Node=as3border1

One BGP edge exists only in the reference, that is, it disappeared in the snapshot. We can find more about this edge, like so:

[13]:
# Find the matching edge in the reference edges answer from before
missing_snapshot_edge = reference_bgp_edges[
                           (reference_bgp_edges['Node']=="as2border2")
                           & (reference_bgp_edges['Remote_Node']=="as3border1")
                         ]

# Print the edge information
show(missing_snapshot_edge)
  Node IP Interface AS_Number Remote_Node Remote_IP Remote_Interface Remote_AS_Number
20 as2border2 10.23.21.2 None 2 as3border1 10.23.21.3 None 3

Do you recall the interface on as2border2 that was shut earlier? This BGP edge was removed because of that interface shutdown (which you confirm using IP of the interface—10.23.21.2/24).

3B. ACL behavior

To compute the behavior differences between ACLs, we use the compare filters question. It returns pairs of lines, one from the filter definition in each snapshot, that match the same flow(s) but treat them differently (i.e. one permits and the other denies the flow).

[14]:
# compute behavior differences between ACLs
compare_filters = bf.q.compareFilters().answer(
                                            snapshot='snapshot',
                                            reference_snapshot='reference'
                                        ).frame()

# print the result
show(compare_filters)
  Node Filter_Name Line_Index Line_Content Line_Action Reference_Line_Index Reference_Line_Content
0 as2dist2 105 4 permit ip host 3.0.3.0 host 255.255.255.0 PERMIT End of ACL

We see that the only difference in the ACL behaviors of the two snapshots is for ACL 105 on as2dist. Line permit ip host 3.0.3.0 host 255.255.255.0 in the snapshot permits some flows that were being denied in the reference snapshhot because of the implicit deny at the end of the ACL. Thus, we have permitted flows that were not being permitted before.

If you were paying attention to the text diff above, the result above may surprise you. The text diff (relevant snippet repeated below) showed that ACL 102 on as2dist1 changed as well.

[15]:
!diff -ur networks/drift/reference/configs/as2dist1.cfg networks/drift/snapshot/configs/as2dist1.cfg | grep -A 7 '@@ -113,6 +113,7 @@'
@@ -113,6 +113,7 @@
 no ip http server
 no ip http secure-server
 !
+access-list 102 permit tcp host 2.128.0.0 host 255.255.0.0
 access-list 102 permit ip host 2.128.0.0 host 255.255.0.0
 access-list 105 permit ip host 1.0.1.0 host 255.255.255.0
 access-list 105 permit ip host 1.0.2.0 host 255.255.255.0

You may have expected a behahvior diff corresponding to this change, but Batfish analysis reveals that that didn’t happen. The added line is permitting TCP traffic between two hosts for which IP traffic was already permitted, so no new traffic was permitted. So, either this change was unnecessary or someone mistyped the host addresses.

Summary

Batfish enables you to easily understand how your device configs differ from a historial reference or golden versions. It provides structured information about not only changes to settings in configs but also about changes in network behavior. This information provides important context beyond simple text diffs and can be inserted into an automated pipeline that alerts on important changes.


Get involved with the Batfish community

Join our community on Slack and GitHub.