bababadalgharaghtakamminarronnkonnbro: June 2012

Monday, June 25, 2012

InterviewStreet: Going Office (AKA The Most Vital Arc Problem)

Overview

Going Office is definitely one of the tougher problems on InterviewStreet, but the main idea* behind the problem is simple: given a weighted undirected graph \( G = ( V, E ) \) and two nodes \( s, t \in V \), we wish to answer \( Q \) queries of the following form:

What is the shortest path from \( s \) to \( t \) given that edge \( e \in E \) is removed from the graph?

(This type of problem is also known as the "most vital arc" problem -- see the first footnote below.)

Let \( N = | V | \) and \( M = | E | \) as usual. In this case, we are guaranteed that \( N, M, Q \leq 200000 \).

Prerequisites

You should definitely know Dijkstra's algorithm before attempting to solve this problem. Also, knowing about segment trees with lazy propagation would be good, but is not required to understand the solution.

Building Bridges

First of all, it should be clear that naively running Dijkstra's isn't going to run in time, because it requires \( O( Q N \log N ) \) operations. In fact, the size of the input suggests that we should look for at worst an \( O( N \log N ) \) solution. What can we do then? We may first try to solve an easier problem: can we determine whether an edge \( e \), when removed, will increase the shortest path length between \( s \) and \( t \)?

Definitions:

An edge is optimal if it is on a shortest path from \( s \) to \( t \).

An edge is a bridge* if it is on all shortest paths from \( s \) to \( t \). (Here we refer only to shortest paths in \( G \), without having removed any edges.)

Let \( d_s( u ) \) (resp. \( d_t( u ) \)) be the shortest-path distance from \( s \) (resp. \( t \)) to \( u \), and let \( OPT \) be the shortest-path distance from \( s \) to \( t \). Then we can show the following properties:

Lemmata:

\( e = ( u, v ) \) is optimal if and only \( d_s( u ) + length( e ) + d_t( v ) = OPT \).

\( e = ( u, v ) \) is a bridge if and only if it is optimal and for all other optimal \( e' = ( u', v' ) \), we have either \( d_s( v' ) \leq d_s( u ) \) or \( d_s( v ) \leq d_s( u' ) \). In other words, when we travel along an optimal path from \( s \) to \( t \), then between lengths \( d_s( u ) \) and \( d_s( v ) \), we must be traveling along \( e \). By convention we denote a bridge with the letter \( b \).

As a corollary of the above statement, \( e \) is a bridge if and only if, when removed from \( G \), the shortest path distance from \( s \) to \( t \) increases.

(Proofs are left to the reader.) These properties should give you some idea of how to find all the bridges in \( G \). If you still need some hints, an algorithm for computing bridges is included at the end of this post.

Islands

Now we have a way of identifying which edges, when removed, affect the shortest path length. This still leaves one major question unanswered: given a bridge \( b \), how exactly do we determine the shortest path length from \( s \) to \( t \) once \( b \) is removed? Running Dijkstra's repeatedly is too slow, since we can easily construct examples in which every edge is a bridge. (Consider a graph consisting of only a path from \( s \) to \( t \).)

Let \( K \) denote the number of bridges. Note that Lemma 2 above implies that we can uniquely order the bridges \( b_0, b_1, \ldots, b_{K-1} \) based on their relative distances from \( s \). Could we use this to help us?

Definition: Let the \( i \)-th island be defined as the set of all vertices \( v \), such that there exists a shortest path from \( s \) to \( v \) using no more than \( i \) bridges. (If \( v \) is not connected to either \( s \) or \( t \), then we can ignore it -- it lies on some strange continent somewhere far away.)

The intuition behind islands and bridges is this: bridges are the fastest way to "travel" between the islands when going from \( s \) to \( t \). If we remove an edge that is not a bridge, our shortest path is not affected. However, when we take out a bridge, we are forced to travel along another (possibly much longer) path.

Proposition: Consider the bridge \( b_i \), connecting the \( i \)-th and \( i + 1 \)-st islands. Let \( E_i \) be the set of all edges \( e \) connecting an island with index \( \leq i \) to an island with index \( > i \). Then the shortest path from \( s \) to \( t \) that bypasses \( b_i \) must have length \[ \min_{ e = ( u, v ) \in E_i } \{ d_s( u ) + length( e ) + d_t( v ) \} \]

Proof: Any path \( P \) from \( s \) to \( t \) bypassing \( b_i \) must include some edge in \( E_i \). Now consider some edge \( e = ( u, v ) \in E_i \) -- the shortest path from \( s \) to \( t \) that goes through \( e \) must have length \( d_s( u ) + length( e ) + d_t( v ) \). Lastly, by the way we have constructed the islands, we know that \( b_i \) is not on the shortest paths from \( s \) to \( u \) or from \( v \) to \( t \).

A Data Structure for Bypass Length

Let \( bypass( i ) \) denote the length of the shortest path that bypasses \( b_i \). The above observations suggest that we can run the following algorithm to compute \( bypass( i ) \):

For each non-bridge edge \( e = ( u, v ) \) connecting islands \( i \) and \( j \), where \( i < j \):
For each \( k \in \{ i, i + 1, \ldots, j - 1 \} \):
\( bypass( k ) \leftarrow \min \{ bypass( k ), d_s( u ) + length( e ) + d_t( v ) \} \)

However, there is a slight problem with this algorithm: it's too slow. Through some tricky analysis, you can show that in the worst case, we can construct a graph that will cause the above subroutine to take \( O( M^{3/2} ) \) operations.

What we need is a data structure that supports the following two operations efficiently:

\( update( i, j, val ) \), which runs \( bypass( k ) \leftarrow \min \{ bypass( k ), val \} \) for all \( k \in \{ i, i+1, \ldots, j-1 \} \)
\( query( i ) \), which returns \( bypass( i ) \)

One possible data structure that can support both operations in \( O( \log K ) \) time is called a segment tree with lazy propagation or a segment tree with range update. There are several tutorials online about this type of tree, and it's a good data structure to be familiar with. (The main problem right now is that the structure is rather complicated and the online tutorials are not very good. I may do an article on it some time, if there is sufficient demand.)

However, if you are lazy, there is another data structure called the Kempe tree, which is a bit simpler and can support the same sort of queries, also using \( O( \log K ) \) operations. Check the link for more information.

Putting It All Together

The previous observations suggest the following algorithm for solving the problem:

Compute \( d_s \) and \( d_t \) using Dijkstra's algorithm.
Find the bridges:
1. First find all the optimal edges by iterating over all edges \( e = ( u, v ) \), and storing the edges such that \( d_s( u ) + length( e ) + d_t( v ) = OPT \).
2. Sort all the optimal edges by \( d_s( u ) \) and \( d_s( v ) \). Use the criterion from above to find the bridges.
Find the islands.
1. One way (the hard way) is to run a modified version of Dijkstra.
2. A smarter and shorter alternative is to just run multiple depth-first searches from \( s \) and from the far end of each bridge. It's up to the reader to figure out how to do so.
Find the bypassing paths.
1. Initialise your favourite range-update data structure.
2. Iterate over all the non-bridge edges. If \( e = ( u, v ) \) is an edge crossing from island \( i \) to island \( j > i \), then range-update \( bypass(i), bypass(i+1), \ldots, bypass(j-1) \) such that they are no more than \( d_s( u ) + length( e ) + d_t( v ) \).
Process all the queries. If the edge in question is not a bridge, then we simply return the optimal length. Otherwise, we query the range-update data structure to find the best bypass length.

All of the above steps run in require \( O( R \log R ) \) operations where \( R = max\{ N, M, Q \} \), so the algorithm has complexity \( O( R \log R ) \).

The solution took me around 150 lines to implement with a decent amount of spacing and a relatively clean coding style. I'm somewhat suspicious, though, that I've overcomplicated the solution, and that a nicer solution exists. Message me or comment here if you have any other alternatives.

Footnotes

See this paper for an example of previous work on the subject.
See also this page.

The Kempe Tree: A Simple Data Structure for Supporting Range Updates or Queries

Introduction

This post describes two variations on tree structure that I've used in some programming problems for solving variations of the range query/update problem. I haven't seen anyone else use it before, so for the moment I'm going to call it the Kempe tree, named after my great-grandfather who was also named Richard Kempe.

Suppose we are given a list of data \( a_0, a_1, \ldots, a_{n-1} \). In general terms, range update data structure supports the following type of operation: \[ update( i, j, x ): a_k \leftarrow x \text{ for all } k \in \{ i, i+1, \ldots, j \} \] Similarly, a range query data structure supports the following type of operation: \[ query( i, j ) = f( a_i, a_{i+1}, \ldots, a_{j} ) \] (Here \( f \) is taken to be some function we can aggregate over a large number of values, like \( \min \), \( + \), etc....) Both of these operations are expected to be reasonably fast, usually taking at most \( O( n ) \) steps.

As an example, segment trees and Fenwick trees (or binary-indexed trees) are range query data structures. Segment trees with lazy propagation support both range queries and range updates.

The Kempe tree is a data structure that supports either fast range updates and single-element queries, or fast range queries and single-element updates, but not both simultaneously. (Okay, fine, it can support both, provided one is \( O( \log^2 n ) \). But more on that later.) Why, then, should I use a Kempe tree, you ask? The main reason is that it uses numerical tricks to make it easier to code, the same reason why the Fenwick tree is sometimes used instead of the segment tree. (This also means that, for all intents and purposes, Kempe trees will never be used outside of competition programming.)

Range Traversals

The Kempe tree, like the Fenwick tree, is an array-based tree. Suppose we have \( n \) data entries, a[0], a[1], ..., a[n-1]; then we require an array of size \( N + n \), where \( N \) is the smallest power of \( 2 \) that is \( \geq n \). Let tree[i] denote this array. The entries of tree[i] are allocated as follows:

Elements a[0], a[1], ..., a[n-1] are stored in entries tree[N], tree[N+1], ..., tree[N+n-1], respectively.
The parent node for any entry tree[i] is tree[i/2].
Likewise, a non-leaf node tree[i] has children tree[2*i] and tree[2*i+1].
The root of the tree is tree[1].

See the following image for a description of the mapping. Numbers in the tree nodes denote position in the array tree[i].

We say a tree node governs the array entries that are its descendants. For some range a[i], a[i+1], ..., a[j], the following algorithm finds the minimal set of tree nodes tree[x[0]], tree[x[1]], ..., tree[x[m-1]] that govern a[i], a[i+1], ..., a[j]:

vector find_governing_nodes( int i, int j )
{
  vector ans;
  i += N, j += N;
  while( i <= j )
  {
    if( i % 2 == 1 ) ans.push_back( i );
    if( j % 2 == 0 ) ans.push_back( j );
    i = ( i + 1 ) / 2;
    j = ( j - 1 ) / 2;
  }
  return ans;
}

A sketch of the proof for the correctness of the algorithm is the following. We start with i and j at the ends of the interval we are interested in. We process the interval by moving i and j up the tree, and by 'pulling' them together. If moving either i or j shrinks the interval during the next step, then we add them to the list of governing nodes. The following image illustrates the process in action:

We wish to query the range \( [1, 5] \). i and j are initialised to 9 and 13, respectively.
Since 9 would jump out of its subtree during the next move, we add 9 to the list of governing nodes.
We move i to 5 and j to 6. Both i and j will jump out of their respective ranges during the next move, so we add both 5 and 6 to the list of governing nodes.
i gets moved to 3, and j to 2. This is an invalid range, so we break and return the list [9, 5, 6].

Range Queries and Singleton Updates

We now present an example of a tree structure that can handle updates to single elements and range queries. The function we will try to query is the sum function.

// define MAX to be some really big number

int tree[MAX], N;

// initialises a tree for n data entries
int init( int n )
{
  N = 1;
  while( N < n ) N *= 2;
  for( int i = 1; i < N + n; i++ ) tree[i] = 0;
}

// compute the product a[i] + a[i+1] + ... + a[j]
int range_sum( int i, int j )
{
  int ans = 1;
  for( i += N, j += N; i <= j; i = ( i + 1 ) / 2, j = ( j - 1 ) / 2 )
  {
    if( i % 2 == 1 ) ans += tree[i];
    if( j % 2 == 0 ) ans += tree[j];
  }
  return ans;
}

// set a[i] = val
void update( int i, int val )
{
  int diff = val - tree[i+N];
  for( int j = i + N; j; j /= 2 ) tree[j] += diff;
}

Range Updates and Singleton Queries

The following example will allow us to increment a range, while allowing us to query the result on a singleton value.

// define MAX to be a really big number

int tree[MAX], N;

// initialises a tree for n data entries
int init( int n )
{
  N = 1;
  while( N < n ) N *= 2;
  for( int i = 1; i < N + n; i++ ) tree[i] = 0;
}

// increments a[k] by val for all k between i and j, inclusive
void range_increment( int i, int j, int val )
{
  for( i += N, j += N; i <= j; i = ( i + 1 ) / 2, j = ( j - 1 ) / 2 )
  {
    if( i % 2 == 1 ) tree[i] += val;
    if( j % 2 == 0 ) tree[j] += val;
  }
}

// compute a[i]
int query( int i )
{
  int ans = 0;
  for( int j = i + N; j; j /= 2 ) ans += tree[j];
  return ans;
}

Final Remarks

We can easily extend the tree to support many types of functions, such as the product function, min, max, set union, set intersection, and so forth.

Unfortunately, under the current scheme, the tree cannot simultaneously support range updates and queries. There is a way to support both operations, provided that one operation takes \( O( \log n ) \) and the other takes \( O( \log^2 n ) \). (The proof of this is left to the reader.) Thus, Kempe trees do not provide as much functionality as segment trees, but are generally somewhat easier to code.