a bit of cs: Statistics

Showing posts with label Statistics. Show all posts

Tuesday, December 16, 2014

Codeforces Round #282 Div. 1 Problem C - Helping People

Problem Statement:
494C - Helping People

Solution:
Another challenging problem, with a very interesting DP solution. The official editorial to this round is very detailed and well-written :)

The idea is to build a tree-like directed acyclic graph (DAG) and keep track on some cumulative probabilities.

To find:
the expectation of maximum element in [1..N]. This is equal to sum of P[X=x] * x, for all x possible.

Structure:
Each segments can be thought of as nodes. We first add a root to the DAG, which will be our entry point on traversing this graph. This root is a segment [1 .. N] with probability of being chosen 0. Call this node as e. Afterwards we sort the segments in increasing left index, and breaking ties with decreasing right index. Then we incrementally build the DAG as follows: For each segment u, we consider each segment v in increasing order. Add a directed edge from u to v if v is fully contained by u, but cannot be contained by any segment w that has already been pointed by u.

Follow Up: Codeforces 275 Div. 2 Problem E / Div. 1 Problem C

Problem Statement:
482C - Game of String

Solution:
The most excruciating problem I've done so far in my life.. haha
Anyway, I was not able to come up with a purely \(O(L 2^L) \) solution, but there is such solutions, and I am still trying to understand the thought process. The one that I managed to get it accepted is actually \(O(NL 2^L)\) but the N term is very small thanks to __builtin_popcount() function.

TopCoder SRM 637

Division 1
Problem 250:
SRM 637 Div1 (250) - GreaterGame

Solution:
To find the expectation, first sort the cards on each hand. Then for each of the known Sothe's card, we can determine whether we want to top the card with the lowest possible card on our hand, or just lose with the lowest possible card. Each of the card that we topped will contribute to the expectation by 1, and those we lose contribute 0. Then we will end up with the cards that Sothe did not reveal, and the cards that we do not used yet. For each card on our hand, we find the probability of our card winning given the knowledge of Sothe's remaining cards. Each of the probability contributes to our expectation. Hence we just sum up all the probabilities with the number of the cards that we won earlier on.

a bit of stats: Relationship between Poisson and Binomial Distribution

I have came across a problem that seems very counter intuitive in a book "All of Statistics", roughly goes like this:
Let random variable N follows Poisson Distribution, and we do N flips of a coin, with probability of getting a head is p. Let X and Y be the random variables representing the total number of heads and tails respectively. Prove that X and Y are independent.

Wait, what? Does not X affects Y? But apparently it does not and here is the proof:

If we know \(N = n\) in advance, then X given N follows Binomial distribution where \(P(X=x|N=n) = {n \choose x} p^x (1-p)^{n-x} \). However, in general, the random variable \(X\) takes in any values from 0 to infinity, hence it certainly does not follow binomial distribution. To find its probability mass function, we use the theory of total probability across \(N\):
\(
\begin{eqnarray*}
P(X=x) &=& \sum_{n=0}^{\infty} P(X=x|N=n) P(N=n) \\
&=& \sum_{n=0}^{\infty} {n \choose x} p^x (1-p)^{n-x} e^{-\lambda} \frac{\lambda^n}{n!} \\
&=& \sum_{n=0}^{\infty} \left( \frac{(p\lambda)^xe^{-p\lambda}}{x!} \right) \left( \frac{((1-p)\lambda)^{n-x}e^{-(1-p)\lambda}}{(n-x)!} \right) \\
&=& \left( \frac{(p\lambda)^xe^{-p\lambda}}{x!} \right) \sum_{n=0}^{\infty} \left( \frac{((1-p)\lambda)^{n-x}e^{-(1-p)\lambda}}{(n-x)!} \right)
\end{eqnarray*}
\)
The summation is equal to 1 since it is the sum over the probability mass density of \(\text{Poisson}((1-p)\lambda)\), hence we arrive at the conclusion that \(X \sim \text{Poisson}(p\lambda)\). Similarly, \(Y \sim \text{Poisson}((1-p)\lambda) \).
Lastly, we need to find a way to conclude that \(P(X=x \text{ and } Y=y) = P(X=x)P(Y=y) \). Using theory of total probability we have
\(P(X=x \text{ and } Y=y) = \sum_{n=0}^{\infty} P(X=x \text{ and } Y=y | N=n) P(N=n) \)
Here, the terms with \(x+y \neq n\) will equal to zero, hence the probability is the summation over those where \(x+y=n\). This last step is direct cancellation of factorial terms to derive the desired conclusion.

Wednesday, October 15, 2014

a bit of topcoder: SRM 636 Div. 1

Problem Statement:
TopCoder SRM 636 Div. 1 500

Summary:
Given an array called board of dimension MxN ( \(M, N\leq 20\) ). Each cell board[i][j] contains either '#' or '.', indicating 'full' or 'empty' respectively. We are to choose r cells out of the empty cells. Between 2 cells, we define distance of board[i][j] and board[m][n] is \(\sqrt{(i-m)^2+(j-n)^2}\). For each board[i][j], an edge is drawn between that cell and another cell with the smallest distance. If there is a tie, draw the edge to the one with lower row index, and if there is still a tie, choose the one with lower column index. The resulting undirected graphs have a number of connected components. Find the expected number of connected components.

One of the most important observation that makes solving this problem possible: If two cells have double edges (each of them have edge pointing to one another), then they contribute to one connected component.

We need some background on probability though, especially on Expectation and Indicator Function:
1. X is the random variable representing the number of connected components in a graph. Equivalently, from our observation, that X is equal to the number of pairs of double edges.
2. \(E(X) = \sum {xP(X=x)} \)
3. \(I\) is indicator function, \(I(A) = 1\) if event A occurs, 0 if it does not. If we define a random variable \(X_A = I(A)\), then \(E(X_A) = P(A)\).
4. Let \(A_{(c,d)}\) be an event that there is a double edge between two cells c and d. Furthermore, we define \(X_{(c,d)} = I(A_{(c,d)}) \). We have \(X = \sum_{c,d \in \text{board cells}, c\neq d} X_{(c,d)} \).
5. Hence E(X) = E( \( \sum_{c,d \in \text{board cells}, c\neq d} X_{(c,d)} \) ), which is equal to the sum of all probabilities of each \(A_{(c,d)}\).

Here is the implementation:

/*
X = number of double edges 
E[X] = sum xP(X=x)
P(i-j has double edge)
Given P(X=1), can I get P(X=k) k>1? Seems terribly inefficient.
Let's use Indicator function instead
Let I(edge i-j) = 1 if i-j has double edge
if we sort all edges into one sequence
X(i) = I(i-th edge)

E[X] = E[X(i)]
     = sum {i=1 to |E|} E(X(i))
And E(X(i)) = P(ith-edge) since expectation of an indicator is its prability
Now the problem reduces to finding probability of each edges from happening.

Complexity: O(N^2) to perform check for each edge. total O(N^3)
*/


class ClosestRabbit{
public:
    vector<string> board;
    double ncr[403][403];
    int M, N, R;
    int emp;

    int dist(int ia, int ja, int ib, int jb){
        return (ia-ib)*(ia-ib) + (ja-jb)*(ja-jb);
    }
    
    int countAvailable(int ia, int ja, int ib, int jb){
        int ret = 0;
        int mindist = dist(ia, ja, ib, jb);
        for(int i=0;i<M;++i){
            for(int j=0;j<N;++j){
                if(i == ia && j == ja) continue;
                if(i == ib && j == jb) continue;
                if(board[i][j] == '#' || mindist > dist(i,j,ia,ja) || mindist > dist(i,j,ib,jb)){
                    continue;
                }
                if(mindist == dist(i,j,ia,ja)){
                    if(i < ib) continue;
                    if(i == ib && j < jb) continue;
                }
                if(mindist == dist(i,j,ib,jb)){
                    if(i < ia) continue;
                    if(i == ia && j < ja) continue;
                }
                ++ret;
            }
        }
        return ret;
    }
    
    double prob(int ia, int ja, int ib, int jb){
        if(board[ia][ja] == '#' || board[ib][jb] == '#') return 0;
        int avail = countAvailable(ia,ja,ib,jb);
        return ncr[avail][R-2];
    }
    
    double getExpected(vector<string> board, int r){
        this->board = board;
        M = board.size();
        N = board[0].size();
        R = r;
        emp = 0;
        for(int i=0;i<M;++i){
            for(int j=0;j<N;++j){
                if(board[i][j] == '.') ++emp;
            }
        }
        for(int i=0;i<=emp;++i){
            ncr[i][0] = 1;
            if(i>0) ncr[i-1][i] = 0;
        }
        for(int i=1;i<=emp;++i){
            for(int j=i;j>=1;--j){
                ncr[i][j] = ncr[i-1][j] + ncr[i-1][j-1];
            }
        }
        double ans = 0;
        for(int ia=0;ia<M;++ia){
            for(int ja=0;ja<N;++ja){
                for(int ib=0;ib<M;++ib){
                    for(int jb=0;jb<N;++jb){
                        if(ib == ia && jb == ja) continue;
                        ans += prob(ia, ja, ib, jb);
                    }
                }
            }
        }
        return ans/ncr[emp][r]/2.0;
    }
};

Thursday, July 10, 2014

a bit of data structure: Fenwick Tree (Binary Indexed Tree)

Problem Statement:
Given a set of values \(i \in \{ 1, \ldots, n \} \) and set of frequencies \(f_i\):
1. find the cumulative frequency from i to j
2. support updating frequencies

Data Structure: Fenwick Tree (also known as Binary Indexed Tree)

Construction:
Every number can be represented the sum of powers of 2, which is the basis of the binary representation of numbers. Define LSB[i] as the least significant bit of i (the last bit 1 in the binary representation of i). For example:
1. LSB[3] = LSB[11] = 0
2. LSB[14] = LSB[1110] = 1
3. LSB[52] = LSB[110100] = 2

We define FTree[i] as the sum of frequencies from \( i - 2^{\text{LSB[i]}} + 1, \ldots, i\). This allows an iterative representation of Cumulative Frequency from 1 to i CF[i] in terms of FTree, for example:

1. CF[10001101] = FTree[10001100] + FTree[10001000] + FTree[10000000]
2. CF[1101] = FTree[1100] + FTree[1000]

As such, CF[i] can be found in \(O(\log{N})\) time, where \(N\) is the length of the binary representation of i.

In case we update the entry in FTree[i] by a value val, we iteratively add val to FTree[i + \(2^\text{LSB[i]}\)], hence updating all responsibility chain in FTree in \(O(\log{N})\) time as well.

The pseudocode for construction of Fenwick Tree:

LSB(i):
1. Return (i & ~i)

Build-FTree(array of frequency F):
1. let FTree[] be an array, FTree[0] = 0.
2. For each i from 1 to n = F.length
2.1 Set FTree[i] = sum of F[k] for k from (i - LSB(i) + 1) to i
3. Return FTree

Find-CF(FTree, i):
1. Set cumfreq = 0
2. While i != 0:
2.1 Set cumfreq = cumfreq + FTree[i]
2.2 Set i = i - LSB(i)
3. Retrun cumfreq

Update-FTree(FTree, i, val):
1. While i not greater than n
1.1 Set FTree[i] = FTree[i] + val
1.2 Set i = i + LSB(i)

The implementation is very sweet and simple :D

vector<int> ftree(11,0);
int maxval;

int lsb(int i) 
{
 return i & (-i);
}

void update(int i, int val)
{
 while(i <= maxval)
 {
  ftree[i] += val;
  i = i + lsb(i);
 }
}

int cumfreq(int i)
{
 int tot = 0;
 while( i > 0)
 {
  tot += ftree[i];
  i = i - lsb(i);
 }
 return tot;
}

a bit of cs