-
Notifications
You must be signed in to change notification settings - Fork 0
/
appendix.tex
220 lines (189 loc) · 13 KB
/
appendix.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
\begin{comment}
fixtex --fpaths appendix.tex --outline --asmarkdown --numlines=999 --shortcite -w && ./checklang.py outline_appendix.md
\end{comment}
\appendix % This command is used only once!
\addcontentsline{toc}{chapter}{APPENDICES} %toc entry or:
%\addtocontents{toc}{\parindent0pt\vskip12pt APPENDICES} %toc entry, no page #
%\chapter{Appendix}
\crefalias{section}{appsec}
\crefalias{chapter}{appsec}
\chapter{OCCURRENCES}\label{app:occurgroup}
In the identification workflow we separate groups of images into \glossterm{occurrences}.
The Darwin Core defines an occurrence as a collection of evidence that shows an organism exists within
specific location and span of time~\cite{wieczorek_darwin_2012}.
For our purposes this amounts to a cluster of images localized in space and time.
We outline an occurrence grouping algorithm performs agglomerative clustering on the GPS coordinates and time
specified in the image metadata.
We first describe the space-time measure of distance between images and then describe the clustering
algorithm.
\section{SPACE-TIME IMAGE DISTANCE}
To compute occurrences we define a space-time feature $\g_i$ for each image $i$, and a pairwise distance,
$\Delta(\g_i, \g_j)$, between these features.
This feature will a two-dimensional feature tuple, %
$\g_i = \paren{\time_i, \gps_i}$, where the first component is the POSIX timestamp $\time_i$, and the second
component is a GPS coordinate %
$\gps_i = \brak{\lat_i, \lon_i}^{T}$, where the angles of latitude and longitude are measured in radians.
To compute this distance between two images $\g_i$ and $\g_j$ we first compute the distance in each component
of the feature tuple.
The difference in time is the absolute value of the timedelta, %
$\Delta_t(\g_i, \g_j) = \abs{\time_i - \time_j}$, which is in seconds.
% DISTANCE BETWEEN TWO IMAGES (space and final)
Next, the distance in space is computed by approximating the Earth as a sphere.
In general, the distance between two points on a sphere with radius $r$ is a function of inverse haversines,
and is expressed as:
\begin{equation}\label{eqn:geodistance}
d(\gps_i, \gps_j, r) =
2 r \asin{\sqrt{
\haversine{\lat_i - \lat_j} +
\haversine{\lon_i - \lon_j} +
\cos\paren{\lat_i} \cos\paren{\lat_j}}}
\end{equation}
In the previous equation, $\haversine{\theta} = \haversineFULL{\theta}$ is the half vertical sine function.
Thus, we arrive at the spatial distance between two images by estimating the radius of the earth to be
$r=6367$ kilometers.
\begin{equation}
\Delta_s(\g_i, \g_j) = d(\gps_i, \gps_j, 6367).
\end{equation}
This results in distance in seconds and a distance in kilometers, which are in incompatible units.
To combine these distances we convert kilometers to seconds by heuristically estimating the walking speed,
$S$, of an animal (for zebras we use $S=2\sciE{-3}$ kilometers per second).
This allows us to cancel kilometers from the expression and express GPS distance as a unit of time:
$\frac{\Delta_s(\g_i, \g_j)}{S}$.
This distance can be interpreted as the total amount of time it would take an animal to move between two
points.
The total distance between two images is the sum of these components.
\begin{equation}\label{eqn:imgdist}
\Delta(\g_i, \g_j) = \Delta_t(\g_i, \g_j) + \frac{\Delta_s(\gps_i, \gps_j)}{S}
\end{equation}
Notice that if there is no difference in GPS location, then this measure
becomes to a distance in time.
\section{CLUSTERING PROCEDURE}
Having defined pairwise a distance between two images, we use the agglomerative clustering procedures
implemented in SciPy~\cite{eric_jones_scipy_2001} to group the images.
There are two inputs to the agglomerative clustering algorithm:
(1) The matrix of pairwise distance between images, and
(2) the minimum distance threshold between two images.
The matrix of distances is computed using~\cref{eqn:imgdist}, and we set the distance threshold to $30$
minutes.
Any pair of images that is within this threshold connected via a linkage matrix.
Connected components in this matrix form the final clusters that we use as occurrences{}.
To improve the efficiency of the algorithm, we separate it into disjoint parts by sorting the images by
timestamp and splitting the data wherever the difference in consecutive timestamps exceeds the threshold.
Images that are missing either timestamp or GPS location are grouped by what data they do have and clustered
separately.
%\section{DISCUSSION OF OCCURRENCES}
%These computed occurrences are valuable measurements for multiple components of the IBEIS software.
%At its core an occurrence describes \wquest{when} a group of animals was seen and \wquest{where} that group
% was seen.
%However, to answer the questions like \wquest{how many} animals there were, \wquest{who} an animal is,
% \wquest{who else} is an animal with, and \wquest{where else} have these animals been seen, the \annots{} in
% the occurrence must be grouped into individual \encounters{} and then matched against the \masterdatabase{}.
\begin{comment}
\chapter{Converting existing datasets to decision graphs}\label{sec:rename}
In this section we briefly discuss the problem of applying graph identification from~\cref{chap:graphid} to
existing databases.
Most datasets used in practice ignore detailed connectivity information and simply associate a name label with a
database of cropped (or sometimes un-cropped) images.
Because graph identification relies on this detailed connectivity, we must reconstruct it before new images can
be added.
To apply graph identification to a previously existing dataset where annotations have been assigned name labels
and connectivity between the annotations is unknown use follow the following process.
First we compute the pairwise probabilities between each pair of annotations labeled with the same name.
Then, we automatically classify any edge above a threshold as positive, negative, or incomparable.
For any set of nodes originally labeled with the same name, we compute an edge augmentation to connect the PCC as
detailed in~\cref{subsec:augredun} and insert these edges into the graph, labeling them as positive but assigning
them the confidence of guessing.
Note that any edge labeled as negative in the classification step will result in an inconsistency because it will
be a negative inside a PCC.
It will be common for such datasets to contain errors, we resolve any inconsistent PCCs using the algorithm
from~\cref{sec:incon}, but then we search for additional split cases using the pairwise classifier.
The main idea is to re-review all edges where the pairwise classifier prediction disagrees with its assigned
match-state.
Edges are sorted by the magnitude of the disagreement, but any edge with a confidence of absolutely-sure is
ignored.
This will present edges labeled as guessing for the user to re-review.
At this point the dataset is in a legal state, where the name labels correspond to PCCs.
The final step is to execute normal graph review in order to find any merge cases and explicitly label hard
negative edges.
In the case that a pairwise classifier does not exist, then one can be trained using the temporary edges defined
by the maximum spanning trees of the PCCs.
It is sometimes desirable new PCCs to keep the old name labels from the original database (\eg{} sometimes
ecologists encoded information in these names).
This is a simple matter when the original database contained no mistakes, but when the original database contains
errors care must be taken.
We address this problem by seeking to minimize the number of annotations that have their name label changed from
the original dataset.
This can be computing by finding a maximum linear sum assignment using the Munkres algorithm implemented in
SciPy~\cite{eric_jones_scipy_2001}.
We create a matrix where each row represents a group of annotations in the same PCC and each column represents an
original name.
If there are more PCCs than original names, then the columns are padded with extra values.
The matrix is first initialized to be negative infinity representing impossible assignments.
Then for each column representing a padded name, we set we its value to $1$ indicating that each new name could
be assigned to a padded name for some small profit.
Finally, we encode both the profit of assigning a new name with an original name and the extra one ensures that
these original names are always preferred over padded names.
Let $f_{rc}$ be the number of annotations in row $r$ with an original name of $c$, and set matrix value %
$(r, c)$ to $f_{rc} + 1$ if $f_{rc} > 0$.
The maximum linear sum assignment of this matrix results in the optimal consistent assignment of PCCs to original
name labels.
\end{comment}
\begin{comment}
\chapter{Sight-resight analysis with incomparability}\label{app:markrecapincomp}
Given a completed decision graph, each PCC corresponds to an individual, and each pair of PCCs is either known to
be different or known to be incomparable.
If all PCCs are known to be different, then sight-resight analysis is simple.
Each PCC can be grouped into a set of encounters.
The chronologically first encounter is a sighting, and the subsequent encounters are re-sightings.
However, if it cannot be determined that some pairs of PCCs are different, we must use only the first or second
of these PCCs in our analysis, and the other must be discarded.
Finding the largest set of PCCs that can be used in sight-resight statistics, we must find the largest set of
PCCs that are all comparable to each other.
This problem can be solved in the following steps:
\begin{enumln}
\item Find all pairs of PCCs that are incomparable.
\item Consider the meta-graph where each of these incomparable PCCs is a node and there is an edge between each
pair.
\item Find the largest independent set in this meta-graph (note the is NP-hard).
\item Remove all nodes from the decision graph corresponding to the PCCs in the meta-graph that were not in the
independent set.
\end{enumln}
Now, in the original graph there is no PCC that is incomparable with any other, otherwise there would have been
two nodes in the independent set that had an edge between them, which is a contradiction.
Even though finding the largest independent set would use the most data, any independent set will do.
Thus, sight-resight statistics can now be performed on this graph.
\end{comment}
%\chapter{Properties of the graph algorithm}\label{app:graphprop}
%\begin{enumln}
%\item Finding the largest set of comparable PCCs is a generalization to the independent set problem.
%Using the meta-graph where PCCs are nodes, and an edge is drawn between any PCCs known to have no comparable
% pairs of annotations, any independent set of PCCs will all be comparable to one another.
%The largest independent set will be the largest set of comparable PCCs.
%\item The graph identification procedure is agnostic to the underlying computer vision algorithms.
% Any ranking and verification algorithm can be substituted in.
%\item The graph identification procedure can be used without computer vision algorithms to facilitate a more
% efficient brute-force search, which can be useful to label small datasets.
%This is done by using all edges as candidate edges, and each edge is given a priority of $0$.
%The refresh criterion is not used, and the algorithm stops when the queue is empty.
%The positive and negative redundancy criteria will remove edges from the queue, so less than $|V|^2$ manual
% reviews will have to be done.
%\item When $k=2$ and edges are only added to the graph exactly to the specifications of the graph algorithm, it
% is impossible for a PCC to contain more than one negative edge at any time.
%\item When $k=1$, reviewing edges in order of positive match probability minimizes the expected number of total
% reviews.
%\item At the end of the graph algorithm every PCC will be $k$-positive-redundant.
%\item In recovery mode, consider that the hypothesis algorithm has generated a set of edges.
%The user has reviewed some of these edges and agreed with each hypothesis so far.
%In this case the remaining edges are necessarily a minimum cut.
%This is because the edges reviewed so far have had their label changed, meaning that they no longer connect
% terminal nodes.
%In this case, if hypothesis generation is recomputed, then the new hypothesis will be the same as the set of
% remaining edges if the weights in the graph are unique.
%If the weights in the graph are not unique, then the implementation of min-cut can be chosen to enforce that the
% new set is the same as the remaining set.
%\item A review can have one of the following mutually exclusive effects on the PCCs of the graph:
% (1) merge exactly two PCCs into one PCC.
% (2) split a single PCC into exactly two PCCs.
% (3) do nothing to the PCCs, but potentially influence redundancy.
%\end{enumln}
%\end{appendices}