A similarity search in Hamming space finds binary vectors whose Hamming distances are no more than a threshold from a query vector. It is a fundamental problem in many applications, including image retrieval, near-duplicate Web page detection, and machine learning. State-of-the-art approaches to answering such queries are mainly based on the pigeonhole principle to generate a set of candidates and then verify them.
We observe that the constraint based on the pigeonhole principle is not always tight and hence may bring about unnecessary candidates. We also observe that the distribution in real data is often skew, but most existing solutions adopt a simple equi-width partitioning and allocate the same threshold to all the partitions, and hence fail to exploit the data skewness to optimize the query processing. In this paper, we propose a new form of the pigeonhole principle which allows variable partition size and threshold. Based on the new principle, we first develop a tight constraint of candidates, and then devise cost-aware methods for dimension partitioning and threshold allocation to optimize query processing. Our evaluation on datasets with various data distributions shows the robustness of our solution and its superior query processing performance to the state-of-the-art methods.