Abstract

Given a data set \(\mathcal\{D\}\) containing millions of data points and a data consumer who is willing to pay for \$\(X\) to train a machine learning (ML) model over \(\mathcal\{D\}\), how should we distribute this \$\(X\) to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all \(N\) data points, it requires \(O(2^N)\) model evaluations for exact computation and \(O(Nlog N)\) for \((\epsilon, \delta)\)-approximation. In this paper, we focus on one popular family of ML models relying on \(K\)-nearest neighbors (\(K\)NN). The most surprising result is that for unweighted \(K\)NN classifiers and regressors, the Shapley value of all \(N\) data points can be computed, ex

Authors

(none)

Tags

  • Uncategorized

Stats

  • citations125
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score15.75
  • arxiv keyjia2019efficient

Related papers