Skip to content
Snippets Groups Projects
  • Joseph K. Bradley's avatar
    c8b16ca0
    [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes · c8b16ca0
    Joseph K. Bradley authored
    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)
    
    CC: mengxr dorx  Main changes were examples to show usage across APIs.
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
    
    ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
    8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
    b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
    32173b7 [Joseph K. Bradley] Stats examples update.
    c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
    ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
    65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
    064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    c8b16ca0
    History
    [SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes
    Joseph K. Bradley authored
    Added examples for statistical summarization:
    * Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
    * python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
    
    Added examples for random and sampled RDDs:
    * Scala: RandomAndSampledRDDs.scala
    * python: random_and_sampled_rdds.py
    * Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey
    
    Added sc.stop() to all examples.
    
    CorrelationSuite.scala
    * Added 1 test for RDDs with only 1 value
    
    RowMatrix.scala
    * numCols(): Added check for numRows = 0, with error message.
    * computeCovariance(): Added check for numRows <= 1, with error message.
    
    Python SparseVector (pyspark/mllib/linalg.py)
    * Added toDense() function
    
    python/run-tests script
    * Added stat.py (doc test)
    
    CC: mengxr dorx  Main changes were examples to show usage across APIs.
    
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:
    
    ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
    8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
    b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
    32173b7 [Joseph K. Bradley] Stats examples update.
    c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
    ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
    65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
    064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
    ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
sampled_rdds.py 3.10 KiB
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
Randomly sampled RDDs.
"""

import sys

from pyspark import SparkContext
from pyspark.mllib.util import MLUtils


if __name__ == "__main__":
    if len(sys.argv) not in [1, 2]:
        print >> sys.stderr, "Usage: sampled_rdds <libsvm data file>"
        exit(-1)
    if len(sys.argv) == 2:
        datapath = sys.argv[1]
    else:
        datapath = 'data/mllib/sample_binary_classification_data.txt'

    sc = SparkContext(appName="PythonSampledRDDs")

    fraction = 0.1 # fraction of data to sample

    examples = MLUtils.loadLibSVMFile(sc, datapath)
    numExamples = examples.count()
    if numExamples == 0:
        print >> sys.stderr, "Error: Data file had no samples to load."
        exit(1)
    print 'Loaded data with %d examples from file: %s' % (numExamples, datapath)

    # Example: RDD.sample() and RDD.takeSample()
    expectedSampleSize = int(numExamples * fraction)
    print 'Sampling RDD using fraction %g.  Expected sample size = %d.' \
        % (fraction, expectedSampleSize)
    sampledRDD = examples.sample(withReplacement = True, fraction = fraction)
    print '  RDD.sample(): sample has %d examples' % sampledRDD.count()
    sampledArray = examples.takeSample(withReplacement = True, num = expectedSampleSize)
    print '  RDD.takeSample(): sample has %d examples' % len(sampledArray)

    print

    # Example: RDD.sampleByKey()
    keyedRDD = examples.map(lambda lp: (int(lp.label), lp.features))
    print '  Keyed data using label (Int) as key ==> Orig'
    #  Count examples per label in original data.
    keyCountsA = keyedRDD.countByKey()

    #  Subsample, and count examples per label in sampled data.
    fractions = {}
    for k in keyCountsA.keys():
        fractions[k] = fraction
    sampledByKeyRDD = keyedRDD.sampleByKey(withReplacement = True, fractions = fractions)
    keyCountsB = sampledByKeyRDD.countByKey()
    sizeB = sum(keyCountsB.values())
    print '  Sampled %d examples using approximate stratified sampling (by label). ==> Sample' \
        % sizeB

    #  Compare samples
    print '   \tFractions of examples with key'
    print 'Key\tOrig\tSample'
    for k in sorted(keyCountsA.keys()):
        fracA = keyCountsA[k] / float(numExamples)
        if sizeB != 0:
            fracB = keyCountsB.get(k, 0) / float(sizeB)
        else:
            fracB = 0
        print '%d\t%g\t%g' % (k, fracA, fracB)

    sc.stop()