Skip to content
Snippets Groups Projects
  1. Jan 13, 2015
    • Davies Liu's avatar
      [SPARK-5223] [MLlib] [PySpark] fix MapConverter and ListConverter in MLlib · 8ead999f
      Davies Liu authored
      It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector.
      Also, pickle may have better performance for larger object (less RPC).
      
      In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert.
      
      This PR should be ported into branch-1.2
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4023 from davies/listconvert and squashes the following commits:
      
      55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib
      8ead999f
  2. Nov 21, 2014
    • Davies Liu's avatar
      [SPARK-4531] [MLlib] cache serialized java object · ce95bd8e
      Davies Liu authored
      The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
      
      This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3397 from davies/cache and squashes the following commits:
      
      7f6e6ce [Davies Liu] Update -> Updater
      4b52edd [Davies Liu] using named argument
      63b984e [Davies Liu] fix
      7da0332 [Davies Liu] add unpersist()
      dff33e1 [Davies Liu] address comments
      c2bdfc2 [Davies Liu] refactor
      d572f00 [Davies Liu] Merge branch 'master' into cache
      f1063e1 [Davies Liu] cache serialized java object
      ce95bd8e
  3. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API · c8abddc5
      Davies Liu authored
      ```
      pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
          :: Experimental ::
      
          If `observed` is Vector, conduct Pearson's chi-squared goodness
          of fit test of the observed data against the expected distribution,
          or againt the uniform distribution (by default), with each category
          having an expected frequency of `1 / len(observed)`.
          (Note: `observed` cannot contain negative values)
      
          If `observed` is matrix, conduct Pearson's independence test on the
          input contingency matrix, which cannot contain negative entries or
          columns or rows that sum up to 0.
      
          If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
          test for every feature against the label across the input RDD.
          For each feature, the (feature, label) pairs are converted into a
          contingency matrix for which the chi-squared statistic is computed.
          All label and feature values must be categorical.
      
          :param observed: it could be a vector containing the observed categorical
                           counts/relative frequencies, or the contingency matrix
                           (containing either counts or relative frequencies),
                           or an RDD of LabeledPoint containing the labeled dataset
                           with categorical features. Real-valued features will be
                           treated as categorical for each distinct value.
          :param expected: Vector containing the expected categorical counts/relative
                           frequencies. `expected` is rescaled if the `expected` sum
                           differs from the `observed` sum.
          :return: ChiSquaredTest object containing the test statistic, degrees
                   of freedom, p-value, the method used, and the null hypothesis.
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3091 from davies/his and squashes the following commits:
      
      145d16c [Davies Liu] address comments
      0ab0764 [Davies Liu] fix float
      5097d54 [Davies Liu] add Hypothesis test Python API
      c8abddc5
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  4. Oct 31, 2014
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
Loading