Skip to content
Snippets Groups Projects
  1. May 29, 2015
    • Xiangrui Meng's avatar
      [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel · db951378
      Xiangrui Meng authored
      Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`.
      
      coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6468 from mengxr/SPARK-7922 and squashes the following commits:
      
      7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark
      1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
      db951378
  2. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  3. Mar 17, 2015
    • Xiangrui Meng's avatar
      [SPARK-6226][MLLIB] add save/load in PySpark's KMeansModel · c94d0626
      Xiangrui Meng authored
      Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
      
      570ba81 [Xiangrui Meng] fix python style
      b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
      c94d0626
  4. Feb 20, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release · 4a17eedb
      Joseph K. Bradley authored
      For SPARK-5867:
      * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
      * It should also include Python examples now.
      
      For SPARK-5892:
      * Fix Python docs
      * Various other cleanups
      
      BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
      
      CC: mengxr  (ML),  davies  (Python docs)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
      
      f191bb0 [Joseph K. Bradley] small cleanups
      e786efa [Joseph K. Bradley] small doc corrections
      6b1ab4a [Joseph K. Bradley] fixed python lint test
      946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
      da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
      629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
      b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
      34b067f [Joseph K. Bradley] small doc correction
      da16aef [Joseph K. Bradley] Fixed python mllib docs
      8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
      695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
      a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
      b05a80d [Joseph K. Bradley] organize imports. doc cleanups
      e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
      4a17eedb
  5. Jan 13, 2015
    • Davies Liu's avatar
      [SPARK-5223] [MLlib] [PySpark] fix MapConverter and ListConverter in MLlib · 8ead999f
      Davies Liu authored
      It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector.
      Also, pickle may have better performance for larger object (less RPC).
      
      In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert.
      
      This PR should be ported into branch-1.2
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4023 from davies/listconvert and squashes the following commits:
      
      55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib
      8ead999f
  6. Nov 21, 2014
    • Davies Liu's avatar
      [SPARK-4531] [MLlib] cache serialized java object · ce95bd8e
      Davies Liu authored
      The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
      
      This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3397 from davies/cache and squashes the following commits:
      
      7f6e6ce [Davies Liu] Update -> Updater
      4b52edd [Davies Liu] using named argument
      63b984e [Davies Liu] fix
      7da0332 [Davies Liu] add unpersist()
      dff33e1 [Davies Liu] address comments
      c2bdfc2 [Davies Liu] refactor
      d572f00 [Davies Liu] Merge branch 'master' into cache
      f1063e1 [Davies Liu] cache serialized java object
      ce95bd8e
  7. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API · c8abddc5
      Davies Liu authored
      ```
      pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
          :: Experimental ::
      
          If `observed` is Vector, conduct Pearson's chi-squared goodness
          of fit test of the observed data against the expected distribution,
          or againt the uniform distribution (by default), with each category
          having an expected frequency of `1 / len(observed)`.
          (Note: `observed` cannot contain negative values)
      
          If `observed` is matrix, conduct Pearson's independence test on the
          input contingency matrix, which cannot contain negative entries or
          columns or rows that sum up to 0.
      
          If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
          test for every feature against the label across the input RDD.
          For each feature, the (feature, label) pairs are converted into a
          contingency matrix for which the chi-squared statistic is computed.
          All label and feature values must be categorical.
      
          :param observed: it could be a vector containing the observed categorical
                           counts/relative frequencies, or the contingency matrix
                           (containing either counts or relative frequencies),
                           or an RDD of LabeledPoint containing the labeled dataset
                           with categorical features. Real-valued features will be
                           treated as categorical for each distinct value.
          :param expected: Vector containing the expected categorical counts/relative
                           frequencies. `expected` is rescaled if the `expected` sum
                           differs from the `observed` sum.
          :return: ChiSquaredTest object containing the test statistic, degrees
                   of freedom, p-value, the method used, and the null hypothesis.
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3091 from davies/his and squashes the following commits:
      
      145d16c [Davies Liu] address comments
      0ab0764 [Davies Liu] fix float
      5097d54 [Davies Liu] add Hypothesis test Python API
      c8abddc5
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  8. Oct 31, 2014
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
Loading