Skip to content
Snippets Groups Projects
  1. Oct 21, 2015
  2. Oct 20, 2015
    • Xiangrui Meng's avatar
      [MINOR][ML] fix doc warnings · 135ade90
      Xiangrui Meng authored
      Without an empty line, sphinx will treat doctest as docstring. holdenk
      
      ~~~
      /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
      /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
      ~~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9188 from mengxr/py-count-vec-doc-fix.
      135ade90
    • Holden Karau's avatar
      [SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistent · aea7142c
      Holden Karau authored
      Namely "." shows up in some places in the template when using the param docstring and not in others
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
      aea7142c
    • noelsmith's avatar
      [SPARK-10269][PYSPARK][MLLIB] Add @since annotation to pyspark.mllib.classification · 04521ea0
      noelsmith authored
      Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
      
      Added since to methods + "versionadded::" to classes derived from the file history.
      
      Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated.
      
      Author: noelsmith <mail@noelsmith.com>
      
      Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.
      04521ea0
    • noelsmith's avatar
      [SPARK-10272][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.evaluation · 82e9d9c8
      noelsmith authored
      Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
      
      Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark).
      
      Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove.
      
      Author: noelsmith <mail@noelsmith.com>
      
      Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
      82e9d9c8
    • Holden Karau's avatar
      [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9 · e18b571c
      Holden Karau authored
      Upgrade to Py4j0.9
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
      e18b571c
  3. Oct 19, 2015
    • Davies Liu's avatar
      [SPARK-11114][PYSPARK] add getOrCreate for SparkContext/SQLContext in Python · 232d7f8d
      Davies Liu authored
      Also added SQLContext.newSession()
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9122 from davies/py_create.
      232d7f8d
    • Brennon York's avatar
      [SPARK-7018][BUILD] Refactor dev/run-tests-jenkins into Python · d3180c25
      Brennon York authored
      This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes.
      
      From the original PR description (by brennonyork):
      
      Currently a few things are left out that, could and I think should, be smaller JIRA's after this.
      
      1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out.
      2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO.
      3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well.
      
      Closes #7401.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #9161 from JoshRosen/run-tests-jenkins-refactoring.
      d3180c25
  4. Oct 18, 2015
    • Mahmoud Lababidi's avatar
      [SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by... · a337c235
      Mahmoud Lababidi authored
      [SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by presenting the Object
      
      The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion.
      
      The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings.
      
      Author: Mahmoud Lababidi <mahmoud@thehumangeo.com>
      Author: Mahmoud Lababidi <lababidi@gmail.com>
      
      Closes #9149 from lababidi/master.
      a337c235
  5. Oct 17, 2015
    • Koert Kuipers's avatar
      [SPARK-10185] [SQL] Feat sql comma separated paths · 57f83e36
      Koert Kuipers authored
      Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider
      
      Author: Koert Kuipers <koert@tresata.com>
      
      Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
      57f83e36
  6. Oct 16, 2015
    • zero323's avatar
      [SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before binary search · 8ac71d62
      zero323 authored
      At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9098 from zero323/sparse_vector_getitem_improved.
      8ac71d62
    • Bhargav Mangipudi's avatar
      [SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e… · 1ec0a0dc
      Bhargav Mangipudi authored
      …rror message
      
      For negative indices in the SparseVector, we update the index value. If we have an incorrect index
      at this point, the error message has the incorrect *updated* index instead of the original one. This
      change contains the fix for the same.
      
      Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com>
      
      Closes #9069 from bhargav/spark-10759.
      1ec0a0dc
  7. Oct 13, 2015
  8. Oct 12, 2015
  9. Oct 09, 2015
    • Vladimir Vladimirov's avatar
      [SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark · c1b4ce43
      Vladimir Vladimirov authored
      Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
      
      Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
      
      Closes #8700 from smartkiwi/SPARK-10535_.
      c1b4ce43
    • Bryan Cutler's avatar
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with... · 5410747a
      Bryan Cutler authored
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters
      
      These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training.  Same with StreamingLinearRegressionWithSGD.  I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
      5410747a
  10. Oct 08, 2015
    • zero323's avatar
      [SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we… · 8e67882b
      zero323 authored
      __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry
      
          from pyspark.mllib.linalg import Vectors
          sv = Vectors.sparse(5, {1: 3})
          sv[0]
          ## 0.0
          sv[1]
          ## 3.0
          sv[2]
          ## Traceback (most recent call last):
          ##   File "<stdin>", line 1, in <module>
          ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
          ##     row_ind = inds[insert_index]
          ## IndexError: index out of bounds
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9009 from zero323/sparse_vector_index_error.
      8e67882b
  11. Oct 07, 2015
  12. Oct 06, 2015
  13. Sep 29, 2015
  14. Sep 25, 2015
    • Eric Liang's avatar
      [SPARK-9681] [ML] Support R feature interactions in RFormula · 92233881
      Eric Liang authored
      This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
      
      To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
      
      mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #8830 from ericl/interaction-2.
      92233881
  15. Sep 23, 2015
    • Reynold Xin's avatar
      [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. · 99522177
      Reynold Xin authored
      Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).
      
      This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8876 from rxin/SPARK-10731.
      99522177
  16. Sep 22, 2015
  17. Sep 21, 2015
  18. Sep 19, 2015
    • Josh Rosen's avatar
      [SPARK-10710] Remove ability to disable spilling in core and SQL · 2117eea7
      Josh Rosen authored
      It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
      
      This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
      2117eea7
  19. Sep 18, 2015
  20. Sep 17, 2015
Loading