Skip to content
Snippets Groups Projects
  1. Jun 29, 2015
    • BenFradet's avatar
      [SPARK-8575] [SQL] Deprecate callUDF in favor of udf · 0b10662f
      BenFradet authored
      Follow up of [SPARK-8356](https://issues.apache.org/jira/browse/SPARK-8356) and #6902.
      Removes the unit test for the now deprecated ```callUdf```
      Unit test in SQLQuerySuite now uses ```udf``` instead of ```callUDF```
      Replaced ```callUDF``` by ```udf``` where possible in mllib
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #6993 from BenFradet/SPARK-8575 and squashes the following commits:
      
      26f5a7a [BenFradet] 2 spaces instead of 1
      1ddb452 [BenFradet] renamed initUDF in order to be consistent in OneVsRest
      48ca15e [BenFradet] used vector type tag for udf call in VectorIndexer
      0ebd0da [BenFradet] replace the now deprecated callUDF by udf in VectorIndexer
      8013409 [BenFradet] replaced the now deprecated callUDF by udf in Predictor
      94345b5 [BenFradet] unifomized udf calls in ProbabilisticClassifier
      1305492 [BenFradet] uniformized udf calls in Classifier
      a672228 [BenFradet] uniformized udf calls in OneVsRest
      49e4904 [BenFradet] Revert "removal of the unit test for the now deprecated callUdf"
      bbdeaf3 [BenFradet] fixed syntax for init udf in OneVsRest
      fe2a10b [BenFradet] callUDF => udf in ProbabilisticClassifier
      0ea30b3 [BenFradet] callUDF => udf in Classifier where possible
      197ec82 [BenFradet] callUDF => udf in OneVsRest
      84d6780 [BenFradet] modified unit test in SQLQuerySuite to use udf instead of callUDF
      477709f [BenFradet] removal of the unit test for the now deprecated callUdf
      0b10662f
    • Yanbo Liang's avatar
      [SPARK-5962] [MLLIB] Python support for Power Iteration Clustering · dfde31da
      Yanbo Liang authored
      Python support for Power Iteration Clustering
      https://issues.apache.org/jira/browse/SPARK-5962
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6992 from yanboliang/pyspark-pic and squashes the following commits:
      
      6b03d82 [Yanbo Liang] address comments
      4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
      dfde31da
    • Feynman Liang's avatar
      [SPARK-7212] [MLLIB] Add sequence learning flag · 25f574eb
      Feynman Liang authored
      Support mining of ordered frequent item sequences.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6997 from feynmanliang/fp-sequence and squashes the following commits:
      
      7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
      0d3e4b6 [Feynman Liang] Fix python test
      ce987cb [Feynman Liang] Backwards compatibility aux constructor
      34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
      f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
      648d4d4 [Feynman Liang] Test case for frequent item sequences
      252a36a [Feynman Liang] Add sequence learning flag
      25f574eb
  2. Jun 28, 2015
    • Cheng Lian's avatar
      [SPARK-7845] [BUILD] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1 · 00a9d22b
      Cheng Lian authored
      PR #5694 reverted PR #6384 while refactoring `dev/run-tests` to `dev/run-tests.py`. Also, PR #6384 didn't bump Hadoop 1 version defined in POM.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7062 from liancheng/spark-7845 and squashes the following commits:
      
      c088b72 [Cheng Lian] Bumping default Hadoop version used in profile hadoop-1 to 1.2.1
      00a9d22b
    • Liang-Chi Hsieh's avatar
      [SPARK-8677] [SQL] Fix non-terminating decimal expansion for decimal divide operation · 24fda738
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8677
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #7056 from viirya/fix_decimal3 and squashes the following commits:
      
      34d7419 [Liang-Chi Hsieh] Fix Non-terminating decimal expansion for decimal divide operation.
      24fda738
    • Vincent D. Warmerdam's avatar
      [SPARK-8596] [EC2] Added port for Rstudio · 9ce78b43
      Vincent D. Warmerdam authored
      This would otherwise need to be set manually by R users in AWS.
      
      https://issues.apache.org/jira/browse/SPARK-8596
      
      Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com>
      Author: vincent <vincentwarmerdam@gmail.com>
      
      Closes #7068 from koaning/rstudio-port-number and squashes the following commits:
      
      ac8100d [vincent] Update spark_ec2.py
      ce6ad88 [Vincent D. Warmerdam] added port number for rstudio
      9ce78b43
    • Kousuke Saruta's avatar
      [SPARK-8686] [SQL] DataFrame should support `where` with expression represented by String · ec784381
      Kousuke Saruta authored
      DataFrame supports `filter` function with two types of argument, `Column` and `String`. But `where` doesn't.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7063 from sarutak/SPARK-8686 and squashes the following commits:
      
      180f9a4 [Kousuke Saruta] Added test
      d61aec4 [Kousuke Saruta] Add "where" method with String argument to DataFrame
      ec784381
    • Davies Liu's avatar
      [SPARK-8610] [SQL] Separate Row and InternalRow (part 2) · 77da5be6
      Davies Liu authored
      Currently, we use GenericRow both for Row and InternalRow, which is confusing because it could contain Scala type also Catalyst types.
      
      This PR changes to use GenericInternalRow for InternalRow (contains catalyst types), GenericRow for Row (contains Scala types).
      
      Also fixes some incorrect use of InternalRow or Row.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7003 from davies/internalrow and squashes the following commits:
      
      d05866c [Davies Liu] fix test: rollback changes for pyspark
      72878dd [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      efd0b25 [Davies Liu] fix copy of MutableRow
      87b13cf [Davies Liu] fix test
      d2ebd72 [Davies Liu] fix style
      eb4b473 [Davies Liu] mark expensive API as final
      bd4e99c [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      bdfb78f [Davies Liu] remove BaseMutableRow
      6f99a97 [Davies Liu] fix catalyst test
      defe931 [Davies Liu] remove BaseRow
      288b31f [Davies Liu] Merge branch 'master' of github.com:apache/spark into internalrow
      9d24350 [Davies Liu] separate Row and InternalRow (part 2)
      77da5be6
    • Thomas Szymanski's avatar
      [SPARK-8649] [BUILD] Mapr repository is not defined properly · 52d12818
      Thomas Szymanski authored
      The previous commiter on this part was pwendell
      
      The previous url gives 404, the new one seems to be OK.
      
      This patch is added under the Apache License 2.0.
      
      The JIRA link: https://issues.apache.org/jira/browse/SPARK-8649
      
      Author: Thomas Szymanski <develop@tszymanski.com>
      
      Closes #7054 from tszym/SPARK-8649 and squashes the following commits:
      
      bfda9c4 [Thomas Szymanski] [SPARK-8649] [BUILD] Mapr repository is not defined properly
      52d12818
    • Josh Rosen's avatar
      [SPARK-8683] [BUILD] Depend on mockito-core instead of mockito-all · f5100451
      Josh Rosen authored
      Spark's tests currently depend on `mockito-all`, which bundles Hamcrest and Objenesis classes. Instead, it should depend on `mockito-core`, which declares those libraries as Maven dependencies. This is necessary in order to fix a dependency conflict that leads to a NoSuchMethodError when using certain Hamcrest matchers.
      
      See https://github.com/mockito/mockito/wiki/Declaring-mockito-dependency for more details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7061 from JoshRosen/mockito-core-instead-of-all and squashes the following commits:
      
      70eccbe [Josh Rosen] Depend on mockito-core instead of mockito-all.
      f5100451
    • Josh Rosen's avatar
      42db3a1c
  3. Jun 27, 2015
    • Josh Rosen's avatar
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with... · 40648c56
      Josh Rosen authored
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
      
      This patch refactors the `python/run-tests` script:
      
      - It's now written in Python instead of Bash.
      - The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
      - `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):
      
        ```
      Usage: run-tests [options]
      
      Options:
        -h, --help            show this help message and exit
        --python-executables=PYTHON_EXECUTABLES
                              A comma-separated list of Python executables to test
                              against (default: python2.6,python3.4,pypy)
        --modules=MODULES     A comma-separated list of Python modules to test
                              (default: pyspark-core,pyspark-ml,pyspark-mllib
                              ,pyspark-sql,pyspark-streaming)
         ```
      - `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:
      
      f578d6d [Josh Rosen] Fix print for Python 2.x
      8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
      34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
      8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
      37aff00 [Josh Rosen] Python 3 fix
      27a389f [Josh Rosen] Skip MLLib tests for PyPy
      c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
      568a3fd [Josh Rosen] Fix hashbang
      3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
      f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
      9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
      d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
      4f8902c [Josh Rosen] Python lint fixes.
      8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
      f542ac5 [Josh Rosen] Fix lint check for Python 3
      fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
      2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
      b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
      caeb040 [Josh Rosen] Fixes to PySpark test module definitions
      d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
      def2d8a [Josh Rosen] Two minor fixes
      aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
      04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
      4c97136 [Josh Rosen] PYTHONPATH fixes
      dcc9c09 [Josh Rosen] Fix time division
      32660fc [Josh Rosen] Initial cut at Python test runner refactoring
      311c6a9 [Josh Rosen] Move shell utility functions to own module.
      1bdeb87 [Josh Rosen] Move module definitions to separate file.
      40648c56
    • Josh Rosen's avatar
      [SPARK-8606] Prevent exceptions in RDD.getPreferredLocations() from crashing DAGScheduler · 0b5abbf5
      Josh Rosen authored
      If `RDD.getPreferredLocations()` throws an exception it may crash the DAGScheduler and SparkContext. This patch addresses this by adding a try-catch block.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7023 from JoshRosen/SPARK-8606 and squashes the following commits:
      
      770b169 [Josh Rosen] Fix getPreferredLocations() DAGScheduler crash with try block.
      44a9b55 [Josh Rosen] Add test of a buggy getPartitions() method
      19aa9f7 [Josh Rosen] Add (failing) regression test for getPreferredLocations() DAGScheduler crash
      0b5abbf5
    • Sandy Ryza's avatar
      [SPARK-8623] Hadoop RDDs fail to properly serialize configuration · 4153776f
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #7050 from sryza/sandy-spark-8623 and squashes the following commits:
      
      58a8079 [Sandy Ryza] SPARK-8623. Hadoop RDDs fail to properly serialize configuration
      4153776f
    • Neelesh Srinivas Salian's avatar
      [SPARK-3629] [YARN] [DOCS]: Improvement of the "Running Spark on YARN" document · d48e7893
      Neelesh Srinivas Salian authored
      As per the description in the JIRA, I moved the contents of the page and added a few additional content.
      
      Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
      
      Closes #6924 from nssalian/SPARK-3629 and squashes the following commits:
      
      944b7a0 [Neelesh Srinivas Salian] Changed the lines about deploy-mode and added backticks to all parameters
      40dbc0b [Neelesh Srinivas Salian] Changed dfs to HDFS, deploy-mode in backticks and updated the master yarn line
      9cbc072 [Neelesh Srinivas Salian] Updated a few lines in the Launching Spark on YARN Section
      8e8db7f [Neelesh Srinivas Salian] Removed the changes in this commit to help clearly distinguish movement from update
      151c298 [Neelesh Srinivas Salian] SPARK-3629: Improvement of the Spark on YARN document
      d48e7893
    • Rosstin's avatar
      [SPARK-8639] [DOCS] Fixed Minor Typos in Documentation · b5a6663d
      Rosstin authored
      Ticket: [SPARK-8639](https://issues.apache.org/jira/browse/SPARK-8639)
      
      fixed minor typos in docs/README.md and docs/api.md
      
      Author: Rosstin <asterazul@gmail.com>
      
      Closes #7046 from Rosstin/SPARK-8639 and squashes the following commits:
      
      6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
      b5a6663d
  4. Jun 26, 2015
    • cafreeman's avatar
      [SPARK-8607] SparkR -- jars not being added to application classpath correctly · 9d118177
      cafreeman authored
      Add `getStaticClass` method in SparkR's `RBackendHandler`
      
      This is a fix for the problem referenced in [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185
      
      ).
      
      cc shivaram
      
      Author: cafreeman <cfreeman@alteryx.com>
      
      Closes #7001 from cafreeman/branch-1.4 and squashes the following commits:
      
      8f81194 [cafreeman] Add missing license
      31aedcf [cafreeman] Refactor test to call an external R script
      2c22073 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      0bea809 [cafreeman] Fixed relative path issue and added smaller JAR
      ee25e60 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      9a5c362 [cafreeman] test for including JAR when launching sparkContext
      9101223 [cafreeman] Merge branch 'branch-1.4' of github.com:apache/spark into branch-1.4
      5a80844 [cafreeman] Fix style nits
      7c6bd0c [cafreeman] [SPARK-8607] SparkR
      
      (cherry picked from commit 2579948b)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      9d118177
    • cafreeman's avatar
      [SPARK-8662] SparkR Update SparkSQL Test · a56516fc
      cafreeman authored
      Test `infer_type` using a more fine-grained approach rather than comparing environments. Since `all.equal`'s behavior has changed in R 3.2, the test became unpassable.
      
      JIRA here:
      https://issues.apache.org/jira/browse/SPARK-8662
      
      
      
      Author: cafreeman <cfreeman@alteryx.com>
      
      Closes #7045 from cafreeman/R32_Test and squashes the following commits:
      
      b97cc52 [cafreeman] Add `checkStructField` utility
      3381e5c [cafreeman] Update SparkSQL Test
      
      (cherry picked from commit 78b31a2a)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      a56516fc
    • Josh Rosen's avatar
      [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod() · 41afa165
      Josh Rosen authored
      This patch addresses a critical issue in the PySpark tests:
      
      Several of our Python modules' `__main__` methods call `doctest.testmod()` in order to run doctests but forget to check and handle its return value. As a result, some PySpark test failures can go unnoticed because they will not fail the build.
      
      Fortunately, there was only one test failure which was masked by this bug: a `pyspark.profiler` doctest was failing due to changes in RDD pipelining.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7032 from JoshRosen/testmod-fix and squashes the following commits:
      
      60dbdc0 [Josh Rosen] Account for int vs. long formatting change in Python 3
      8b8d80a [Josh Rosen] Fix failing test.
      e6423f9 [Josh Rosen] Check return code for all uses of doctest.testmod().
      41afa165
    • Marcelo Vanzin's avatar
      [SPARK-8302] Support heterogeneous cluster install paths on YARN. · 37bf76a2
      Marcelo Vanzin authored
      Some users have Hadoop installations on different paths across
      their cluster. Currently, that makes it hard to set up some
      configuration in Spark since that requires hardcoding paths to
      jar files or native libraries, which wouldn't work on such a cluster.
      
      This change introduces a couple of YARN-specific configurations
      that instruct the backend to replace certain paths when launching
      remote processes. That way, if the configuration says the Spark
      jar is in "/spark/spark.jar", and also says that "/spark" should be
      replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
      in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
      of the jar.
      
      Coupled with YARN's environment whitelist (which allows certain
      env variables to be exposed to containers), this allows users to
      support such heterogeneous environments, as long as a single
      replacement is enough. (Otherwise, this feature would need to be
      extended to support multiple path replacements.)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6752 from vanzin/SPARK-8302 and squashes the following commits:
      
      4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
      0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
      2e9cc9d [Marcelo Vanzin] Style.
      a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
      37bf76a2
    • Holden Karau's avatar
      [SPARK-8613] [ML] [TRIVIAL] add param to disable linear feature scaling · c9e05a31
      Holden Karau authored
      Add a param to disable linear feature scaling (to be implemented later in linear & logistic regression). Done as a seperate PR so we can use same param & not conflict while working on the sub-tasks.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7024 from holdenk/SPARK-8522-Disable-Linear_featureScaling-Spark-8613-Add-param and squashes the following commits:
      
      ce8931a [Holden Karau] Regenerate the sharedParams code
      fa6427e [Holden Karau] update text for standardization param.
      7b24a2b [Holden Karau] generate the new standardization param
      3c190af [Holden Karau] Add the standardization param to sharedparamscodegen
      c9e05a31
    • Josh Rosen's avatar
      [SPARK-8344] Add message processing time metric to DAGScheduler · 9fed6abf
      Josh Rosen authored
      This commit adds a new metric, `messageProcessingTime`, to the DAGScheduler metrics source. This metrics tracks the time taken to process messages in the scheduler's event processing loop, which is a helpful debugging aid for diagnosing performance issues in the scheduler (such as SPARK-4961).
      
      In order to do this, I moved the creation of the DAGSchedulerSource metrics source into DAGScheduler itself, similar to how MasterSource is created and registered in Master.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7002 from JoshRosen/SPARK-8344 and squashes the following commits:
      
      57f914b [Josh Rosen] Fix import ordering
      7d6bb83 [Josh Rosen] Add message processing time metrics to DAGScheduler
      9fed6abf
    • Wenchen Fan's avatar
      [SPARK-8635] [SQL] improve performance of CatalystTypeConverters · 1a79f0eb
      Wenchen Fan authored
      In `CatalystTypeConverters.createToCatalystConverter`, we add special handling for primitive types. We can apply this strategy to more places to improve performance.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7018 from cloud-fan/converter and squashes the following commits:
      
      8b16630 [Wenchen Fan] another fix
      326c82c [Wenchen Fan] optimize type converter
      1a79f0eb
    • Wenchen Fan's avatar
      [SPARK-8620] [SQL] cleanup CodeGenContext · 40360112
      Wenchen Fan authored
      fix docs, remove nativeTypes , use java type to get boxed type ,default value, etc. to avoid handle `DateType` and `TimestampType` as int and long again and again.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7010 from cloud-fan/cg and squashes the following commits:
      
      aa01cf9 [Wenchen Fan] cleanup CodeGenContext
      40360112
    • Liang-Chi Hsieh's avatar
      [SPARK-8237] [SQL] Add misc function sha2 · 47c874ba
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8237
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6934 from viirya/expr_sha2 and squashes the following commits:
      
      35e0bb3 [Liang-Chi Hsieh] For comments.
      68b5284 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
      8573aff [Liang-Chi Hsieh] Remove unnecessary Product.
      ee61e06 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
      59e41aa [Liang-Chi Hsieh] Add misc function: sha2.
      47c874ba
  5. Jun 25, 2015
    • Shivaram Venkataraman's avatar
      [SPARK-8637] [SPARKR] [HOTFIX] Fix packages argument, sparkSubmitBinName · c392a9ef
      Shivaram Venkataraman authored
      cc cafreeman
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #7022 from shivaram/sparkr-init-hotfix and squashes the following commits:
      
      9178d15 [Shivaram Venkataraman] Fix packages argument, sparkSubmitBinName
      c392a9ef
    • Yanbo Liang's avatar
      [MINOR] [MLLIB] rename some functions of PythonMLLibAPI · 2519dcc3
      Yanbo Liang authored
      Keep the same naming conventions for PythonMLLibAPI.
      Only the following three functions is different from others
      ```scala
      trainNaiveBayes
      trainGaussianMixture
      trainWord2Vec
      ```
      So change them to
      ```scala
      trainNaiveBayesModel
      trainGaussianMixtureModel
      trainWord2VecModel
      ```
      It does not affect any users and public APIs, only to make better understand for developer and code hacker.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
      
      771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
      2519dcc3
    • Yin Huai's avatar
      [SPARK-8567] [SQL] Add logs to record the progress of HiveSparkSubmitSuite. · f9b397f5
      Yin Huai authored
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #7009 from yhuai/SPARK-8567 and squashes the following commits:
      
      62fb1f9 [Yin Huai] Add sc.stop().
      b22cf7d [Yin Huai] Add logs.
      f9b397f5
    • Tom Graves's avatar
      [SPARK-8574] org/apache/spark/unsafe doesn't honor the java source/ta… · e988adb5
      Tom Graves authored
      …rget versions.
      
      I basically copied the compatibility rules from the top level pom.xml into here.  Someone more familiar with all the options in the top level pom may want to make sure nothing else should be copied on down.
      
      With this is allows me to build with jdk8 and run with lower versions.  Source shows compiled for jdk6 as its supposed to.
      
      Author: Tom Graves <tgraves@yahoo-inc.com>
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #6989 from tgravescs/SPARK-8574 and squashes the following commits:
      
      e1ea2d4 [Thomas Graves] Change to use combine.children="append"
      150d645 [Tom Graves] [SPARK-8574] org/apache/spark/unsafe doesn't honor the java source/target versions
      e988adb5
    • Joshi's avatar
      [SPARK-5768] [WEB UI] Fix for incorrect memory in Spark UI · 085a7216
      Joshi authored
      Fix for incorrect memory in Spark UI as per SPARK-5768
      
      Author: Joshi <rekhajoshm@gmail.com>
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      
      Closes #6972 from rekhajoshm/SPARK-5768 and squashes the following commits:
      
      b678a91 [Joshi] Fix for incorrect memory in Spark UI
      2fe53d9 [Joshi] Fix for incorrect memory in Spark UI
      eb823b8 [Joshi] SPARK-5768: Fix for incorrect memory in Spark UI
      0be142d [Rekha Joshi] Merge pull request #3 from apache/master
      106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      085a7216
    • Cheng Lian's avatar
      [SPARK-8604] [SQL] HadoopFsRelation subclasses should set their output format class · c337844e
      Cheng Lian authored
      `HadoopFsRelation` subclasses, especially `ParquetRelation2` should set its own output format class, so that the default output committer can be setup correctly when doing appending (where we ignore user defined output committers).
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6998 from liancheng/spark-8604 and squashes the following commits:
      
      9be51d1 [Cheng Lian] Adds more comments
      6db1368 [Cheng Lian] HadoopFsRelation subclasses should set their output format class
      c337844e
    • Matt Massie's avatar
      [SPARK-7884] Move block deserialization from BlockStoreShuffleFetcher to ShuffleReader · 7bac2fe7
      Matt Massie authored
      This commit updates the shuffle read path to enable ShuffleReader implementations more control over the deserialization process.
      
      The BlockStoreShuffleFetcher.fetch() method has been renamed to BlockStoreShuffleFetcher.fetchBlockStreams(). Previously, this method returned a record iterator; now, it returns an iterator of (BlockId, InputStream). Deserialization of records is now handled in the ShuffleReader.read() method.
      
      This change creates a cleaner separation of concerns and allows implementations of ShuffleReader more flexibility in how records are retrieved.
      
      Author: Matt Massie <massie@cs.berkeley.edu>
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #6423 from massie/shuffle-api-cleanup and squashes the following commits:
      
      8b0632c [Matt Massie] Minor Scala style fixes
      d0a1b39 [Matt Massie] Merge pull request #1 from kayousterhout/massie_shuffle-api-cleanup
      290f1eb [Kay Ousterhout] Added test for HashShuffleReader.read()
      5186da0 [Kay Ousterhout] Revert "Add test to ensure HashShuffleReader is freeing resources"
      f98a1b9 [Matt Massie] Add test to ensure HashShuffleReader is freeing resources
      a011bfa [Matt Massie] Use PrivateMethodTester on check that delegate stream is closed
      4ea1712 [Matt Massie] Small code cleanup for readability
      7429a98 [Matt Massie] Update tests to check that BufferReleasingStream is closing delegate InputStream
      f458489 [Matt Massie] Remove unnecessary map() on return Iterator
      4abb855 [Matt Massie] Consolidate metric code. Make it clear why InterrubtibleIterator is needed.
      5c30405 [Matt Massie] Return visibility of BlockStoreShuffleFetcher to private[hash]
      7eedd1d [Matt Massie] Small Scala import cleanup
      28f8085 [Matt Massie] Small import nit
      f93841e [Matt Massie] Update shuffle read metrics in ShuffleReader instead of BlockStoreShuffleFetcher.
      7e8e0fe [Matt Massie] Minor Scala style fixes
      01e8721 [Matt Massie] Explicitly cast iterator in branches for type clarity
      7c8f73e [Matt Massie] Close Block InputStream immediately after all records are read
      208b7a5 [Matt Massie] Small code style changes
      b70c945 [Matt Massie] Make BlockStoreShuffleFetcher visible to shuffle package
      19135f2 [Matt Massie] [SPARK-7884] Allow Spark shuffle APIs to be more customizable
      7bac2fe7
  6. Jun 24, 2015
    • Reynold Xin's avatar
      Two minor SQL cleanup (compiler warning & indent). · 82f80c1c
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7000 from rxin/minor-cleanup and squashes the following commits:
      
      046044c [Reynold Xin] Two minor SQL cleanup (compiler warning & indent).
      82f80c1c
    • Wenchen Fan's avatar
      [SPARK-8075] [SQL] apply type check interface to more expressions · b71d3254
      Wenchen Fan authored
      a follow up of https://github.com/apache/spark/pull/6405.
      Note: It's not a big change, a lot of changing is due to I swap some code in `aggregates.scala` to make aggregate functions right below its corresponding aggregate expressions.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6723 from cloud-fan/type-check and squashes the following commits:
      
      2124301 [Wenchen Fan] fix tests
      5a658bb [Wenchen Fan] add tests
      287d3bb [Wenchen Fan] apply type check interface to more expressions
      b71d3254
    • Yin Huai's avatar
      [SPARK-8567] [SQL] Increase the timeout of HiveSparkSubmitSuite · 7daa7029
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-8567
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6957 from yhuai/SPARK-8567 and squashes the following commits:
      
      62dff5b [Yin Huai] Increase the timeout.
      7daa7029
    • fe2s's avatar
      [SPARK-8558] [BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set · dca21a83
      fe2s authored
      Author: fe2s <aka.fe2s@gmail.com>
      Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com>
      
      Closes #6956 from fe2s/fix-run-tests and squashes the following commits:
      
      31b6edc [fe2s] str is a built-in function, so using it as a variable name will lead to spurious warnings in some Python linters
      7d781a0 [fe2s] fixing for openjdk/IBM, seems like they have slightly different wording, but all have 'version' word. Surrounding with spaces for the case if version word appears in _JAVA_OPTIONS
      cd455ef [fe2s] address comment, looking for java version string rather than expecting to have on a certain line number
      ad577d7 [Oleksiy Dyagilev] [SPARK-8558][BUILD] Script /dev/run-tests fails when _JAVA_OPTIONS env var set
      dca21a83
    • Cheng Lian's avatar
      [SPARK-6777] [SQL] Implements backwards compatibility rules in CatalystSchemaConverter · 8ab50765
      Cheng Lian authored
      This PR introduces `CatalystSchemaConverter` for converting Parquet schema to Spark SQL schema and vice versa.  Original conversion code in `ParquetTypesConverter` is removed. Benefits of the new version are:
      
      1. When converting Spark SQL schemas, it generates standard Parquet schemas conforming to [the most updated Parquet format spec] [1]. Converting to old style Parquet schemas is also supported via feature flag `spark.sql.parquet.followParquetFormatSpec` (which is set to `false` for now, and should be set to `true` after both read and write paths are fixed).
      
         Note that although this version of Parquet format spec hasn't been officially release yet, Parquet MR 1.7.0 already sticks to it. So it should be safe to follow.
      
      1. It implements backwards-compatibility rules described in the most updated Parquet format spec. Thus can recognize more schema patterns generated by other/legacy systems/tools.
      1. Code organization follows convention used in [parquet-mr] [2], which is easier to follow. (Structure of `CatalystSchemaConverter` is similar to `AvroSchemaConverter`).
      
      To fully implement backwards-compatibility rules in both read and write path, we also need to update `CatalystRowConverter` (which is responsible for converting Parquet records to `Row`s), `RowReadSupport`, and `RowWriteSupport`. These would be done in follow-up PRs.
      
      TODO
      
      - [x] More schema conversion test cases for legacy schema patterns.
      
      [1]: https://github.com/apache/parquet-format/blob/ea095226597fdbecd60c2419d96b54b2fdb4ae6c/LogicalTypes.md
      [2]: https://github.com/apache/parquet-mr/
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6617 from liancheng/spark-6777 and squashes the following commits:
      
      2a2062d [Cheng Lian] Don't convert decimals without precision information
      b60979b [Cheng Lian] Adds a constructor which accepts a Configuration, and fixes default value of assumeBinaryIsString
      743730f [Cheng Lian] Decimal scale shouldn't be larger than precision
      a104a9e [Cheng Lian] Fixes Scala style issue
      1f71d8d [Cheng Lian] Adds feature flag to allow falling back to old style Parquet schema conversion
      ba84f4b [Cheng Lian] Fixes MapType schema conversion bug
      13cb8d5 [Cheng Lian] Fixes MiMa failure
      81de5b0 [Cheng Lian] Fixes UDT, workaround read path, and add tests
      28ef95b [Cheng Lian] More AnalysisExceptions
      b10c322 [Cheng Lian] Replaces require() with analysisRequire() which throws AnalysisException
      cceaf3f [Cheng Lian] Implements backwards compatibility rules in CatalystSchemaConverter
      8ab50765
    • MechCoder's avatar
      [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD · fb32c388
      MechCoder authored
      Add Python bindings to StreamingLogisticRegressionwithSGD.
      
      No Java wrappers are needed as models are updated directly using train.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6849 from MechCoder/spark-3258 and squashes the following commits:
      
      b4376a5 [MechCoder] minor
      d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
      9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
      fb32c388
    • Wenchen Fan's avatar
      [SPARK-7289] handle project -> limit -> sort efficiently · f04b5672
      Wenchen Fan authored
      make the `TakeOrdered` strategy and operator more general, such that it can optionally handle a projection when necessary
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6780 from cloud-fan/limit and squashes the following commits:
      
      34aa07b [Wenchen Fan] revert
      07d5456 [Wenchen Fan] clean closure
      20821ec [Wenchen Fan] fix
      3676a82 [Wenchen Fan] address comments
      b558549 [Wenchen Fan] address comments
      214842b [Wenchen Fan] fix style
      2d8be83 [Wenchen Fan] add LimitPushDown
      948f740 [Wenchen Fan] fix existing
      f04b5672
    • Santiago M. Mola's avatar
      [SPARK-7088] [SQL] Fix analysis for 3rd party logical plan. · b84d4b4d
      Santiago M. Mola authored
      ResolveReferences analysis rule now does not throw when it cannot resolve references in a self-join.
      
      Author: Santiago M. Mola <smola@stratio.com>
      
      Closes #6853 from smola/SPARK-7088 and squashes the following commits:
      
      af71ac7 [Santiago M. Mola] [SPARK-7088] Fix analysis for 3rd party logical plan.
      b84d4b4d
Loading