Skip to content
Snippets Groups Projects
  1. Apr 17, 2015
    • jerryshao's avatar
      [SPARK-6975][Yarn] Fix argument validation error · d850b4bd
      jerryshao authored
      `numExecutors` checking is failed when dynamic allocation is enabled with default configuration. Details can be seen is [SPARK-6975](https://issues.apache.org/jira/browse/SPARK-6975). sryza, please help me to review this, not sure is this the correct way, I think previous you change this part :)
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5551 from jerryshao/SPARK-6975 and squashes the following commits:
      
      4335da1 [jerryshao] Change according to the comments
      77bdcbd [jerryshao] Fix argument validation error
      d850b4bd
    • Marcelo Vanzin's avatar
      [SPARK-5933] [core] Move config deprecation warnings to SparkConf. · 19913373
      Marcelo Vanzin authored
      I didn't find many deprecated configs after a grep-based search,
      but the ones I could find were moved to the centralized location
      in SparkConf.
      
      While there, I deprecated a couple more HS configs that mentioned
      time units.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5562 from vanzin/SPARK-5933 and squashes the following commits:
      
      dcb617e7 [Marcelo Vanzin] [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
      19913373
    • Jongyoul Lee's avatar
      [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode · 6fbeb82e
      Jongyoul Lee authored
      - Defined executorCores from "spark.mesos.executor.cores"
      - Changed the amount of mesosExecutor's cores to executorCores.
      - Added new configuration option on running-on-mesos.md
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #5063 from jongyoul/SPARK-6350 and squashes the following commits:
      
      9238d6e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs - Changed configuration name - Made mesosExecutorCores private
      2d41241 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      89edb4f [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      8ba7694 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
      7549314 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed docs
      4ae7b0c [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Removed TODO
      c27efce [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed Mesos*Suite for supporting integer WorkerOffers - Fixed Documentation
      1fe4c03 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Change available resources of cpus to integer value beacuse WorkerOffer support the amount cpus as integer value
      5f3767e [Jongyoul Lee] Revert "[SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode"
      4b7c69e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Changed configruation name and description from "spark.mesos.executor.cores" to "spark.executor.frameworkCores"
      0556792 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Defined executorCores from "spark.mesos.executor.cores" - Changed the amount of mesosExecutor's cores to executorCores. - Added new configuration option on running-on-mesos.md
      6fbeb82e
    • Ilya Ganelin's avatar
      [SPARK-6703][Core] Provide a way to discover existing SparkContext's · c5ed5101
      Ilya Ganelin authored
      I've added a static getOrCreate method to the static SparkContext object that allows one to either retrieve a previously created SparkContext or to instantiate a new one with the provided config. The method accepts an optional SparkConf to make usage intuitive.
      
      Still working on a test for this, basically want to create a new context from scratch, then ensure that subsequent calls don't overwrite that.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #5501 from ilganeli/SPARK-6703 and squashes the following commits:
      
      db9a963 [Ilya Ganelin] Closing second spark context
      1dc0444 [Ilya Ganelin] Added ref equality check
      8c884fa [Ilya Ganelin] Made getOrCreate synchronized
      cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
      270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
      15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
      0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
      dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
      733ec9f [Ilya Ganelin] Fixed some bugs in test code
      8be2f83 [Ilya Ganelin] Replaced match with if
      e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
      a99032f [Ilya Ganelin] Spacing fix
      d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
      c5ed5101
    • Reynold Xin's avatar
      Minor fix to SPARK-6958: Improve Python docstring for DataFrame.sort. · a452c592
      Reynold Xin authored
      As a follow up PR to #5544.
      
      cc davies
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5558 from rxin/sort-doc-improvement and squashes the following commits:
      
      f4c276f [Reynold Xin] Review feedback.
      d2dcf24 [Reynold Xin] Minor fix to SPARK-6958: Improve Python docstring for DataFrame.sort.
      a452c592
    • Olivier Girardot's avatar
      SPARK-6988 : Fix documentation regarding DataFrames using the Java API · d305e686
      Olivier Girardot authored
      
      This patch includes :
       * adding how to use map after an sql query using javaRDD
       * fixing the first few java examples that were written in Scala
      
      Thank you for your time,
      
      Olivier.
      
      Author: Olivier Girardot <o.girardot@lateral-thoughts.com>
      
      Closes #5564 from ogirardot/branch-1.3 and squashes the following commits:
      
      9f8d60e [Olivier Girardot] SPARK-6988 : Fix documentation regarding DataFrames using the Java API
      
      (cherry picked from commit 6b528dc139da594ef2e651d84bd91fe0f738a39d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      d305e686
    • cafreeman's avatar
      [SPARK-6807] [SparkR] Merge recent SparkR-pkg changes · 59e206de
      cafreeman authored
      This PR pulls in recent changes in SparkR-pkg, including
      
      cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.
      
      Author: cafreeman <cfreeman@alteryx.com>
      Author: Davies Liu <davies@databricks.com>
      Author: Zongheng Yang <zongheng.y@gmail.com>
      Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #5436 from davies/R3 and squashes the following commits:
      
      c2b09be [Davies Liu] SQLTypes -> schema
      a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
      168b7fe [Davies Liu] sort generics
      b1fe460 [Davies Liu] fix conflict in README.md
      e74c04e [Davies Liu] fix schema.R
      4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
      41f8184 [Davies Liu] rm man
      ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
      1bdcb63 [Zongheng Yang] Updates to README.md.
      5a553e7 [cafreeman] Use object attribute instead of argument
      71372d9 [cafreeman] Update docs and examples
      8526d2e71 [cafreeman] Remove `tojson` functions
      6ef5f2d [cafreeman] Fix spacing
      7741d66 [cafreeman] Rename the SQL DataType function
      141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
      9387402 [Davies Liu] fix style
      40199eb [Shivaram Venkataraman] Move except into sorted position
      07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
      7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
      ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
      f3ba785 [cafreeman] Fixed duplicate export
      275deb4 [cafreeman] Update `NAMESPACE` and tests
      1a3b63d [cafreeman] new version of `CreateDF`
      836c4bf [cafreeman] Update `createDataFrame` and `toDF`
      be5d5c1 [cafreeman] refactor schema functions
      40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
      20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
      ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
      c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
      b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
      136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
      cd66603 [cafreeman] new line at EOF
      8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
      7dd81b7 [cafreeman] Documentation
      0e2a94f [cafreeman] Define functions for schema and fields
      59e206de
    • Joseph K. Bradley's avatar
      [SPARK-6113] [ml] Stabilize DecisionTree API · a83571ac
      Joseph K. Bradley authored
      This is a PR for cleaning up and finalizing the DecisionTree API.  PRs for ensembles will follow once this is merged.
      
      ### Goal
      
      Here is the description copied from the JIRA (for both trees and ensembles):
      
      > **Issue**: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design.
      > **Proposal**: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details.
      > **[Design doc](https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4)** : This outlines current issues and the proposed API.
      
      Overall code layout:
      * The old API in mllib.tree.* will remain the same.
      * The new API will reside in ml.classification.* and ml.regression.*
      
      ### Summary of changes
      
      Old API
      * Exactly the same, except I made 1 method in Loss private (but that is not a breaking change since that method was introduced after the Spark 1.3 release).
      
      New APIs
      * Under Pipeline API
      * The new API preserves functionality, except:
        * New API does NOT store prob (probability of label in classification).  I want to have it store the full vector of probabilities but feel that should be in a later PR.
      * Use abstractions for parameters, estimators, and models to avoid code duplication
      * Limit parameters to relevant algorithms
      * For enum-like types, only expose Strings
        * We can make these pluggable later on by adding new parameters.  That is a far-future item.
      
      Test suites
      * I organized DecisionTreeSuite, but I made absolutely no changes to the tests themselves.
      * The test suites for the new API only test (a) similarity with the results of the old API and (b) elements of the new API.
        * After code is moved to this new API, we should move the tests from the old suites which test the internals.
      
      ### Details
      
      #### Changed names
      
      Parameters
      * useNodeIdCache -> cacheNodeIds
      
      #### Other changes
      
      * Split: Changed categories to set instead of list
      
      #### Non-decision tree changes
      * AttributeGroup
        * Added parentheses to toMetadata, toStructField methods (These were removed in a previous PR, but I ran into 1 issue with the Scala compiler not being able to disambiguate between a toMetadata method with no parentheses and a toMetadata method which takes 1 argument.)
      * Attributes
        * Renamed: toMetadata -> toMetadataImpl
        * Added toMetadata methods which return ML metadata (keyed with “ML_ATTR”)
        * NominalAttribute: Added getNumValues method which examines both numValues and values.
      * Params.inheritValues: Checks whether the parent param really belongs to the child (to allow Estimator-Model pairs with different sets of parameters)
      
      ### Questions for reviewers
      
      * Is "DecisionTreeClassificationModel" too long a name?
      * Is this OK in the docs?
      ```
      class DecisionTreeRegressor extends TreeRegressor[DecisionTreeRegressionModel] with DecisionTreeParams[DecisionTreeRegressor] with TreeRegressorParams[DecisionTreeRegressor]
      ```
      
      ### Future
      
      We should open up the abstractions at some point.  E.g., it would be useful to be able to set tree-related parameters in 1 place and then pass those to multiple tree-based algorithms.
      
      Follow-up JIRAs will be (in this order):
      * Tree ensembles
      * Deprecate old tree code
      * Move DecisionTree implementation code to new API.
      * Move tests from the old suites which test the internals.
      * Update programming guide
      * Python API
      * Change RandomForest* to always use bootstrapping, even when numTrees = 1
      * Provide the probability of the predicted label for classification.  After we move code to the new API and update it to maintain probabilities for all labels, then we can add the probabilities to the new API.
      
      CC: mengxr  manishamde  codedeft  chouqin  MechCoder
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5530 from jkbradley/dt-api-dt and squashes the following commits:
      
      6aae255 [Joseph K. Bradley] Changed tree abstractions not to take type parameters, and for setters to return this.type instead
      ec17947 [Joseph K. Bradley] Updates based on code review.  Main changes were: moving public types from ml.impl.tree to ml.tree, modifying CategoricalSplit to take an Array of categories but store a Set internally, making more types sealed or final
      5626c81 [Joseph K. Bradley] style fixes
      f8fbd24 [Joseph K. Bradley] imported reorg of DecisionTreeSuite from old PR.  small cleanups
      7ef63ed [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example (for real this time)
      e11673f [Joseph K. Bradley] Added DecisionTreeRegressor, test suites, and example
      119f407 [Joseph K. Bradley] added DecisionTreeClassifier example
      0bdc486 [Joseph K. Bradley] fixed issues after param PR was merged
      f9fbb60 [Joseph K. Bradley] Done with DecisionTreeClassifier, but no save/load yet.  Need to add example as well
      2532c9a [Joseph K. Bradley] partial move to spark.ml API, not done yet
      c72c1a0 [Joseph K. Bradley] Copied changes for common items, plus DecisionTreeClassifier from original PR
      a83571ac
    • Marcelo Vanzin's avatar
      [SPARK-2669] [yarn] Distribute client configuration to AM. · 50ab8a65
      Marcelo Vanzin authored
      Currently, when Spark launches the Yarn AM, the process will use
      the local Hadoop configuration on the node where the AM launches,
      if one is present. A more correct approach is to use the same
      configuration used to launch the Spark job, since the user may
      have made modifications (such as adding app-specific configs).
      
      The approach taken here is to use the distributed cache to make
      all files in the Hadoop configuration directory available to the
      AM. This is a little overkill since only the AM needs them (the
      executors use the broadcast Hadoop configuration from the driver),
      but is the easier approach.
      
      Even though only a few files in that directory may end up being
      used, all of them are uploaded. This allows supporting use cases
      such as when auxiliary configuration files are used for SSL
      configuration, or when uploading a Hive configuration directory.
      Not all of these may be reflected in a o.a.h.conf.Configuration object,
      but may be needed when a driver in cluster mode instantiates, for
      example, a HiveConf object instead.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4142 from vanzin/SPARK-2669 and squashes the following commits:
      
      f5434b9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      013f0fb [Marcelo Vanzin] Review feedback.
      f693152 [Marcelo Vanzin] Le sigh.
      ed45b7d [Marcelo Vanzin] Zip all config files and upload them as an archive.
      5927b6b [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      cbb9fb3 [Marcelo Vanzin] Remove stale test.
      e3e58d0 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      e3d0613 [Marcelo Vanzin] Review feedback.
      34bdbd8 [Marcelo Vanzin] Fix test.
      022a688 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      a77ddd5 [Marcelo Vanzin] Merge branch 'master' into SPARK-2669
      79221c7 [Marcelo Vanzin] [SPARK-2669] [yarn] Distribute client configuration to AM.
      50ab8a65
    • Davies Liu's avatar
      [SPARK-6957] [SPARK-6958] [SQL] improve API compatibility to pandas · c84d9169
      Davies Liu authored
      ```
      select(['cola', 'colb'])
      
      groupby(['colA', 'colB'])
      groupby([df.colA, df.colB])
      
      df.sort('A', ascending=True)
      df.sort(['A', 'B'], ascending=True)
      df.sort(['A', 'B'], ascending=[1, 0])
      ```
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5544 from davies/compatibility and squashes the following commits:
      
      4944058 [Davies Liu] add docstrings
      adb2816 [Davies Liu] Merge branch 'master' of github.com:apache/spark into compatibility
      bcbbcab [Davies Liu] support ascending as list
      8dabdf0 [Davies Liu] improve API compatibility to pandas
      c84d9169
    • linweizhong's avatar
      [SPARK-6604][PySpark]Specify ip of python server scoket · dc48ba9f
      linweizhong authored
      In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process.
      /cc davies
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #5256 from Sephiroth-Lin/SPARK-6604 and squashes the following commits:
      
      7b3c633 [linweizhong] rephrase
      dc48ba9f
    • Punya Biswal's avatar
      [SPARK-6952] Handle long args when detecting PID reuse · f6a9a57a
      Punya Biswal authored
      sbin/spark-daemon.sh used
      
          ps -p "$TARGET_PID" -o args=
      
      to figure out whether the process running with the expected PID is actually a Spark
      daemon. When running with a large classpath, the output of ps gets
      truncated and the check fails spuriously.
      
      This weakens the check to see if it's a java command (which is something
      we do in other parts of the script) rather than looking for the specific
      main class name. This means that SPARK-4832 might happen under a
      slightly broader range of circumstances (a java program happened to
      reuse the same PID), but it seems worthwhile compared to failing
      consistently with a large classpath.
      
      Author: Punya Biswal <pbiswal@palantir.com>
      
      Closes #5535 from punya/feature/SPARK-6952 and squashes the following commits:
      
      7ea12d1 [Punya Biswal] Handle long args when detecting PID reuse
      f6a9a57a
    • Marcelo Vanzin's avatar
      [SPARK-6046] [core] Reorganize deprecated config support in SparkConf. · 4527761b
      Marcelo Vanzin authored
      This change tries to follow the chosen way for handling deprecated
      configs in SparkConf: all values (old and new) are kept in the conf
      object, and newer names take precedence over older ones when
      retrieving the value.
      
      Warnings are logged when config options are set, which generally happens
      on the driver node (where the logs are most visible).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5514 from vanzin/SPARK-6046 and squashes the following commits:
      
      9371529 [Marcelo Vanzin] Avoid math.
      6cf3f11 [Marcelo Vanzin] Review feedback.
      2445d48 [Marcelo Vanzin] Fix (and cleanup) update interval initialization.
      b6824be [Marcelo Vanzin] Clean up the other deprecated config use also.
      ab20351 [Marcelo Vanzin] Update FsHistoryProvider to only retrieve new config key.
      2c93209 [Marcelo Vanzin] [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
      4527761b
    • Sean Owen's avatar
      SPARK-6846 [WEBUI] Stage kill URL easy to accidentally trigger and possibility for security issue · f7a25644
      Sean Owen authored
      kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5528 from srowen/SPARK-6846 and squashes the following commits:
      
      137ac9f [Sean Owen] Oops, fix scalastyle line length probelm
      7c5f961 [Sean Owen] Add Imran's test of kill link
      59f447d [Sean Owen] kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs
      f7a25644
  2. Apr 16, 2015
    • Michael Armbrust's avatar
      [SPARK-6972][SQL] Add Coalesce to DataFrame · 8220d526
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5545 from marmbrus/addCoalesce and squashes the following commits:
      
      9fdf3f6 [Michael Armbrust] [SPARK-6972][SQL] Add Coalesce to DataFrame
      8220d526
    • Michael Armbrust's avatar
      [SPARK-6966][SQL] Use correct ClassLoader for JDBC Driver · e5949c28
      Michael Armbrust authored
      Otherwise we cannot add jars with drivers after the fact.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5543 from marmbrus/jdbcClassloader and squashes the following commits:
      
      d9930f3 [Michael Armbrust] fix imports
      73d0614 [Michael Armbrust] [SPARK-6966][SQL] Use correct ClassLoader for JDBC Driver
      e5949c28
    • Liang-Chi Hsieh's avatar
      [SPARK-6899][SQL] Fix type mismatch when using codegen with Average on DecimalType · 1e43851d
      Liang-Chi Hsieh authored
      JIRA https://issues.apache.org/jira/browse/SPARK-6899
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5517 from viirya/fix_codegen_average and squashes the following commits:
      
      8ae5f65 [Liang-Chi Hsieh] Add the case of DecimalType.Unlimited to Average.
      1e43851d
    • scwf's avatar
      [SQL][Minor] Fix foreachUp of treenode · d9660867
      scwf authored
      `foreachUp` should runs the given function recursively on [[children]] then on this node(just like transformUp). The current implementation does not follow this.
      
      This will leads to checkanalysis do not check from bottom of logical tree.
      
      Author: scwf <wangfei1@huawei.com>
      Author: Fei Wang <wangfei1@huawei.com>
      
      Closes #5518 from scwf/patch-1 and squashes the following commits:
      
      18e28b2 [scwf] added a test case
      1ccbfa8 [Fei Wang] fix foreachUp
      d9660867
    • Davies Liu's avatar
      [SPARK-6911] [SQL] improve accessor for nested types · 6183b5e2
      Davies Liu authored
      Support access columns by index in Python:
      ```
      >>> df[df[0] > 3].collect()
      [Row(age=5, name=u'Bob')]
      ```
      
      Access items in ArrayType or MapType
      ```
      >>> df.select(df.l.getItem(0), df.d.getItem("key")).show()
      >>> df.select(df.l[0], df.d["key"]).show()
      ```
      
      Access field in StructType
      ```
      >>> df.select(df.r.getField("b")).show()
      >>> df.select(df.r.a).show()
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5513 from davies/access and squashes the following commits:
      
      e04d5a0 [Davies Liu] Update run-tests-jenkins
      7ada9eb [Davies Liu] update timeout
      d125ac4 [Davies Liu] check column name, improve scala tests
      6b62540 [Davies Liu] fix test
      db15b42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into access
      6c32e79 [Davies Liu] add scala tests
      11f1df3 [Davies Liu] improve accessor for nested types
      6183b5e2
    • 云峤's avatar
      SPARK-6927 [SQL] Sorting Error when codegen on · 5fe43433
      云峤 authored
      Fix this error by adding BinaryType comparor in GenerateOrdering.
      JIRA https://issues.apache.org/jira/browse/SPARK-6927
      
      Author: 云峤 <chensong.cs@alibaba-inc.com>
      
      Closes #5524 from kaka1992/fix-codegen-sort and squashes the following commits:
      
      d7e2afe [云峤] fix codegen sorting error
      5fe43433
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
    • Shivaram Venkataraman's avatar
      [SPARK-6855] [SPARKR] Set R includes to get the right collate order. · 55f553a9
      Shivaram Venkataraman authored
      This prevents tools like devtools::document creating invalid collate orders
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #5462 from shivaram/collate-order and squashes the following commits:
      
      f3db562 [Shivaram Venkataraman] Set R includes to get the right collate order. This prevents tools like devtools::document creating invalid collate orders
      55f553a9
    • zsxwing's avatar
      [SPARK-6934][Core] Use 'spark.akka.askTimeout' for the ask timeout · ef3fb801
      zsxwing authored
      Fixed my mistake in #4588
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5529 from zsxwing/SPARK-6934 and squashes the following commits:
      
      9890b2d [zsxwing] Use 'spark.akka.askTimeout' for the ask timeout
      ef3fb801
    • Jin Adachi's avatar
      [SPARK-6694][SQL]SparkSQL CLI must be able to specify an option --database on the command line. · 3ae37b93
      Jin Adachi authored
      SparkSQL CLI has an option --database as follows.
      But, the option --database is ignored.
      
      ```
      $ spark-sql --help
      :
      CLI options:
          :
          --database <databasename>     Specify the database to use
      ```
      
      Author: Jin Adachi <adachij2002@yahoo.co.jp>
      Author: adachij <adachij@nttdata.co.jp>
      
      Closes #5345 from adachij2002/SPARK-6694 and squashes the following commits:
      
      8659084 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
      0301eb9 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
      df81086 [Jin Adachi] Modify code style.
      846f83e [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
      dbe8c63 [Jin Adachi] Change file permission to 644.
      7b58f42 [Jin Adachi] Merge branch 'master' of https://github.com/apache/spark into SPARK-6694
      c581d06 [Jin Adachi] Add an option --database test
      db56122 [Jin Adachi] Merge branch 'SPARK-6694' of https://github.com/adachij2002/spark into SPARK-6694
      ee09fa5 [adachij] Merge branch 'master' into SPARK-6694
      c804c03 [adachij] SparkSQL CLI must be able to specify an option --database on the command line.
      3ae37b93
    • Marcelo Vanzin's avatar
      [SPARK-4194] [core] Make SparkContext initialization exception-safe. · de4fa6b6
      Marcelo Vanzin authored
      SparkContext has a very long constructor, where multiple things are
      initialized, multiple threads are spawned, and multiple opportunities
      for exceptions to be thrown exist. If one of these happens at an
      innoportune time, lots of garbage tends to stick around.
      
      This patch re-organizes SparkContext so that its internal state is
      initialized in a big "try" block. The fields keeping state are now
      completely private to SparkContext, and are "vars", because Scala
      doesn't allow you to initialize a val later. The existing API interface
      is kept by turning vals into defs (which works because Scala guarantees
      the same binary interface for those).
      
      On top of that, a few things in other areas were changed to avoid more
      things leaking:
      
      - Executor was changed to explicitly wait for the heartbeat thread to
        stop. LocalBackend was changed to wait for the "StopExecutor"
        message to be received, since otherwise there could be a race
        between that message arriving and the actor system being shut down.
      - ConnectionManager could possibly hang during shutdown, because an
        interrupt at the wrong moment could cause the selector thread to
        still call select and then wait forever. So also wake up the
        selector so that this situation is avoided.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #5335 from vanzin/SPARK-4194 and squashes the following commits:
      
      746b661 [Marcelo Vanzin] Fix borked merge.
      80fc00e [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
      408dada [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
      2621609 [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
      6b73fcb [Marcelo Vanzin] Scalastyle.
      c671c46 [Marcelo Vanzin] Fix merge.
      3979aad [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
      8caa8b3 [Marcelo Vanzin] [SPARK-4194] [core] Make SparkContext initialization exception-safe.
      071f16e [Marcelo Vanzin] Nits.
      27456b9 [Marcelo Vanzin] More exception safety.
      a0b0881 [Marcelo Vanzin] Stop alloc manager before scheduler.
      5545d83 [Marcelo Vanzin] [SPARK-6650] [core] Stop ExecutorAllocationManager when context stops.
      de4fa6b6
    • Sean Owen's avatar
      SPARK-4783 [CORE] System.exit() calls in SparkContext disrupt applications embedding Spark · 6179a948
      Sean Owen authored
      Avoid `System.exit(1)` in `TaskSchedulerImpl` and convert to `SparkException`; ensure scheduler calls `sc.stop()` even when this exception is thrown.
      
      CC mateiz aarondav as those who may have last touched this code.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5492 from srowen/SPARK-4783 and squashes the following commits:
      
      60dc682 [Sean Owen] Avoid System.exit(1) in TaskSchedulerImpl and convert to SparkException; ensure scheduler calls sc.stop() even when this exception is thrown
      6179a948
    • jerryshao's avatar
      [Streaming][minor] Remove additional quote and unneeded imports · 83705505
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #5540 from jerryshao/minor-fix and squashes the following commits:
      
      ebaa646 [jerryshao] Minor fix
      83705505
    • Xiangrui Meng's avatar
      [SPARK-6893][ML] default pipeline parameter handling in python · 57cd1e86
      Xiangrui Meng authored
      Same as #5431 but for Python. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5534 from mengxr/SPARK-6893 and squashes the following commits:
      
      d3b519b [Xiangrui Meng] address comments
      ebaccc6 [Xiangrui Meng] style update
      fce244e [Xiangrui Meng] update explainParams with test
      4d6b07a [Xiangrui Meng] add tests
      5294500 [Xiangrui Meng] update default param handling in python
      57cd1e86
  3. Apr 15, 2015
    • Juliet Hougland's avatar
      SPARK-6938: All require statements now have an informative error message. · 52c3439a
      Juliet Hougland authored
      This pr adds informative error messages to all require statements in the Vectors class that did not previously have them. This references [SPARK-6938](https://issues.apache.org/jira/browse/SPARK-6938).
      
      Author: Juliet Hougland <juliet@cloudera.com>
      
      Closes #5532 from jhlch/SPARK-6938 and squashes the following commits:
      
      ab321bb [Juliet Hougland] Remove braces from string interpolation when not required.
      1221f94 [Juliet Hougland] All require statements now have an informative error message.
      52c3439a
    • Max Seiden's avatar
      [SPARK-5277][SQL] - SparkSqlSerializer doesn't always register user specified KryoRegistrators · 8a53de16
      Max Seiden authored
      [SPARK-5277][SQL] - SparkSqlSerializer doesn't always register user specified KryoRegistrators
      
      There were a few places where new SparkSqlSerializer instances were created with new, empty SparkConfs resulting in user specified registrators sometimes not getting initialized.
      
      The fix is to try and pull a conf from the SparkEnv, and construct a new conf (that loads defaults) if one cannot be found.
      
      The changes touched:
          1) SparkSqlSerializer's resource pool (this appears to fix the issue in the comment)
          2) execution.Exchange (for all of the partitioners)
          3) execution.Limit (for the HashPartitioner)
      
      A few tests were added to ColumnTypeSuite, ensuring that a custom registrator and serde is initialized and used when in-memory columns are written.
      
      Author: Max Seiden <max@platfora.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #5237 from mhseiden/sql_udt_kryo and squashes the following commits:
      
      3175c2f [Max Seiden] [SPARK-5277][SQL] - address code review comments
      e5011fb [Max Seiden] [SPARK-5277][SQL] - SparkSqlSerializer does not register user specified KryoRegistrators
      8a53de16
    • Isaias Barroso's avatar
      [SPARK-2312] Logging Unhandled messages · d5f1b965
      Isaias Barroso authored
      The  previous solution has changed based on https://github.com/apache/spark/pull/2048 discussions.
      
      Author: Isaias Barroso <isaias.barroso@gmail.com>
      
      Closes #2055 from isaias/SPARK-2312 and squashes the following commits:
      
      f61d9e6 [Isaias Barroso] Change Log level for unhandled message to debug
      f341777 [Isaias Barroso] [SPARK-2312] Logging Unhandled messages
      d5f1b965
    • Daoyuan Wang's avatar
      [SPARK-2213] [SQL] sort merge join for spark sql · 585638e8
      Daoyuan Wang authored
      Thanks for the initial work from Ishiihara in #3173
      
      This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient.
      
      We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #5208 from adrian-wang/smj and squashes the following commits:
      
      2493b9f [Daoyuan Wang] fix style
      5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning
      f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments
      f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine
      ec8061b [Daoyuan Wang] minor change
      413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208
      952168a [Michael Armbrust] add type
      5492884 [Michael Armbrust] copy when ordering
      7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements
      b198278 [Daoyuan Wang] inherit ordering in project
      c8e82a3 [Daoyuan Wang] fix style
      6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj
      8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting
      2875ef2 [Daoyuan Wang] fix changed configuration
      61d7f49 [Daoyuan Wang] add omitted comment
      00a4430 [Daoyuan Wang] fix bug
      078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others
      3af6ba5 [Daoyuan Wang] use buffer for only one side
      171001f [Daoyuan Wang] change default outputordering
      47455c9 [Daoyuan Wang] add apache license ...
      a28277f [Daoyuan Wang] fix style
      645c70b [Daoyuan Wang] address comments using sort
      068c35d [Daoyuan Wang] fix new style and add some tests
      925203b [Daoyuan Wang] address comments
      07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound
      42fca0e [Daoyuan Wang] code clean
      e3ec096 [Daoyuan Wang] fix comment style..
      2edd235 [Daoyuan Wang] fix outputpartitioning
      57baa40 [Daoyuan Wang] fix sort eval bug
      303b6da [Daoyuan Wang] fix several errors
      95db7ad [Daoyuan Wang] fix brackets for if-statement
      4464f16 [Daoyuan Wang] fix error
      880d8e9 [Daoyuan Wang] sort merge join for spark sql
      585638e8
    • Wenchen Fan's avatar
      [SPARK-6898][SQL] completely support special chars in column names · 4754e16f
      Wenchen Fan authored
      Even if we wrap column names in backticks like `` `a#$b.c` ``,  we still handle the "." inside column name specially. I think it's fragile to use a special char to split name parts, why not put name parts in `UnresolvedAttribute` directly?
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #5511 from cloud-fan/6898 and squashes the following commits:
      
      48e3e57 [Wenchen Fan] more style fix
      820dc45 [Wenchen Fan] do not ignore newName in UnresolvedAttribute
      d81ad43 [Wenchen Fan] fix style
      11699d6 [Wenchen Fan] completely support special chars in column names
      4754e16f
    • sboeschhuawei's avatar
      [SPARK-6937][MLLIB] Fixed bug in PICExample in which the radius were not being accepted on c... · 557a797a
      sboeschhuawei authored
       Tiny bug in PowerIterationClusteringExample in which radius not accepted from command line
      
      Author: sboeschhuawei <stephen.boesch@huawei.com>
      
      Closes #5531 from javadba/picsub and squashes the following commits:
      
      2aab8cf [sboeschhuawei] Fixed bug in PICExample in which the radius were not being accepted on command line
      557a797a
    • Liang-Chi Hsieh's avatar
      [SPARK-6844][SQL] Clean up accumulators used in InMemoryRelation when it is uncached · cf38fe04
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6844
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5475 from viirya/cache_memory_leak and squashes the following commits:
      
      0b41235 [Liang-Chi Hsieh] fix style.
      dc1d5d5 [Liang-Chi Hsieh] For comments.
      78af229 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cache_memory_leak
      26c9bb6 [Liang-Chi Hsieh] Add configuration to enable in-memory table scan accumulators.
      1c3b06e [Liang-Chi Hsieh] Clean up accumulators used in InMemoryRelation when it is uncached.
      cf38fe04
    • Davies Liu's avatar
      [SPARK-6638] [SQL] Improve performance of StringType in SQL · 85842760
      Davies Liu authored
      This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte.
      
      This PR should not break any public API, Row.getString() will still return java.lang.String.
      
      This is the first step of improve the performance of String in SQL.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5350 from davies/string and squashes the following commits:
      
      3b7bfa8 [Davies Liu] fix schema of AddJar
      2772f0d [Davies Liu] fix new test failure
      6d776a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      59025c8 [Davies Liu] address comments from @marmbrus
      341ec2c [Davies Liu] turn off scala style check in UTF8StringSuite
      744788f [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      b04a19c [Davies Liu] add comment for getString/setString
      08d897b [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      5116b43 [Davies Liu] rollback unrelated changes
      1314a37 [Davies Liu] address comments from Yin
      867bf50 [Davies Liu] fix String filter push down
      13d9d42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      2089d24 [Davies Liu] add hashcode check back
      ac18ae6 [Davies Liu] address comment
      fd11364 [Davies Liu] optimize UTF8String
      8d17f21 [Davies Liu] fix hive compatibility tests
      e5fa5b8 [Davies Liu] remove clone in UTF8String
      28f3d81 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      28d6f32 [Davies Liu] refactor
      537631c [Davies Liu] some comment about Date
      9f4c194 [Davies Liu] convert data type for data source
      956b0a4 [Davies Liu] fix hive tests
      73e4363 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string
      9dc32d1 [Davies Liu] fix some hive tests
      23a766c [Davies Liu] refactor
      8b45864 [Davies Liu] fix codegen with UTF8String
      bb52e44 [Davies Liu] fix scala style
      c7dd4d2 [Davies Liu] fix some catalyst tests
      38c303e [Davies Liu] fix python sql tests
      5f9e120 [Davies Liu] fix sql tests
      6b499ac [Davies Liu] fix style
      a85fb27 [Davies Liu] refactor
      d32abd1 [Davies Liu] fix utf8 for python api
      4699c3a [Davies Liu] use Array[Byte] in UTF8String
      21f67c6 [Davies Liu] cleanup
      685fd07 [Davies Liu] use UTF8String instead of String for StringType
      85842760
    • Yin Huai's avatar
      [SPARK-6887][SQL] ColumnBuilder misses FloatType · 785f9558
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-6887
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #5499 from yhuai/inMemFloat and squashes the following commits:
      
      84cba38 [Yin Huai] Add test.
      4b75ba6 [Yin Huai] Add FloatType back.
      785f9558
    • Liang-Chi Hsieh's avatar
      [SPARK-6800][SQL] Update doc for JDBCRelation's columnPartition · e3e4e9a3
      Liang-Chi Hsieh authored
      JIRA https://issues.apache.org/jira/browse/SPARK-6800
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5488 from viirya/fix_jdbc_where and squashes the following commits:
      
      51386c8 [Liang-Chi Hsieh] Update code comment.
      1dcc929 [Liang-Chi Hsieh] Update document.
      3eb74d6 [Liang-Chi Hsieh] Revert and modify doc.
      df11783 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_jdbc_where
      3e7db15 [Liang-Chi Hsieh] Fix wrong logic to generate WHERE clause for JDBC.
      e3e4e9a3
    • Liang-Chi Hsieh's avatar
      [SPARK-6730][SQL] Allow using keyword as identifier in OPTIONS · b75b3070
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6730
      
      It is very possible that keyword will be used as identifier in `OPTIONS`, this pr makes it works.
      
      However, another approach is that we can request that `OPTIONS` can't include keywords and has to use alternative identifier (e.g. table -> cassandraTable) if needed.
      
      If so, please let me know to close this pr. Thanks.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5520 from viirya/relax_options and squashes the following commits:
      
      339fd68 [Liang-Chi Hsieh] Use regex parser.
      92be11c [Liang-Chi Hsieh] Allow using keyword as identifier in OPTIONS.
      b75b3070
    • Davies Liu's avatar
      [SPARK-6886] [PySpark] fix big closure with shuffle · f11288d5
      Davies Liu authored
      Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.
      
      This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5496 from davies/big_closure and squashes the following commits:
      
      9a0ea4c [Davies Liu] fix big closure with shuffle
      f11288d5
Loading