Skip to content
Snippets Groups Projects
  1. Oct 03, 2014
    • Masayoshi TSUZUKI's avatar
      [SPARK-3775] Not suitable error message in spark-shell.cmd · 358d7ffd
      Masayoshi TSUZUKI authored
      Modified some sentence of error message in bin\*.cmd.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2640 from tsudukim/feature/SPARK-3775 and squashes the following commits:
      
      3458afb [Masayoshi TSUZUKI] [SPARK-3775] Not suitable error message in spark-shell.cmd
      358d7ffd
    • Brenden Matthews's avatar
      [SPARK-3535][Mesos] Fix resource handling. · a8c52d53
      Brenden Matthews authored
      Author: Brenden Matthews <brenden@diddyinc.com>
      
      Closes #2401 from brndnmtthws/master and squashes the following commits:
      
      4abaa5d [Brenden Matthews] [SPARK-3535][Mesos] Fix resource handling.
      a8c52d53
    • Michael Armbrust's avatar
      [SPARK-3212][SQL] Use logical plan matching instead of temporary tables for table caching · 6a1d48f4
      Michael Armbrust authored
      _Also addresses: SPARK-1671, SPARK-1379 and SPARK-3641_
      
      This PR introduces a new trait, `CacheManger`, which replaces the previous temporary table based caching system.  Instead of creating a temporary table that shadows an existing table with and equivalent cached representation, the cached manager maintains a separate list of logical plans and their cached data.  After optimization, this list is searched for any matching plan fragments.  When a matching plan fragment is found it is replaced with the cached data.
      
      There are several advantages to this approach:
       - Calling .cache() on a SchemaRDD now works as you would expect, and uses the more efficient columnar representation.
       - Its now possible to provide a list of temporary tables, without having to decide if a given table is actually just a  cached persistent table. (To be done in a follow-up PR)
       - In some cases it is possible that cached data will be used, even if a cached table was not explicitly requested.  This is because we now look at the logical structure instead of the table name.
       - We now correctly invalidate when data is inserted into a hive table.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2501 from marmbrus/caching and squashes the following commits:
      
      63fbc2c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching.
      0ea889e [Michael Armbrust] Address comments.
      1e23287 [Michael Armbrust] Add support for cache invalidation for hive inserts.
      65ed04a [Michael Armbrust] fix tests.
      bdf9a3f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into caching
      b4b77f2 [Michael Armbrust] Address comments
      6923c9d [Michael Armbrust] More comments / tests
      80f26ac [Michael Armbrust] First draft of improved semantics for Spark SQL caching.
      6a1d48f4
    • Cheng Lian's avatar
      [SPARK-3007][SQL] Adds dynamic partitioning support · bec0d0ea
      Cheng Lian authored
      PR #2226 was reverted because it broke Jenkins builds for unknown reason. This debugging PR aims to fix the Jenkins build.
      
      This PR also fixes two bugs:
      
      1. Compression configurations in `InsertIntoHiveTable` are disabled by mistake
      
         The `FileSinkDesc` object passed to the writer container doesn't have compression related configurations. These configurations are not taken care of until `saveAsHiveFile` is called. This PR moves compression code forward, right after instantiation of the `FileSinkDesc` object.
      
      1. `PreInsertionCasts` doesn't take table partitions into account
      
         In `castChildOutput`, `table.attributes` only contains non-partition columns, thus for partitioned table `childOutputDataTypes` never equals to `tableOutputDataTypes`. This results funny analyzed plan like this:
      
         ```
         == Analyzed Logical Plan ==
         InsertIntoTable Map(partcol1 -> None, partcol2 -> None), false
          MetastoreRelation default, dynamic_part_table, None
          Project [c_0#1164,c_1#1165,c_2#1166]
           Project [c_0#1164,c_1#1165,c_2#1166]
            Project [c_0#1164,c_1#1165,c_2#1166]
             ... (repeats 99 times) ...
              Project [c_0#1164,c_1#1165,c_2#1166]
               Project [c_0#1164,c_1#1165,c_2#1166]
                Project [1 AS c_0#1164,1 AS c_1#1165,1 AS c_2#1166]
                 Filter (key#1170 = 150)
                  MetastoreRelation default, src, None
         ```
      
         Awful though this logical plan looks, it's harmless because all projects will be eliminated by optimizer. Guess that's why this issue hasn't been caught before.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: baishuo(白硕) <vc_java@hotmail.com>
      Author: baishuo <vc_java@hotmail.com>
      
      Closes #2616 from liancheng/dp-fix and squashes the following commits:
      
      21935b6 [Cheng Lian] Adds back deleted trailing space
      f471c4b [Cheng Lian] PreInsertionCasts should take table partitions into account
      a132c80 [Cheng Lian] Fixes output compression
      9c6eb2d [Cheng Lian] Adds tests to verify dynamic partitioning folder layout
      0eed349 [Cheng Lian] Addresses @yhuai's comments
      26632c3 [Cheng Lian] Adds more tests
      9227181 [Cheng Lian] Minor refactoring
      c47470e [Cheng Lian] Refactors InsertIntoHiveTable to a Command
      6fb16d7 [Cheng Lian] Fixes typo in test name, regenerated golden answer files
      d53daa5 [Cheng Lian] Refactors dynamic partitioning support
      b821611 [baishuo] pass check style
      997c990 [baishuo] use HiveConf.DEFAULTPARTITIONNAME to replace hive.exec.default.partition.name
      761ecf2 [baishuo] modify according micheal's advice
      207c6ac [baishuo] modify for some bad indentation
      caea6fb [baishuo] modify code to pass scala style checks
      b660e74 [baishuo] delete a empty else branch
      cd822f0 [baishuo] do a little modify
      8e7268c [baishuo] update file after test
      3f91665 [baishuo(白硕)] Update Cast.scala
      8ad173c [baishuo(白硕)] Update InsertIntoHiveTable.scala
      051ba91 [baishuo(白硕)] Update Cast.scala
      d452eb3 [baishuo(白硕)] Update HiveQuerySuite.scala
      37c603b [baishuo(白硕)] Update InsertIntoHiveTable.scala
      98cfb1f [baishuo(白硕)] Update HiveCompatibilitySuite.scala
      6af73f4 [baishuo(白硕)] Update InsertIntoHiveTable.scala
      adf02f1 [baishuo(白硕)] Update InsertIntoHiveTable.scala
      1867e23 [baishuo(白硕)] Update SparkHadoopWriter.scala
      6bb5880 [baishuo(白硕)] Update HiveQl.scala
      bec0d0ea
    • Marcelo Vanzin's avatar
      [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster. · fbe8e985
      Marcelo Vanzin authored
      Sometimes the cluster's start() method returns before the configuration
      having been updated, which is done by ClientRMService in, I assume, a
      separate thread (otherwise there would be no race). That can cause tests
      to fail if the old configuration data is read, since it will contain
      the wrong RM address.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2605 from vanzin/SPARK-2778 and squashes the following commits:
      
      8d02ce0 [Marcelo Vanzin] Minor cleanup.
      5bebee7 [Marcelo Vanzin] [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster.
      fbe8e985
    • ravipesala's avatar
      [SPARK-2693][SQL] Supported for UDAF Hive Aggregates like PERCENTILE · 22f8e1ee
      ravipesala authored
      Implemented UDAF Hive aggregates by adding wrapper to Spark Hive.
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2620 from ravipesala/SPARK-2693 and squashes the following commits:
      
      a8df326 [ravipesala] Removed resolver from constructor arguments
      caf25c6 [ravipesala] Fixed style issues
      5786200 [ravipesala] Supported for UDAF Hive Aggregates like PERCENTILE
      22f8e1ee
    • WangTaoTheTonic's avatar
      [SPARK-3696]Do not override the user-difined conf_dir · 9d320e22
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-3696
      
      We see if SPARK_CONF_DIR is already defined before assignment.
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      
      Closes #2541 from WangTaoTheTonic/confdir and squashes the following commits:
      
      c3f31e0 [WangTaoTheTonic] Do not override the user-difined conf_dir
      9d320e22
    • EugenCepoi's avatar
      SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR · f0811f92
      EugenCepoi authored
      Update of PR #997.
      
      With this PR, setting SPARK_CONF_DIR overrides SPARK_HOME/conf (not only spark-defaults.conf and spark-env).
      
      Author: EugenCepoi <cepoi.eugen@gmail.com>
      
      Closes #2481 from EugenCepoi/SPARK-2058 and squashes the following commits:
      
      0bb32c2 [EugenCepoi] use orElse orNull and fixing trailing percent in compute-classpath.cmd
      77f35d7 [EugenCepoi] SPARK-2058: Overriding SPARK_HOME/conf with SPARK_CONF_DIR
      f0811f92
    • qiping.lqp's avatar
      [SPARK-3366][MLLIB]Compute best splits distributively in decision tree · 2e4eae3a
      qiping.lqp authored
      Currently, all best splits are computed on the driver, which makes the driver a bottleneck for both communication and computation. This PR fix this problem by computed best splits on executors.
      Instead of send all aggregate stats to the driver node, we can send aggregate stats for a node to a particular executor, using `reduceByKey` operation, then we can compute best split for this node there.
      
      Implementation details:
      
      Each node now has a nodeStatsAggregator, which save aggregate stats for all features and bins.
      First use mapPartition to compute node aggregate stats for all nodes in each partition.
      Then transform node aggregate stats to (nodeIndex, nodeStatsAggregator) pairs and use to `reduceByKey` operation to combine nodeStatsAggregator for the same node.
      After all stats have been combined, best splits can be computed for each node based on the node aggregate stats. Best split result is collected to driver to construct the decision tree.
      
      CC: mengxr manishamde jkbradley, please help me review this, thanks.
      
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      Author: chouqin <liqiping1991@gmail.com>
      
      Closes #2595 from chouqin/dt-dist-agg and squashes the following commits:
      
      db0d24a [chouqin] fix a minor bug and adjust code
      a0d9de3 [chouqin] adjust code based on comments
      9f201a6 [chouqin] fix bug: statsSize -> allStatsSize
      a8a7ed0 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-dist-agg
      f13b346 [chouqin] adjust randomforest comments
      c32636e [chouqin] adjust code based on comments
      ac6a505 [chouqin] adjust code based on comments
      7bbb787 [chouqin] add comments
      bdd2a63 [qiping.lqp] fix test suite
      a75df27 [qiping.lqp] fix test suite
      b5b0bc2 [qiping.lqp] fix style
      e76414f [qiping.lqp] fix testsuite
      748bd45 [qiping.lqp] fix type-mismatch bug
      24eacd8 [qiping.lqp] fix type-mismatch bug
      5f63d6c [qiping.lqp] add multiclassification using One-Vs-All strategy
      4f56496 [qiping.lqp] fix bug
      f00fc22 [qiping.lqp] fix bug
      532993a [qiping.lqp] Compute best splits distributively in decision tree
      2e4eae3a
  2. Oct 02, 2014
    • ravipesala's avatar
      [SPARK-3654][SQL] Implement all extended HiveQL statements/commands with a... · 1c90347a
      ravipesala authored
      [SPARK-3654][SQL] Implement all extended HiveQL statements/commands with a separate parser combinator
      
      Created separate parser for hql. It preparses the commands like cache,uncache,add jar etc.. and then parses with HiveQl
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2590 from ravipesala/SPARK-3654 and squashes the following commits:
      
      bbca7dd [ravipesala] Fixed code as per admin comments.
      ae9290a [ravipesala] Fixed style issues as per Admin comments
      898ed81 [ravipesala] Removed spaces
      fb24edf [ravipesala] Updated the code as per admin comments
      8947d37 [ravipesala] Removed duplicate code
      ba26cd1 [ravipesala] Created seperate parser for hql.It pre parses the commands like cache,uncache,add jar etc.. and then parses with HiveQl
      1c90347a
    • Michael Armbrust's avatar
      [SQL] Initilize session state before creating CommandProcessor · 7de4e50a
      Michael Armbrust authored
      With the old ordering it was possible for commands in the HiveDriver to NPE due to the lack of configuration in the threadlocal session state.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2635 from marmbrus/initOrder and squashes the following commits:
      
      9749850 [Michael Armbrust] Initilize session state before creating CommandProcessor
      7de4e50a
    • Eric Eijkelenboom's avatar
      [DEPLOY] SPARK-3759: Return the exit code of the driver process · 42d5077f
      Eric Eijkelenboom authored
      SparkSubmitDriverBootstrapper.scala now returns the exit code of the driver process, instead of always returning 0.
      
      Author: Eric Eijkelenboom <ee@userreport.com>
      
      Closes #2628 from ericeijkelenboom/master and squashes the following commits:
      
      cc4a571 [Eric Eijkelenboom] Return the exit code of the driver process
      42d5077f
    • scwf's avatar
      [SPARK-3755][Core] avoid trying privileged port when request a non-privileged port · 8081ce8b
      scwf authored
      pwendell, ```tryPort``` is not compatible with old code in last PR, this is to fix it.
      And after discuss with srowen renamed the title to "avoid trying privileged port when request a non-privileged port". Plz refer to the discuss for detail.
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2623 from scwf/1-1024 and squashes the following commits:
      
      10a4437 [scwf] add comment
      de3fd17 [scwf] do not try privileged port when request a non-privileged port
      42cb0fa [scwf] make tryPort compatible with old code
      cb8cc76 [scwf] do not use port 1 - 1024
      8081ce8b
    • Thomas Graves's avatar
      [SPARK-3632] ConnectionManager can run out of receive threads with authentication on · 127e97be
      Thomas Graves authored
      If you turn authentication on and you are using a lot of executors. There is a chance that all the of the threads in the handleMessageExecutor could be waiting to send a message because they are blocked waiting on authentication to happen. This can cause a temporary deadlock until the connection times out.
      
      To fix it, I got rid of the wait/notify and use a single outbox but only send security messages from it until authentication has completed.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #2484 from tgravescs/cm_threads_auth and squashes the following commits:
      
      a0a961d [Thomas Graves] give it a type
      b6bc80b [Thomas Graves] Rework comments
      d6d4175 [Thomas Graves] update from comments
      081b765 [Thomas Graves] cleanup
      4d7f8f5 [Thomas Graves] Change to not use wait/notify while waiting for authentication
      127e97be
    • Tathagata Das's avatar
      [SPARK-3495] Block replication fails continuously when the replication target... · 5db78e6b
      Tathagata Das authored
      [SPARK-3495] Block replication fails continuously when the replication target node is dead AND [SPARK-3496] Block replication by mistake chooses driver as target
      
      If a block manager (say, A) wants to replicate a block and the node chosen for replication (say, B) is dead, then the attempt to send the block to B fails. However, this continues to fail indefinitely. Even if the driver learns about the demise of the B, A continues to try replicating to B and failing miserably.
      
      The reason behind this bug is that A initially fetches a list of peers from the driver (when B was active), but never updates it after B is dead. This affects Spark Streaming as its receiver uses block replication.
      
      The solution in this patch adds the following.
      - Changed BlockManagerMaster to return all the peers of a block manager, rather than the requested number. It also filters out driver BlockManager.
      - Refactored BlockManager's replication code to handle peer caching correctly.
          + The peer for replication is randomly selected. This is different from past behavior where for a node A, a node B was deterministically chosen for the lifetime of the application.
          + If replication fails to one node, the peers are refetched.
          + The peer cached has a TTL of 1 second to enable discovery of new peers and using them for replication.
      - Refactored use of \<driver\> in BlockManager into a new method `BlockManagerId.isDriver`
      - Added replication unit tests (replication was not tested till now, duh!)
      
      This should not make a difference in performance of Spark workloads where replication is not used.
      
      @andrewor14 @JoshRosen
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2366 from tdas/replication-fix and squashes the following commits:
      
      9690f57 [Tathagata Das] Moved replication tests to a new BlockManagerReplicationSuite.
      0661773 [Tathagata Das] Minor changes based on PR comments.
      a55a65c [Tathagata Das] Added a unit test to test replication behavior.
      012afa3 [Tathagata Das] Bug fix
      89f91a0 [Tathagata Das] Minor change.
      68e2c72 [Tathagata Das] Made replication peer selection logic more efficient.
      08afaa9 [Tathagata Das] Made peer selection for replication deterministic to block id
      3821ab9 [Tathagata Das] Fixes based on PR comments.
      08e5646 [Tathagata Das] More minor changes.
      d402506 [Tathagata Das] Fixed imports.
      4a20531 [Tathagata Das] Filtered driver block manager from peer list, and also consolidated the use of <driver> in BlockManager.
      7598f91 [Tathagata Das] Minor changes.
      03de02d [Tathagata Das] Change replication logic to correctly refetch peers from master on failure and on new worker addition.
      d081bf6 [Tathagata Das] Fixed bug in get peers and unit tests to test get-peers and replication under executor churn.
      9f0ac9f [Tathagata Das] Modified replication tests to fail on replication bug.
      af0c1da [Tathagata Das] Added replication unit tests to BlockManagerSuite
      5db78e6b
    • scwf's avatar
      [SPARK-3766][Doc]Snappy is also the default compress codec for broadcast variables · c6469a02
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2632 from scwf/compress-doc and squashes the following commits:
      
      7983a1a [scwf] snappy is the default compression codec for broadcast
      c6469a02
    • Nishkam Ravi's avatar
      Modify default YARN memory_overhead-- from an additive constant to a multiplier · b4fb7b80
      Nishkam Ravi authored
      Redone against the recent master branch (https://github.com/apache/spark/pull/1391)
      
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      
      Closes #2485 from nishkamravi2/master_nravi and squashes the following commits:
      
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      b4fb7b80
    • Yin Huai's avatar
      [SQL][Docs] Update the output of printSchema and fix a typo in SQL programming guide. · 82a6a083
      Yin Huai authored
      We have changed the output format of `printSchema`. This PR will update our SQL programming guide to show the updated format. Also, it fixes a typo (the value type of `StructType` in Java API).
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #2630 from yhuai/sqlDoc and squashes the following commits:
      
      267d63e [Yin Huai] Update the output of printSchema and fix a typo.
      82a6a083
    • cocoatomo's avatar
      [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset · 5b4a5b1a
      cocoatomo authored
      ### Problem
      
      The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
      But a folloing command does not run IPython but a default Python executable.
      
      ```
      $ IPYTHON=1 ./bin/pyspark
      Python 2.7.8 (default, Jul  2 2014, 10:14:46)
      ...
      ```
      
      the spark/bin/pyspark script on the commit b235e013 decides which executable and options it use folloing way.
      
      1. if PYSPARK_PYTHON unset
         * → defaulting to "python"
      2. if IPYTHON_OPTS set
         * → set IPYTHON "1"
      3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
         * out of this issues scope
      4. if IPYTHON set as "1"
         * → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
         * otherwise execute $PYSPARK_PYTHON
      
      Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
      In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.
      
      PYSPARK_PYTHON | IPYTHON_OPTS | IPYTHON | resulting command | expected command
      ---- | ---- | ----- | ----- | -----
      (unset → defaults to python) | (unset) | (unset) | python | (same)
      (unset → defaults to python) | (unset) | 1 | python | ipython
      (unset → defaults to python) | an_option | (unset → set to 1) | python an_option | ipython an_option
      (unset → defaults to python) | an_option | 1 | python an_option | ipython an_option
      ipython | (unset) | (unset) | ipython | (same)
      ipython | (unset) | 1 | ipython | (same)
      ipython | an_option | (unset → set to 1) | ipython an_option | (same)
      ipython | an_option | 1 | ipython an_option | (same)
      
      ### Suggestion
      
      The pyspark script should determine firstly whether a user wants to run IPython or other executables.
      
      1. if IPYTHON_OPTS set
         * set IPYTHON "1"
      2.  if IPYTHON has a value "1"
         * PYSPARK_PYTHON defaults to "ipython" if not set
      3. PYSPARK_PYTHON defaults to "python" if not set
      
      See the pull request for more detailed modification.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2554 from cocoatomo/issues/cannot-run-ipython-without-options and squashes the following commits:
      
      d2a9b06 [cocoatomo] [SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable instead of -u option
      264114c [cocoatomo] [SPARK-3706][PySpark] Remove the sentence about deprecated environment variables
      42e02d5 [cocoatomo] [SPARK-3706][PySpark] Replace environment variables used to customize execution of PySpark REPL
      10d56fb [cocoatomo] [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
      5b4a5b1a
    • Colin Patrick Mccabe's avatar
      SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks · 6e27cb63
      Colin Patrick Mccabe authored
      This change reorders the replicas returned by
      HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
      the start of the list.  This requires Hadoop 2.5 or higher; previous
      versions of Hadoop do not expose the information needed to determine
      whether a replica is cached.
      
      Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
      
      Closes #1486 from cmccabe/SPARK-1767 and squashes the following commits:
      
      338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
      6e27cb63
    • ravipesala's avatar
      [SPARK-3371][SQL] Renaming a function expression with group by gives error · bbdf1de8
      ravipesala authored
      The following code gives error.
      ```
      sqlContext.registerFunction("len", (s: String) => s.length)
      sqlContext.sql("select len(foo) as a, count(1) from t1 group by len(foo)").collect()
      ```
      Because SQl parser creates the aliases to the functions in grouping expressions with generated alias names. So if user gives the alias names to the functions inside projection then it does not match the generated alias name of grouping expression.
      This kind of queries are working in Hive.
      So the fix I have given that if user provides alias to the function in projection then don't generate alias in grouping expression,use the same alias.
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2511 from ravipesala/SPARK-3371 and squashes the following commits:
      
      9fb973f [ravipesala] Removed aliases to grouping expressions.
      f8ace79 [ravipesala] Fixed the testcase issue
      bad2fd0 [ravipesala] SPARK-3371 : Fixed Renaming a function expression with group by gives error
      bbdf1de8
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · f341e1c8
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #1375 (close requested by 'pwendell')
      Closes #476 (close requested by 'mengxr')
      Closes #2502 (close requested by 'pwendell')
      Closes #2391 (close requested by 'andrewor14')
      f341e1c8
  3. Oct 01, 2014
    • Marcelo Vanzin's avatar
      [SPARK-3446] Expose underlying job ids in FutureAction. · 29c35132
      Marcelo Vanzin authored
      FutureAction is the only type exposed through the async APIs, so
      for job IDs to be useful they need to be exposed there. The complication
      is that some async jobs run more than one job (e.g. takeAsync),
      so the exposed ID has to actually be a list of IDs that can actually
      change over time. So the interface doesn't look very nice, but...
      
      Change is actually small, I just added a basic test to make sure
      it works.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2337 from vanzin/SPARK-3446 and squashes the following commits:
      
      e166a68 [Marcelo Vanzin] Fix comment.
      1fed2bc [Marcelo Vanzin] [SPARK-3446] Expose underlying job ids in FutureAction.
      29c35132
    • aniketbhatnagar's avatar
      SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile · 93861a5e
      aniketbhatnagar authored
      This patch forces use of commons http client 4.2 in Kinesis-asl profile so that the AWS SDK does not run into dependency conflicts
      
      Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>
      
      Closes #2535 from aniketbhatnagar/Kinesis-HttpClient-Dep-Fix and squashes the following commits:
      
      aa2079f [aniketbhatnagar] Merge branch 'Kinesis-HttpClient-Dep-Fix' of https://github.com/aniketbhatnagar/spark into Kinesis-HttpClient-Dep-Fix
      73f55f6 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      70cc75b [aniketbhatnagar] deleted merge files
      725dbc9 [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-HttpClient-Dep-Fix' into Kinesis-HttpClient-Dep-Fix
      4ed61d8 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      9cd6103 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      93861a5e
    • scwf's avatar
      [SPARK-3704][SQL] Fix ColumnValue type for Short values in thrift server · 1b9f0d67
      scwf authored
      case ```ShortType```, we should add short value to hive row. Int value may lead to some problems.
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2551 from scwf/fix-addColumnValue and squashes the following commits:
      
      08bcc59 [scwf] ColumnValue.shortValue for short type
      1b9f0d67
    • Michael Armbrust's avatar
      [SPARK-3729][SQL] Do all hive session state initialization in lazy val · 45e058ca
      Michael Armbrust authored
      This change avoids a NPE during context initialization when settings are present.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2583 from marmbrus/configNPE and squashes the following commits:
      
      da2ec57 [Michael Armbrust] Do all hive session state initilialization in lazy val
      45e058ca
    • Patrick Wendell's avatar
      4e79970d
    • Cheng Lian's avatar
      [SQL] Made Command.sideEffectResult protected · a31f4ff2
      Cheng Lian authored
      Considering `Command.executeCollect()` simply delegates to `Command.sideEffectResult`, we no longer need to leave the latter `protected[sql]`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2431 from liancheng/narrow-scope and squashes the following commits:
      
      1bfc16a [Cheng Lian] Made Command.sideEffectResult protected
      a31f4ff2
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3593][SQL] Add support for sorting BinaryType · f84b228c
      Venkata Ramana Gollamudi authored
      BinaryType is derived from NativeType and added Ordering support.
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2617 from gvramana/binarytype_sort and squashes the following commits:
      
      1cf26f3 [Venkata Ramana Gollamudi] Supported Sorting of BinaryType
      f84b228c
    • scwf's avatar
      [SPARK-3705][SQL] Add case for VoidObjectInspector to cover NullType · f315fb7e
      scwf authored
      add case for VoidObjectInspector in ```inspectorToDataType```
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2552 from scwf/inspectorToDataType and squashes the following commits:
      
      453d892 [scwf] add case for VoidObjectInspector
      f315fb7e
    • ravipesala's avatar
      [SPARK-3708][SQL] Backticks aren't handled correctly is aliases · 3508ce8a
      ravipesala authored
      The below query gives error
      sql("SELECT k FROM (SELECT \`key\` AS \`k\` FROM src) a")
      It gives error because the aliases are not cleaned so it could not be resolved in further processing.
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2594 from ravipesala/SPARK-3708 and squashes the following commits:
      
      d55db54 [ravipesala] Fixed SPARK-3708 (Backticks aren't handled correctly is aliases)
      3508ce8a
    • WangTaoTheTonic's avatar
      [SPARK-3658][SQL] Start thrift server as a daemon · d61f2c15
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-3658
      
      And keep the `CLASS_NOT_FOUND_EXIT_STATUS` and exit message in `SparkSubmit.scala`.
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #2509 from WangTaoTheTonic/thriftserver and squashes the following commits:
      
      5dcaab2 [WangTaoTheTonic] issue about coupling
      8ad9f95 [WangTaoTheTonic] generalization
      598e21e [WangTao] take thrift server as a daemon
      d61f2c15
    • Michael Armbrust's avatar
      [SPARK-3746][SQL] Lock hive client when creating tables · fcad3fae
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2598 from marmbrus/hiveClientLock and squashes the following commits:
      
      ca89fe8 [Michael Armbrust] Lock hive client when creating tables
      fcad3fae
    • jyotiska's avatar
      Python SQL Example Code · 17333c7a
      jyotiska authored
      SQL example code for Python, as shown on [SQL Programming Guide](https://spark.apache.org/docs/1.0.2/sql-programming-guide.html)
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #2521 from jyotiska/sql_example and squashes the following commits:
      
      1471dcb [jyotiska] added imports for sql
      b25e436 [jyotiska] pep 8 compliance
      43fd10a [jyotiska] lines broken to maintain 80 char limit
      b4fdf4e [jyotiska] removed blank lines
      83d5ab7 [jyotiska] added inferschema and applyschema to the demo
      306667e [jyotiska] replaced blank line with end line
      c90502a [jyotiska] fixed new line
      4939a70 [jyotiska] added new line at end for python style
      0b46148 [jyotiska] fixed appname for python sql example
      8f67b5b [jyotiska] added python sql example
      17333c7a
    • Gaspar Munoz's avatar
      Typo error in KafkaWordCount example · b81ee0b4
      Gaspar Munoz authored
      topicpMap to topicMap
      
      Author: Gaspar Munoz <munozs.88@gmail.com>
      
      Closes #2614 from gasparms/patch-1 and squashes the following commits:
      
      00aab2c [Gaspar Munoz] Typo error in KafkaWordCount example
      b81ee0b4
    • Cheng Lian's avatar
      [SQL] Kill dangerous trailing space in query string · 8cc70e7e
      Cheng Lian authored
      MD5 of query strings in `createQueryTest` calls are used to generate golden files, leaving trailing spaces there can be really dangerous. Got bitten by this while working on #2616: my "smart" IDE automatically removed a trailing space and makes Jenkins fail.
      
      (Really should add "no trailing space" to our coding style guidelines!)
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2619 from liancheng/kill-trailing-space and squashes the following commits:
      
      034f119 [Cheng Lian] Kill dangerous trailing space in query string
      8cc70e7e
    • scwf's avatar
      [SPARK-3756] [Core]check exception is caused by an address-port collision properly · 2fedb5dd
      scwf authored
      Jetty server use MultiException to handle exceptions when start server
      refer https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java
      
      So in ```isBindCollision``` add the logical to cover MultiException
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2611 from scwf/fix-isBindCollision and squashes the following commits:
      
      984cb12 [scwf] optimize the fix
      3a6c849 [scwf] fix bug in isBindCollision
      2fedb5dd
    • scwf's avatar
      [SPARK-3755][Core] Do not bind port 1 - 1024 to server in spark · 6390aae4
      scwf authored
      Non-root user use port 1- 1024 to start jetty server will get the exception " java.net.SocketException: Permission denied", so not use these ports
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2610 from scwf/1-1024 and squashes the following commits:
      
      cb8cc76 [scwf] do not use port 1 - 1024
      6390aae4
    • Sean Owen's avatar
      SPARK-2626 [DOCS] Stop SparkContext in all examples · dcb2f73f
      Sean Owen authored
      Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2575 from srowen/SPARK-2626 and squashes the following commits:
      
      5b2baae [Sean Owen] Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
      dcb2f73f
    • Davies Liu's avatar
      [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD · abf588f4
      Davies Liu authored
      1. broadcast is triggle unexpected
      2. fd is leaked in JVM (also leak in parallelize())
      3. broadcast is not unpersisted in JVM after RDD is not be used any more.
      
      cc JoshRosen , sorry for these stupid bugs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2603 from davies/fix_broadcast and squashes the following commits:
      
      080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
      abf588f4
Loading