Skip to content
Snippets Groups Projects
  1. Dec 19, 2014
    • Eran Medan's avatar
      change signature of example to match released code · c25c669d
      Eran Medan authored
      the signature of registerKryoClasses is actually of Array[Class[_]]  not Seq
      
      Author: Eran Medan <ehrann.mehdan@gmail.com>
      
      Closes #3747 from eranation/patch-1 and squashes the following commits:
      
      ee9885d [Eran Medan] change signature of example to match released code
      c25c669d
    • Marcelo Vanzin's avatar
      [SPARK-2261] Make event logger use a single file. · 45645191
      Marcelo Vanzin authored
      Currently the event logger uses a directory and several files to
      describe an app's event log, all but one of which are empty. This
      is not very HDFS-friendly, since creating lots of nodes in HDFS
      (especially when they don't contain any data) is frowned upon due
      to the node metadata being kept in the NameNode's memory.
      
      Instead, add a header section to the event log file that contains metadata
      needed to read the events. This metadata includes things like the Spark
      version (for future code that may need it for backwards compatibility) and
      the compression codec used for the event data.
      
      With the new approach, aside from reducing the load on the NN, there's
      also a lot less remote calls needed when reading the log directory.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #1222 from vanzin/hist-server-single-log and squashes the following commits:
      
      cc8f5de [Marcelo Vanzin] Store header in plain text.
      c7e6123 [Marcelo Vanzin] Update comment.
      59c561c [Marcelo Vanzin] Review feedback.
      216c5a3 [Marcelo Vanzin] Review comments.
      dce28e9 [Marcelo Vanzin] Fix log overwrite test.
      f91c13e [Marcelo Vanzin] Handle "spark.eventLog.overwrite", and add unit test.
      346f0b4 [Marcelo Vanzin] Review feedback.
      ed0023e [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      3f4500f [Marcelo Vanzin] Unit test for SPARK-3697.
      45c7a1f [Marcelo Vanzin] Version of SPARK-3697 for this branch.
      b3ee30b [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      a6d5c50 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      16fd491 [Marcelo Vanzin] Use unique log directory for each codec.
      0ef3f70 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      d93c44a [Marcelo Vanzin] Add a newline to make the header more readable.
      9e928ba [Marcelo Vanzin] Add types.
      bd6ba8c [Marcelo Vanzin] Review feedback.
      a624a89 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      04364dc [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      bb7c2d3 [Marcelo Vanzin] Fix scalastyle warning.
      16661a3 [Marcelo Vanzin] Simplify some internal code.
      cc6bce4 [Marcelo Vanzin] Some review feedback.
      a722184 [Marcelo Vanzin] Do not encode metadata in log file name.
      3700586 [Marcelo Vanzin] Restore log flushing.
      f677930 [Marcelo Vanzin] Fix botched rebase.
      ae571fa [Marcelo Vanzin] Fix end-to-end event logger test.
      9db0efd [Marcelo Vanzin] Show prettier name in UI.
      8f42274 [Marcelo Vanzin] Make history server parse old-style log directories.
      6251dd7 [Marcelo Vanzin] Make event logger use a single file.
      45645191
    • Josh Rosen's avatar
      [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it · c28083f4
      Josh Rosen authored
      This patch upgrades `spark-ec2`'s Boto version to 2.34.0, since this is blocking several features.  Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.
      
      Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in `SPARK_EC2_DIR/lib`, similar to what we do in the `sbt/sbt` script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster.  By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.
      
      I've tested this with Python 2.6, too.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3737 from JoshRosen/update-boto and squashes the following commits:
      
      0aa43cc [Josh Rosen] Remove unused setup_standalone_cluster() method.
      f02935d [Josh Rosen] Enable Python deprecation warnings and fix one Boto warning:
      587ae89 [Josh Rosen] [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it
      c28083f4
    • Ryan Williams's avatar
      [SPARK-4896] don’t redundantly overwrite executor JAR deps · 7981f969
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #2848 from ryan-williams/fetch-file and squashes the following commits:
      
      c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently
      8e39c16 [Ryan Williams] code review feedback
      788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps
      7981f969
    • Ryan Williams's avatar
      [SPARK-4889] update history server example cmds · cdb2c645
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #3736 from ryan-williams/hist and squashes the following commits:
      
      421d8ff [Ryan Williams] add another random typo fix
      76d6a4c [Ryan Williams] remove hdfs example
      a2d0f82 [Ryan Williams] code review feedback
      9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds
      cdb2c645
    • Reynold Xin's avatar
      Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor. · 336cd341
      Reynold Xin authored
      This consolidates some code path and makes constructor arguments simpler for a few classes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3738 from rxin/sparkEnvDepRefactor and squashes the following commits:
      
      82e02cc [Reynold Xin] Fixed couple bugs.
      217062a [Reynold Xin] Code review feedback.
      bd00af7 [Reynold Xin] Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor.
      336cd341
    • scwf's avatar
      [Build] Remove spark-staging-1038 · 8e253ebb
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3743 from scwf/abc and squashes the following commits:
      
      7d98bc8 [scwf] removing spark-staging-1038
      8e253ebb
    • Cheng Hao's avatar
      [SPARK-4901] [SQL] Hot fix for ByteWritables.copyBytes · 5479450c
      Cheng Hao authored
      HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:
      
      bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes
      5479450c
    • Sandy Ryza's avatar
      SPARK-3428. TaskMetrics for running tasks is missing GC time metrics · 283263ff
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits:
      
      cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics
      283263ff
  2. Dec 18, 2014
    • Liang-Chi Hsieh's avatar
      [SPARK-4674] Refactor getCallSite · d7fc69a8
      Liang-Chi Hsieh authored
      The current version of `getCallSite` visits the collection of `StackTraceElement` twice. However, it is unnecessary since we can perform our work with a single visit. We also do not need to keep filtered `StackTraceElement`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #3532 from viirya/refactor_getCallSite and squashes the following commits:
      
      62aa124 [Liang-Chi Hsieh] Fix style.
      e741017 [Liang-Chi Hsieh] Refactor getCallSite.
      d7fc69a8
    • RJ Nowling's avatar
      [SPARK-4728][MLLib] Add exponential, gamma, and log normal sampling to MLlib da... · ee1fb97a
      RJ Nowling authored
      ...ta generators
      
      This patch adds:
      
      * Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API
      * Functions for generating exponential, gamma, and log normal RDDs and vector RDDs
      * Tests for the above
      
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #3680 from rnowling/spark4728 and squashes the following commits:
      
      455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite
      3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits
      58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev
      84fd98d [RJ Nowling] Add more values for testing distributions.
      9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators
      ee1fb97a
    • wangfei's avatar
      [SPARK-4861][SQL] Refactory command in spark sql · c3d91da5
      wangfei authored
      Remove ```Command``` and use ```RunnableCommand``` instead.
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3712 from scwf/cmd and squashes the following commits:
      
      51a82f2 [wangfei] fix test failure
      0e03be8 [wangfei] address comments
      4033bed [scwf] remove CreateTableAsSelect in hivestrategy
      5d20010 [wangfei] address comments
      125f542 [scwf] factory command in spark sql
      c3d91da5
    • Cheng Hao's avatar
      [SPARK-4573] [SQL] Add SettableStructObjectInspector support in "wrap" function · ae9f1286
      Cheng Hao authored
      Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.
      
      Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3429 from chenghao-intel/settable_oi and squashes the following commits:
      
      9f0aff3 [Cheng Hao] update code style issues as feedbacks
      2b0561d [Cheng Hao] Add more scala doc
      f5a40e8 [Cheng Hao] add scala doc
      2977e9b [Cheng Hao] remove the timezone setting for test suite
      3ed284c [Cheng Hao] fix the date type comparison
      f1b6749 [Cheng Hao] Update the comment
      932940d [Cheng Hao] Add more unit test
      72e4332 [Cheng Hao] Add settable StructObjectInspector support
      ae9f1286
    • ravipesala's avatar
      [SPARK-2554][SQL] Supporting SumDistinct partial aggregation · 7687415c
      ravipesala authored
      Adding support to the partial aggregation of SumDistinct
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3348 from ravipesala/SPARK-2554 and squashes the following commits:
      
      fd28e4d [ravipesala] Fixed review comments
      e60e67f [ravipesala] Fixed test cases and made it as nullable
      32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: 	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
      7687415c
    • YanTangZhai's avatar
      [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an... · e7de7e5f
      YanTangZhai authored
      [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      
      The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
      partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
      The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).
      
      Author: YanTangZhai <hakeemzhai@tencent.com>
      Author: yantangzhai <tyz0303@163.com>
      
      Closes #3556 from YanTangZhai/SPARK-4693 and squashes the following commits:
      
      620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
      72accf1 [YanTangZhai] Update HiveQuerySuite.scala
      e572b9a [YanTangZhai] Update HiveStrategies.scala
      6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
      e249846 [YanTangZhai] Merge pull request #10 from apache/master
      d26d982 [YanTangZhai] Merge pull request #9 from apache/master
      76d4027 [YanTangZhai] Merge pull request #8 from apache/master
      03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
      8a00106 [YanTangZhai] Merge pull request #6 from apache/master
      cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
      cdef539 [YanTangZhai] Merge pull request #1 from apache/master
      e7de7e5f
    • guowei2's avatar
      [SPARK-4756][SQL] FIX: sessionToActivePool grow infinitely, even as sessions expire · 22ddb6e0
      guowei2 authored
      **sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
      we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.
      
      Author: guowei2 <guowei2@asiainfo.com>
      
      Closes #3617 from guowei2/SPARK-4756 and squashes the following commits:
      
      e9b97b8 [guowei2] fix compile bug with Shim12
      cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
      e070998 [guowei2] fix: remove active pool of the session when it expired
      22ddb6e0
    • Thu Kyaw's avatar
      [SPARK-3928][SQL] Support wildcard matches on Parquet files. · b68bc6d2
      Thu Kyaw authored
      ...arquetFile accept hadoop glob pattern in path.
      
      Author: Thu Kyaw <trk007@gmail.com>
      
      Closes #3407 from tkyaw/master and squashes the following commits:
      
      19115ad [Thu Kyaw] Merge https://github.com/apache/spark
      ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      b68bc6d2
    • Cheng Hao's avatar
      [SPARK-2663] [SQL] Support the Grouping Set · f728e0fe
      Cheng Hao authored
      Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.
      
      More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
      https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
      
      The generic idea of the implementations are :
      1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
      2 Explode each of the input row, and then feed them to `Aggregate`
        * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
        * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
        * Output Schema of `Explode` is `child.output :+ grouping__id`
        * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
        * Keep the `Aggregation expressions` the same for the `Aggregate`
      
      The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.
      
      A known issue will be done in the follow up PR:
      * Optimization `ColumnPruning` is not supported yet for `Explosive` node.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:
      
      fe65fcc [Cheng Hao] Remove the extra space
      3547056 [Cheng Hao] Add more doc and Simplify the Expand
      a7c869d [Cheng Hao] update code as feedbacks
      d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
      414b165 [Cheng Hao] revert the unnecessary changes
      ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
      f728e0fe
    • Andrew Or's avatar
      [SPARK-4754] Refactor SparkContext into ExecutorAllocationClient · 9804a759
      Andrew Or authored
      This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object.
      
      cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits:
      
      187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
      59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
      347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient
      9804a759
    • Aaron Davidson's avatar
      [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config · 105293a7
      Aaron Davidson authored
      This is used in NioBlockTransferService here:
      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3688 from aarondav/SPARK-4837 and squashes the following commits:
      
      ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config
      105293a7
    • Ivan Vergiliev's avatar
      SPARK-4743 - Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey · f9f58b9a
      Ivan Vergiliev authored
      Author: Ivan Vergiliev <ivan@leanplum.com>
      
      Closes #3605 from IvanVergiliev/change-serializer and squashes the following commits:
      
      a49b7cf [Ivan Vergiliev] Use serializer instead of closureSerializer in aggregate/foldByKey.
      f9f58b9a
    • Madhu Siddalingaiah's avatar
      [SPARK-4884]: Improve Partition docs · d5a596d4
      Madhu Siddalingaiah authored
      Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html
      This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884
      
      Author: Madhu Siddalingaiah <madhu@madhu.com>
      
      Closes #3722 from msiddalingaiah/master and squashes the following commits:
      
      79e679f [Madhu Siddalingaiah] [DOC]: improve documentation
      51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
      332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
      cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
      d5a596d4
    • Ernest's avatar
      [SPARK-4880] remove spark.locality.wait in Analytics · a7ed6f3c
      Ernest authored
      spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
      Should be left to the user.
      
      Author: Ernest <earneyzxl@gmail.com>
      
      Closes #3730 from Earne/SPARK-4880 and squashes the following commits:
      
      d79ed04 [Ernest] remove spark.locality.wait in Analytics
      a7ed6f3c
    • DB Tsai's avatar
      [SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite · 59a49db5
      DB Tsai authored
      The original test doesn't make sense since if you step in, the lossSum is already NaN,
      and the coefficients are diverging. That's because the step size is too large for SGD,
      so it doesn't work.
      
      The correct behavior is that you should get smaller coefficients than the one
      without regularization. Comparing the values using 20000.0 relative error doesn't
      make sense as well.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3735 from dbtsai/mlortestfix and squashes the following commits:
      
      b1a3c42 [DB Tsai] first commit
      59a49db5
    • Ilya Ganelin's avatar
      [SPARK-3607] ConnectionManager threads.max configs on the thread pools don't work · 3720057b
      Ilya Ganelin authored
      Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #3664 from ilganeli/SPARK-3607C and squashes the following commits:
      
      3c05690 [Ilya Ganelin] Updated documentation and refactored code to extract shared variables
      3720057b
    • Timothy Chen's avatar
      Add mesos specific configurations into doc · d9956f86
      Timothy Chen authored
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #3349 from tnachen/mesos_doc and squashes the following commits:
      
      737ef49 [Timothy Chen] Add TOC
      5ca546a [Timothy Chen] Update description around cores requested.
      26283a5 [Timothy Chen] Add mesos specific configurations into doc
      d9956f86
    • Sandy Ryza's avatar
      SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be... · 253b72b5
      Sandy Ryza authored
      ... changed to a time period
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3471 from sryza/sandy-spark-3779 and squashes the following commits:
      
      20b9887 [Sandy Ryza] Deprecate old property
      42b5df7 [Sandy Ryza] Review feedback
      9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
      253b72b5
    • Zhan Zhang's avatar
      [SPARK-4461][YARN] pass extra java options to yarn application master · 3b764699
      Zhan Zhang authored
      Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution.
      
      The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf.
      
      In addition, with this new extra options enabled to be passed to AM, it provides more flexibility.
      
      For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in.
      
      This is the example, currently spark will break due to hadoop.version is not passed in.
        <property>
          <name>mapreduce.application.classpath</name>
          <value>/etc/hadoop/${hadoop.version}/mapreduce/*</value>
        </property>
      
      In the meantime, we cannot relies on  mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark.
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      
      Closes #3409 from zhzhan/Spark-4461 and squashes the following commits:
      
      daec3d0 [Zhan Zhang] solve review comments
      08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured
      5a505d3 [Zhan Zhang] solve review comments
      4ed43ad [Zhan Zhang] solve review comments
      ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461
      3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      e3f9abe [Zhan Zhang] solve review comments
      8963552 [Zhan Zhang] rebase
      f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      dea1692 [Zhan Zhang] change the option key name to client mode specific
      90d5dff [Zhan Zhang] rebase
      8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      092a25f [Zhan Zhang] solve review comments
      bc5a9ae [Zhan Zhang] solve review comments
      782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template
      6faaa97 [Zhan Zhang] solve review comments
      369863f [Zhan Zhang] clean up unnecessary var
      733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461
      a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master
      15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      ebb213a [Zhan Zhang] revert
      b983ef3 [Zhan Zhang] test
      c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f6a8a40 [Zhan Zhang] revert
      ba14f28 [Zhan Zhang] test
      3b764699
  3. Dec 17, 2014
    • lewuathe's avatar
      [SPARK-4822] Use sphinx tags for Python doc annotations · 3cd51619
      lewuathe authored
      Modify python annotations for sphinx. There is no change to build process from.
      https://github.com/apache/spark/blob/master/docs/README.md
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
      
      88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
      3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
      3cd51619
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · ca126089
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3137 (close requested by 'marmbrus')
      Closes #3362 (close requested by 'marmbrus')
      Closes #2979 (close requested by 'JoshRosen')
      Closes #2223 (close requested by 'JoshRosen')
      Closes #2998 (close requested by 'marmbrus')
      Closes #3202 (close requested by 'marmbrus')
      Closes #3079 (close requested by 'marmbrus')
      Closes #3210 (close requested by 'marmbrus')
      Closes #2764 (close requested by 'marmbrus')
      Closes #3618 (close requested by 'marmbrus')
      Closes #3501 (close requested by 'marmbrus')
      Closes #2768 (close requested by 'marmbrus')
      Closes #3381 (close requested by 'marmbrus')
      Closes #3510 (close requested by 'marmbrus')
      Closes #3703 (close requested by 'marmbrus')
      Closes #2543 (close requested by 'marmbrus')
      Closes #2876 (close requested by 'marmbrus')
      Closes #1281 (close requested by 'JoshRosen')
      ca126089
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3891][SQL] Add array support to percentile, percentile_approx and... · f33d5504
      Venkata Ramana Gollamudi authored
      [SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support
      
      Supported passing array to percentile and percentile_approx UDAFs
      To support percentile_approx,  constant inspectors are supported for GenericUDAF
      Constant folding support added to CreateArray expression
      Avoided constant udf expression re-evaluation
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2802 from gvramana/percentile_array_support and squashes the following commits:
      
      a0182e5 [Venkata Ramana Gollamudi] fixed review comment
      a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
      c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
      4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
      f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
      47f6365 [Venkata Ramana Gollamudi] fixed test
      cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
      7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
      f33d5504
    • Cheng Hao's avatar
      [SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul... · 8d0d2a65
      Cheng Hao authored
      ```
      TestSQLContext.sparkContext.parallelize(
        """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
        """{"ip":"27.31.100.29","headers":{}}""" ::
        """{"ip":"27.31.100.29","headers":""}""" :: Nil)
      ```
      As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3708 from chenghao-intel/json and squashes the following commits:
      
      e7a72e9 [Cheng Hao] add more concise unit test
      853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
      8d0d2a65
    • Michael Armbrust's avatar
      [HOTFIX][SQL] Fix parquet filter suite · 19c0faad
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:
      
      2157bfc [Michael Armbrust] Fix parquet filter suite
      19c0faad
    • Joseph K. Bradley's avatar
      [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc · affc3f46
      Joseph K. Bradley authored
      + small doc edit
      + include edit to make IntelliJ happy
      
      CC: davies  mengxr
      
      Note to davies  -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like.  (Those warnings occur when generating the Python docs.)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3669 from jkbradley/python-warnings and squashes the following commits:
      
      4587868 [Joseph K. Bradley] fixed warning
      8cb073c [Joseph K. Bradley] Updated based on davies recommendation
      c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc.  Small doc edit.  Small include edit to make IntelliJ happy.
      affc3f46
    • Cheng Hao's avatar
      [SPARK-3739] [SQL] Update the split num base on block size for table scanning · 636d9fc4
      Cheng Hao authored
      In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2589 from chenghao-intel/source_split and squashes the following commits:
      
      dff38e7 [Cheng Hao] Remove the extra blank line
      160a2b6 [Cheng Hao] fix the compiling bug
      04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
      636d9fc4
    • Daoyuan Wang's avatar
      [SPARK-4755] [SQL] sqrt(negative value) should return null · 902e4d54
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3616 from adrian-wang/sqrt and squashes the following commits:
      
      d877439 [Daoyuan Wang] fix NULLTYPE
      3effa2c [Daoyuan Wang] sqrt(negative value) should return null
      902e4d54
    • Cheng Lian's avatar
      [SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet · 62771353
      Cheng Lian authored
      Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.
      
      However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1
      
      This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3367 from liancheng/filters-with-null and squashes the following commits:
      
      cc41281 [Cheng Lian] Fixes several styling issues
      de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
      62771353
    • Michael Armbrust's avatar
      [SPARK-3698][SQL] Fix case insensitive resolution of GetField. · 7ad579ee
      Michael Armbrust authored
      Based on #2543.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3724 from marmbrus/resolveGetField and squashes the following commits:
      
      0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
      7ad579ee
    • carlmartin's avatar
      [SPARK-4694]Fix HiveThriftServer2 cann't stop In Yarn HA mode. · 4782def0
      carlmartin authored
      HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
      The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
      To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.
      
      Author: carlmartin <carlmartinmax@gmail.com>
      
      Closes #3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:
      
      2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
      c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
      4782def0
    • Cheng Hao's avatar
      [SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser · 5fdcbdc0
      Cheng Hao authored
      Add `sort by` support for both DSL & SqlParser.
      
      This PR is relevant with #3386, either one merged, will cause the other rebased.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3481 from chenghao-intel/sortby and squashes the following commits:
      
      041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
      5fdcbdc0
Loading