Skip to content
Snippets Groups Projects
  1. Dec 22, 2014
    • Takeshi Yamamuro's avatar
      [SPARK-4733] Add missing prameter comments in ShuffleDependency · fb8e85e8
      Takeshi Yamamuro authored
      Add missing Javadoc comments in ShuffleDependency.
      
      Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
      
      Closes #3594 from maropu/DependencyJavadocFix and squashes the following commits:
      
      32129b4 [Takeshi Yamamuro] Fix comments in @aggregator and @mapSideCombine
      303c75d [Takeshi Yamamuro] [SPARK-4733] Add missing prameter comments in ShuffleDependency
      fb8e85e8
    • carlmartin's avatar
      [Minor] Improve some code in BroadcastTest for short · 1d9788e4
      carlmartin authored
      Using
          val arr1 = (0 until num).toArray
      instead of
          val arr1 = new Array[Int](num)
          for (i <- 0 until arr1.length) {
            arr1(i) = i
          }
      for short.
      
      Author: carlmartin <carlmartinmax@gmail.com>
      
      Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits:
      
      43adb70 [carlmartin] Improve some code in BroadcastTest for short
      1d9788e4
    • zsxwing's avatar
      [SPARK-4883][Shuffle] Add a name to the directoryCleaner thread · 8773705f
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3734 from zsxwing/SPARK-4883 and squashes the following commits:
      
      e6f2b61 [zsxwing] Fix the name
      cc74727 [zsxwing] Add a name to the directoryCleaner thread
      8773705f
    • Zhang, Liye's avatar
      [SPARK-4870] Add spark version to driver log · 39272c8c
      Zhang, Liye authored
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #3717 from liyezhang556520/version2Log and squashes the following commits:
      
      ccd30d7 [Zhang, Liye] delete log in sparkConf
      330f70c [Zhang, Liye] move the log from SaprkConf to SparkContext
      96dc115 [Zhang, Liye] remove curly brace
      e833330 [Zhang, Liye] add spark version to driver log
      39272c8c
    • Tsuyoshi Ozawa's avatar
      [SPARK-4915][YARN] Fix classname to be specified for external shuffle service. · 96606f69
      Tsuyoshi Ozawa authored
      Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@lab.ntt.co.jp>
      
      Closes #3757 from oza/SPARK-4915 and squashes the following commits:
      
      3b0d6d6 [Tsuyoshi Ozawa] Fix classname to be specified for external shuffle service.
      96606f69
    • zsxwing's avatar
      [SPARK-4918][Core] Reuse Text in saveAsTextFile · 93b2f3a8
      zsxwing authored
      Reuse Text in saveAsTextFile to reduce GC.
      
      /cc rxin
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3762 from zsxwing/SPARK-4918 and squashes the following commits:
      
      59f03eb [zsxwing] Reuse Text in saveAsTextFile
      93b2f3a8
    • zsxwing's avatar
      [SPARK-2075][Core] Make the compiler generate same bytes code for Hadoop 1.+ and Hadoop 2.+ · 6ee6aa70
      zsxwing authored
      `NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.
      
      I used the following commands to confirm the generated byte codes are some.
      ```
      mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
      javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt
      
      mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
      javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt
      
      diff ~/hadoop1.txt ~/hadoop2.txt
      ```
      
      However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`.
      
      To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits:
      
      39d9df2 [zsxwing] Fix the code style
      e4ad8b5 [zsxwing] Use null for the implicit Ordering
      734bac9 [zsxwing] Explicitly set the implicit parameters
      ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration
      fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD
      6ee6aa70
  2. Dec 21, 2014
    • Sean Owen's avatar
      SPARK-4910 [CORE] build failed (use of FileStatus.isFile in Hadoop 1.x) · c6a3c0d5
      Sean Owen authored
      Fix small Hadoop 1 compile error from SPARK-2261. In Hadoop 1.x, all we have is FileStatus.isDir, so these "is file" assertions are changed to "is not a dir". This is how similar checks are done so far in the code base.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3754 from srowen/SPARK-4910 and squashes the following commits:
      
      52c5e4e [Sean Owen] Fix small Hadoop 1 compile error from SPARK-2261
      c6a3c0d5
  3. Dec 20, 2014
  4. Dec 19, 2014
    • Andrew Or's avatar
      [SPARK-4140] Document dynamic allocation · 15c03e1e
      Andrew Or authored
      Once the external shuffle service is also documented, the dynamic allocation section will link to it. Let me know if the whole dynamic allocation should be moved to its separate page; I personally think the organization might be cleaner that way.
      
      This patch builds on top of oza's work in #3689.
      
      aarondav pwendell
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Tsuyoshi Ozawa <ozawa.tsuyoshi@gmail.com>
      
      Closes #3731 from andrewor14/document-dynamic-allocation and squashes the following commits:
      
      1281447 [Andrew Or] Address a few comments
      b9843f2 [Andrew Or] Document the configs as well
      246fb44 [Andrew Or] Merge branch 'SPARK-4839' of github.com:oza/spark into document-dynamic-allocation
      8c64004 [Andrew Or] Add documentation for dynamic allocation (without configs)
      6827b56 [Tsuyoshi Ozawa] Fixing a documentation of spark.dynamicAllocation.enabled.
      53cff58 [Tsuyoshi Ozawa] Adding a documentation about dynamic resource allocation.
      15c03e1e
    • Daniel Darabos's avatar
      [SPARK-4831] Do not include SPARK_CLASSPATH if empty · 7cb3f547
      Daniel Darabos authored
      My guess for fixing https://issues.apache.org/jira/browse/SPARK-4831.
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #3678 from darabos/patch-1 and squashes the following commits:
      
      36e1243 [Daniel Darabos] Do not include SPARK_CLASSPATH if empty.
      7cb3f547
    • Kanwaljit Singh's avatar
      SPARK-2641: Passing num executors to spark arguments from properties file · 1d648123
      Kanwaljit Singh authored
      Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances.
      
      Author: Kanwaljit Singh <kanwaljit.singh@guavus.com>
      
      Closes #1657 from kjsingh/branch-1.0 and squashes the following commits:
      
      d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors
      
      Conflicts:
      	core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
      1d648123
    • Masayoshi TSUZUKI's avatar
      [SPARK-3060] spark-shell.cmd doesn't accept application options in Windows OS · 8d932475
      Masayoshi TSUZUKI authored
      Added equivalent module as utils.sh and modified spark-shell2.cmd to use it to parse options.
      
      Now we can use application options.
        ex) `bin\spark-shell.cmd --master spark://master:7077 -i path\to\script.txt`
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #3350 from tsudukim/feature/SPARK-3060 and squashes the following commits:
      
      4551e56 [Masayoshi TSUZUKI] Modified too long line which defines the submission options to pass findstr command.
      3a11361 [Masayoshi TSUZUKI] [SPARK-3060] spark-shell.cmd doesn't accept application options in Windows OS
      8d932475
    • Eran Medan's avatar
      change signature of example to match released code · c25c669d
      Eran Medan authored
      the signature of registerKryoClasses is actually of Array[Class[_]]  not Seq
      
      Author: Eran Medan <ehrann.mehdan@gmail.com>
      
      Closes #3747 from eranation/patch-1 and squashes the following commits:
      
      ee9885d [Eran Medan] change signature of example to match released code
      c25c669d
    • Marcelo Vanzin's avatar
      [SPARK-2261] Make event logger use a single file. · 45645191
      Marcelo Vanzin authored
      Currently the event logger uses a directory and several files to
      describe an app's event log, all but one of which are empty. This
      is not very HDFS-friendly, since creating lots of nodes in HDFS
      (especially when they don't contain any data) is frowned upon due
      to the node metadata being kept in the NameNode's memory.
      
      Instead, add a header section to the event log file that contains metadata
      needed to read the events. This metadata includes things like the Spark
      version (for future code that may need it for backwards compatibility) and
      the compression codec used for the event data.
      
      With the new approach, aside from reducing the load on the NN, there's
      also a lot less remote calls needed when reading the log directory.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #1222 from vanzin/hist-server-single-log and squashes the following commits:
      
      cc8f5de [Marcelo Vanzin] Store header in plain text.
      c7e6123 [Marcelo Vanzin] Update comment.
      59c561c [Marcelo Vanzin] Review feedback.
      216c5a3 [Marcelo Vanzin] Review comments.
      dce28e9 [Marcelo Vanzin] Fix log overwrite test.
      f91c13e [Marcelo Vanzin] Handle "spark.eventLog.overwrite", and add unit test.
      346f0b4 [Marcelo Vanzin] Review feedback.
      ed0023e [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      3f4500f [Marcelo Vanzin] Unit test for SPARK-3697.
      45c7a1f [Marcelo Vanzin] Version of SPARK-3697 for this branch.
      b3ee30b [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      a6d5c50 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      16fd491 [Marcelo Vanzin] Use unique log directory for each codec.
      0ef3f70 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      d93c44a [Marcelo Vanzin] Add a newline to make the header more readable.
      9e928ba [Marcelo Vanzin] Add types.
      bd6ba8c [Marcelo Vanzin] Review feedback.
      a624a89 [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      04364dc [Marcelo Vanzin] Merge branch 'master' into hist-server-single-log
      bb7c2d3 [Marcelo Vanzin] Fix scalastyle warning.
      16661a3 [Marcelo Vanzin] Simplify some internal code.
      cc6bce4 [Marcelo Vanzin] Some review feedback.
      a722184 [Marcelo Vanzin] Do not encode metadata in log file name.
      3700586 [Marcelo Vanzin] Restore log flushing.
      f677930 [Marcelo Vanzin] Fix botched rebase.
      ae571fa [Marcelo Vanzin] Fix end-to-end event logger test.
      9db0efd [Marcelo Vanzin] Show prettier name in UI.
      8f42274 [Marcelo Vanzin] Make history server parse old-style log directories.
      6251dd7 [Marcelo Vanzin] Make event logger use a single file.
      45645191
    • Josh Rosen's avatar
      [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it · c28083f4
      Josh Rosen authored
      This patch upgrades `spark-ec2`'s Boto version to 2.34.0, since this is blocking several features.  Newer versions of Boto don't work properly when they're loaded from a zipfile since they try to read a JSON file from a path relative to the Boto library sources.
      
      Therefore, this patch also changes spark-ec2 to automatically download Boto from PyPi if it's not present in `SPARK_EC2_DIR/lib`, similar to what we do in the `sbt/sbt` script. This shouldn't ben an issue for users since they already need to have an internet connection to launch an EC2 cluster.  By performing the downloading in spark_ec2.py instead of the Bash script, this should also work for Windows users.
      
      I've tested this with Python 2.6, too.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3737 from JoshRosen/update-boto and squashes the following commits:
      
      0aa43cc [Josh Rosen] Remove unused setup_standalone_cluster() method.
      f02935d [Josh Rosen] Enable Python deprecation warnings and fix one Boto warning:
      587ae89 [Josh Rosen] [SPARK-4890] Upgrade Boto to 2.34.0; automatically download Boto from PyPi instead of packaging it
      c28083f4
    • Ryan Williams's avatar
      [SPARK-4896] don’t redundantly overwrite executor JAR deps · 7981f969
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #2848 from ryan-williams/fetch-file and squashes the following commits:
      
      c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently
      8e39c16 [Ryan Williams] code review feedback
      788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps
      7981f969
    • Ryan Williams's avatar
      [SPARK-4889] update history server example cmds · cdb2c645
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #3736 from ryan-williams/hist and squashes the following commits:
      
      421d8ff [Ryan Williams] add another random typo fix
      76d6a4c [Ryan Williams] remove hdfs example
      a2d0f82 [Ryan Williams] code review feedback
      9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds
      cdb2c645
    • Reynold Xin's avatar
      Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor. · 336cd341
      Reynold Xin authored
      This consolidates some code path and makes constructor arguments simpler for a few classes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3738 from rxin/sparkEnvDepRefactor and squashes the following commits:
      
      82e02cc [Reynold Xin] Fixed couple bugs.
      217062a [Reynold Xin] Code review feedback.
      bd00af7 [Reynold Xin] Small refactoring to pass SparkEnv into Executor rather than creating SparkEnv in Executor.
      336cd341
    • scwf's avatar
      [Build] Remove spark-staging-1038 · 8e253ebb
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3743 from scwf/abc and squashes the following commits:
      
      7d98bc8 [scwf] removing spark-staging-1038
      8e253ebb
    • Cheng Hao's avatar
      [SPARK-4901] [SQL] Hot fix for ByteWritables.copyBytes · 5479450c
      Cheng Hao authored
      HiveInspectors.scala failed in compiling with Hadoop 1, as the BytesWritable.copyBytes is not available in Hadoop 1.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3742 from chenghao-intel/settable_oi_hotfix and squashes the following commits:
      
      bb04d1f [Cheng Hao] hot fix for ByteWritables.copyBytes
      5479450c
    • Sandy Ryza's avatar
      SPARK-3428. TaskMetrics for running tasks is missing GC time metrics · 283263ff
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits:
      
      cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics
      283263ff
  5. Dec 18, 2014
    • Liang-Chi Hsieh's avatar
      [SPARK-4674] Refactor getCallSite · d7fc69a8
      Liang-Chi Hsieh authored
      The current version of `getCallSite` visits the collection of `StackTraceElement` twice. However, it is unnecessary since we can perform our work with a single visit. We also do not need to keep filtered `StackTraceElement`.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #3532 from viirya/refactor_getCallSite and squashes the following commits:
      
      62aa124 [Liang-Chi Hsieh] Fix style.
      e741017 [Liang-Chi Hsieh] Refactor getCallSite.
      d7fc69a8
    • RJ Nowling's avatar
      [SPARK-4728][MLLib] Add exponential, gamma, and log normal sampling to MLlib da... · ee1fb97a
      RJ Nowling authored
      ...ta generators
      
      This patch adds:
      
      * Exponential, gamma, and log normal generators that wrap Apache Commons math3 to the private API
      * Functions for generating exponential, gamma, and log normal RDDs and vector RDDs
      * Tests for the above
      
      Author: RJ Nowling <rnowling@gmail.com>
      
      Closes #3680 from rnowling/spark4728 and squashes the following commits:
      
      455f50a [RJ Nowling] Add tests for exponential, gamma, and log normal samplers to JavaRandomRDDsSuite
      3e1134a [RJ Nowling] Fix val/var, unncessary creation of Distribution objects when setting seeds, and import line longer than line wrap limits
      58f5b97 [RJ Nowling] Fix bounds in tests so they scale with variance, not stdev
      84fd98d [RJ Nowling] Add more values for testing distributions.
      9f96232 [RJ Nowling] [SPARK-4728] Add exponential, gamma, and log normal sampling to MLlib data generators
      ee1fb97a
    • wangfei's avatar
      [SPARK-4861][SQL] Refactory command in spark sql · c3d91da5
      wangfei authored
      Remove ```Command``` and use ```RunnableCommand``` instead.
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3712 from scwf/cmd and squashes the following commits:
      
      51a82f2 [wangfei] fix test failure
      0e03be8 [wangfei] address comments
      4033bed [scwf] remove CreateTableAsSelect in hivestrategy
      5d20010 [wangfei] address comments
      125f542 [scwf] factory command in spark sql
      c3d91da5
    • Cheng Hao's avatar
      [SPARK-4573] [SQL] Add SettableStructObjectInspector support in "wrap" function · ae9f1286
      Cheng Hao authored
      Hive UDAF may create an customized object constructed by SettableStructObjectInspector, this is critical when integrate Hive UDAF with the refactor-ed UDAF interface.
      
      Performance issue in `wrap/unwrap` since more match cases added, will do it in another PR.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3429 from chenghao-intel/settable_oi and squashes the following commits:
      
      9f0aff3 [Cheng Hao] update code style issues as feedbacks
      2b0561d [Cheng Hao] Add more scala doc
      f5a40e8 [Cheng Hao] add scala doc
      2977e9b [Cheng Hao] remove the timezone setting for test suite
      3ed284c [Cheng Hao] fix the date type comparison
      f1b6749 [Cheng Hao] Update the comment
      932940d [Cheng Hao] Add more unit test
      72e4332 [Cheng Hao] Add settable StructObjectInspector support
      ae9f1286
    • ravipesala's avatar
      [SPARK-2554][SQL] Supporting SumDistinct partial aggregation · 7687415c
      ravipesala authored
      Adding support to the partial aggregation of SumDistinct
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3348 from ravipesala/SPARK-2554 and squashes the following commits:
      
      fd28e4d [ravipesala] Fixed review comments
      e60e67f [ravipesala] Fixed test cases and made it as nullable
      32fe234 [ravipesala] Supporting SumDistinct partial aggregation Conflicts: 	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
      7687415c
    • YanTangZhai's avatar
      [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an... · e7de7e5f
      YanTangZhai authored
      [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      
      The sql "select * from spark_test::for_test where abs(20141202) is not null" has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and
      partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the exception "java.lang.IllegalArgumentException: requirement failed: Partition pruning predicates only supported for partitioned tables." is thrown.
      The sql "select * from spark_test::for_test_partitioned_table where abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key insert_date has predicates=List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). PruningPredicates is List(IS NOT NULL HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).
      
      Author: YanTangZhai <hakeemzhai@tencent.com>
      Author: yantangzhai <tyz0303@163.com>
      
      Closes #3556 from YanTangZhai/SPARK-4693 and squashes the following commits:
      
      620ebe3 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      37cfdf5 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      70a3544 [yantangzhai] [SPARK-4693] [SQL] PruningPredicates may be wrong if predicates contains an empty AttributeSet() references
      efa9b03 [YanTangZhai] Update HiveQuerySuite.scala
      72accf1 [YanTangZhai] Update HiveQuerySuite.scala
      e572b9a [YanTangZhai] Update HiveStrategies.scala
      6e643f8 [YanTangZhai] Merge pull request #11 from apache/master
      e249846 [YanTangZhai] Merge pull request #10 from apache/master
      d26d982 [YanTangZhai] Merge pull request #9 from apache/master
      76d4027 [YanTangZhai] Merge pull request #8 from apache/master
      03b62b0 [YanTangZhai] Merge pull request #7 from apache/master
      8a00106 [YanTangZhai] Merge pull request #6 from apache/master
      cbcba66 [YanTangZhai] Merge pull request #3 from apache/master
      cdef539 [YanTangZhai] Merge pull request #1 from apache/master
      e7de7e5f
    • guowei2's avatar
      [SPARK-4756][SQL] FIX: sessionToActivePool grow infinitely, even as sessions expire · 22ddb6e0
      guowei2 authored
      **sessionToActivePool** in **SparkSQLOperationManager** grow infinitely, even as sessions expire.
      we should remove the pool value when the session closed, even though **sessionToActivePool** would not exist in all of sessions.
      
      Author: guowei2 <guowei2@asiainfo.com>
      
      Closes #3617 from guowei2/SPARK-4756 and squashes the following commits:
      
      e9b97b8 [guowei2] fix compile bug with Shim12
      cf0f521 [guowei2] Merge remote-tracking branch 'apache/master' into SPARK-4756
      e070998 [guowei2] fix: remove active pool of the session when it expired
      22ddb6e0
    • Thu Kyaw's avatar
      [SPARK-3928][SQL] Support wildcard matches on Parquet files. · b68bc6d2
      Thu Kyaw authored
      ...arquetFile accept hadoop glob pattern in path.
      
      Author: Thu Kyaw <trk007@gmail.com>
      
      Closes #3407 from tkyaw/master and squashes the following commits:
      
      19115ad [Thu Kyaw] Merge https://github.com/apache/spark
      ceded32 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      d322c28 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      ce677c6 [Thu Kyaw] [SPARK-3928][SQL] Support wildcard matches on Parquet files.
      b68bc6d2
    • Cheng Hao's avatar
      [SPARK-2663] [SQL] Support the Grouping Set · f728e0fe
      Cheng Hao authored
      Add support for `GROUPING SETS`, `ROLLUP`, `CUBE` and the the virtual column `GROUPING__ID`.
      
      More details on how to use the `GROUPING SETS" can be found at: https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup
      https://issues.apache.org/jira/secure/attachment/12676811/grouping_set.pdf
      
      The generic idea of the implementations are :
      1 Replace the `ROLLUP`, `CUBE` with `GROUPING SETS`
      2 Explode each of the input row, and then feed them to `Aggregate`
        * Each grouping set are represented as the bit mask for the `GroupBy Expression List`, for each bit, `1` means the expression is selected, otherwise `0` (left is the lower bit, and right is the higher bit in the `GroupBy Expression List`)
        * Several of projections are constructed according to the grouping sets, and within each projection(Seq[Expression), we replace those expressions with `Literal(null)` if it's not selected in the grouping set (based on the bit mask)
        * Output Schema of `Explode` is `child.output :+ grouping__id`
        * GroupBy Expressions of `Aggregate` is `GroupBy Expression List :+ grouping__id`
        * Keep the `Aggregation expressions` the same for the `Aggregate`
      
      The expressions substitutions happen in Logic Plan analyzing, so we will benefit from the Logical Plan optimization (e.g. expression constant folding, and map side aggregation etc.), Only an `Explosive` operator added for Physical Plan, which will explode the rows according the pre-set projections.
      
      A known issue will be done in the follow up PR:
      * Optimization `ColumnPruning` is not supported yet for `Explosive` node.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1567 from chenghao-intel/grouping_sets and squashes the following commits:
      
      fe65fcc [Cheng Hao] Remove the extra space
      3547056 [Cheng Hao] Add more doc and Simplify the Expand
      a7c869d [Cheng Hao] update code as feedbacks
      d23c672 [Cheng Hao] Add GroupingExpression to replace the Seq[Expression]
      414b165 [Cheng Hao] revert the unnecessary changes
      ec276c6 [Cheng Hao] Support Rollup/Cube/GroupingSets
      f728e0fe
    • Andrew Or's avatar
      [SPARK-4754] Refactor SparkContext into ExecutorAllocationClient · 9804a759
      Andrew Or authored
      This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object.
      
      cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits:
      
      187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
      59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
      347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient
      9804a759
    • Aaron Davidson's avatar
      [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config · 105293a7
      Aaron Davidson authored
      This is used in NioBlockTransferService here:
      https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3688 from aarondav/SPARK-4837 and squashes the following commits:
      
      ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config
      105293a7
    • Ivan Vergiliev's avatar
      SPARK-4743 - Use SparkEnv.serializer instead of closureSerializer in aggregateByKey and foldByKey · f9f58b9a
      Ivan Vergiliev authored
      Author: Ivan Vergiliev <ivan@leanplum.com>
      
      Closes #3605 from IvanVergiliev/change-serializer and squashes the following commits:
      
      a49b7cf [Ivan Vergiliev] Use serializer instead of closureSerializer in aggregate/foldByKey.
      f9f58b9a
    • Madhu Siddalingaiah's avatar
      [SPARK-4884]: Improve Partition docs · d5a596d4
      Madhu Siddalingaiah authored
      Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html
      This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884
      
      Author: Madhu Siddalingaiah <madhu@madhu.com>
      
      Closes #3722 from msiddalingaiah/master and squashes the following commits:
      
      79e679f [Madhu Siddalingaiah] [DOC]: improve documentation
      51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
      332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
      cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
      0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
      d5a596d4
    • Ernest's avatar
      [SPARK-4880] remove spark.locality.wait in Analytics · a7ed6f3c
      Ernest authored
      spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
      Should be left to the user.
      
      Author: Ernest <earneyzxl@gmail.com>
      
      Closes #3730 from Earne/SPARK-4880 and squashes the following commits:
      
      d79ed04 [Ernest] remove spark.locality.wait in Analytics
      a7ed6f3c
    • DB Tsai's avatar
      [SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite · 59a49db5
      DB Tsai authored
      The original test doesn't make sense since if you step in, the lossSum is already NaN,
      and the coefficients are diverging. That's because the step size is too large for SGD,
      so it doesn't work.
      
      The correct behavior is that you should get smaller coefficients than the one
      without regularization. Comparing the values using 20000.0 relative error doesn't
      make sense as well.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3735 from dbtsai/mlortestfix and squashes the following commits:
      
      b1a3c42 [DB Tsai] first commit
      59a49db5
    • Ilya Ganelin's avatar
      [SPARK-3607] ConnectionManager threads.max configs on the thread pools don't work · 3720057b
      Ilya Ganelin authored
      Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #3664 from ilganeli/SPARK-3607C and squashes the following commits:
      
      3c05690 [Ilya Ganelin] Updated documentation and refactored code to extract shared variables
      3720057b
    • Timothy Chen's avatar
      Add mesos specific configurations into doc · d9956f86
      Timothy Chen authored
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #3349 from tnachen/mesos_doc and squashes the following commits:
      
      737ef49 [Timothy Chen] Add TOC
      5ca546a [Timothy Chen] Update description around cores requested.
      26283a5 [Timothy Chen] Add mesos specific configurations into doc
      d9956f86
    • Sandy Ryza's avatar
      SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be... · 253b72b5
      Sandy Ryza authored
      ... changed to a time period
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3471 from sryza/sandy-spark-3779 and squashes the following commits:
      
      20b9887 [Sandy Ryza] Deprecate old property
      42b5df7 [Sandy Ryza] Review feedback
      9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
      253b72b5
Loading