Skip to content
Snippets Groups Projects
  1. Oct 31, 2014
    • Kousuke Saruta's avatar
      [SPARK-3870] EOL character enforcement · 55ab7770
      Kousuke Saruta authored
      We have shell scripts and Windows batch files, so we should enforce proper EOL character.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #2726 from sarutak/eol-enforcement and squashes the following commits:
      
      9748c3f [Kousuke Saruta] Fixed make.bat
      252de89 [Kousuke Saruta] Removed extra characters from make.bat
      5b81c00 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      8633ed2 [Kousuke Saruta] merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      5d630d8 [Kousuke Saruta] Merged
      ba10797 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      7407515 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
      772fd4e [Kousuke Saruta] Normized EOL character in make.bat and compute-classpath.cmd
      ac7f873 [Kousuke Saruta] Added an entry for .gitattributes to .rat-excludes
      1570e77 [Kousuke Saruta] Added .gitattributes
      55ab7770
    • Xiangrui Meng's avatar
      [SPARK-4150][PySpark] return self in rdd.setName · f1e7361f
      Xiangrui Meng authored
      Then we can do `rdd.setName('abc').cache().count()`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3011 from mengxr/rdd-setname and squashes the following commits:
      
      10d0d60 [Xiangrui Meng] update test
      4ac3bbd [Xiangrui Meng] return self in rdd.setName
      f1e7361f
    • Mark Mims's avatar
      [SPARK-4141] Hide Accumulators column on stage page when no accumulators exist · a68ecf32
      Mark Mims authored
      WebUI
      
      Author: Mark Mims <mark.mims@canonical.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3031 from mmm/remove-accumulators-col and squashes the following commits:
      
      6141cb3 [Mark Mims] reformat to satisfy scalastyle linelength.  build failed from jenkins https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22604/
      390893b [Mark Mims] cleanup
      c28c449 [Mark Mims] looking much better now... minimal explicit formatting.  Now, see if any sort keys make sense
      fb72156 [Mark Mims] mimic hasInput.  The basics work here, but wanna clean this up with maybeAccumulators for column content
      a68ecf32
    • Cheng Lian's avatar
      [SPARK-2220][SQL] Fixes remaining Hive commands · 23468e7e
      Cheng Lian authored
      This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #3038 from liancheng/hive-commands and squashes the following commits:
      
      6db61e0 [Cheng Lian] Fixes remaining Hive commands
      23468e7e
    • ravipesala's avatar
      [SPARK-4154][SQL] Query does not work if it has "not between " in Spark SQL and HQL · ea465af1
      ravipesala authored
      if the query contains "not between" does not work like.
      SELECT * FROM src where key not between 10 and 20'
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #3017 from ravipesala/SPARK-4154 and squashes the following commits:
      
      65fc89e [ravipesala] Handled admin comments
      32e6d42 [ravipesala] 'not between' is not working
      ea465af1
    • Venkata Ramana Gollamudi's avatar
      [SPARK-4077][SQL] Spark SQL return wrong values for valid string timestamp values · fa712b30
      Venkata Ramana Gollamudi authored
      In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
      However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #3019 from gvramana/spark_4077 and squashes the following commits:
      
      32d818f [Venkata Ramana Gollamudi] fixed check style
      fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
      fa712b30
    • wangfei's avatar
      [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 · 7c41d135
      wangfei authored
       In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.
      
       1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility
      
       2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version
      
       3 SBT cmd for different version as follows:
         hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
         hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly
      
       4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:
      
      f26f3be [wangfei] remove clean to save time
      f5cac74 [wangfei] remove local hivecontext test
      578234d [wangfei] use new shaded hive
      18fb1ff [wangfei] exclude kryo in hive pom
      fa21d09 [wangfei] clean package assembly/assembly
      8a4daf2 [wangfei] minor fix
      0d7f6cf [wangfei] address comments
      f7c93ae [wangfei] adding build with hive 0.13 before running tests
      bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      c359822 [wangfei] reuse getCommandProcessor in hiveshim
      52674a4 [scwf] sql/hive included since examples depend on it
      3529e98 [scwf] move hive module to hive profile
      f51ff4e [wangfei] update and fix conflicts
      f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      41f727b [scwf] revert pom changes
      13afde0 [scwf] fix small bug
      4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
      0bc53aa [scwf] fixed when result filed is null
      dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
      c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
      7c66b8e [scwf] update pom according spark-2706
      ae47489 [scwf] update and fix conflicts
      7c41d135
    • Kay Ousterhout's avatar
      [SPARK-4016] Allow user to show/hide UI metrics. · adb6415c
      Kay Ousterhout authored
      This commit adds a set of checkboxes to the stage detail
      page that the user can use to show additional task metrics,
      including the GC time, result serialization time, result fetch
      time, and scheduler delay.  All of these metrics are now
      hidden by default.  This allows advanced users to look at more
      detailed metrics, without distracting the average user.
      
      This change also cleans up the stage detail page so that metrics
      are shown in the same order in the summary table as in the task table,
      and updates the metrics in both tables such that they contain the same
      set of metrics.
      
      The ability to remember a user's preferences for which metrics
      should be shown has been filed as SPARK-4024.
      
      Here's what the stage detail page looks like by default:
      ![image](https://cloud.githubusercontent.com/assets/1108612/4744322/3ebe319e-5a2f-11e4-891f-c792be79caa2.png)
      
      and once a user clicks "Show additional metrics" (note that all the metrics get checked by default):
      ![image](https://cloud.githubusercontent.com/assets/1108612/4744332/51e5abda-5a2f-11e4-8994-d0d3705ee05d.png)
      
      cc shivaram andrewor14
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #2867 from kayousterhout/SPARK-checkboxes and squashes the following commits:
      
      6015913 [Kay Ousterhout] Added comment
      08dee73 [Kay Ousterhout] Josh's usability comments
      0940d61 [Kay Ousterhout] Style updates based on Andrew's review
      ef05ccd [Kay Ousterhout] Added tooltips
      d7cfaaf [Kay Ousterhout] Made list of add'l metrics collapsible.
      70c1fb5 [Kay Ousterhout] [SPARK-4016] Allow user to show/hide UI metrics.
      adb6415c
    • Sandy Ryza's avatar
      SPARK-3837. Warn when YARN kills containers for exceeding memory limits · acd4ac7c
      Sandy Ryza authored
      I triggered the issue and verified the message gets printed on a pseudo-distributed cluster.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #2744 from sryza/sandy-spark-3837 and squashes the following commits:
      
      858a268 [Sandy Ryza] Review feedback
      c937f00 [Sandy Ryza] SPARK-3837. Warn when YARN kills containers for exceeding memory limits
      acd4ac7c
    • Cheng Hao's avatar
      [SPARK-4143] [SQL] Move inner class DeferredObjectAdapter to top level · 58a6077e
      Cheng Hao authored
      The class DeferredObjectAdapter is the inner class of HiveGenericUdf, which may cause some overhead in closure ser/de-ser. Move it to top level.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3007 from chenghao-intel/move_deferred and squashes the following commits:
      
      3a139b1 [Cheng Hao] Move inner class DeferredObjectAdapter to top level
      58a6077e
    • Anant's avatar
      [SPARK-4108][SQL] Fixed usage of deprecated in sql/catalyst/types/datatypes · d31517a3
      Anant authored
      Fixed usage of deprecated in sql/catalyst/types/datatypes to have versio...n parameter
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #2970 from anantasty/SPARK-4108 and squashes the following commits:
      
      e92cb01 [Anant] Fixed usage of deprecated in sql/catalyst/types/datatypes to have version parameter
      d31517a3
    • Erik Erlandson's avatar
      [SPARK-3250] Implement Gap Sampling optimization for random sampling · ad3bd0df
      Erik Erlandson authored
      More efficient sampling, based on Gap Sampling optimization:
      http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
      
      Author: Erik Erlandson <eerlands@redhat.com>
      
      Closes #2455 from erikerlandson/spark-3250-pr and squashes the following commits:
      
      72496bc [Erik Erlandson] [SPARK-3250] Implement Gap Sampling optimization for random sampling
      ad3bd0df
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
  2. Oct 30, 2014
    • Patrick Wendell's avatar
      HOTFIX: Clean up build in network module. · 0734d093
      Patrick Wendell authored
      This is currently breaking the package build for some people (including me).
      
      This patch does some general clean-up which also fixes the current issue.
      - Uses consistent artifact naming
      - Adds sbt support for this module
      - Changes tests to use scalatest (fixes the original issue[1])
      
      One thing to note, it turns out that scalatest when invoked in the
      Maven build doesn't succesfully detect JUnit Java tests. This is
      a long standing issue, I noticed it applies to all of our current
      test suites as well. I've created SPARK-4159 to fix this.
      
      [1] The original issue is that we need to allocate extra memory
      for the tests, happens by default in our scalatest configuration.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3025 from pwendell/hotfix and squashes the following commits:
      
      faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
      0734d093
    • Andrew Or's avatar
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use... · 26d31d15
      Andrew Or authored
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"
      
      This reverts commit 68cb69da.
      26d31d15
    • Yash Datta's avatar
      [SPARK-3968][SQL] Use parquet-mr filter2 api · 2e35e242
      Yash Datta authored
      The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
      We can leverage that to further improve performance of queries with filters.
      Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #2841 from saucam/master and squashes the following commits:
      
      8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
      515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
      5f4530e [Yash Datta] SPARK-3968: Fix scala code style
      f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
      ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
      48163c3 [Yash Datta] SPARK-3968: Code cleanup
      cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
      caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
      49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
      9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
      2e35e242
    • ravipesala's avatar
      [SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM... · 9b6ebe33
      ravipesala authored
      [SPARK-4120][SQL] Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL
      
      Right now it works for only 2 tables like below query.
      sql("SELECT * FROM records1 as a,records2 as b where a.key=b.key ")
      
      But it does not work for more than 2 tables like below query
      sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key").
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2987 from ravipesala/multijoin and squashes the following commits:
      
      429b005 [ravipesala] Support multiple joins
      9b6ebe33
    • Sean Owen's avatar
      SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da
      Sean Owen authored
      (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2814 from srowen/SPARK-1209 and squashes the following commits:
      
      ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      68cb69da
    • Andrew Or's avatar
      [SPARK-3661] Respect spark.*.memory in cluster mode · 2f545438
      Andrew Or authored
      This also includes minor re-organization of the code. Tested locally in both client and deploy modes.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2697 from andrewor14/memory-cluster-mode and squashes the following commits:
      
      01d78bc [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
      ccd468b [Andrew Or] Add some comments per Patrick
      c956577 [Andrew Or] Tweak wording
      2b4afa0 [Andrew Or] Unused import
      47a5a88 [Andrew Or] Correct Spark properties precedence order
      bf64717 [Andrew Or] Merge branch 'master' of github.com:apache/spark into memory-cluster-mode
      dd452d0 [Andrew Or] Respect spark.*.memory in cluster mode
      2f545438
    • zsxwing's avatar
      [SPARK-4153][WebUI] Update the sort keys for HistoryPage · d3450578
      zsxwing authored
      Sort "Started", "Completed", "Duration" and "Last Updated" by time.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3014 from zsxwing/SPARK-4153 and squashes the following commits:
      
      ec8b9ad [zsxwing] Sort "Started", "Completed", "Duration" and "Last Updated" by time
      d3450578
    • Andrew Or's avatar
      Minor style hot fix after #2711 · 849b43ec
      Andrew Or authored
      I had planned to fix this when I merged it but I forgot to. witgo
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3018 from andrewor14/command-utils-style and squashes the following commits:
      
      c2959fb [Andrew Or] Style hot fix
      849b43ec
    • Andrew Or's avatar
      [SPARK-4155] Consolidate usages of <driver> · 9334d699
      Andrew Or authored
      We use "\<driver\>" everywhere. Let's not do that.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3020 from andrewor14/consolidate-driver and squashes the following commits:
      
      c1c2204 [Andrew Or] Just use "<driver>" for local executor ID
      3d751e9 [Andrew Or] Consolidate usages of <driver>
      9334d699
    • Andrew Or's avatar
      [Minor] A few typos in comments and log messages · 5231a3f2
      Andrew Or authored
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3021 from andrewor14/typos and squashes the following commits:
      
      daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
      4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
      026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
      a81ae8f [Andrew Or] Some typos
      5231a3f2
    • Andrew Or's avatar
      [SPARK-4138][SPARK-4139] Improve dynamic allocation settings · 26f092d4
      Andrew Or authored
      This should be merged after #2746 (SPARK-3795).
      
      **SPARK-4138**. If the user sets both the number of executors and `spark.dynamicAllocation.enabled`, we should throw an exception.
      
      **SPARK-4139**. If the user sets `spark.dynamicAllocation.enabled`, we should use the max number of executors as the starting number of executors because the first job is likely to run immediately after application startup. If the latter is not set, throw an exception.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3002 from andrewor14/yarn-set-executors and squashes the following commits:
      
      c528fce [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-set-executors
      55d4699 [Andrew Or] Bug fix: `isDynamicAllocationEnabled` was always false
      2b0ccec [Andrew Or] Start the number of executors at the max
      022bfde [Andrew Or] Guard against incompatible settings of number of executors
      26f092d4
    • Andrew Or's avatar
      [SPARK-3319] [SPARK-3338] Resolve Spark submit config paths · 24c51292
      Andrew Or authored
      The bulk of this PR is comprised of tests. All changes in functionality are made in `SparkSubmit.scala` (~20 lines).
      
      **SPARK-3319.** There is currently a divergence in behavior when the user passes in additional jars through `--jars` and through setting `spark.jars` in the default properties file. The former will happily resolve the paths (e.g. convert `my.jar` to `file:/absolute/path/to/my.jar`), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:
      
      - `--jars` ~ `spark.jars`
      - `--files` ~ `spark.files` / `spark.yarn.dist.files`
      - `--archives` ~ `spark.yarn.dist.archives`
      - `--py-files` ~ `spark.submit.pyFiles`
      
      **SPARK-3338.** This PR also fixes the following bug: if the user sets `spark.submit.pyFiles` in his/her properties file, it does not actually get picked up even if `--py-files` is not set. This is simply because the config is overridden by an empty string.
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #2232 from andrewor14/resolve-config-paths and squashes the following commits:
      
      fff2869 [Andrew Or] Add spark.yarn.jar
      da3a1c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
      f0fae64 [Andrew Or] Merge branch 'master' of github.com:apache/spark into resolve-config-paths
      05e03d6 [Andrew Or] Add tests for resolving both command line and config paths
      460117e [Andrew Or] Resolve config paths properly
      fe039d3 [Andrew Or] Beef up tests to test fixed-pointed-ness of Utils.resolveURI(s)
      24c51292
    • Grace's avatar
      [SPARK-4078] New FsPermission instance w/o FsPermission.createImmutable in eventlog · 9142c9b8
      Grace authored
      By default, Spark builds its package against Hadoop 1.0.4 version. In that version, it has some FsPermission bug (see [HADOOP-7629] (https://issues.apache.org/jira/browse/HADOOP-7629) by Todd Lipcon). This bug got fixed since 1.1 version. By using that FsPermission.createImmutable() API, end-user may see some RPC exception like below (if turn on eventlog over HDFS).  Here proposes a quick fix to avoid certain exception for all hadoop versions.
      ```
      Exception in thread "main" java.io.IOException: Call to sr484/10.1.2.84:54310 failed on local exception: java.io.EOFException
              at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
              at org.apache.hadoop.ipc.Client.call(Client.java:1118)
              at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
              at $Proxy6.setPermission(Unknown Source)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
              at java.lang.reflect.Method.invoke(Method.java:597)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
              at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
              at $Proxy6.setPermission(Unknown Source)
              at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285)
              at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572)
              at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138)
              at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
              at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
              at org.apache.spark.SparkContext.<init>(SparkContext.scala:324)
      ```
      
      Author: Grace <jie.huang@intel.com>
      
      Closes #2892 from GraceH/eventlog-rpc and squashes the following commits:
      
      58ea038 [Grace] new FsPermission Instance w/o FsPermission.createImmutable
      9142c9b8
    • Tathagata Das's avatar
      [SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either... · fb1fbca2
      Tathagata Das authored
      [SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS
      
      As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.
      
      Most of this code has been written by @harishreedharan
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #2931 from tdas/driver-ha-rdd and squashes the following commits:
      
      209e49c [Tathagata Das] Better fix to style issue.
      4a5866f [Tathagata Das] Addressed one more comment.
      ed5fbf0 [Tathagata Das] Minor updates.
      b0a18b1 [Tathagata Das] Fixed import order.
      20aa7c6 [Tathagata Das] Fixed more line length issues.
      29aa099 [Tathagata Das] Fixed line length issues.
      9e47b5b [Tathagata Das] Renamed class, simplified+added unit tests.
      6e1bfb8 [Tathagata Das] Tweaks testuite to create spark contxt lazily to prevent contxt leaks.
      9c86a61 [Tathagata Das] Merge pull request #22 from harishreedharan/driver-ha-rdd
      2878c38 [Hari Shreedharan] Shutdown spark context after tests. Formatting/minor fixes
      c709f2f [Tathagata Das] Merge pull request #21 from harishreedharan/driver-ha-rdd
      5cce16f [Hari Shreedharan] Make sure getBlockLocations uses offset and length to find the blocks on HDFS
      eadde56 [Tathagata Das] Transferred HDFSBackedBlockRDD for the driver-ha-working branch
      fb1fbca2
    • Tathagata Das's avatar
      [SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the... · 234de923
      Tathagata Das authored
      [SPARK-4028][Streaming] ReceivedBlockHandler interface to abstract the functionality of storage of received data
      
      As part of the initiative to prevent data loss on streaming driver failure, this JIRA tracks the subtask of implementing a ReceivedBlockHandler, that abstracts the functionality of storage of received data blocks. The default implementation will maintain the current behavior of storing the data into BlockManager. The optional implementation will store the data to both BlockManager as well as a write ahead log.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2940 from tdas/driver-ha-rbh and squashes the following commits:
      
      78a4aaa [Tathagata Das] Fixed bug causing test failures.
      f192f47 [Tathagata Das] Fixed import order.
      df5f320 [Tathagata Das] Updated code to use ReceivedBlockStoreResult as the return type for handler's storeBlock
      33c30c9 [Tathagata Das] Added license, and organized imports.
      2f025b3 [Tathagata Das] Updates based on PR comments.
      18aec1e [Tathagata Das] Moved ReceivedBlockInfo back into spark.streaming.scheduler package
      95a4987 [Tathagata Das] Added ReceivedBlockHandler and its associated tests
      234de923
    • Yanbo Liang's avatar
      SPARK-4111 [MLlib] add regression metrics · d9327192
      Yanbo Liang authored
      Add RegressionMetrics.scala as regression metrics used for evaluation and corresponding test case RegressionMetricsSuite.scala.
      
      Author: Yanbo Liang <yanbohappy@gmail.com>
      Author: liangyanbo <liangyanbo@meituan.com>
      
      Closes #2978 from yanbohappy/regression_metrics and squashes the following commits:
      
      730d0a9 [Yanbo Liang] more clearly annotation
      3d0bec1 [Yanbo Liang] rename and keep code style
      a8ad3e3 [Yanbo Liang] simplify code for keeping style
      d454909 [Yanbo Liang] rename parameter and function names, delete unused columns, add reference
      2e56282 [liangyanbo] rename r2_score() and remove unused column
      43bb12b [liangyanbo] add regression metrics
      d9327192
    • Joseph E. Gonzalez's avatar
      [SPARK-4130][MLlib] Fixing libSVM parser bug with extra whitespace · c7ad0852
      Joseph E. Gonzalez authored
      This simple patch filters out extra whitespace entries.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Joey <joseph.e.gonzalez@gmail.com>
      
      Closes #2996 from jegonzal/loadLibSVM and squashes the following commits:
      
      e0227ab [Joey] improving readability
      e028e84 [Joseph E. Gonzalez] fixing whitespace bug in loadLibSVMFile when parsing libSVM files
      c7ad0852
    • Kay Ousterhout's avatar
      [SPARK-4102] Remove unused ShuffleReader.stop() method. · 6db31574
      Kay Ousterhout authored
      This method is not implemented by the only subclass
      (HashShuffleReader), nor is it ever called. While the
      use of Scala's fancy "???" was pretty exciting, the method's
      existence can only lead to confusion and it therefore should
      be deleted.
      
      mateiz was there a reason for adding this that I'm
      missing?
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #2966 from kayousterhout/SPARK-4102 and squashes the following commits:
      
      532c564 [Kay Ousterhout] Added back commented-out method, as per Matei's request
      904655e [Kay Ousterhout] [SPARK-4102] Remove unused ShuffleReader.stop() method.
      6db31574
    • GuoQiang Li's avatar
      [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path · cd739bd7
      GuoQiang Li authored
      - [X] Standalone
      - [X] YARN
      - [X] Mesos
      - [X]  Mac OS X
      - [X] Linux
      - [ ]  Windows
      
      This is another implementation about #1031
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #2711 from witgo/SPARK-1719 and squashes the following commits:
      
      c7b26f6 [GuoQiang Li] review commits
      4488e41 [GuoQiang Li] Refactoring CommandUtils
      a444094 [GuoQiang Li] review commits
      40c0b4a [GuoQiang Li] Add buildLocalCommand method
      c1a0ddd [GuoQiang Li] fix comments
      156ce88 [GuoQiang Li] review commit
      38aa377 [GuoQiang Li] Refactor CommandUtils.scala
      4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala
      7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path
      cd739bd7
  3. Oct 29, 2014
    • Tathagata Das's avatar
      [SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing... · 12342580
      Tathagata Das authored
      [SPARK-4053][Streaming] Made the ReceiverSuite test more reliable, by fixing block generator throttling
      
      In the unit test that checked whether blocks generated by throttled block generator had expected number of records, the thresholds are too tight, which sometimes led to the test failing.
      This PR fixes it by relaxing the thresholds and the time intervals for testing.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2900 from tdas/receiver-suite-flakiness and squashes the following commits:
      
      28508a2 [Tathagata Das] Made the ReceiverSuite test more reliable
      12342580
    • Andrew Or's avatar
      [SPARK-3795] Heuristics for dynamically scaling executors · 8d59b37b
      Andrew Or authored
      This is part of a bigger effort to provide elastic scaling of executors within a Spark application ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does not provide any functionality by itself; it is a skeleton that is missing a mechanism to be added later in [SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).
      
      Comments and feedback are most welcome. For those of you reviewing this in detail, I highly recommend doing it through your favorite IDE instead of through the diff here.
      
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #2746 from andrewor14/scaling-heuristics and squashes the following commits:
      
      8a4fdaa [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      e045df8 [Andrew Or] Add warning message (minor)
      dfa31ec [Andrew Or] Fix tests
      c0becc4 [Andrew Or] Merging with SPARK-3822
      4784f93 [Andrew Or] Reword an awkward log message
      181f27f [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      c79e907 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      4672b90 [Andrew Or] It's nano time.
      a6a30f2 [Andrew Or] Do not allow min/max executors of 0
      c60ec33 [Andrew Or] Rewrite test logic with clocks
      b00b680 [Andrew Or] Fix style
      c3caa65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      7f9da14 [Andrew Or] Factor out logic to verify bounds on # executors (minor)
      f279019 [Andrew Or] Add time mocking tests for polling loop
      685e347 [Andrew Or] Factor out clock in polling loop to facilitate testing
      3cea7f7 [Andrew Or] Use PrivateMethodTester to keep original class private
      3156d81 [Andrew Or] Update comments and exception messages
      92f36f9 [Andrew Or] Address minor review comments
      abdea61 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      2aefd09 [Andrew Or] Correct listener behavior
      9fe6e44 [Andrew Or] Rename variables and configs + update comments and log messages
      149cc32 [Andrew Or] Fix style
      254c958 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      5ff829b [Andrew Or] Add tests for ExecutorAllocationManager
      19c6c4b [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      5896515 [Andrew Or] Move ExecutorAllocationManager out of scheduler package
      9ca8945 [Andrew Or] Rewrite callbacks through the listener interface
      5e336b9 [Andrew Or] Remove code from backend to avoid conflict with SPARK-3822
      092d1fd [Andrew Or] Remove timeout logic for pending requests
      1309fab [Andrew Or] Request executors by specifying the number pending
      8bc0e9d [Andrew Or] Add logic to expire pending requests after timeouts
      b750ee1 [Andrew Or] Express timers in terms of expiration times + remove retry logic
      7f8dd47 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      9d516cc [Andrew Or] Bug fix: Actually trigger the add timer / add retry timer
      44f1832 [Andrew Or] Rename configs to include time units
      eaae7ef [Andrew Or] Address various review comments
      6f8be6c [Andrew Or] Beef up comments on what each of the timers mean
      baaa403 [Andrew Or] Simplify variable names (minor)
      42beec8 [Andrew Or] Reset whether the add threshold is crossed on cancellation
      9bcc0bc [Andrew Or] ExecutorScalingManager -> ExecutorAllocationManager
      2784398 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
      5a97d9e [Andrew Or] Log retry attempts in INFO + clean up logging
      2f55c9f [Andrew Or] Do not keep requesting executors even after max attempts
      0acd1cb [Andrew Or] Rewrite timer logic with polling
      b3c7d44 [Andrew Or] Start the retry timer for adding executors at the right time
      9b5f2ea [Andrew Or] Wording changes in comments and log messages
      c2203a5 [Andrew Or] Simplify code to access the scheduler backend
      e519d08 [Andrew Or] Simplify initialization code
      2cc87a7 [Andrew Or] Add retry logic for removing executors
      d0b34a6 [Andrew Or] Add retry logic for adding executors
      9cc4649 [Andrew Or] Simplifying synchronization logic
      67c03c7 [Andrew Or] Correct semantics of adding executors + update comments
      6c48ab0 [Andrew Or] Update synchronization comment
      8901900 [Andrew Or] Simplify remove policy + change the semantics of add policy
      1cc8444 [Andrew Or] Minor wording change
      ae5b64a [Andrew Or] Add synchronization
      20ec6b9 [Andrew Or] First cut implementation of removing executors dynamically
      4077ae2 [Andrew Or] Minor code re-organization
      6f1fa66 [Andrew Or] First cut implementation of adding executors dynamically
      b2e6dcc [Andrew Or] Add skeleton interface for requesting / killing executors
      8d59b37b
    • zsxwing's avatar
      [SPARK-4097] Fix the race condition of 'thread' · e7fd8041
      zsxwing authored
      There is a chance that `thread` is null when calling `thread.interrupt()`.
      
      ```Scala
        override def cancel(): Unit = this.synchronized {
          _cancelled = true
          if (thread != null) {
            thread.interrupt()
          }
        }
      ```
      Should put `thread = null` into a `synchronized` block to fix the race condition.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #2957 from zsxwing/SPARK-4097 and squashes the following commits:
      
      edf0aee [zsxwing] Add comments to explain the lock
      c5cfeca [zsxwing] Fix the race condition of 'thread'
      e7fd8041
    • Andrew Or's avatar
      [SPARK-3822] Executor scaling mechanism for Yarn · 1df05a40
      Andrew Or authored
      This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.
      
      The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.
      
      I have tested this significantly on a stable Yarn cluster.
      
      ------------
      A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
      Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:
      
      485863e [Andrew Or] Minor log message changes
      4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
      1c57804 [Andrew Or] Reword a few comments + other review comments
      6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      02836c0 [Andrew Or] Limit scope of synchronization
      4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
      73ade46 [Andrew Or] Wording changes (minor)
      2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
      665f229 [Andrew Or] Mima excludes
      79aa2df [Andrew Or] Simplify the request interface by asking for a total
      04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
      f4783f8 [Andrew Or] Change the semantics of requesting executors
      005a124 [Andrew Or] Fix tests
      4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      572f5c5 [Andrew Or] Unused import (minor)
      f30261c [Andrew Or] Kill multiple executors rather than one at a time
      de260d9 [Andrew Or] Simplify by skipping useless null check
      9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
      97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
      7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
      47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
      c4dfaac [Andrew Or] Avoid thrashing when removing executors
      53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
      bbee669 [Andrew Or] Add mechanism in yarn client mode
      1df05a40
    • Daoyuan Wang's avatar
      [SPARK-4003] [SQL] add 3 types for java SQL context · 35354676
      Daoyuan Wang authored
      In JavaSqlContext, we need to let java program use big decimal, timestamp, date types.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2850 from adrian-wang/javacontext and squashes the following commits:
      
      4c4292c [Daoyuan Wang] change underlying type of JavaSchemaRDD as scala
      bb0508f [Daoyuan Wang] add test cases
      3c58b0d [Daoyuan Wang] add 3 types for java SQL context
      35354676
    • Reynold Xin's avatar
      [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core · dff01553
      Reynold Xin authored
      This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.
      
      In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.
      
      In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.
      
      These issues: have been addressed by folding in #2330:
      
      SPARK-3453: Refactor Netty module to use BlockTransferService interface
      SPARK-3018: Release all buffers upon task completion/failure
      SPARK-3002: Create a connection pool and reuse clients across different threads
      SPARK-3017: Integration tests and unit tests for connection failures
      SPARK-3049: Make sure client doesn't block when server/connection has error(s)
      SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
      SPARK-3503: Disable thread local cache in PooledByteBufAllocator
      
      TODO before mergeable:
      - [x] Implement uploadBlock()
      - [x] Unit tests for RPC side of code
      - [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
      - [x] Turn OFF by default (currently on for unit testing)
      
      Author: Reynold Xin <rxin@apache.org>
      Author: Aaron Davidson <aaron@databricks.com>
      Author: cocoatomo <cocoatomo77@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Anand Avati <avati@redhat.com>
      
      Closes #2753 from aarondav/netty and squashes the following commits:
      
      cadfd28 [Aaron Davidson] Turn netty off by default
      d7be11b [Aaron Davidson] Turn netty on by default
      4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
      2b0d1c0 [Aaron Davidson] 100ch
      0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      14e37f7 [Aaron Davidson] Address Reynold's comments
      8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
      e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
      ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
      9da0bc1 [Aaron Davidson] Add RPC unit tests
      d236dfd [Aaron Davidson] Remove no-op serializer :)
      7b7a26c [Aaron Davidson] Fix Nio compile issue
      dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
      939f276 [Aaron Davidson] Attempt to make comm. bidirectional
      aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
      2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
      020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
      ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
      29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
      f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
      5d98ce3 [Reynold Xin] Flip buffer.
      f6c220d [Reynold Xin] Merge with latest master.
      407e59a [Reynold Xin] Fix style violation.
      a0518c7 [Reynold Xin] Implemented block uploads.
      4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
      bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
      1bdd7ee [Reynold Xin] Fixed tests.
      d68f328 [Reynold Xin] Logging close() in case close() fails.
      f63fb4c [Reynold Xin] Add more debug message.
      6afc435 [Reynold Xin] Added logging.
      c066309 [Reynold Xin] Implement java.io.Closeable interface.
      519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
      f0a16e9 [Reynold Xin] Fixed test hanging.
      14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
      b2f3281 [Reynold Xin] Added connection pooling.
      d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
      9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
      5cd33d7 [Reynold Xin] Fixed style violation.
      cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
      1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
      108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
      b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
      064747b [Reynold Xin] Reference count buffers and clean them up properly.
      2b44cf1 [Reynold Xin] Added more documentation.
      1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
      165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
      dff01553
    • DB Tsai's avatar
      [SPARK-4129][MLlib] Performance tuning in MultivariateOnlineSummarizer · 51ce9973
      DB Tsai authored
      In MultivariateOnlineSummarizer, breeze's activeIterator is used
      to loop through the nonZero elements in the vector. However,
      activeIterator doesn't perform well due to lots of overhead.
      In this PR, native while loop is used for both DenseVector and SparseVector.
      
      The benchmark result with 20 executors using mnist8m dataset:
      Before:
      DenseVector: 48.2 seconds
      SparseVector: 16.3 seconds
      
      After:
      DenseVector: 17.8 seconds
      SparseVector: 11.2 seconds
      
      Since MultivariateOnlineSummarizer is used in several places,
      the overall performance gain in mllib library will be significant with this PR.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #2992 from dbtsai/SPARK-4129 and squashes the following commits:
      
      b99db6c [DB Tsai] fixed java.lang.ArrayIndexOutOfBoundsException
      2b5e882 [DB Tsai] small refactoring
      ebe3e74 [DB Tsai] First commit
      51ce9973
    • Xiangrui Meng's avatar
      [FIX] disable benchmark code · 1559495d
      Xiangrui Meng authored
      I forgot to disable the benchmark code in #2937, which increased the Jenkins build time by couple minutes.
      
      aarondav
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2990 from mengxr/disable-benchmark and squashes the following commits:
      
      c58f070 [Xiangrui Meng] disable benchmark code
      1559495d
Loading