Skip to content
Snippets Groups Projects
  1. May 15, 2015
    • Kousuke Saruta's avatar
      [SPARK-7503] [YARN] Resources in .sparkStaging directory can't be cleaned up on error · c64ff803
      Kousuke Saruta authored
      When we run applications on YARN with cluster mode, uploaded resources on .sparkStaging directory can't be cleaned up in case of failure of uploading local resources.
      
      You can see this issue by running following command.
      ```
      bin/spark-submit --master yarn --deploy-mode cluster --class <someClassName> <non-existing-jar>
      ```
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #6026 from sarutak/delete-uploaded-resources-on-error and squashes the following commits:
      
      caef9f4 [Kousuke Saruta] Fixed style
      882f921 [Kousuke Saruta] Wrapped Client#submitApplication with try/catch blocks in order to delete resources on error
      1786ca4 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into delete-uploaded-resources-on-error
      f61071b [Kousuke Saruta] Fixed cleanup problem
      c64ff803
    • Cheng Lian's avatar
      [SPARK-7591] [SQL] Partitioning support API tweaks · fdf5bba3
      Cheng Lian authored
      Please see [SPARK-7591] [1] for the details.
      
      /cc rxin marmbrus yhuai
      
      [1]: https://issues.apache.org/jira/browse/SPARK-7591
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6150 from liancheng/spark-7591 and squashes the following commits:
      
      af422e7 [Cheng Lian] Addresses @rxin's comments
      37d1738 [Cheng Lian] Fixes HadoopFsRelation partition columns initialization
      2fc680a [Cheng Lian] Fixes Scala style issue
      189ad23 [Cheng Lian] Removes HadoopFsRelation constructor arguments
      522c24e [Cheng Lian] Adds OutputWriterFactory
      047d40d [Cheng Lian] Renames FSBased* to HadoopFs*, also renamed FSBasedParquetRelation back to ParquetRelation2
      fdf5bba3
    • Yanbo Liang's avatar
      [SPARK-6258] [MLLIB] GaussianMixture Python API parity check · 94761485
      Yanbo Liang authored
      Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
      ```scala
      GaussianMixture
          setInitialModel
      GaussianMixtureModel
          k
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6087 from yanboliang/spark-6258 and squashes the following commits:
      
      b3af21c [Yanbo Liang] fix typo
      2b645c1 [Yanbo Liang] fix doc
      638b4b7 [Yanbo Liang] address comments
      b5bcade [Yanbo Liang] GaussianMixture Python API parity check
      94761485
    • zsxwing's avatar
      [SPARK-7650] [STREAMING] [WEBUI] Move streaming css and js files to the streaming project · cf842d42
      zsxwing authored
      cc tdas
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6160 from zsxwing/SPARK-7650 and squashes the following commits:
      
      fe6ae15 [zsxwing] Fix the import order
      a4ffd99 [zsxwing] Merge branch 'master' into SPARK-7650
      dc402b6 [zsxwing] Move streaming css and js files to the streaming project
      cf842d42
    • Kan Zhang's avatar
      [CORE] Remove unreachable Heartbeat message from Worker · daf4ae72
      Kan Zhang authored
      It doesn't look to me Heartbeat is sent to Worker from anyone.
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #6163 from kanzhang/deadwood and squashes the following commits:
      
      56be118 [Kan Zhang] [core] Remove unreachable Heartbeat message from Worker
      daf4ae72
    • Josh Rosen's avatar
  2. May 14, 2015
    • Yin Huai's avatar
      [SQL] When creating partitioned table scan, explicitly create UnionRDD. · e8f0e016
      Yin Huai authored
      Otherwise, it will cause stack overflow when there are many partitions.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6162 from yhuai/partitionUnionedRDD and squashes the following commits:
      
      fa016d8 [Yin Huai] Explicitly create UnionRDD.
      e8f0e016
    • Liang-Chi Hsieh's avatar
      [SPARK-7098][SQL] Make the WHERE clause with timestamp show consistent result · f9705d46
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7098
      
      The WHERE clause with timstamp shows inconsistent results. This pr fixes it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5682 from viirya/consistent_timestamp and squashes the following commits:
      
      171445a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into consistent_timestamp
      4e98520 [Liang-Chi Hsieh] Make the WHERE clause with timestamp show consistent result.
      f9705d46
    • Michael Armbrust's avatar
      [SPARK-7548] [SQL] Add explode function for DataFrames · 6d0633e3
      Michael Armbrust authored
      Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions.   There are currently the following restrictions:
       - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
       - only one may be present in a single select to avoid potentially confusing implicit Cartesian products.
      
      TODO:
       - [ ] Python
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6107 from marmbrus/explodeFunction and squashes the following commits:
      
      7ee2c87 [Michael Armbrust] whitespace
      6f80ba3 [Michael Armbrust] Update dataframe.py
      c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
      81b5da3 [Michael Armbrust] style
      d3faa05 [Michael Armbrust] fix self join case
      f9e1e3e [Michael Armbrust] fix python, add since
      4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
      e710fe4 [Michael Armbrust] add java and python
      52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
      6d0633e3
    • Xiangrui Meng's avatar
      [SPARK-7619] [PYTHON] fix docstring signature · 48fc38f5
      Xiangrui Meng authored
      Just realized that we need `\` at the end of the docstring. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6161 from mengxr/SPARK-7619 and squashes the following commits:
      
      e44495f [Xiangrui Meng] fix docstring signature
      48fc38f5
    • Xiangrui Meng's avatar
      [SPARK-7648] [MLLIB] Add weights and intercept to GLM wrappers in spark.ml · 723853ed
      Xiangrui Meng authored
      Otherwise, users can only use `transform` on the models. brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6156 from mengxr/SPARK-7647 and squashes the following commits:
      
      1ae3d2d [Xiangrui Meng] add weights and intercept to LogisticRegression in Python
      f49eb46 [Xiangrui Meng] add weights and intercept to LinearRegressionModel
      723853ed
    • zsxwing's avatar
      [SPARK-7645] [STREAMING] [WEBUI] Show milliseconds in the UI if the batch interval < 1 second · b208f998
      zsxwing authored
      I also updated the summary of the Streaming page.
      
      ![screen shot 2015-05-14 at 11 52 59 am](https://cloud.githubusercontent.com/assets/1000778/7640103/13cdf68e-fa36-11e4-84ec-e2a3954f4319.png)
      ![screen shot 2015-05-14 at 12 39 33 pm](https://cloud.githubusercontent.com/assets/1000778/7640151/4cc066ac-fa36-11e4-8494-2821d6a6f17c.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6154 from zsxwing/SPARK-7645 and squashes the following commits:
      
      5db6ca1 [zsxwing] Add UIUtils.formatBatchTime
      e4802df [zsxwing] Show milliseconds in the UI if the batch interval < 1 second
      b208f998
    • zsxwing's avatar
      [SPARK-7649] [STREAMING] [WEBUI] Use window.localStorage to store the status rather than the url · 0a317c12
      zsxwing authored
      Use window.localStorage to store the status rather than the url so that the url won't be changed.
      
      cc tdas
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6158 from zsxwing/SPARK-7649 and squashes the following commits:
      
      3c56fef [zsxwing] Use window.localStorage to store the status rather than the url
      0a317c12
    • Xiangrui Meng's avatar
      [SPARK-7643] [UI] use the correct size in RDDPage for storage info and partitions · 57ed16cf
      Xiangrui Meng authored
      `dataDistribution` and `partitions` are `Option[Seq[_]]`. andrewor14 squito
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6157 from mengxr/SPARK-7643 and squashes the following commits:
      
      99fe8a4 [Xiangrui Meng] use the correct size in RDDPage for storage info and partitions
      57ed16cf
    • Rex Xiong's avatar
      [SPARK-7598] [DEPLOY] Add aliveWorkers metrics in Master · 93dbb3ad
      Rex Xiong authored
      In Spark Standalone setup, when some workers are DEAD, they will stay in master worker list for a while.
      master.workers metrics for master is only showing the total number of workers, we need to monitor how many real ALIVE workers are there to ensure the cluster is healthy.
      
      Author: Rex Xiong <pengx@microsoft.com>
      
      Closes #6117 from twilightgod/add-aliveWorker-metrics and squashes the following commits:
      
      6be69a5 [Rex Xiong] Fix comment for aliveWorkers metrics
      a882f39 [Rex Xiong] Fix style for aliveWorkers metrics
      38ce955 [Rex Xiong] Add aliveWorkers metrics in Master
      93dbb3ad
    • tedyu's avatar
      Make SPARK prefix a variable · 11a1a135
      tedyu authored
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #6153 from ted-yu/master and squashes the following commits:
      
      4e0bac5 [tedyu] Use JIRA_PROJECT_NAME as variable name
      ab982aa [tedyu] Make SPARK prefix a variable
      11a1a135
    • ksonj's avatar
      [SPARK-7278] [PySpark] DateType should find datetime.datetime acceptable · 5d7d4f88
      ksonj authored
      DateType should not be restricted to `datetime.date` but accept `datetime.datetime` objects as well. Could someone with a little more insight verify this?
      
      Author: ksonj <kson@siberie.de>
      
      Closes #6057 from ksonj/dates and squashes the following commits:
      
      68a158e [ksonj] DateType should find datetime.datetime acceptable too
      5d7d4f88
    • Wenchen Fan's avatar
      [SQL][minor] rename apply for QueryPlanner · f2cd00be
      Wenchen Fan authored
      A follow-up of https://github.com/apache/spark/pull/5624
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6142 from cloud-fan/tmp and squashes the following commits:
      
      971a92b [Wenchen Fan] use plan instead of execute
      24c5ffe [Wenchen Fan] rename apply
      f2cd00be
    • FavioVazquez's avatar
      [SPARK-7249] Updated Hadoop dependencies due to inconsistency in the versions · 7fb715de
      FavioVazquez authored
      Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons.
      
      Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly.
      
      Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned.
      
      Author: FavioVazquez <favio.vazquezp@gmail.com>
      
      Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits:
      
      11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh
      379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior
      3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies
      31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation
      cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about  hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies
      83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml
      93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM
      668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
      fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh  due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
      0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0
      a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml
      199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that.
      88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file
      70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles
      287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc.
      1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation
      6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff.
      7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
      660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
      ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
      7fb715de
    • DB Tsai's avatar
      [SPARK-7568] [ML] ml.LogisticRegression doesn't output the right prediction · c1080b6f
      DB Tsai authored
      The difference is because we previously don't fit the intercept in Spark 1.3. Here, we change the input `String` so that the probability of instance 6 can be classified as `1.0` without any ambiguity.
      
      with lambda = 0.001 in current LOR implementation, the prediction is
      ```
      (4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
      (5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
      (6, spark hadoop spark) --> prob=[0.0692663313297627,0.9307336686702373], prediction=1.0
      (7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
      ```
      and the training accuracy is
      ```
      (0, a b c d e spark) --> prob=[0.0021342419881406746,0.9978657580118594], prediction=1.0
      (1, b d) --> prob=[0.9959176174854043,0.004082382514595685], prediction=0.0
      (2, spark f g h) --> prob=[0.0014541569986711233,0.9985458430013289], prediction=1.0
      (3, hadoop mapreduce) --> prob=[0.9982978367343561,0.0017021632656438518], prediction=0.0
      ```
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6109 from dbtsai/lor-example and squashes the following commits:
      
      ac63ce4 [DB Tsai] first commit
      c1080b6f
    • Xiangrui Meng's avatar
      [SPARK-7407] [MLLIB] use uid + name to identify parameters · 1b8625f4
      Xiangrui Meng authored
      A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
      
      This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
      
      c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      520f0a2 [Xiangrui Meng] address comments
      2569168 [Xiangrui Meng] fix tests
      873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
      409ea08 [Xiangrui Meng] minor updates
      83a163c [Xiangrui Meng] update JavaDeveloperApiExample
      5db5325 [Xiangrui Meng] update OneVsRest
      7bde7ae [Xiangrui Meng] merge master
      697fdf9 [Xiangrui Meng] update Bucketizer
      7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      629d402 [Xiangrui Meng] fix LRSuite
      154516f [Xiangrui Meng] merge master
      aa4a611 [Xiangrui Meng] fix examples/compile
      a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
      fdbc415 [Xiangrui Meng] all tests passed
      c255f17 [Xiangrui Meng] fix tests in ParamsSuite
      818e1db [Xiangrui Meng] merge master
      e1160cf [Xiangrui Meng] fix tests
      fbc39f0 [Xiangrui Meng] pass test:compile
      108937e [Xiangrui Meng] pass compile
      8726d39 [Xiangrui Meng] use parent uid in Param
      eaeed35 [Xiangrui Meng] update Identifiable
      1b8625f4
    • linweizhong's avatar
      [SPARK-7595] [SQL] Window will cause resolve failed with self join · 13e652b6
      linweizhong authored
      for example:
      table: src(key string, value string)
      sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5;
      then will analyze fail when resolving conflicting references in Join:
      'Limit 5
       'Project [*]
        'Subquery v2
         'Project ['v1.key,'v1_lag.cnt_val]
          'Filter ('v1.key = 'v1_lag.key)
           'Join Inner, None
            Subquery v1
             Project [key#95,cnt_val#94L]
              Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
               Project [key#95,value#96]
                MetastoreRelation default, src, None
            Subquery v1_lag
             Subquery v1
              Project [key#97,cnt_val#94L]
               Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
                Project [key#97,value#98]
                 MetastoreRelation default, src, None
      
      Conflicting attributes: cnt_val#94L
      
      Author: linweizhong <linweizhong@huawei.com>
      
      Closes #6114 from Sephiroth-Lin/spark-7595 and squashes the following commits:
      
      f8f2637 [linweizhong] Add unit test
      dfe9169 [linweizhong] Handle windowExpression with self join
      13e652b6
    • DB Tsai's avatar
      [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition to avoid extra JVM call · d3db2fd6
      DB Tsai authored
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6137 from dbtsai/clean and squashes the following commits:
      
      185816d [DB Tsai] fix compilication issue
      f418d08 [DB Tsai] first commit
      d3db2fd6
  3. May 13, 2015
    • Xiangrui Meng's avatar
      [SPARK-7612] [MLLIB] update NB training to use mllib's BLAS · d5f18de1
      Xiangrui Meng authored
      This is similar to the changes to k-means, which gives us better control on the performance. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6128 from mengxr/SPARK-7612 and squashes the following commits:
      
      b5c24c5 [Xiangrui Meng] merge master
      a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS
      d5f18de1
    • Andrew Or's avatar
      [HOT FIX #6125] Do not wait for all stages to start rendering · 3113da9c
      Andrew Or authored
      zsxwing
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6138 from andrewor14/dag-viz-clean-properly and squashes the following commits:
      
      19d4e98 [Andrew Or] Add synchronize
      02542d6 [Andrew Or] Rename overloaded variable
      d11bee1 [Andrew Or] Don't wait until all stages have started before rendering
      3113da9c
    • zsxwing's avatar
      [HOTFIX] Use 'new Job' in fsBasedParquet.scala · 728af88c
      zsxwing authored
      Same issue as #6095
      
      cc liancheng
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6136 from zsxwing/hotfix and squashes the following commits:
      
      4beea54 [zsxwing] Use 'new Job' in fsBasedParquet.scala
      728af88c
    • Patrick Wendell's avatar
      [HOTFIX] Bug in merge script · 32e27df4
      Patrick Wendell authored
      32e27df4
    • Tathagata Das's avatar
      [SPARK-6752] [STREAMING] [REVISED] Allow StreamingContext to be recreated from... · bce00dac
      Tathagata Das authored
      [SPARK-6752] [STREAMING] [REVISED] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
      
      This is a revision of the earlier version (see #5773) that passed the active SparkContext explicitly through a new set of Java and Scala API. The drawbacks are.
      
      * Hard to implement in python.
      * New API introduced. This is even more confusing since we are introducing getActiveOrCreate in SPARK-7553
      
      Furthermore, there is now a direct way get an existing active SparkContext or create a new on - SparkContext.getOrCreate(conf). Its better to use this to get the SparkContext rather than have a new API to explicitly pass the context.
      
      So in this PR I have
      * Removed the new versions of StreamingContext.getOrCreate() which took SparkContext
      * Added the ability to pick up existing SparkContext when the StreamingContext tries to create a SparkContext.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6096 from tdas/SPARK-6752 and squashes the following commits:
      
      53f4b2d [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-6752
      f024b77 [Tathagata Das] Removed extra API and used SparkContext.getOrCreate
      bce00dac
    • Venkata Ramana Gollamudi's avatar
      [SPARK-7601] [SQL] Support Insert into JDBC Datasource · 59aaa1da
      Venkata Ramana Gollamudi authored
      Supported InsertableRelation for JDBC Datasource JDBCRelation.
      Example usage:
      sqlContext.sql(
            s"""
              |CREATE TEMPORARY TABLE testram1
              |USING org.apache.spark.sql.jdbc
              |OPTIONS (url '$url', dbtable 'testram1', user 'xx', password 'xx', driver 'com.h2.Driver')
            """.stripMargin.replaceAll("\n", " "))
      
      sqlContext.sql("insert into table testram1 select * from testsrc")
      sqlContext.sql("insert overwrite table testram1 select * from testsrc")
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #6121 from gvramana/JDBCDatasource_insert and squashes the following commits:
      
      f3fb5f1 [Venkata Ramana Gollamudi] Support for JDBC Datasource InsertableRelation
      59aaa1da
    • Josh Rosen's avatar
      [SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort · 73bed408
      Josh Rosen authored
      This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records.
      
      The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf.
      
      The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles.
      
      UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold:
      
       - The shuffle dependency specifies no aggregation or output ordering.
       - The shuffle serializer supports relocation of serialized values (this is currently supported
         by KryoSerializer and Spark SQL's custom serializers).
       - The shuffle produces fewer than 16777216 output partitions.
       - No individual record is larger than 128 MB when serialized.
      
      In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer.
      
      At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager.  In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file.
      
      UnsafeShuffleManager optimizes this process in several ways:
      
       - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization.  See SPARK-4550, where this optimization was first proposed and implemented, for more details.
      
       - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache.
      
       - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge.
      
       - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition.  This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge.
      
      The shuffle read path is unchanged.
      
      This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725).
      
      ### Future work
      
      There are several tasks that build upon this patch, which will be left to future work:
      
      - [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data.
      - Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL).
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:
      
      ef0a86e [Josh Rosen] Fix scalastyle errors
      7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
      d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
      52a9981 [Josh Rosen] Fix some bugs in the address packing code.
      51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
      4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
      de40b9d [Josh Rosen] More comments to try to explain metrics code
      df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
      5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file.
      d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
      e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
      4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
      6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter.
      57312c9 [Josh Rosen] Clarify fileBufferSize units
      2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter.
      fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
      85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
      0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
      56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
      e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
      e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
      4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
      d4e6d89 [Josh Rosen] Update to bit shifting constants
      69d5899 [Josh Rosen] Remove some unnecessary override vals
      8531286 [Josh Rosen] Add tests that automatically trigger spills.
      7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap().
      e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
      39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!)
      1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
      ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
      ae538dc [Josh Rosen] Document UnsafeShuffleManager.
      ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
      0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
      b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance.
      1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations.
      b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless.
      f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation.
      4a01c45 [Josh Rosen] Remove unnecessary log message
      27b18b0 [Josh Rosen] That for inserting records AT the max record size.
      fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
      9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
      fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
      67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager
      8f5061a [Josh Rosen] Strengthen assertion to check partitioning
      01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
      1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename.
      e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors
      7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
      722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests.
      9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
      1ce1300 [Josh Rosen] More minor cleanup
      5e8cf75 [Josh Rosen] More minor cleanup
      e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
      cfe0ec4 [Josh Rosen] Address a number of minor review comments:
      8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
      11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
      b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
      aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
      4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
      133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
      f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
      57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort.
      69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
      7ee918e [Josh Rosen] Re-order imports in tests
      3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces
      3490512 [Josh Rosen] Misc. cleanup
      f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
      2776aca [Josh Rosen] First passing test for ExternalSorter.
      5e100b2 [Josh Rosen] Super-messy WIP on external sort
      595923a [Josh Rosen] Remove some unused variables.
      8958584 [Josh Rosen] Fix bug in calculating free space in current page.
      f17fa8f [Josh Rosen] Add missing newline
      c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
      b8a09fe [Josh Rosen] Back out accidental log4j.properties change
      bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
      240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert()
      1433b42 [Josh Rosen] Store record length as int instead of long.
      026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
      0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
      87e721b [Josh Rosen] Renaming and comments
      d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
      e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
      e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
      9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
      253f13e [Josh Rosen] More cleanup
      8e3ec20 [Josh Rosen] Begin code cleanup.
      4d2f5e1 [Josh Rosen] WIP
      3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
      767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
      e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
      57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
      abf7bfe [Josh Rosen] Add basic test case.
      81d52c5 [Josh Rosen] WIP on UnsafeSorter
      73bed408
    • Hari Shreedharan's avatar
      [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using... · 61d1e87c
      Hari Shreedharan authored
      [SPARK-7356] [STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
      
      This is meant to make the FlumePollingStreamSuite deterministic. Now we basically count the number of batches that have been completed - and then verify the results rather than sleeping for random periods of time.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #5918 from harishreedharan/flume-test-fix and squashes the following commits:
      
      93f24f3 [Hari Shreedharan] Add an eventually block to ensure that all received data is processed. Refactor the dstream creation and remove redundant code.
      1108804 [Hari Shreedharan] [SPARK-7356][STREAMING] Fix flakey tests in FlumePollingStreamSuite using SparkSink's batch CountDownLatch.
      61d1e87c
    • Andrew Or's avatar
      [STREAMING] [MINOR] Keep streaming.UIUtils private · bb6dec3b
      Andrew Or authored
      zsxwing
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6134 from andrewor14/private-streaming-uiutils and squashes the following commits:
      
      225df94 [Andrew Or] Privatize class
      bb6dec3b
    • Andrew Or's avatar
      [SPARK-7502] DAG visualization: gracefully handle removed stages · aa183787
      Andrew Or authored
      Old stages are removed without much feedback to the user. This happens very often in streaming. See screenshots below for more detail. zsxwing
      
      **Before**
      
      <img src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png" width="500px"/>
      
      -------------------------
      **After**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the following commits:
      
      43175cd [Andrew Or] Handle removed jobs and stages gracefully
      aa183787
    • Andrew Or's avatar
      [SPARK-7464] DAG visualization: highlight the same RDDs on hover · 44403414
      Andrew Or authored
      This is pretty useful for MLlib.
      
      <img src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png" width="400px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6100 from andrewor14/dag-viz-hover and squashes the following commits:
      
      fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD
      90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs
      44403414
    • Andrew Or's avatar
      [SPARK-7399] Spark compilation error for scala 2.11 · f88ac701
      Andrew Or authored
      Subsequent fix following #5966. I tried this out locally.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6129 from andrewor14/211-compilation and squashes the following commits:
      
      713868f [Andrew Or] Fix compilation issue for scala 2.11
      f88ac701
    • Andrew Or's avatar
      [SPARK-7608] Clean up old state in RDDOperationGraphListener · f6e18388
      Andrew Or authored
      This is necessary for streaming and long-running Spark applications. zsxwing tdas
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6125 from andrewor14/viz-listener-leak and squashes the following commits:
      
      8660949 [Andrew Or] Fix thing + add tests
      33c0843 [Andrew Or] Clean up old job state
      f6e18388
    • Reynold Xin's avatar
      [SQL] Move some classes into packages that are more appropriate. · e683182c
      Reynold Xin authored
      JavaTypeInference into catalyst
      types.DateUtils into catalyst
      CacheManager into execution
      DefaultParserDialect into catalyst
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6108 from rxin/sql-rename and squashes the following commits:
      
      3fc9613 [Reynold Xin] Fixed import ordering.
      83d9ff4 [Reynold Xin] Fixed codegen tests.
      e271e86 [Reynold Xin] mima
      f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
      e683182c
    • scwf's avatar
      [SPARK-7303] [SQL] push down project if possible when the child is sort · 59250fe5
      scwf authored
      Optimize the case of `project(_, sort)` , a example is:
      
      `select key from (select * from testData order by key) t`
      
      before this PR:
      ```
      == Parsed Logical Plan ==
      'Project ['key]
       'Subquery t
        'Sort ['key ASC], true
         'Project [*]
          'UnresolvedRelation [testData], None
      
      == Analyzed Logical Plan ==
      Project [key#0]
       Subquery t
        Sort [key#0 ASC], true
         Project [key#0,value#1]
          Subquery testData
           LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Optimized Logical Plan ==
      Project [key#0]
       Sort [key#0 ASC], true
        LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Physical Plan ==
      Project [key#0]
       Sort [key#0 ASC], true
        Exchange (RangePartitioning [key#0 ASC], 5), []
         PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
      ```
      
      after this PR
      ```
      == Parsed Logical Plan ==
      'Project ['key]
       'Subquery t
        'Sort ['key ASC], true
         'Project [*]
          'UnresolvedRelation [testData], None
      
      == Analyzed Logical Plan ==
      Project [key#0]
       Subquery t
        Sort [key#0 ASC], true
         Project [key#0,value#1]
          Subquery testData
           LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Optimized Logical Plan ==
      Sort [key#0 ASC], true
       Project [key#0]
        LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
      
      == Physical Plan ==
      Sort [key#0 ASC], true
       Exchange (RangePartitioning [key#0 ASC], 5), []
        Project [key#0]
         PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
      ```
      
      with this rule we will first do column pruning on the table and then do sorting.
      
      Author: scwf <wangfei1@huawei.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #5838 from scwf/pruning and squashes the following commits:
      
      b00d833 [scwf] address michael's comment
      e230155 [scwf] fix tests failure
      b09b895 [scwf] improve column pruning
      59250fe5
    • Burak Yavuz's avatar
      [SPARK-7382] [MLLIB] Feature Parity in PySpark for ml.classification · df2fb130
      Burak Yavuz authored
      The missing pieces in ml.classification for Python!
      
      cc mengxr
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6106 from brkyvz/ml-class and squashes the following commits:
      
      dd78237 [Burak Yavuz] fix style
      1048e29 [Burak Yavuz] ready for PR
      df2fb130
    • leahmcguire's avatar
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that... · 61e05fc5
      leahmcguire authored
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that both training and predict features have values of 0 or 1
      
      Author: leahmcguire <lmcguire@salesforce.com>
      
      Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits:
      
      b8442c2 [leahmcguire] changed to if else for value checks
      911bf83 [leahmcguire] undid reformat
      4eedf1e [leahmcguire] moved bernoulli check
      9ee9e84 [leahmcguire] fixed style error
      3f3b32c [leahmcguire] fixed zero one check so only called in combiner
      831fd27 [leahmcguire] got test working
      f44bb3c [leahmcguire] removed changes from CV branch
      67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one
      f191c71 [leahmcguire] fixed name
      58d060b [leahmcguire] changed param name and test according to comments
      04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
      61e05fc5
Loading