Skip to content
Snippets Groups Projects
  1. May 19, 2015
  2. May 18, 2015
    • Josh Rosen's avatar
      [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String · c9fa870a
      Josh Rosen authored
      In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:
      
      146b615 [Josh Rosen] Fix R test.
      2974bd5 [Josh Rosen] Cast to string type instead
      f206580 [Josh Rosen] Cast to double to fix SPARK-7687
      307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
      c9fa870a
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Tathagata Das's avatar
      [SPARK-7692] Updated Kinesis examples · 3a600386
      Tathagata Das authored
      - Updated Kinesis examples to use stable API
      - Cleaned up comments, etc.
      - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6249 from tdas/kinesis-examples and squashes the following commits:
      
      7cc307b [Tathagata Das] More tweaks
      f080872 [Tathagata Das] More cleanup
      841987f [Tathagata Das] Small update
      011cbe2 [Tathagata Das] More fixes
      b0d74f9 [Tathagata Das] Updated examples.
      3a600386
    • jerluc's avatar
      [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners · 0a7a94ea
      jerluc authored
      PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s.
      
      Author: jerluc <jeremyalucas@gmail.com>
      
      Closes #6204 from jerluc/master and squashes the following commits:
      
      82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
      0a7a94ea
    • Davies Liu's avatar
      [SPARK-7624] Revert #4147 · 4fb52f95
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6172 from davies/revert_4147 and squashes the following commits:
      
      3bfbbde [Davies Liu] Revert #4147
      4fb52f95
    • Michael Armbrust's avatar
      [SQL] Fix serializability of ORC table scan · eb4632f2
      Michael Armbrust authored
      A follow-up to #6244.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6247 from marmbrus/fixOrcTests and squashes the following commits:
      
      e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
      eb4632f2
    • Jihong MA's avatar
      [SPARK-7063] when lz4 compression is used, it causes core dump · 6525fc0a
      Jihong MA authored
      this fix is to solve one issue found in lz4 1.2.0, which caused core dump in Spark Core with IBM JDK.  that issue is fixed in lz4 1.3.0 version.
      
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #6226 from JihongMA/SPARK-7063-1 and squashes the following commits:
      
      0cca781 [Jihong MA] SPARK-7063
      4559ed5 [Jihong MA] SPARK-7063
      daa520f [Jihong MA] SPARK-7063 upgrade lz4 jars
      71738ee [Jihong MA] Merge remote-tracking branch 'upstream/master'
      dfaa971 [Jihong MA] SPARK-7265 minor fix of the content
      ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation
      9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      9c84695 [Jihong MA] SPARK-7265 address review comment
      a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
      6525fc0a
    • Andrew Or's avatar
      [SPARK-7501] [STREAMING] DAG visualization: show DStream operations · b93c97d7
      Andrew Or authored
      This is similar to #5999, but for streaming. Roughly 200 lines are tests.
      
      One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way.
      
      tdas zsxwing
      
      ------------------------
      **Before**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/>
      
      --------------------------
      **After**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits:
      
      932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      e685df9 [Andrew Or] Rename createRDDWith
      84d0656 [Andrew Or] Review feedback
      697c086 [Andrew Or] Fix tests
      53b9936 [Andrew Or] Set scopes for foreachRDD properly
      1881802 [Andrew Or] Refactor DStream scope names again
      af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      fd07d22 [Andrew Or] Make MQTT lower case
      f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases
      fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within
      1af0b0e [Andrew Or] Fix style
      074c00b [Andrew Or] Review comments
      d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      e4a93ac [Andrew Or] Fix tests?
      25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      9113183 [Andrew Or] Add tests for DStream scopes
      b3806ab [Andrew Or] Fix test
      bb80bbb [Andrew Or] Fix MIMA?
      5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      5703939 [Andrew Or] Rename operations that create InputDStreams
      7c4513d [Andrew Or] Group RDDs by DStream operations and batches
      bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      05c2676 [Andrew Or] Wrap many more methods in withScope
      c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      65ef3e9 [Andrew Or] Fix NPE
      a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations
      b93c97d7
    • Michael Armbrust's avatar
      [HOTFIX] Fix ORC build break · fcf90b75
      Michael Armbrust authored
      Fix break caused by merging #6225 and #6194.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:
      
      b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
      fcf90b75
    • zsxwing's avatar
      [SPARK-7658] [STREAMING] [WEBUI] Update the mouse behaviors for the timeline graphs · 0b6f503d
      zsxwing authored
      1. If the user click one point of a batch, scroll down to the corresponding batch row and highlight it. And recovery the batch row after 3 seconds if necessary.
      
      2. Add "#batches" in the histogram graphs.
      
      ![screen shot 2015-05-14 at 7 36 19 pm](https://cloud.githubusercontent.com/assets/1000778/7646108/84f4a014-fa73-11e4-8c13-1903d267e60f.png)
      
      ![screen shot 2015-05-14 at 7 36 53 pm](https://cloud.githubusercontent.com/assets/1000778/7646109/8b11154a-fa73-11e4-820b-8ece9fa6ee3e.png)
      
      ![screen shot 2015-05-14 at 7 36 34 pm](https://cloud.githubusercontent.com/assets/1000778/7646111/93828272-fa73-11e4-89f8-580670144d3c.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6168 from zsxwing/SPARK-7658 and squashes the following commits:
      
      c242b00 [zsxwing] Change 5 seconds to 3 seconds
      31fd0aa [zsxwing] Remove the mouseover highlight feature
      06c6f6f [zsxwing] Merge branch 'master' into SPARK-7658
      2eaff06 [zsxwing] Merge branch 'master' into SPARK-7658
      108d56c [zsxwing] Update the mouse behaviors for the timeline graphs
      0b6f503d
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
    • Cheng Lian's avatar
      [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations · 9dadf019
      Cheng Lian authored
      This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`:
      
      1.  Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`.
      
          This new cache generalizes and replaces the one used in `ParquetRelation2`.
      
          This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`.
      
      1.  When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.
      
          This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel.
      
      Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark.  However, this complicates data source user code because user code must merge partition values manually.
      
      To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`.  All results are shown below.
      
      ### Microbenchmark
      
      #### Preparation code
      
      Generating a partitioned table with 50k partitions, 1k rows per partition:
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      for (n <- 0 until 500) {
        val data = for {
          p <- (n * 10) until ((n + 1) * 10)
          i <- 0 until 1000
        } yield (i, f"val_$i%04d", f"$p%04d")
      
        data.
          toDF("a", "b", "p").
          write.
          partitionBy("p").
          mode("append").
          parquet(path)
      }
      ```
      
      #### Benchmarking code
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      import org.apache.spark.sql.types._
      import com.google.common.base.Stopwatch
      
      val path = "hdfs://localhost:9000/user/lian/5k"
      
      def benchmark(n: Int)(f: => Unit) {
        val stopwatch = new Stopwatch()
      
        def run() = {
          stopwatch.reset()
          stopwatch.start()
          f
          stopwatch.stop()
          stopwatch.elapsedMillis()
        }
      
        val records = (0 until n).map(_ => run())
      
        (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
        println(s"Average: ${records.sum / n.toDouble} ms")
      }
      
      benchmark(3) { read.parquet(path).explain(extended = true) }
      ```
      
      #### Results
      
      Before:
      
      ```
      Round 0: 72528 ms
      Round 1: 68938 ms
      Round 2: 65372 ms
      Average: 68946.0 ms
      ```
      
      After:
      
      ```
      Round 0: 59499 ms
      Round 1: 53645 ms
      Round 2: 53844 ms
      Round 3: 49093 ms
      Round 4: 50555 ms
      Average: 53327.2 ms
      ```
      
      Also removing Hadoop configuration broadcasting:
      
      (Note that I was testing on a local laptop, thus network cost is pretty low.)
      
      ```
      Round 0: 15806 ms
      Round 1: 14394 ms
      Round 2: 14699 ms
      Round 3: 15334 ms
      Round 4: 14123 ms
      Average: 14871.2 ms
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6225 from liancheng/spark-7673 and squashes the following commits:
      
      2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading
      7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files
      ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2
      3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file
      b84612a [Cheng Lian] Fixes Scala style issue
      6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
      9dadf019
    • Yin Huai's avatar
      [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based on mapreduce apis · 530397ba
      Yin Huai authored
      cc liancheng marmbrus
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6130 from yhuai/directOutput and squashes the following commits:
      
      312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer.
      530397ba
    • Wenchen Fan's avatar
      [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals) · 103c863c
      Wenchen Fan authored
      A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6173 from cloud-fan/7269 and squashes the following commits:
      
      e4a3cc7 [Wenchen Fan] address comments
      cc02045 [Wenchen Fan] consider elements length equal
      d7ff8f4 [Wenchen Fan] fix 7269
      103c863c
    • scwf's avatar
      [SPARK-7631] [SQL] treenode argString should not print children · fc2480ed
      scwf authored
      spark-sql>
      > explain extended
      > select * from (
      > select key from src union all
      > select key from src) t;
      
      now the spark plan will print children in argString
      ```
      == Physical Plan ==
      Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
      HiveTableScan key#3, (MetastoreRelation default, src, None), None]
      HiveTableScan key#1, (MetastoreRelation default, src, None), None
      HiveTableScan key#3, (MetastoreRelation default, src, None), None
      ```
      
      after this patch:
      ```
      == Physical Plan ==
      Union
       HiveTableScan [key#1], (MetastoreRelation default, src, None), None
       HiveTableScan [key#3], (MetastoreRelation default, src, None), None
      ```
      
      I have tested this locally
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6144 from scwf/fix-argString and squashes the following commits:
      
      1a642e0 [scwf] fix treenode argString
      fc2480ed
    • Zhan Zhang's avatar
      [SPARK-2883] [SQL] ORC data source for Spark SQL · aa31e431
      Zhan Zhang authored
      This PR updates PR #6135 authored by zhzhan from Hortonworks.
      
      ----
      
      This PR implements a Spark SQL data source for accessing ORC files.
      
      > **NOTE**
      >
      > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.
      
      1.  Saving/loading ORC files without contacting Hive metastore
      
      1.  Support for complex data types (i.e. array, map, and struct)
      
      1.  Aware of common optimizations provided by Spark SQL:
      
          - Column pruning
          - Partitioning pruning
          - Filter push-down
      
      1.  Schema evolution support
      1.  Hive metastore table conversion
      
      This PR also include initial work done by scwf from Huawei (PR #3753).
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6194 from liancheng/polishing-orc and squashes the following commits:
      
      55ecd96 [Cheng Lian] Reorganizes ORC test suites
      d4afeed [Cheng Lian] Addresses comments
      21ada22 [Cheng Lian] Adds @since and @Experimental annotations
      128bd3b [Cheng Lian] ORC filter bug fix
      d734496 [Cheng Lian] Polishes the ORC data source
      2650a42 [Zhan Zhang] resolve review comments
      3c9038e [Zhan Zhang] resolve review comments
      7b3c7c5 [Zhan Zhang] save mode fix
      f95abfd [Zhan Zhang] reuse test suite
      7cc2c64 [Zhan Zhang] predicate fix
      4e61c16 [Zhan Zhang] minor change
      305418c [Zhan Zhang] orc data source support
      aa31e431
    • Xiangrui Meng's avatar
      [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python · 9c7e802a
      Xiangrui Meng authored
      This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
      
      1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
      2. Accept a list of param maps in `fit`.
      3. Use parent uid and name to identify param.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:
      
      413c463 [Xiangrui Meng] remove unnecessary doc
      4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      611c719 [Xiangrui Meng] fix python style
      68862b8 [Xiangrui Meng] update _java_obj initialization
      927ad19 [Xiangrui Meng] fix ml/tests.py
      0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
      9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
      c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
      7e0d27f [Xiangrui Meng] merge master
      46840fb [Xiangrui Meng] update wrappers
      b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
      46cb6ed [Xiangrui Meng] merge master
      a163413 [Xiangrui Meng] fix style
      1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      9630eae [Xiangrui Meng] fix Identifiable._randomUID
      13bd70a [Xiangrui Meng] update ml/tests.py
      64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
      02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
      66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
      7431272 [Joseph K. Bradley] Rebased with master
      9c7e802a
    • Wenchen Fan's avatar
      [SQL] [MINOR] [THIS] use private for internal field in ScalaUdf · 56ede884
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6235 from cloud-fan/tmp and squashes the following commits:
      
      8f16367 [Wenchen Fan] use private[this]
      56ede884
    • Cheng Lian's avatar
      [SPARK-7570] [SQL] Ignores _temporary during partition discovery · 010a1c27
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6091 from liancheng/spark-7570 and squashes the following commits:
      
      8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery
      010a1c27
    • Rene Treffer's avatar
      [SPARK-6888] [SQL] Make the jdbc driver handling user-definable · e1ac2a95
      Rene Treffer authored
      Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
      and allow developers to change the dialects on the fly (for new JDBCRRDs only).
      
      Some types (like an unsigned 64bit number) can be trivially mapped to java.
      The status quo is that the RRD will fail to load.
      This patch makes it possible to overwrite the type mapping to read e.g.
      64Bit numbers as strings and handle them afterwards in software.
      
      JDBCSuite has an example that maps all types to String, which should always
      work (at the cost of extra code afterwards).
      
      As a side effect it should now be possible to develop simple dialects
      out-of-tree and even with spark-shell.
      
      Author: Rene Treffer <treffer@measite.de>
      
      Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:
      
      3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
      fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
      e1ac2a95
    • Andrew Or's avatar
      [SPARK-7627] [SPARK-7472] DAG visualization: style skipped stages · 563bfcc1
      Andrew Or authored
      This patch fixes two things:
      
      **SPARK-7627.** Cached RDDs no longer light up on the job page. This is a simple fix.
      **SPARK-7472.** Display skipped stages differently from normal stages.
      
      The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty.
      
      -------------------
      <img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" />
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits:
      
      f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      0eda358 [Andrew Or] Tweak skipped stage border color
      c604150 [Andrew Or] Tweak grayscale colors
      7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions
      51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      b928cd4 [Andrew Or] Fix potential leak + write tests for it
      7c4c364 [Andrew Or] Show skipped stages differently
      7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      c121fa2 [Andrew Or] Fix cache color
      563bfcc1
    • Vincenzo Selvaggio's avatar
      [SPARK-7272] [MLLIB] User guide for PMML model export · 814b3dab
      Vincenzo Selvaggio authored
      https://issues.apache.org/jira/browse/SPARK-7272
      
      Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
      
      Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits:
      
      c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export
      d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
      814b3dab
    • Xiangrui Meng's avatar
      [SPARK-6657] [PYSPARK] Fix doc warnings · 1ecfac6e
      Xiangrui Meng authored
      Fixed the following warnings in `make clean html` under `python/docs`:
      
      ~~~
      /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent.
      ~~~
      
      davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6221 from mengxr/SPARK-6657 and squashes the following commits:
      
      e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings
      2b4371e [Xiangrui Meng] fix mllib python doc warnings
      1ecfac6e
    • Liang-Chi Hsieh's avatar
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC... · e32c0f69
      Liang-Chi Hsieh authored
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC metadata instead of returned BigDecimal
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-7299
      
      When connecting with oracle db through jdbc, the precision and scale of `BigDecimal` object returned by `ResultSet.getBigDecimal` is not correctly matched to the table schema reported by `ResultSetMetaData.getPrecision` and `ResultSetMetaData.getScale`.
      
      So in case you insert a value like `19999` into a column with `NUMBER(12, 2)` type, you get through a `BigDecimal` object with scale as 0. But the dataframe schema has correct type as `DecimalType(12, 2)`. Thus, after you save the dataframe into parquet file and then retrieve it, you will get wrong result `199.99`.
      
      Because it is reported to be problematic on jdbc connection with oracle db. It might be difficult to add test case for it. But according to the user's test on JIRA, it solves this problem.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5833 from viirya/jdbc_decimal_precision and squashes the following commits:
      
      69bc2b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into jdbc_decimal_precision
      928f864 [Liang-Chi Hsieh] Add comments.
      5f9da94 [Liang-Chi Hsieh] Set up Decimal's precision and scale according to table schema instead of returned BigDecimal.
      e32c0f69
  3. May 17, 2015
    • Shuo Xiang's avatar
      [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model · 775e6f99
      Shuo Xiang authored
      The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6224 from coderxiang/getorelse and squashes the following commits:
      
      d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      775e6f99
    • zsxwing's avatar
      [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global" · ff71d34e
      zsxwing authored
      Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.
      
      This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:
      
      a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
      cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
      ff71d34e
    • Wenchen Fan's avatar
      [SQL] [MINOR] use catalyst type converter in ScalaUdf · 2f22424e
      Wenchen Fan authored
      It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6182 from cloud-fan/tmp and squashes the following commits:
      
      241cfe9 [Wenchen Fan] use converter in ScalaUdf
      2f22424e
    • Tathagata Das's avatar
      [SPARK-6514] [SPARK-5960] [SPARK-6656] [SPARK-7679] [STREAMING] [KINESIS]... · ca4257ae
      Tathagata Das authored
      [SPARK-6514] [SPARK-5960] [SPARK-6656] [SPARK-7679] [STREAMING] [KINESIS] Updates to the Kinesis API
      
      SPARK-6514 - Use correct region
      SPARK-5960 - Allow AWS Credentials to be directly passed
      SPARK-6656 - Specify kinesis application name explicitly
      SPARK-7679 - Upgrade to latest KCL and AWS SDK.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6147 from tdas/kinesis-api-update and squashes the following commits:
      
      f23ea77 [Tathagata Das] Updated versions and updated APIs
      373b201 [Tathagata Das] Updated Kinesis API
      ca4257ae
    • Michael Armbrust's avatar
      [SPARK-7491] [SQL] Allow configuration of classloader isolation for hive · 2ca60ace
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6167 from marmbrus/configureIsolation and squashes the following commits:
      
      6147cbe [Michael Armbrust] filter other conf
      22cc3bc7 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into configureIsolation
      07476ee [Michael Armbrust] filter empty prefixes
      dfdf19c [Michael Armbrust] [SPARK-6906][SQL] Allow configuration of classloader isolation for hive
      2ca60ace
    • Josh Rosen's avatar
      [SPARK-7686] [SQL] DescribeCommand is assigned wrong output attributes in SparkStrategies · 56456287
      Josh Rosen authored
      In `SparkStrategies`, `RunnableDescribeCommand` is called with the output attributes of the table being described rather than the attributes for the `describe` command's output.  I discovered this issue because it caused type conversion errors in some UnsafeRow conversion code that I'm writing.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6217 from JoshRosen/SPARK-7686 and squashes the following commits:
      
      953a344 [Josh Rosen] Fix SPARK-7686 with a simple change in SparkStrategies.
      a4eec9f [Josh Rosen] Add failing regression test for SPARK-7686
      56456287
    • Josh Rosen's avatar
      [SPARK-7660] Wrap SnappyOutputStream to work around snappy-java bug · f2cc6b5b
      Josh Rosen authored
      This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for https://github.com/xerial/snappy-java/issues/107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits:
      
      8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660
      f2cc6b5b
    • Steve Loughran's avatar
      [SPARK-7669] Builds against Hadoop 2.6+ get inconsistent curator depend… · 50217667
      Steve Loughran authored
      This adds a new profile, `hadoop-2.6`, copying over the hadoop-2.4 properties, updating ZK to 3.4.6 and making the curator version a configurable option. That keeps the curator-recipes JAR in sync with that used in hadoop.
      
      There's one more option to consider: making the full curator-client version explicit with its own dependency version. This will pin down the version from hadoop and hive imports
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #6191 from steveloughran/stevel/SPARK-7669-hadoop-2.6 and squashes the following commits:
      
      e3e281a [Steve Loughran] SPARK-7669 declare the version of curator-client and curator-framework JARs
      2901ea9 [Steve Loughran] SPARK-7669 Builds against Hadoop 2.6+ get inconsistent curator dependencies
      50217667
    • Liang-Chi Hsieh's avatar
      [SPARK-7447] [SQL] Don't re-merge Parquet schema when the relation is deserialized · 33990557
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7447
      
      `MetadataCache` in `ParquetRelation2` is annotated as `transient`. When `ParquetRelation2` is deserialized, we ask `MetadataCache` to refresh and perform schema merging again. It is time-consuming especially for very many parquet files.
      
      With the new `FSBasedParquetRelation`, although `MetadataCache` is not `transient` now, `MetadataCache.refresh()` still performs schema merging again when the relation is deserialized.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6012 from viirya/without_remerge_schema and squashes the following commits:
      
      2663957 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
      6ac7d93 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into without_remerge_schema
      b0fc09b [Liang-Chi Hsieh] Don't generate and merge parquetSchema multiple times.
      33990557
    • scwf's avatar
      [SQL] [MINOR] Skip unresolved expression for InConversion · edf09ea1
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6145 from scwf/InConversion and squashes the following commits:
      
      5c8ac6b [scwf] minir fix for InConversion
      edf09ea1
    • Shivaram Venkataraman's avatar
      [MINOR] Add 1.3, 1.3.1 to master branch EC2 scripts · 1a7b9ce8
      Shivaram Venkataraman authored
      cc pwendell
      
      P.S: I can't believe this was outdated all along ?
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #6215 from shivaram/update-ec2-map and squashes the following commits:
      
      ae3937a [Shivaram Venkataraman] Add 1.3, 1.3.1 to master branch EC2 scripts
      1a7b9ce8
    • Cheng Lian's avatar
      [MINOR] [SQL] Removes an unreachable case clause · ba4f8ca0
      Cheng Lian authored
      This case clause is already covered by the one above, and generates a compilation warning.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6214 from liancheng/remove-unreachable-code and squashes the following commits:
      
      c38ca7c [Cheng Lian] Removes an unreachable case clause
      ba4f8ca0
    • Reynold Xin's avatar
      [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface. · 517eb37a
      Reynold Xin authored
      Also moved all the deprecated functions into one place for SQLContext and DataFrame, and updated tests to use the new API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6210 from rxin/df-writer-reader-jdbc and squashes the following commits:
      
      7465c2c [Reynold Xin] Fixed unit test.
      118e609 [Reynold Xin] Updated tests.
      3441b57 [Reynold Xin] Updated javadoc.
      13cdd1c [Reynold Xin] [SPARK-7654][SQL] Move JDBC into DataFrame's reader/writer interface.
      517eb37a
Loading