Skip to content
Snippets Groups Projects
  1. May 19, 2015
    • Iulian Dragos's avatar
      [SPARK-7726] Fix Scaladoc false errors · 3c4c1f96
      Iulian Dragos authored
      Visibility rules for static members are different in Scala and Java, and this case requires an explicit static import. Even though these are Java files, they are run through scaladoc, which enforces Scala rules.
      
      Also reverted the commit that reverts the upgrade to 2.11.6
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #6260 from dragos/issue/scaladoc-false-error and squashes the following commits:
      
      f2e998e [Iulian Dragos] Revert "[HOTFIX] Revert "[SPARK-7092] Update spark scala version to 2.11.6""
      0bad052 [Iulian Dragos] Fix scaladoc faux-error.
      3c4c1f96
    • Joseph K. Bradley's avatar
      [SPARK-7678] [ML] Fix default random seed in HasSeed · 7b16e9f2
      Joseph K. Bradley authored
      Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      Also, removed fixed random seeds from Word2Vec and ALS.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits:
      
      0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds
      678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      7b16e9f2
    • Joseph K. Bradley's avatar
      [SPARK-7047] [ML] ml.Model optional parent support · fb902732
      Joseph K. Bradley authored
      Made Model.parent transient.  Added Model.hasParent to test for null parent
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5914 from jkbradley/parent-optional and squashes the following commits:
      
      d501774 [Joseph K. Bradley] Made Model.parent transient.  Added Model.hasParent to test for null parent
      fb902732
    • Dice's avatar
      [SPARK-7704] Updating Programming Guides per SPARK-4397 · 32fa611b
      Dice authored
      The change per SPARK-4397 makes implicit objects in SparkContext to be found by the compiler automatically. So that we don't need to import the o.a.s.SparkContext._ explicitly any more and can remove some statements around the "implicit conversions" from the latest Programming Guides (1.3.0 and higher)
      
      Author: Dice <poleon.kd@gmail.com>
      
      Closes #6234 from daisukebe/patch-1 and squashes the following commits:
      
      b77ecd9 [Dice] fix a typo
      45dfcd3 [Dice] rewording per Sean's advice
      a094bcf [Dice] Adding a note for users on any previous releases
      a29be5f [Dice] Updating Programming Guides per SPARK-4397
      32fa611b
    • Xiangrui Meng's avatar
      [SPARK-7681] [MLLIB] remove mima excludes for 1.3 · 6845cb2f
      Xiangrui Meng authored
      There excludes are unnecessary for 1.3 because the changes were made in 1.4.x.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6254 from mengxr/SPARK-7681-mima and squashes the following commits:
      
      7f0cea0 [Xiangrui Meng] remove mima excludes for 1.3
      6845cb2f
    • Saleem Ansari's avatar
      [SPARK-7723] Fix string interpolation in pipeline examples · df34793a
      Saleem Ansari authored
      https://issues.apache.org/jira/browse/SPARK-7723
      
      Author: Saleem Ansari <tuxdna@gmail.com>
      
      Closes #6258 from tuxdna/master and squashes the following commits:
      
      2bb5a42 [Saleem Ansari] Merge branch 'master' into mllib-pipeline
      e39db9c [Saleem Ansari] Fix string interpolation in pipeline examples
      df34793a
    • Patrick Wendell's avatar
      27fa88b9
    • Mike Dusenberry's avatar
      Fixing a few basic typos in the Programming Guide. · 61f164d3
      Mike Dusenberry authored
      Just a few minor fixes in the guide, so a new JIRA issue was not created per the guidelines.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6240 from dusenberrymw/Fix_Programming_Guide_Typos and squashes the following commits:
      
      ffa76eb [Mike Dusenberry] Fixing a few basic typos in the Programming Guide.
      61f164d3
    • Xusen Yin's avatar
      [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion · 6008ec14
      Xusen Yin authored
      JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).
      
      CC jkbradley
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:
      
      1a7d80d [Xusen Yin] merge with master
      892a8e9 [Xusen Yin] fix python 3 compatibility
      ec935bf [Xusen Yin] small fix
      3e9fa1d [Xusen Yin] delete note
      69fcf85 [Xusen Yin] simplify and add python example
      81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
      40babfb [Xusen Yin] add java test suite for PolynomialExpansion
      6008ec14
    • Patrick Wendell's avatar
      23cf8971
    • Patrick Wendell's avatar
      [HOTFIX]: Java 6 Build Breaks · 9ebb44f8
      Patrick Wendell authored
      These were blocking RC1 so I fixed them manually.
      9ebb44f8
  2. May 18, 2015
    • Josh Rosen's avatar
      [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String · c9fa870a
      Josh Rosen authored
      In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to.  As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits:
      
      146b615 [Josh Rosen] Fix R test.
      2974bd5 [Josh Rosen] Cast to string type instead
      f206580 [Josh Rosen] Cast to double to fix SPARK-7687
      307ecbf [Josh Rosen] Add failing regression test for SPARK-7687
      c9fa870a
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Tathagata Das's avatar
      [SPARK-7692] Updated Kinesis examples · 3a600386
      Tathagata Das authored
      - Updated Kinesis examples to use stable API
      - Cleaned up comments, etc.
      - Renamed KinesisWordCountProducerASL to KinesisWordProducerASL
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6249 from tdas/kinesis-examples and squashes the following commits:
      
      7cc307b [Tathagata Das] More tweaks
      f080872 [Tathagata Das] More cleanup
      841987f [Tathagata Das] Small update
      011cbe2 [Tathagata Das] More fixes
      b0d74f9 [Tathagata Das] Updated examples.
      3a600386
    • jerluc's avatar
      [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners · 0a7a94ea
      jerluc authored
      PR per [SPARK-7621](https://issues.apache.org/jira/browse/SPARK-7621), which makes both `KafkaReceiver` and `ReliableKafkaReceiver` report its errors to the `ReceiverTracker`, which in turn will add the events to the bus to fire off any registered `StreamingListener`s.
      
      Author: jerluc <jeremyalucas@gmail.com>
      
      Closes #6204 from jerluc/master and squashes the following commits:
      
      82439a5 [jerluc] [SPARK-7621] [STREAMING] Report Kafka errors to StreamingListeners
      0a7a94ea
    • Davies Liu's avatar
      [SPARK-7624] Revert #4147 · 4fb52f95
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6172 from davies/revert_4147 and squashes the following commits:
      
      3bfbbde [Davies Liu] Revert #4147
      4fb52f95
    • Michael Armbrust's avatar
      [SQL] Fix serializability of ORC table scan · eb4632f2
      Michael Armbrust authored
      A follow-up to #6244.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6247 from marmbrus/fixOrcTests and squashes the following commits:
      
      e39ee1b [Michael Armbrust] [SQL] Fix serializability of ORC table scan
      eb4632f2
    • Jihong MA's avatar
      [SPARK-7063] when lz4 compression is used, it causes core dump · 6525fc0a
      Jihong MA authored
      this fix is to solve one issue found in lz4 1.2.0, which caused core dump in Spark Core with IBM JDK.  that issue is fixed in lz4 1.3.0 version.
      
      Author: Jihong MA <linlin200605@gmail.com>
      
      Closes #6226 from JihongMA/SPARK-7063-1 and squashes the following commits:
      
      0cca781 [Jihong MA] SPARK-7063
      4559ed5 [Jihong MA] SPARK-7063
      daa520f [Jihong MA] SPARK-7063 upgrade lz4 jars
      71738ee [Jihong MA] Merge remote-tracking branch 'upstream/master'
      dfaa971 [Jihong MA] SPARK-7265 minor fix of the content
      ace454d [Jihong MA] SPARK-7265 take out PySpark on YARN limitation
      9ea0832 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      d5bf3f5 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      7b842e6 [Jihong MA] Merge remote-tracking branch 'upstream/master'
      9c84695 [Jihong MA] SPARK-7265 address review comment
      a399aa6 [Jihong MA] SPARK-7265 Improving documentation for Spark SQL Hive support
      6525fc0a
    • Andrew Or's avatar
      [SPARK-7501] [STREAMING] DAG visualization: show DStream operations · b93c97d7
      Andrew Or authored
      This is similar to #5999, but for streaming. Roughly 200 lines are tests.
      
      One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way.
      
      tdas zsxwing
      
      ------------------------
      **Before**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/>
      
      --------------------------
      **After**
      <img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/>
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits:
      
      932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      e685df9 [Andrew Or] Rename createRDDWith
      84d0656 [Andrew Or] Review feedback
      697c086 [Andrew Or] Fix tests
      53b9936 [Andrew Or] Set scopes for foreachRDD properly
      1881802 [Andrew Or] Refactor DStream scope names again
      af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      fd07d22 [Andrew Or] Make MQTT lower case
      f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases
      fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within
      1af0b0e [Andrew Or] Fix style
      074c00b [Andrew Or] Review comments
      d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      e4a93ac [Andrew Or] Fix tests?
      25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      9113183 [Andrew Or] Add tests for DStream scopes
      b3806ab [Andrew Or] Fix test
      bb80bbb [Andrew Or] Fix MIMA?
      5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      5703939 [Andrew Or] Rename operations that create InputDStreams
      7c4513d [Andrew Or] Group RDDs by DStream operations and batches
      bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      05c2676 [Andrew Or] Wrap many more methods in withScope
      c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
      65ef3e9 [Andrew Or] Fix NPE
      a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations
      b93c97d7
    • Michael Armbrust's avatar
      [HOTFIX] Fix ORC build break · fcf90b75
      Michael Armbrust authored
      Fix break caused by merging #6225 and #6194.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits:
      
      b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break
      fcf90b75
    • zsxwing's avatar
      [SPARK-7658] [STREAMING] [WEBUI] Update the mouse behaviors for the timeline graphs · 0b6f503d
      zsxwing authored
      1. If the user click one point of a batch, scroll down to the corresponding batch row and highlight it. And recovery the batch row after 3 seconds if necessary.
      
      2. Add "#batches" in the histogram graphs.
      
      ![screen shot 2015-05-14 at 7 36 19 pm](https://cloud.githubusercontent.com/assets/1000778/7646108/84f4a014-fa73-11e4-8c13-1903d267e60f.png)
      
      ![screen shot 2015-05-14 at 7 36 53 pm](https://cloud.githubusercontent.com/assets/1000778/7646109/8b11154a-fa73-11e4-820b-8ece9fa6ee3e.png)
      
      ![screen shot 2015-05-14 at 7 36 34 pm](https://cloud.githubusercontent.com/assets/1000778/7646111/93828272-fa73-11e4-89f8-580670144d3c.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6168 from zsxwing/SPARK-7658 and squashes the following commits:
      
      c242b00 [zsxwing] Change 5 seconds to 3 seconds
      31fd0aa [zsxwing] Remove the mouseover highlight feature
      06c6f6f [zsxwing] Merge branch 'master' into SPARK-7658
      2eaff06 [zsxwing] Merge branch 'master' into SPARK-7658
      108d56c [zsxwing] Update the mouse behaviors for the timeline graphs
      0b6f503d
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
    • Cheng Lian's avatar
      [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations · 9dadf019
      Cheng Lian authored
      This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`:
      
      1.  Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`.
      
          This new cache generalizes and replaces the one used in `ParquetRelation2`.
      
          This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`.
      
      1.  When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.
      
          This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel.
      
      Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark.  However, this complicates data source user code because user code must merge partition values manually.
      
      To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`.  All results are shown below.
      
      ### Microbenchmark
      
      #### Preparation code
      
      Generating a partitioned table with 50k partitions, 1k rows per partition:
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      for (n <- 0 until 500) {
        val data = for {
          p <- (n * 10) until ((n + 1) * 10)
          i <- 0 until 1000
        } yield (i, f"val_$i%04d", f"$p%04d")
      
        data.
          toDF("a", "b", "p").
          write.
          partitionBy("p").
          mode("append").
          parquet(path)
      }
      ```
      
      #### Benchmarking code
      
      ```scala
      import sqlContext._
      import sqlContext.implicits._
      
      import org.apache.spark.sql.types._
      import com.google.common.base.Stopwatch
      
      val path = "hdfs://localhost:9000/user/lian/5k"
      
      def benchmark(n: Int)(f: => Unit) {
        val stopwatch = new Stopwatch()
      
        def run() = {
          stopwatch.reset()
          stopwatch.start()
          f
          stopwatch.stop()
          stopwatch.elapsedMillis()
        }
      
        val records = (0 until n).map(_ => run())
      
        (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
        println(s"Average: ${records.sum / n.toDouble} ms")
      }
      
      benchmark(3) { read.parquet(path).explain(extended = true) }
      ```
      
      #### Results
      
      Before:
      
      ```
      Round 0: 72528 ms
      Round 1: 68938 ms
      Round 2: 65372 ms
      Average: 68946.0 ms
      ```
      
      After:
      
      ```
      Round 0: 59499 ms
      Round 1: 53645 ms
      Round 2: 53844 ms
      Round 3: 49093 ms
      Round 4: 50555 ms
      Average: 53327.2 ms
      ```
      
      Also removing Hadoop configuration broadcasting:
      
      (Note that I was testing on a local laptop, thus network cost is pretty low.)
      
      ```
      Round 0: 15806 ms
      Round 1: 14394 ms
      Round 2: 14699 ms
      Round 3: 15334 ms
      Round 4: 14123 ms
      Average: 14871.2 ms
      ```
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6225 from liancheng/spark-7673 and squashes the following commits:
      
      2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading
      7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files
      ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2
      3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file
      b84612a [Cheng Lian] Fixes Scala style issue
      6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation
      9dadf019
    • Yin Huai's avatar
      [SPARK-7567] [SQL] [follow-up] Use a new flag to set output committer based on mapreduce apis · 530397ba
      Yin Huai authored
      cc liancheng marmbrus
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6130 from yhuai/directOutput and squashes the following commits:
      
      312b07d [Yin Huai] A data source can use spark.sql.sources.outputCommitterClass to override the output committer.
      530397ba
    • Wenchen Fan's avatar
      [SPARK-7269] [SQL] Incorrect analysis for aggregation(use semanticEquals) · 103c863c
      Wenchen Fan authored
      A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6173 from cloud-fan/7269 and squashes the following commits:
      
      e4a3cc7 [Wenchen Fan] address comments
      cc02045 [Wenchen Fan] consider elements length equal
      d7ff8f4 [Wenchen Fan] fix 7269
      103c863c
    • scwf's avatar
      [SPARK-7631] [SQL] treenode argString should not print children · fc2480ed
      scwf authored
      spark-sql>
      > explain extended
      > select * from (
      > select key from src union all
      > select key from src) t;
      
      now the spark plan will print children in argString
      ```
      == Physical Plan ==
      Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
      HiveTableScan key#3, (MetastoreRelation default, src, None), None]
      HiveTableScan key#1, (MetastoreRelation default, src, None), None
      HiveTableScan key#3, (MetastoreRelation default, src, None), None
      ```
      
      after this patch:
      ```
      == Physical Plan ==
      Union
       HiveTableScan [key#1], (MetastoreRelation default, src, None), None
       HiveTableScan [key#3], (MetastoreRelation default, src, None), None
      ```
      
      I have tested this locally
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #6144 from scwf/fix-argString and squashes the following commits:
      
      1a642e0 [scwf] fix treenode argString
      fc2480ed
    • Zhan Zhang's avatar
      [SPARK-2883] [SQL] ORC data source for Spark SQL · aa31e431
      Zhan Zhang authored
      This PR updates PR #6135 authored by zhzhan from Hortonworks.
      
      ----
      
      This PR implements a Spark SQL data source for accessing ORC files.
      
      > **NOTE**
      >
      > Although ORC is now an Apache TLP, the codebase is still tightly coupled with Hive.  That's why the new ORC data source is under `org.apache.spark.sql.hive` package, and must be used with `HiveContext`.  However, it doesn't require existing Hive installation to access ORC files.
      
      1.  Saving/loading ORC files without contacting Hive metastore
      
      1.  Support for complex data types (i.e. array, map, and struct)
      
      1.  Aware of common optimizations provided by Spark SQL:
      
          - Column pruning
          - Partitioning pruning
          - Filter push-down
      
      1.  Schema evolution support
      1.  Hive metastore table conversion
      
      This PR also include initial work done by scwf from Huawei (PR #3753).
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6194 from liancheng/polishing-orc and squashes the following commits:
      
      55ecd96 [Cheng Lian] Reorganizes ORC test suites
      d4afeed [Cheng Lian] Addresses comments
      21ada22 [Cheng Lian] Adds @since and @Experimental annotations
      128bd3b [Cheng Lian] ORC filter bug fix
      d734496 [Cheng Lian] Polishes the ORC data source
      2650a42 [Zhan Zhang] resolve review comments
      3c9038e [Zhan Zhang] resolve review comments
      7b3c7c5 [Zhan Zhang] save mode fix
      f95abfd [Zhan Zhang] reuse test suite
      7cc2c64 [Zhan Zhang] predicate fix
      4e61c16 [Zhan Zhang] minor change
      305418c [Zhan Zhang] orc data source support
      aa31e431
    • Xiangrui Meng's avatar
      [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python · 9c7e802a
      Xiangrui Meng authored
      This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
      
      1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
      2. Accept a list of param maps in `fit`.
      3. Use parent uid and name to identify param.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:
      
      413c463 [Xiangrui Meng] remove unnecessary doc
      4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      611c719 [Xiangrui Meng] fix python style
      68862b8 [Xiangrui Meng] update _java_obj initialization
      927ad19 [Xiangrui Meng] fix ml/tests.py
      0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
      9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
      c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
      7e0d27f [Xiangrui Meng] merge master
      46840fb [Xiangrui Meng] update wrappers
      b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
      46cb6ed [Xiangrui Meng] merge master
      a163413 [Xiangrui Meng] fix style
      1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      9630eae [Xiangrui Meng] fix Identifiable._randomUID
      13bd70a [Xiangrui Meng] update ml/tests.py
      64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
      02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
      66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
      7431272 [Joseph K. Bradley] Rebased with master
      9c7e802a
    • Wenchen Fan's avatar
      [SQL] [MINOR] [THIS] use private for internal field in ScalaUdf · 56ede884
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6235 from cloud-fan/tmp and squashes the following commits:
      
      8f16367 [Wenchen Fan] use private[this]
      56ede884
    • Cheng Lian's avatar
      [SPARK-7570] [SQL] Ignores _temporary during partition discovery · 010a1c27
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6091)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6091 from liancheng/spark-7570 and squashes the following commits:
      
      8ff07e8 [Cheng Lian] Ignores _temporary during partition discovery
      010a1c27
    • Rene Treffer's avatar
      [SPARK-6888] [SQL] Make the jdbc driver handling user-definable · e1ac2a95
      Rene Treffer authored
      Replace the DriverQuirks with JdbcDialect(s) (and MySQLDialect/PostgresDialect)
      and allow developers to change the dialects on the fly (for new JDBCRRDs only).
      
      Some types (like an unsigned 64bit number) can be trivially mapped to java.
      The status quo is that the RRD will fail to load.
      This patch makes it possible to overwrite the type mapping to read e.g.
      64Bit numbers as strings and handle them afterwards in software.
      
      JDBCSuite has an example that maps all types to String, which should always
      work (at the cost of extra code afterwards).
      
      As a side effect it should now be possible to develop simple dialects
      out-of-tree and even with spark-shell.
      
      Author: Rene Treffer <treffer@measite.de>
      
      Closes #5555 from rtreffer/jdbc-dialects and squashes the following commits:
      
      3cbafd7 [Rene Treffer] [SPARK-6888] ignore classes belonging to changed API in MIMA report
      fe7e2e8 [Rene Treffer] [SPARK-6888] Make the jdbc driver handling user-definable
      e1ac2a95
    • Andrew Or's avatar
      [SPARK-7627] [SPARK-7472] DAG visualization: style skipped stages · 563bfcc1
      Andrew Or authored
      This patch fixes two things:
      
      **SPARK-7627.** Cached RDDs no longer light up on the job page. This is a simple fix.
      **SPARK-7472.** Display skipped stages differently from normal stages.
      
      The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty.
      
      -------------------
      <img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" />
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits:
      
      f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      0eda358 [Andrew Or] Tweak skipped stage border color
      c604150 [Andrew Or] Tweak grayscale colors
      7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions
      51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      b928cd4 [Andrew Or] Fix potential leak + write tests for it
      7c4c364 [Andrew Or] Show skipped stages differently
      7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
      c121fa2 [Andrew Or] Fix cache color
      563bfcc1
    • Vincenzo Selvaggio's avatar
      [SPARK-7272] [MLLIB] User guide for PMML model export · 814b3dab
      Vincenzo Selvaggio authored
      https://issues.apache.org/jira/browse/SPARK-7272
      
      Author: Vincenzo Selvaggio <vselvaggio@hotmail.it>
      
      Closes #6219 from selvinsource/mllib_pmml_model_export_SPARK-7272 and squashes the following commits:
      
      c866fb8 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      1beda98 [Vincenzo Selvaggio] [SPARK-7272] Initial user guide for pmml export
      d670662 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2731375 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      680dc33 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      2e298b5 [Vincenzo Selvaggio] Update mllib-pmml-model-export.md
      a932f51 [Vincenzo Selvaggio] Create mllib-pmml-model-export.md
      814b3dab
    • Xiangrui Meng's avatar
      [SPARK-6657] [PYSPARK] Fix doc warnings · 1ecfac6e
      Xiangrui Meng authored
      Fixed the following warnings in `make clean html` under `python/docs`:
      
      ~~~
      /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation.
      /Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent.
      ~~~
      
      davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6221 from mengxr/SPARK-6657 and squashes the following commits:
      
      e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings
      2b4371e [Xiangrui Meng] fix mllib python doc warnings
      1ecfac6e
    • Liang-Chi Hsieh's avatar
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC... · e32c0f69
      Liang-Chi Hsieh authored
      [SPARK-7299][SQL] Set precision and scale for Decimal according to JDBC metadata instead of returned BigDecimal
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-7299
      
      When connecting with oracle db through jdbc, the precision and scale of `BigDecimal` object returned by `ResultSet.getBigDecimal` is not correctly matched to the table schema reported by `ResultSetMetaData.getPrecision` and `ResultSetMetaData.getScale`.
      
      So in case you insert a value like `19999` into a column with `NUMBER(12, 2)` type, you get through a `BigDecimal` object with scale as 0. But the dataframe schema has correct type as `DecimalType(12, 2)`. Thus, after you save the dataframe into parquet file and then retrieve it, you will get wrong result `199.99`.
      
      Because it is reported to be problematic on jdbc connection with oracle db. It might be difficult to add test case for it. But according to the user's test on JIRA, it solves this problem.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5833 from viirya/jdbc_decimal_precision and squashes the following commits:
      
      69bc2b5 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into jdbc_decimal_precision
      928f864 [Liang-Chi Hsieh] Add comments.
      5f9da94 [Liang-Chi Hsieh] Set up Decimal's precision and scale according to table schema instead of returned BigDecimal.
      e32c0f69
  3. May 17, 2015
    • Shuo Xiang's avatar
      [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model · 775e6f99
      Shuo Xiang authored
      The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6224 from coderxiang/getorelse and squashes the following commits:
      
      d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      775e6f99
    • zsxwing's avatar
      [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global" · ff71d34e
      zsxwing authored
      Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.
      
      This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:
      
      a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
      cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
      ff71d34e
    • Wenchen Fan's avatar
      [SQL] [MINOR] use catalyst type converter in ScalaUdf · 2f22424e
      Wenchen Fan authored
      It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6182 from cloud-fan/tmp and squashes the following commits:
      
      241cfe9 [Wenchen Fan] use converter in ScalaUdf
      2f22424e
    • Tathagata Das's avatar
      [SPARK-6514] [SPARK-5960] [SPARK-6656] [SPARK-7679] [STREAMING] [KINESIS]... · ca4257ae
      Tathagata Das authored
      [SPARK-6514] [SPARK-5960] [SPARK-6656] [SPARK-7679] [STREAMING] [KINESIS] Updates to the Kinesis API
      
      SPARK-6514 - Use correct region
      SPARK-5960 - Allow AWS Credentials to be directly passed
      SPARK-6656 - Specify kinesis application name explicitly
      SPARK-7679 - Upgrade to latest KCL and AWS SDK.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #6147 from tdas/kinesis-api-update and squashes the following commits:
      
      f23ea77 [Tathagata Das] Updated versions and updated APIs
      373b201 [Tathagata Das] Updated Kinesis API
      ca4257ae
Loading