Skip to content
Snippets Groups Projects
  1. Feb 10, 2015
    • Davies Liu's avatar
      [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns · ea602840
      Davies Liu authored
      Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4498 from davies/create and squashes the following commits:
      
      08469c1 [Davies Liu] remove Scala/Java API for now
      c80a7a9 [Davies Liu] fix hive test
      d1bd8f2 [Davies Liu] cleanup applySchema
      9526e97 [Davies Liu] createDataFrame from RDD with columns
      ea602840
    • Cheng Hao's avatar
      [SPARK-5683] [SQL] Avoid multiple json generator created · a60aea86
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4468 from chenghao-intel/json and squashes the following commits:
      
      aeb7801 [Cheng Hao] avoid multiple json generator created
      a60aea86
    • Michael Armbrust's avatar
      [SQL] Add an exception for analysis errors. · 6195e247
      Michael Armbrust authored
      Also start from the bottom so we show the first error instead of the top error.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4439 from marmbrus/analysisException and squashes the following commits:
      
      45862a0 [Michael Armbrust] fix hive test
      a773bba [Michael Armbrust] Merge remote-tracking branch 'origin/master' into analysisException
      f88079f [Michael Armbrust] update more cases
      fede90a [Michael Armbrust] newline
      fbf4bc3 [Michael Armbrust] move to sql
      6235db4 [Michael Armbrust] [SQL] Add an exception for analysis errors.
      6195e247
    • Yin Huai's avatar
      [SPARK-5658][SQL] Finalize DDL and write support APIs · aaf50d05
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-5658
      
      Author: Yin Huai <yhuai@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Michael Armbrust <michael@databricks.com>
      
      Closes #4446 from yhuai/writeSupportFollowup and squashes the following commits:
      
      f3a96f7 [Yin Huai] davies's comments.
      225ff71 [Yin Huai] Use Scala TestHiveContext to initialize the Python HiveContext in Python tests.
      2306f93 [Yin Huai] Style.
      2091fcd [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      537e28f [Yin Huai] Correctly clean up temp data.
      ae4649e [Yin Huai] Fix Python test.
      609129c [Yin Huai] Doc format.
      92b6659 [Yin Huai] Python doc and other minor updates.
      cbc717f [Yin Huai] Rename dataSourceName to source.
      d1c12d3 [Yin Huai] No need to delete the duplicate rule since it has been removed in master.
      22cfa70 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      d91ecb8 [Yin Huai] Fix test.
      4c76d78 [Yin Huai] Simplify APIs.
      3abc215 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      0832ce4 [Yin Huai] Fix test.
      98e7cdb [Yin Huai] Python style.
      2bf44ef [Yin Huai] Python APIs.
      c204967 [Yin Huai] Format
      a10223d [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      9ff97d8 [Yin Huai] Add SaveMode to saveAsTable.
      9b6e570 [Yin Huai] Update doc.
      c2be775 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      99950a2 [Yin Huai] Use Java enum for SaveMode.
      4679665 [Yin Huai] Remove duplicate rule.
      77d89dc [Yin Huai] Update doc.
      e04d908 [Yin Huai] Move import and add (Scala-specific) to scala APIs.
      cf5703d [Yin Huai] Add checkAnswer to Java tests.
      7db95ff [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      6dfd386 [Yin Huai] Add java test.
      f2f33ef [Yin Huai] Fix test.
      e702386 [Yin Huai] Apache header.
      b1e9b1b [Yin Huai] Format.
      ed4e1b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into writeSupportFollowup
      af9e9b3 [Yin Huai] DDL and write support API followup.
      2a6213a [Yin Huai] Update API names.
      e6a0b77 [Yin Huai] Update test.
      43bae01 [Yin Huai] Remove createTable from HiveContext.
      5ffc372 [Yin Huai] Add more load APIs to SQLContext.
      5390743 [Yin Huai] Add more save APIs to DataFrame.
      aaf50d05
    • Marcelo Vanzin's avatar
      [SPARK-5493] [core] Add option to impersonate user. · ed167e70
      Marcelo Vanzin authored
      Hadoop has a feature that allows users to impersonate other users
      when submitting applications or talking to HDFS, for example. These
      impersonated users are referred generally as "proxy users".
      
      Services such as Oozie or Hive use this feature to run applications
      as the requesting user.
      
      This change makes SparkSubmit accept a new command line option to
      run the application as a proxy user. It also fixes the plumbing
      of the user name through the UI (and a couple of other places) to
      refer to the correct user running the application, which can be
      different than `sys.props("user.name")` even without proxies (e.g.
      when using kerberos).
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4405 from vanzin/SPARK-5493 and squashes the following commits:
      
      df82427 [Marcelo Vanzin] Clarify the reason for the special exception handling.
      05bfc08 [Marcelo Vanzin] Remove unneeded annotation.
      4840de9 [Marcelo Vanzin] Review feedback.
      8af06ff [Marcelo Vanzin] Fix usage string.
      2e4fa8f [Marcelo Vanzin] Merge branch 'master' into SPARK-5493
      b6c947d [Marcelo Vanzin] Merge branch 'master' into SPARK-5493
      0540d38 [Marcelo Vanzin] [SPARK-5493] [core] Add option to impersonate user.
      ed167e70
    • Yin Huai's avatar
      [SQL] Make Options in the data source API CREATE TABLE statements optional. · e28b6bdb
      Yin Huai authored
      Users will not need to put `Options()` in a CREATE TABLE statement when there is not option provided.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4515 from yhuai/makeOptionsOptional and squashes the following commits:
      
      1a898d3 [Yin Huai] Make options optional.
      e28b6bdb
    • Cheng Lian's avatar
      [SPARK-5725] [SQL] Fixes ParquetRelation2.equals · 2d50a010
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4513)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4513 from liancheng/spark-5725 and squashes the following commits:
      
      bf6a087 [Cheng Lian] Fixes ParquetRelation2.equals
      2d50a010
    • Sheng, Li's avatar
      [SQL][Minor] correct some comments · 91e35125
      Sheng, Li authored
      Author: Sheng, Li <OopsOutOfMemory@users.noreply.github.com>
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4508 from OopsOutOfMemory/cmt and squashes the following commits:
      
      d8a68c6 [Sheng, Li] Update ddl.scala
      f24aeaf [OopsOutOfMemory] correct style
      91e35125
    • Sephiroth-Lin's avatar
      [SPARK-5644] [Core]Delete tmp dir when sc is stop · 52983d7f
      Sephiroth-Lin authored
      When we run driver as a service, and for each time we run job we only call sc.stop, then will not delete tmp dir create by HttpFileServer and SparkEnv, it will be deleted until the service process exit, so we need to delete these tmp dirs when sc is stop directly.
      
      Author: Sephiroth-Lin <linwzhong@gmail.com>
      
      Closes #4412 from Sephiroth-Lin/bug-fix-master-01 and squashes the following commits:
      
      fbbc785 [Sephiroth-Lin] using an interpolated string
      b968e14 [Sephiroth-Lin] using an interpolated string
      4edf394 [Sephiroth-Lin] rename the variable and update comment
      1339c96 [Sephiroth-Lin] add a member to store the reference of tmp dir
      b2018a5 [Sephiroth-Lin] check sparkFilesDir before delete
      f48a3c6 [Sephiroth-Lin] don't check sparkFilesDir, check executorId
      dd9686e [Sephiroth-Lin] format code
      b38e0f0 [Sephiroth-Lin] add dir check before delete
      d7ccc64 [Sephiroth-Lin] Change log level
      1d70926 [Sephiroth-Lin] update comment
      e2a2b1b [Sephiroth-Lin] update comment
      aeac518 [Sephiroth-Lin] Delete tmp dir when sc is stop
      c0d5b28 [Sephiroth-Lin] Delete tmp dir when sc is stop
      52983d7f
    • Brennon York's avatar
      [SPARK-5343][GraphX]: ShortestPaths traverses backwards · 58209612
      Brennon York authored
      Corrected the logic with ShortestPaths so that the calculation will run forward rather than backwards. Output before looked like:
      
      ```scala
      import org.apache.spark.graphx._
      val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
      lib.ShortestPaths.run(g,Array(3)).vertices.collect
      // res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 -> 0)), (2,Map()))
      lib.ShortestPaths.run(g,Array(1)).vertices.collect
      // res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (3,Map(1 -> 2)), (2,Map(1 -> 1)))
      ```
      
      And new output after the changes looks like:
      
      ```scala
      import org.apache.spark.graphx._
      val g = Graph(sc.makeRDD(Array((1L,""), (2L,""), (3L,""))), sc.makeRDD(Array(Edge(1L,2L,""), Edge(2L,3L,""))))
      lib.ShortestPaths.run(g,Array(3)).vertices.collect
      // res0: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(3 -> 2)), (2,Map(3 -> 1)), (3,Map(3 -> 0)))
      lib.ShortestPaths.run(g,Array(1)).vertices.collect
      // res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 -> 0)), (2,Map()), (3,Map()))
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #4478 from brennonyork/SPARK-5343 and squashes the following commits:
      
      aa57f83 [Brennon York] updated to set ShortestPaths to run 'forward' rather than 'backward'
      58209612
    • MechCoder's avatar
      [SPARK-5021] [MLlib] Gaussian Mixture now supports Sparse Input · fd2c032f
      MechCoder authored
      Following discussion in the Jira.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4459 from MechCoder/sparse_gmm and squashes the following commits:
      
      1b18dab [MechCoder] Rewrite syr for sparse matrices
      e579041 [MechCoder] Add test for covariance matrix
      5cb370b [MechCoder] Separate tests for sparse data
      5e096bd [MechCoder] Alphabetize and correct error message
      e180f4c [MechCoder] [SPARK-5021] Gaussian Mixture now supports Sparse Input
      fd2c032f
    • OopsOutOfMemory's avatar
      [SPARK-5686][SQL] Add show current roles command in HiveQl · f98707c0
      OopsOutOfMemory authored
      show current roles
      
      Author: OopsOutOfMemory <victorshengli@126.com>
      
      Closes #4471 from OopsOutOfMemory/show_current_role and squashes the following commits:
      
      1c6b210 [OopsOutOfMemory] add show current roles
      f98707c0
    • Michael Armbrust's avatar
      [SQL] Add toString to DataFrame/Column · de80b1ba
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4436 from marmbrus/dfToString and squashes the following commits:
      
      8a3c35f [Michael Armbrust] Merge remote-tracking branch 'origin/master' into dfToString
      b72a81b [Michael Armbrust] add toString
      de80b1ba
    • Miguel Peralvo's avatar
      [SPARK-5668] Display region in spark_ec2.py get_existing_cluster() · c49a4049
      Miguel Peralvo authored
      Show the region for the different messages displayed by get_existing_cluster(): The search, found and error messages.
      
      Author: Miguel Peralvo <miguel.peralvo@gmail.com>
      
      Closes #4457 from MiguelPeralvo/patch-2 and squashes the following commits:
      
      a5514c8 [Miguel Peralvo] Update spark_ec2.py
      0a837b0 [Miguel Peralvo] Update spark_ec2.py
      3923f36 [Miguel Peralvo] Update spark_ec2.py
      4ecd9f9 [Miguel Peralvo] [SPARK-5668] Display region in spark_ec2.py get_existing_cluster()
      c49a4049
    • wangfei's avatar
      [SPARK-5592][SQL] java.net.URISyntaxException when insert data to a partitioned table · 59272dad
      wangfei authored
      flowing sql get URISyntaxException:
      ```
      create table sc as select *
      from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows)
      union all
      select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows)
      union all
      select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s;
      create table sc_part (key string) partitioned by (ts string) stored as rcfile;
      set hive.exec.dynamic.partition=true;
      set hive.exec.dynamic.partition.mode=nonstrict;
      insert overwrite table sc_part partition(ts) select * from sc;
      ```
      java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26
      at org.apache.hadoop.fs.Path.initialize(Path.java:206)
      at org.apache.hadoop.fs.Path.<init>(Path.java:172)
      at org.apache.hadoop.fs.Path.<init>(Path.java:94)
      at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230)
      at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
      at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243)
      at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
      at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
      at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243)
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113)
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105)
      at scala.collection.Iterator$class.foreach(Iterator.scala:727)
      at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105)
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
      at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
      at org.apache.spark.scheduler.Task.run(Task.scala:64)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
      at java.lang.Thread.run(Thread.java:722)
      Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26
      at java.net.URI.checkPath(URI.java:1804)
      at java.net.URI.<init>(URI.java:752)
      at org.apache.hadoop.fs.Path.initialize(Path.java:203)
      
      Author: wangfei <wangfei1@huawei.com>
      Author: Fei Wang <wangfei1@huawei.com>
      
      Closes #4368 from scwf/SPARK-5592 and squashes the following commits:
      
      aa55ef4 [Fei Wang] comments addressed
      f8f8bb1 [wangfei] added test case
      f24624f [wangfei] Merge branch 'master' of https://github.com/apache/spark into SPARK-5592
      9998177 [wangfei] added test case
      ea81daf [wangfei] fix URISyntaxException
      59272dad
    • Andrew Or's avatar
      [HOTFIX][SPARK-4136] Fix compilation and tests · b640c841
      Andrew Or authored
      b640c841
    • Sandy Ryza's avatar
      SPARK-4136. Under dynamic allocation, cancel outstanding executor requests when no longer needed · 69bc3bb6
      Sandy Ryza authored
      This takes advantage of the changes made in SPARK-4337 to cancel pending requests to YARN when they are no longer needed.
      
      Each time the timer in `ExecutorAllocationManager` strikes, we compute `maxNumNeededExecutors`, the maximum number of executors we could fill with the current load.  This is calculated as the total number of running and pending tasks divided by the number of cores per executor.  If `maxNumNeededExecutors` is below the total number of running and pending executors, we call `requestTotalExecutors(maxNumNeededExecutors)` to let the cluster manager know that it should cancel any pending requests above this amount.  If not, `maxNumNeededExecutors` is just used as a bound in alongside the configured `maxExecutors` to limit the number of new requests.
      
      The patch modifies the API exposed by `ExecutorAllocationClient` for requesting additional executors by moving from `requestExecutors` to `requestTotalExecutors`.  This makes the communication between the `ExecutorAllocationManager` and the `YarnAllocator` easier to reason about and removes some state that needed to be kept in the `CoarseGrainedSchedulerBackend`.  I think an argument can be made that this makes for a less attractive user-facing API in `SparkContext`, but I'm having trouble envisioning situations where a user would want to use either of these APIs.
      
      This will likely break some tests, but I wanted to get feedback on the approach before adding tests and polishing.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4168 from sryza/sandy-spark-4136 and squashes the following commits:
      
      37ce77d [Sandy Ryza] Warn on negative number
      cd3b2ff [Sandy Ryza] SPARK-4136
      69bc3bb6
    • Daoyuan Wang's avatar
      [SPARK-5716] [SQL] Support TOK_CHARSETLITERAL in HiveQl · c7ad80ae
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4502 from adrian-wang/utf8 and squashes the following commits:
      
      4d7b0ee [Daoyuan Wang] remove useless import
      606f981 [Daoyuan Wang] support TOK_CHARSETLITERAL in HiveQl
      c7ad80ae
    • JqueryFan's avatar
      [Spark-5717] [MLlib] add stop and reorganize import · 6cc96cf0
      JqueryFan authored
      Trivial. add sc stop and reorganize import
      https://issues.apache.org/jira/browse/SPARK-5717
      
      Author: JqueryFan <firing@126.com>
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #4503 from hhbyyh/scstop and squashes the following commits:
      
      7837a2c [JqueryFan] revert import change
      2e85cc1 [Yuhao Yang] add stop and reorganize import
      6cc96cf0
    • Nicholas Chammas's avatar
      [SPARK-1805] [EC2] Validate instance types · 50820f15
      Nicholas Chammas authored
      Addresses [SPARK-1805](https://issues.apache.org/jira/browse/SPARK-1805), though doesn't resolve it completely.
      
      Error out quickly if the user asks for the master and slaves to have different AMI virtualization types, since we don't currently support that.
      
      In addition to that, we print warnings if the inputted instance types are not recognized, though I would prefer if we errored out. Elsewhere in the script it seems [we allow unrecognized instance types](https://github.com/apache/spark/blob/5de14cc2763a8211f77eeb55940dec025822eb78/ec2/spark_ec2.py#L331), though I think we should remove that.
      
      It's messy, but it should serve us until we enhance spark-ec2 to support clusters with mixed virtualization types.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #4455 from nchammas/ec2-master-slave-different-virtualization and squashes the following commits:
      
      ce28609 [Nicholas Chammas] fix style
      b0adba0 [Nicholas Chammas] validate input instance types
      50820f15
    • Cheng Lian's avatar
      [SPARK-5700] [SQL] [Build] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles · ba667935
      Cheng Lian authored
      This is a follow-up PR for #4454 and #4484. JetS3t 0.9.2 contains a log4j.properties file inside the artifact and breaks our tests (see SPARK-5696). This is fixed in 0.9.3.
      
      This PR also reverts hotfix changes introduced in #4484. The reason is that asking users to configure HiveThriftServer2 logging configurations in hive-log4j.properties can be unintuitive.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4499)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4499 from liancheng/spark-5700 and squashes the following commits:
      
      4f020c7 [Cheng Lian] Bumps jets3t to 0.9.3 for hadoop-2.3 and hadoop-2.4 profiles
      ba667935
    • Sean Owen's avatar
      SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError:... · 2d1e9167
      Sean Owen authored
      SPARK-5239 [CORE] JdbcRDD throws "java.lang.AbstractMethodError: oracle.jdbc.driver.xxxxxx.isClosed()Z"
      
      This is a completion of https://github.com/apache/spark/pull/4033 which was withdrawn for some reason.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4470 from srowen/SPARK-5239.2 and squashes the following commits:
      
      2398bde [Sean Owen] Avoid use of JDBC4-only isClosed()
      2d1e9167
    • Tathagata Das's avatar
      [SPARK-4964][Streaming][Kafka] More updates to Exactly-once Kafka stream · c1513463
      Tathagata Das authored
      Changes
      - Added example
      - Added a critical unit test that verifies that offset ranges can be recovered through checkpoints
      
      Might add more changes.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4384 from tdas/new-kafka-fixes and squashes the following commits:
      
      7c931c3 [Tathagata Das] Small update
      3ed9284 [Tathagata Das] updated scala doc
      83d0402 [Tathagata Das] Added JavaDirectKafkaWordCount example.
      26df23c [Tathagata Das] Updates based on PR comments from Cody
      e4abf69 [Tathagata Das] Scala doc improvements and stuff.
      bb65232 [Tathagata Das] Fixed test bug and refactored KafkaStreamSuite
      50f2b56 [Tathagata Das] Added Java API and added more Scala and Java unit tests. Also updated docs.
      e73589c [Tathagata Das] Minor changes.
      4986784 [Tathagata Das] Added unit test to kafka offset recovery
      6a91cab [Tathagata Das] Added example
      c1513463
    • Joseph K. Bradley's avatar
      [SPARK-5597][MLLIB] save/load for decision trees and emsembles · ef2f55b9
      Joseph K. Bradley authored
      This is based on #4444 from jkbradley with the following changes:
      
      1. Node schema updated to
         ~~~
      treeId: int
      nodeId: Int
      predict/
             |- predict: Double
             |- prob: Double
      impurity: Double
      isLeaf: Boolean
      split/
           |- feature: Int
           |- threshold: Double
           |- featureType: Int
           |- categories: Array[Double]
      leftNodeId: Integer
      rightNodeId: Integer
      infoGain: Double
      ~~~
      
      2. Some refactor of the implementation.
      
      Closes #4444.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4493 from mengxr/SPARK-5597 and squashes the following commits:
      
      75e3bb6 [Xiangrui Meng] fix style
      2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
      45873a2 [Joseph K. Bradley] org imports
      1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
      dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles
      ef2f55b9
  2. Feb 09, 2015
    • Cheng Hao's avatar
      [SQL] Remove the duplicated code · bd0b5ea7
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #4494 from chenghao-intel/tiny_code_change and squashes the following commits:
      
      450dfe7 [Cheng Hao] remove the duplicated code
      bd0b5ea7
    • Kay Ousterhout's avatar
      [SPARK-5701] Only set ShuffleReadMetrics when task has shuffle deps · a2d33d0b
      Kay Ousterhout authored
      The updateShuffleReadMetrics method in TaskMetrics (called by the executor heartbeater) will currently always add a ShuffleReadMetrics to TaskMetrics (with values set to 0), even when the task didn't read any shuffle data. ShuffleReadMetrics should only be added if the task reads shuffle data.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4488 from kayousterhout/SPARK-5701 and squashes the following commits:
      
      673ed58 [Kay Ousterhout] SPARK-5701: Only set ShuffleReadMetrics when task has shuffle deps
      a2d33d0b
    • Andrew Or's avatar
      [SPARK-5703] AllJobsPage throws empty.max exception · a95ed521
      Andrew Or authored
      If you have a `SparkListenerJobEnd` event without the corresponding `SparkListenerJobStart` event, then `JobProgressListener` will create an empty `JobUIData` with an empty `stageIds` list. However, later in `AllJobsPage` we call `stageIds.max`. If this is empty, it will throw an exception.
      
      This crashed my history server.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4490 from andrewor14/jobs-page-max and squashes the following commits:
      
      21797d3 [Andrew Or] Check nonEmpty before calling max
      a95ed521
    • Marcelo Vanzin's avatar
      [SPARK-2996] Implement userClassPathFirst for driver, yarn. · 20a60131
      Marcelo Vanzin authored
      Yarn's config option `spark.yarn.user.classpath.first` does not work the same way as
      `spark.files.userClassPathFirst`; Yarn's version is a lot more dangerous, in that it
      modifies the system classpath, instead of restricting the changes to the user's class
      loader. So this change implements the behavior of the latter for Yarn, and deprecates
      the more dangerous choice.
      
      To be able to achieve feature-parity, I also implemented the option for drivers (the existing
      option only applies to executors). So now there are two options, each controlling whether
      to apply userClassPathFirst to the driver or executors. The old option was deprecated, and
      aliased to the new one (`spark.executor.userClassPathFirst`).
      
      The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it
      was also doing some things that ended up causing JVM errors depending on how things
      were being called.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3233 from vanzin/SPARK-2996 and squashes the following commits:
      
      9cf9cf1 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1499e2 [Marcelo Vanzin] Remove SPARK_HOME propagation.
      fa7df88 [Marcelo Vanzin] Remove 'test.resource' file, create it dynamically.
      a8c69f1 [Marcelo Vanzin] Review feedback.
      cabf962 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a1b8d7e [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3f768e3 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      2ce3c7a [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6d6be [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      70d4044 [Marcelo Vanzin] Fix pyspark/yarn-cluster test.
      0fe7777 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      0e6ef19 [Marcelo Vanzin] Move class loaders around and make names more meaninful.
      fe970a7 [Marcelo Vanzin] Review feedback.
      25d4fed [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      3cb6498 [Marcelo Vanzin] Call the right loadClass() method on the parent.
      fbb8ab5 [Marcelo Vanzin] Add locking in loadClass() to avoid deadlocks.
      2e6c4b7 [Marcelo Vanzin] Mention new setting in documentation.
      b6497f9 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      a10f379 [Marcelo Vanzin] Some feedback.
      3730151 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      f513871 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      44010b6 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      7b57cba [Marcelo Vanzin] Remove now outdated message.
      5304d64 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      35949c8 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      54e1a98 [Marcelo Vanzin] Merge branch 'master' into SPARK-2996
      d1273b2 [Marcelo Vanzin] Add test file to rat exclude.
      fa1aafa [Marcelo Vanzin] Remove write check on user jars.
      89d8072 [Marcelo Vanzin] Cleanups.
      a963ea3 [Marcelo Vanzin] Implement spark.driver.userClassPathFirst for standalone cluster mode.
      50afa5f [Marcelo Vanzin] Fix Yarn executor command line.
      7d14397 [Marcelo Vanzin] Register user jars in executor up front.
      7f8603c [Marcelo Vanzin] Fix yarn-cluster mode without userClassPathFirst.
      20373f5 [Marcelo Vanzin] Fix ClientBaseSuite.
      55c88fa [Marcelo Vanzin] Run all Yarn integration tests via spark-submit.
      0b64d92 [Marcelo Vanzin] Add deprecation warning to yarn option.
      4a84d87 [Marcelo Vanzin] Fix the child-first class loader.
      d0394b8 [Marcelo Vanzin] Add "deprecated configs" to SparkConf.
      46d8cf2 [Marcelo Vanzin] Update doc with new option, change name to "userClassPathFirst".
      a314f2d [Marcelo Vanzin] Enable driver class path isolation in SparkSubmit.
      91f7e54 [Marcelo Vanzin] [yarn] Enable executor class path isolation.
      a853e74 [Marcelo Vanzin] Re-work CoarseGrainedExecutorBackend command line arguments.
      89522ef [Marcelo Vanzin] Add class path isolation support for Yarn cluster mode.
      20a60131
    • Sean Owen's avatar
      SPARK-4900 [MLLIB] MLlib SingularValueDecomposition ARPACK IllegalStateException · 36c4e1d7
      Sean Owen authored
      Fix ARPACK error code mapping, at least. It's not yet clear whether the error is what we expect from ARPACK. If it isn't, not sure if that's to be treated as an MLlib or Breeze issue.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4485 from srowen/SPARK-4900 and squashes the following commits:
      
      7355aa1 [Sean Owen] Fix ARPACK error code mapping
      36c4e1d7
    • KaiXinXiaoLei's avatar
      Add a config option to print DAG. · 31d435ec
      KaiXinXiaoLei authored
      Add a config option "spark.rddDebug.enable" to check whether to print DAG info. When "spark.rddDebug.enable" is true, it will print information about DAG in the log.
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #4257 from KaiXinXiaoLei/DAGprint and squashes the following commits:
      
      d9fe42e [KaiXinXiaoLei] change  log info
      c27ee76 [KaiXinXiaoLei] change log info
      83c2b32 [KaiXinXiaoLei] change config option
      adcb14f [KaiXinXiaoLei] change the file.
      f4e7b9e [KaiXinXiaoLei] add a option to print DAG
      31d435ec
    • Davies Liu's avatar
      [SPARK-5469] restructure pyspark.sql into multiple files · 08488c17
      Davies Liu authored
      All the DataTypes moved into pyspark.sql.types
      
      The changes can be tracked by `--find-copies-harder -M25`
      ```
      davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
      2       5       python/docs/pyspark.ml.rst
      0       3       python/docs/pyspark.mllib.rst
      10      2       python/docs/pyspark.sql.rst
      1       1       python/pyspark/mllib/linalg.py
      21      14      python/pyspark/{mllib => sql}/__init__.py
      14      2108    python/pyspark/{sql.py => sql/context.py}
      10      1772    python/pyspark/{sql.py => sql/dataframe.py}
      7       6       python/pyspark/{sql_tests.py => sql/tests.py}
      8       1465    python/pyspark/{sql.py => sql/types.py}
      4       2       python/run-tests
      1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
      ```
      
      Also `git blame -C -C python/pyspark/sql/context.py` to track the history.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4479 from davies/sql and squashes the following commits:
      
      1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
      2b2b983 [Davies Liu] restructure pyspark.sql
      08488c17
    • Andrew Or's avatar
      [SPARK-5698] Do not let user request negative # of executors · d302c480
      Andrew Or authored
      Otherwise we might crash the ApplicationMaster. Why? Please see https://issues.apache.org/jira/browse/SPARK-5698.
      
      sryza I believe this is also relevant in your patch #4168.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4483 from andrewor14/da-negative and squashes the following commits:
      
      53ed955 [Andrew Or] Throw IllegalArgumentException instead
      0e89fd5 [Andrew Or] Check against negative requests
      d302c480
    • Cheng Lian's avatar
      [SPARK-5699] [SQL] [Tests] Runs hive-thriftserver tests whenever SQL code is modified · 3ec3ad29
      Cheng Lian authored
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4486)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4486 from liancheng/spark-5699 and squashes the following commits:
      
      538001d [Cheng Lian] Runs hive-thriftserver tests whenever SQL code is modified
      3ec3ad29
    • DoingDone9's avatar
      [SPARK-5648][SQL] support "alter ... unset tblproperties("key")" · d08e7c2b
      DoingDone9 authored
      make hivecontext support "alter ... unset tblproperties("key")"
      like :
      alter view viewName unset tblproperties("k")
      alter table tableName unset tblproperties("k")
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #4424 from DoingDone9/unset and squashes the following commits:
      
      6dd8bee [DoingDone9] support "alter ... unset tblproperties("key")"
      d08e7c2b
    • Wenchen Fan's avatar
      [SPARK-2096][SQL] support dot notation on array of struct · 0ee53ebc
      Wenchen Fan authored
      ~~The rule is simple: If you want `a.b` work, then `a` must be some level of nested array of struct(level 0 means just a StructType). And the result of `a.b` is same level of nested array of b-type.
      An optimization is: the resolve chain looks like `Attribute -> GetItem -> GetField -> GetField ...`, so we could transmit the nested array information between `GetItem` and `GetField` to avoid repeated computation of `innerDataType` and `containsNullList` of that nested array.~~
      marmbrus Could you take a look?
      
      to evaluate `a.b`, if `a` is array of struct, then `a.b` means get field `b` on each element of `a`, and return a result of array.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #2405 from cloud-fan/nested-array-dot and squashes the following commits:
      
      08a228a [Wenchen Fan] support dot notation on array of struct
      0ee53ebc
    • Lu Yan's avatar
      [SPARK-5614][SQL] Predicate pushdown through Generate. · 2a362925
      Lu Yan authored
      Now in Catalyst's rules, predicates can not be pushed through "Generate" nodes. Further more, partition pruning in HiveTableScan can not be applied on those queries involves "Generate". This makes such queries very inefficient. In practice, it finds patterns like
      
      ```scala
      Filter(predicate, Generate(generator, _, _, _, grandChild))
      ```
      
      and splits the predicate into 2 parts by referencing the generated column from Generate node or not. And a new Filter will be created for those conjuncts can be pushed beneath Generate node. If nothing left for the original Filter, it will be removed.
      For example, physical plan for query
      ```sql
      select len, bk
      from s_server lateral view explode(len_arr) len_table as len
      where len > 5 and day = '20150102';
      ```
      where 'day' is a partition column in metastore is like this in current version of Spark SQL:
      
      > Project [len, bk]
      >
      > Filter ((len > "5") && "(day = "20150102")")
      >
      > Generate explode(len_arr), true, false
      >
      > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), None
      
      But theoretically the plan should be like this
      
      > Project [len, bk]
      >
      > Filter (len > "5")
      >
      > Generate explode(len_arr), true, false
      >
      > HiveTableScan [bk, len_arr, day], (MetastoreRelation default, s_server, None), Some(day = "20150102")
      
      Where partition pruning predicates can be pushed to HiveTableScan nodes.
      
      Author: Lu Yan <luyan02@baidu.com>
      
      Closes #4394 from ianluyan/ppd and squashes the following commits:
      
      a67dce9 [Lu Yan] Fix English grammar.
      7cea911 [Lu Yan] Revised based on @marmbrus's opinions
      ffc59fc [Lu Yan] [SPARK-5614][SQL] Predicate pushdown through Generate.
      2a362925
    • Cheng Lian's avatar
      [SPARK-5696] [SQL] [HOTFIX] Asks HiveThriftServer2 to re-initialize log4j using Hive configurations · b8080aa8
      Cheng Lian authored
      In this way, log4j configurations overriden by jets3t-0.9.2.jar can be again overriden by Hive default log4j configurations.
      
      This might not be the best solution for this issue since it requires users to use `hive-log4j.properties` rather than `log4j.properties` to initialize `HiveThriftServer2` logging configurations, which can be confusing. The main purpose of this PR is to fix Jenkins PR build.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4484)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4484 from liancheng/spark-5696 and squashes the following commits:
      
      df83956 [Cheng Lian] Hot fix: asks HiveThriftServer2 to re-initialize log4j using Hive configurations
      b8080aa8
    • Yin Huai's avatar
      [SQL] Code cleanup. · 5f0b30e5
      Yin Huai authored
      I added an unnecessary line of code in https://github.com/apache/spark/commit/13531dd97c08563e53dacdaeaf1102bdd13ef825.
      
      My bad. Let's delete it.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4482 from yhuai/unnecessaryCode and squashes the following commits:
      
      3645af0 [Yin Huai] Code cleanup.
      5f0b30e5
    • Michael Armbrust's avatar
      [SQL] Add some missing DataFrame functions. · 68b25cf6
      Michael Armbrust authored
      - as with a `Symbol`
      - distinct
      - sqlContext.emptyDataFrame
      - move add/remove col out of RDDApi section
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4437 from marmbrus/dfMissingFuncs and squashes the following commits:
      
      2004023 [Michael Armbrust] Add missing functions
      68b25cf6
    • Florian Verhein's avatar
      [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py · b884daa5
      Florian Verhein authored
      and by extension, the ami-list
      
      Useful for using alternate spark-ec2 repos or branches.
      
      Author: Florian Verhein <florian.verhein@gmail.com>
      
      Closes #4385 from florianverhein/master and squashes the following commits:
      
      7e2b4be [Florian Verhein] [SPARK-5611] [EC2] typo
      8b653dc [Florian Verhein] [SPARK-5611] [EC2] Enforce only supporting spark-ec2 forks from github, log improvement
      bc4b0ed [Florian Verhein] [SPARK-5611] allow spark-ec2 repos with different names
      8b5c551 [Florian Verhein] improve option naming, fix logging, fix lint failing, add guard to enforce spark-ec2
      7724308 [Florian Verhein] [SPARK-5611] [EC2] fixes
      b42b68c [Florian Verhein] [SPARK-5611] [EC2] Allow spark-ec2 repo and branch to be set on CLI of spark_ec2.py
      b884daa5
Loading