Skip to content
Snippets Groups Projects
  1. Jun 18, 2014
    • Vadim Chekan's avatar
      [STREAMING] SPARK-2009 Key not found exception when slow receiver starts · 889f7b76
      Vadim Chekan authored
      I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod.
      
      Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread).
      https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206
      
      
      
      In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception.
      
      The patch makes getReceivedBlockInfo more robust by tolerating missing values.
      
      Author: Vadim Chekan <kot.begemot@gmail.com>
      
      Closes #961 from vchekan/branch-1.0 and squashes the following commits:
      
      e86f82b [Vadim Chekan] Fixed indentation
      4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called
      (cherry picked from commit 26f6b989)
      
      Signed-off-by: default avatarPatrick Wendell <pwendell@gmail.com>
      889f7b76
  2. Jun 17, 2014
    • Patrick Wendell's avatar
      Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop functions" · 9e4b4bd0
      Patrick Wendell authored
      This reverts commit 443f5e1b.
      
      This commit unfortunately would break source compatibility if users have named
      the hadoopConf parameter.
      9e4b4bd0
    • Yin Huai's avatar
      [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL · d2f4f30b
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2060
      
      Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
      
      Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #999 from yhuai/newJson and squashes the following commits:
      
      227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ce8eedd [Yin Huai] rxin's comments.
      bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      94ffdaa [Yin Huai] Remove "get" from method names.
      ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      79ea9ba [Yin Huai] Fix typos.
      5428451 [Yin Huai] Newline
      1f908ce [Yin Huai] Remove extra line.
      d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      7ea750e [Yin Huai] marmbrus's comments.
      6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      83013fb [Yin Huai] Update Java Example.
      e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
      6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4fbddf0 [Yin Huai] Programming guide.
      9df8c5a [Yin Huai] Python API.
      7027634 [Yin Huai] Java API.
      cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
      d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ab810b0 [Yin Huai] Make JsonRDD private.
      6df0891 [Yin Huai] Apache header.
      8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
      8ffed79 [Yin Huai] Update the example.
      a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
      65b87f0 [Yin Huai] Fix sampling...
      8846af5 [Yin Huai] API doc.
      52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      0387523 [Yin Huai] Address PR comments.
      666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      a2313a6 [Yin Huai] Address PR comments.
      f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
      0576406 [Yin Huai] Add Apache license header.
      af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
      f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
      d2f4f30b
    • Patrick Wendell's avatar
      HOTFIX: bug caused by #941 · b2ebf429
      Patrick Wendell authored
      This patch should have qualified the use of PIPE. This needs to be back ported into 0.9 and 1.0.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1108 from pwendell/hotfix and squashes the following commits:
      
      711c58d [Patrick Wendell] HOTFIX: bug caused by #941
      b2ebf429
    • Andrew Or's avatar
      [SPARK-2147 / 2161] Show removed executors on the UI · a14807e8
      Andrew Or authored
      This PR includes two changes
      - **[SPARK-2147]** When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens.
      - **[SPARK-2161]** This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table).
      
      This should go into 1.0.1 if possible.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1102 from andrewor14/remember-removed-executors and squashes the following commits:
      
      2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor)
      abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors
      792f992 [Andrew Or] Add missing equals method in ExecutorInfo
      3390b49 [Andrew Or] Add executor state column to WorkerPage
      161f8a2 [Andrew Or] Display finished executors table (fix bug)
      fbb65b8 [Andrew Or] Removed unused method
      c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI
      fe47402 [Andrew Or] Show exited executors on the Master UI
      a14807e8
    • CodingCat's avatar
      SPARK-2038: rename "conf" parameters in the saveAsHadoop functions · 443f5e1b
      CodingCat authored
      to distinguish with SparkConf object
      
      https://issues.apache.org/jira/browse/SPARK-2038
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #1087 from CodingCat/SPARK-2038 and squashes the following commits:
      
      763975f [CodingCat] style fix
      d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
      443f5e1b
    • Sandy Ryza's avatar
      SPARK-2146. Fix takeOrdered doc · 2794990e
      Sandy Ryza authored
      Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits:
      
      185ff18 [Sandy Ryza] Use Seq instead of Array
      c996120 [Sandy Ryza] SPARK-2146.  Fix takeOrdered doc
      2794990e
    • Andrew Ash's avatar
      SPARK-1063 Add .sortBy(f) method on RDD · b92d16b1
      Andrew Ash authored
      This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
      
      I think this is ready for merging.
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@apache.org>
      
      Closes #369 from ash211/sortby and squashes the following commits:
      
      d09147a [Andrew Ash] Fix Ordering import
      43d0a53 [Andrew Ash] Fix missing .collect()
      29a54ed [Andrew Ash] Re-enable test by converting to a closure
      5a95348 [Andrew Ash] Add license for RDDSuiteUtils
      64ed6e3 [Andrew Ash] Remove leaked diff
      d4de69a [Andrew Ash] Remove scar tissue
      63638b5 [Andrew Ash] Add Python version of .sortBy()
      45e0fde [Andrew Ash] Add Java version of .sortBy()
      adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
      9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
      0457b69 [Andrew Ash] Ignore failing test
      99f0baf [Andrew Ash] Merge branch 'master' into sortby
      222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
      3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
      b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
      8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
      381eef2 [Andrew Ash] Correct silly typo
      7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
      0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
      ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
      b92d16b1
    • Zongheng Yang's avatar
      [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN. · e243c5ff
      Zongheng Yang authored
      JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053
      
      This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`.
      
      [This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics.
      
      Notes / Open issues:
      
      * Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot.
      * We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1055 from concretevitamin/caseWhen and squashes the following commits:
      
      4226eb9 [Zongheng Yang] Comment.
      79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen
      caf9383 [Zongheng Yang] Update a FIXME.
      9d26ab8 [Zongheng Yang] Add @transient marker.
      788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when.
      7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient.
      f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files.
      1c1fbfc [Zongheng Yang] Cleanups per review comments.
      7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time.
      47d406a [Zongheng Yang] Do toArray once and lazily outside of eval().
      bb3d109 [Zongheng Yang] Update scaladoc of a method.
      aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import.
      96870a8 [Zongheng Yang] Turn off scalastyle for some comments.
      7392f3a [Zongheng Yang] Minor cleanup.
      2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen
      9f84b40 [Zongheng Yang] Add golden outputs from Hive.
      db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests.
      3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved).
      be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts.
      f2bcb9d [Zongheng Yang] WIP
      5906f75 [Zongheng Yang] WIP
      efd019b [Zongheng Yang] eval() and toString() bug fixes.
      7d81e95 [Zongheng Yang] Clean up resolved.
      a31d782 [Zongheng Yang] Finish up Case.
      e243c5ff
    • Xi Liu's avatar
      [SPARK-2164][SQL] Allow Hive UDF on columns of type struct · f5a4049e
      Xi Liu authored
      Author: Xi Liu <xil@conviva.com>
      
      Closes #796 from xiliu82/sqlbug and squashes the following commits:
      
      328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test
      354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct
      8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct
      f5a4049e
    • Andrew Or's avatar
      [SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks · 09deb3ee
      Andrew Or authored
      This is reproducible whenever we drop a block because of memory pressure.
      
      This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map.
      
      This PR includes this simple fix.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1080 from andrewor14/ui-blocks and squashes the following commits:
      
      fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached
      09deb3ee
    • Daniel Darabos's avatar
      SPARK-2035: Store call stack for stages, display it on the UI. · 23a12ce2
      Daniel Darabos authored
      I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).
      
      I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #981 from darabos/darabos-call-stack and squashes the following commits:
      
      f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d4 by Doris.
      3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
      b857849 [Daniel Darabos] Style: Break long line.
      ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
      d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
      b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
      4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
      0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
      a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
      932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
      ccd89d1 [Daniel Darabos] Fix long lines.
      ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
      6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
      8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
      23a12ce2
    • Anant's avatar
      SPARK-1990: added compatibility for python 2.6 for ssh_read command · 8cd04c3e
      Anant authored
      https://issues.apache.org/jira/browse/SPARK-1990
      
      There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #941 from anantasty/SPARK-1990 and squashes the following commits:
      
      4ca441d [Anant] Implmented check_optput withinthe module to work with python 2.6
      c6ed85c [Anant] added compatibility for python 2.6 for ssh_read command
      8cd04c3e
    • Kan Zhang's avatar
      [SPARK-2130] End-user friendly String repr for StorageLevel in Python · d81c08ba
      Kan Zhang authored
      JIRA issue https://issues.apache.org/jira/browse/SPARK-2130
      
      This PR adds an end-user friendly String representation for StorageLevel
      in Python, similar to ```StorageLevel.description``` in Scala.
      ```
      >>> rdd = sc.parallelize([1,2])
      >>> storage_level = rdd.getStorageLevel()
      >>> storage_level
      StorageLevel(False, False, False, False, 1)
      >>> print(storage_level)
      Serialized 1x Replicated
      ```
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1096 from kanzhang/SPARK-2130 and squashes the following commits:
      
      7c8b98b [Kan Zhang] [SPARK-2130] Prettier epydoc output
      cc5bf45 [Kan Zhang] [SPARK-2130] End-user friendly String representation for StorageLevel in Python
      d81c08ba
    • Anatoli Fomenko's avatar
      MLlib documentation fix · 7afa912e
      Anatoli Fomenko authored
      Synchronized mllib-optimization.md with Spark Scaladoc: removed reference to GradientDescent.runMiniBatchSGD method
      
      This is a temporary fix to remove  a link from http://spark.apache.org/docs/latest/mllib-optimization.html to GradientDescent.runMiniBatchSGD which is not in the current online GradientDescent Scaladoc.
      FIXME: revert this commit after GradientDescent Scaladoc is updated.
      See images for details.
      
      ![mllib-docs-fix-1](https://cloud.githubusercontent.com/assets/1375501/3294410/ccf19bb8-f5a8-11e3-93f1-f593016209eb.png)
      ![mllib-docs-fix-2](https://cloud.githubusercontent.com/assets/1375501/3294411/d0b59a7e-f5a8-11e3-8fc8-329c177ef8c8.png)
      
      Author: Anatoli Fomenko <fa@apache.org>
      
      Closes #1098 from afomenko/master and squashes the following commits:
      
      5cb0758 [Anatoli Fomenko] MLlib documentation fix
      7afa912e
  3. Jun 16, 2014
    • Cheng Lian's avatar
      Minor fix: made "EXPLAIN" output to play well with JDBC output format · 237b96bc
      Cheng Lian authored
      Fixed the broken JDBC output. Test from Shark `beeline`:
      
      ```
      beeline> !connect jdbc:hive2://localhost:10000/
      scan complete in 2ms
      Connecting to jdbc:hive2://localhost:10000/
      Enter username for jdbc:hive2://localhost:10000/: lian
      Enter password for jdbc:hive2://localhost:10000/:
      Connected to: Hive (version 0.12.0)
      Driver: Hive (version 0.12.0)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      0: jdbc:hive2://localhost:10000/>
      0: jdbc:hive2://localhost:10000/> explain select * from src;
      +-------------------------------------------------------------------------------+
      |                                     plan                                      |
      +-------------------------------------------------------------------------------+
      | ExplainCommand [plan#2:0]                                                     |
      |  HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
      +-------------------------------------------------------------------------------+
      2 rows selected (1.386 seconds)
      ```
      
      Before this change, the output looked something like this:
      
      ```
      +-------------------------------------------------------------------------------+
      |                                     plan                                      |
      +-------------------------------------------------------------------------------+
      | ExplainCommand [plan#2:0]
       HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
      +-------------------------------------------------------------------------------+
      ```
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1097 from liancheng/multiLineExplain and squashes the following commits:
      
      eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
      237b96bc
    • Cheng Lian's avatar
      [SQL][SPARK-2094] Follow up of PR #1071 for Java API · 273afcb2
      Cheng Lian authored
      Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1085 from liancheng/spark-2094-java and squashes the following commits:
      
      29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
      92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
      22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
      273afcb2
    • witgo's avatar
      [SPARK-1930] The Container is running beyond physical memory limits, so as to be killed · cdf2b045
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #894 from witgo/SPARK-1930 and squashes the following commits:
      
      564307e [witgo] Update the running-on-yarn.md
      3747515 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      172647b [witgo] add memoryOverhead docs
      a0ff545 [witgo] leaving only two configs
      a17bda2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      478ca15 [witgo] Merge branch 'master' into SPARK-1930
      d1244a1 [witgo] Merge branch 'master' into SPARK-1930
      8b967ae [witgo] Merge branch 'master' into SPARK-1930
      655a820 [witgo] review commit
      71859a7 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      e3c531d [witgo] review commit
      e16f190 [witgo] different memoryOverhead
      ffa7569 [witgo] review commit
      5c9581f [witgo] Merge branch 'master' into SPARK-1930
      9a6bcf2 [witgo] review commit
      8fae45a [witgo] fix NullPointerException
      e0dcc16 [witgo] Adding  configuration items
      b6a989c [witgo] Fix container memory beyond limit, were killed
      cdf2b045
    • Kan Zhang's avatar
      [SPARK-2010] Support for nested data in PySpark SQL · 4fdb4917
      Kan Zhang authored
      JIRA issue https://issues.apache.org/jira/browse/SPARK-2010
      
      This PR adds support for nested collection types in PySpark SQL, including
      array, dict, list, set, and tuple. Example,
      
      ```
      >>> from array import array
      >>> from pyspark.sql import SQLContext
      >>> sqlCtx = SQLContext(sc)
      >>> rdd = sc.parallelize([
      ...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
      ...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
      >>> srdd = sqlCtx.inferSchema(rdd)
      >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
      ...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
      True
      >>> rdd = sc.parallelize([
      ...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
      ...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
      >>> srdd = sqlCtx.inferSchema(rdd)
      >>> srdd.collect() == \
      ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
      ...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
      True
      ```
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:
      
      1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
      504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
      4fdb4917
    • CodingCat's avatar
      SPARK-2039: apply output dir existence checking for all output formats · 716c88aa
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-2039
      
      apply output dir existence checking for all output formats
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #1088 from CodingCat/SPARK-2039 and squashes the following commits:
      
      c52747a [CodingCat] apply output dir existence checking for all output formats
      716c88aa
    • Ali Ghodsi's avatar
      Updating docs to include missing information about reducers and clarify ... · 119b06a0
      Ali Ghodsi authored
      ...how the OFFHEAP storage level works (there has been confusion around this).
      
      Author: Ali Ghodsi <alig@cs.berkeley.edu>
      
      Closes #1089 from alig/master and squashes the following commits:
      
      ca8114d [Ali Ghodsi] Updating docs to include missing information about reducers and clarify how the OFFHEAP storage level works (there has been confusion around this).
      119b06a0
    • Andrew Ash's avatar
      SPARK-2148 Add link to requirements for custom equals() and hashcode() methods · 9672ee07
      Andrew Ash authored
      https://issues.apache.org/jira/browse/SPARK-2148
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1092 from ash211/SPARK-2148 and squashes the following commits:
      
      93513df [Andrew Ash] SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
      9672ee07
    • CrazyJvm's avatar
      SPARK-1999: StorageLevel in storage tab and RDD Storage Info never changes · a63aa1ad
      CrazyJvm authored
      StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if you call rdd.unpersist() and then you give the rdd another different storage level.
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #968 from CrazyJvm/ui-storagelevel and squashes the following commits:
      
      62555fa [CrazyJvm] change RDDInfo constructor param 'storageLevel' to var, so there's need to add another variable _storageLevel。
      9f1571e [CrazyJvm] JIRA https://issues.apache.org/jira/browse/SPARK-1999 UI : StorageLevel in storage tab and RDD Storage Info never changes
      a63aa1ad
  4. Jun 15, 2014
    • Kan Zhang's avatar
      [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors · ca5d9d43
      Kan Zhang authored
      There seems to be 2 issues.
      
      1. When job is done, driver asks executor to shutdown. However, this clean exit was assigned FAILED executor state by Worker. I introduced EXITED executor state for executors who voluntarily exit (both normal and abnormal exit depending on the exit code).
      
      2. When Master gets notified an executor has exited, it launches another one to replace it, regardless of reason why the executor had exited. When the reason was job has finished, the unnecessary replacement got subsequently killed when App disassociates. This launching and killing of unnecessary executors shows up in the log and is confusing to users. I added check for executor exit status and avoid launching (and subsequent killing) of unnecessary replacements when executors exit cleanly.
      
      One could ask the scheduler to tell Master job is done so that Master wouldn't launch the replacement executor. However, there is a race condition between App telling Master job is done and Worker telling Master an executor had exited. There is no guarantee the former will happen before the later. Instead, I chose to check the exit code when executor exits. If the exit code is 0, I assume executor has been asked to shutdown by driver and Master will not launch replacements.
      
      Due to race condition, it could also happen that (although didn't happen on my local cluster), Master detects App disassociation event before the executor exits by itself. In such cases, the executor will be rightfully killed and labeled as KILLED, while the App state will show FINISHED.
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #306 from kanzhang/SPARK-1118 and squashes the following commits:
      
      cb0cc86 [Kan Zhang] [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
      ca5d9d43
    • Michael Armbrust's avatar
      [SQL] Support transforming TreeNodes with Option children. · 269fc62b
      Michael Armbrust authored
      Thanks goes to @marmbrus for his implementation.
      
      Author: Michael Armbrust <michael@databricks.com>
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1074 from concretevitamin/option-treenode and squashes the following commits:
      
      ef27b85 [Zongheng Yang] Merge pull request #1 from marmbrus/pr/1074
      73133c2 [Michael Armbrust] TreeNodes can't be inner classes.
      ab78420 [Zongheng Yang] Add a test.
      2ccb721 [Michael Armbrust] Add support for transformation of optional children.
      269fc62b
  5. Jun 14, 2014
  6. Jun 13, 2014
    • akkomar's avatar
      Small correction in Streaming Programming Guide doc · edb1f0e3
      akkomar authored
      Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.
      
      Author: akkomar <ak.komar@gmail.com>
      
      Closes #1079 from akkomar/streaming-guide-doc and squashes the following commits:
      
      32dfc62 [akkomar] Corrected description of `repartition` function under 'Level of Parallelism in Data Receiving'.
      edb1f0e3
    • Cheng Lian's avatar
      [SPARK-2094][SQL] "Exactly once" semantics for DDL and command statements · ac96d965
      Cheng Lian authored
      ## Related JIRA issues
      
      - Main issue:
      
        - [SPARK-2094](https://issues.apache.org/jira/browse/SPARK-2094): Ensure exactly once semantics for DDL/Commands
      
      - Issues resolved as dependencies:
      
        - [SPARK-2081](https://issues.apache.org/jira/browse/SPARK-2081): Undefine output() from the abstract class Command and implement it in concrete subclasses
        - [SPARK-2128](https://issues.apache.org/jira/browse/SPARK-2128): No plan for DESCRIBE
        - [SPARK-1852](https://issues.apache.org/jira/browse/SPARK-1852): SparkSQL Queries with Sorts run before the user asks them to
      
      - Other related issue:
      
        - [SPARK-2129](https://issues.apache.org/jira/browse/SPARK-2129): NPE thrown while lookup a view
      
          Two test cases, `join_view` and `mergejoin_mixed`, within the `HiveCompatibilitySuite` are removed from the whitelist to workaround this issue.
      
      ## PR Overview
      
      This PR defines physical plans for DDL statements and commands and wraps their side effects in a lazy field `PhysicalCommand.sideEffectResult`, so that they are executed eagerly and exactly once.  Also, as a positive side effect, now DDL statements and commands can be turned into proper `SchemaRDD`s and let user query the execution results.
      
      This PR defines schemas for the following DDL/commands:
      
      - EXPLAIN command
      
        - `plan`: String, the plan explanation
      
      - SET command
      
        - `key`: String, the key(s) of the propert(y/ies) being set or queried
        - `value`: String, the value(s) of the propert(y/ies) being queried
      
      - Other Hive native command
      
        - `result`: String, execution result returned by Hive
      
        **NOTE**: We should refine schemas for different native commands by defining physical plans for them in the future.
      
      ## Examples
      
      ### EXPLAIN command
      
      Take the "EXPLAIN" command as an example, we first execute the command and obtain a `SchemaRDD` at the same time, then query the `plan` field with the schema DSL:
      
      ```
      scala> loadTestTable("src")
      ...
      
      scala> val q0 = hql("EXPLAIN SELECT key, COUNT(*) FROM src GROUP BY key")
      ...
      q0: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[0] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      ExplainCommandPhysical [plan#11:0]
       Aggregate false, [key#4], [key#4,SUM(PartialCount#6L) AS c_1#2L]
        Exchange (HashPartitioning [key#4:0], 200)
         Exchange (HashPartitioning [key#4:0], 200)
          Aggregate true, [key#4], [key#4,COUNT(1) AS PartialCount#6L]
           HiveTableScan [key#4], (MetastoreRelation default, src, None), None
      
      scala> q0.select('plan).collect()
      ...
      [ExplainCommandPhysical [plan#24:0]
       Aggregate false, [key#17], [key#17,SUM(PartialCount#19L) AS c_1#2L]
        Exchange (HashPartitioning [key#17:0], 200)
         Exchange (HashPartitioning [key#17:0], 200)
          Aggregate true, [key#17], [key#17,COUNT(1) AS PartialCount#19L]
           HiveTableScan [key#17], (MetastoreRelation default, src, None), None]
      
      scala>
      ```
      
      ### SET command
      
      In this example we query all the properties set in `SQLConf`, register the result as a table, and then query the table with HiveQL:
      
      ```
      scala> val q1 = hql("SET")
      ...
      q1: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[7] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      <SET command: executed by Hive, and noted by SQLContext>
      
      scala> q1.registerAsTable("properties")
      
      scala> hql("SELECT key, value FROM properties ORDER BY key LIMIT 10").foreach(println)
      ...
      == Query Plan ==
      TakeOrdered 10, [key#51:0 ASC]
       Project [key#51:0,value#52:1]
        SetCommandPhysical None, None, [key#55:0,value#56:1]), which has no missing parents
      14/06/12 12:19:27 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (SchemaRDD[21] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      TakeOrdered 10, [key#51:0 ASC]
       Project [key#51:0,value#52:1]
        SetCommandPhysical None, None, [key#55:0,value#56:1])
      ...
      [datanucleus.autoCreateSchema,true]
      [datanucleus.autoStartMechanismMode,checked]
      [datanucleus.cache.level2,false]
      [datanucleus.cache.level2.type,none]
      [datanucleus.connectionPoolingType,BONECP]
      [datanucleus.fixedDatastore,false]
      [datanucleus.identifierFactory,datanucleus1]
      [datanucleus.plugin.pluginRegistryBundleCheck,LOG]
      [datanucleus.rdbms.useLegacyNativeValueStrategy,true]
      [datanucleus.storeManagerType,rdbms]
      
      scala>
      ```
      
      ### "Exactly once" semantics
      
      At last, an example of the "exactly once" semantics:
      
      ```
      scala> val q2 = hql("CREATE TABLE t1(key INT, value STRING)")
      ...
      q2: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[28] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      <Native command: executed by Hive>
      
      scala> table("t1")
      ...
      res9: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[32] at RDD at SchemaRDD.scala:98
      == Query Plan ==
      HiveTableScan [key#58,value#59], (MetastoreRelation default, t1, None), None
      
      scala> q2.collect()
      ...
      res10: Array[org.apache.spark.sql.Row] = Array([])
      
      scala>
      ```
      
      As we can see, the "CREATE TABLE" command is executed eagerly right after the `SchemaRDD` is created, and referencing the `SchemaRDD` again won't trigger a duplicated execution.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1071 from liancheng/exactlyOnceCommand and squashes the following commits:
      
      d005b03 [Cheng Lian] Made "SET key=value" returns the newly set key value pair
      f6c7715 [Cheng Lian] Added test cases for DDL/command statement RDDs
      1d00937 [Cheng Lian] Makes SchemaRDD DSLs work for DDL/command statement RDDs
      5c7e680 [Cheng Lian] Bug fix: wrong type used in pattern matching
      48aa2e5 [Cheng Lian] Refined SQLContext.emptyResult as an empty RDD[Row]
      cc64f32 [Cheng Lian] Renamed physical plan classes for DDL/commands
      74789c1 [Cheng Lian] Fixed failing test cases
      0ad343a [Cheng Lian] Added physical plan for DDL and commands to ensure the "exactly once" semantics
      ac96d965
    • Michael Armbrust's avatar
      [SPARK-1964][SQL] Add timestamp to HiveMetastoreTypes.toMetastoreType · 1c2fd015
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1061 from marmbrus/timestamp and squashes the following commits:
      
      79c3903 [Michael Armbrust] Add timestamp to HiveMetastoreTypes.toMetastoreType()
      1c2fd015
    • nravi's avatar
      Workaround in Spark for ConcurrentModification issue (JIRA Hadoop-10456, Spark-1097) · 70c8116c
      nravi authored
      This fix has gone into Hadoop 2.4.1. For developers using <  2.4.1, it would be good to have a workaround in Spark as well.
      
      Fix has been tested for performance as well, no regressions found.
      
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #1000 from nishkamravi2/master and squashes the following commits:
      
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      70c8116c
    • Xiangrui Meng's avatar
      [HOTFIX] add math3 version to pom · b3736e3d
      Xiangrui Meng authored
      Passed `mvn package`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1075 from mengxr/takeSample-fix and squashes the following commits:
      
      45b4590 [Xiangrui Meng] add math3 version to pom
      b3736e3d
    • Michael Armbrust's avatar
      [SPARK-2135][SQL] Use planner for in-memory scans · 13f8cfdc
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1072 from marmbrus/cachedStars and squashes the following commits:
      
      8757c8e [Michael Armbrust] Use planner for in-memory scans.
      13f8cfdc
  7. Jun 12, 2014
    • John Zhao's avatar
      [SPARK-1516]Throw exception in yarn client instead of run system.exit directly. · f95ac686
      John Zhao authored
      All the changes is in  the package of "org.apache.spark.deploy.yarn":
          1) Throw exception in ClinetArguments and ClientBase instead of exit directly.
          2) in Client's main method, if exception is caught, it will exit with code 1, otherwise exit with code 0.
      
      After the fix, if user integrate the spark yarn client into their applications, when the argument is wrong or the running is finished, the application won't be terminated.
      
      Author: John Zhao <jzhao@alpinenow.com>
      
      Closes #490 from codeboyyong/jira_1516_systemexit_inyarnclient and squashes the following commits:
      
      138cb48 [John Zhao] [SPARK-1516]Throw exception in yarn clinet instead of run system.exit directly. All the changes is in  the package of "org.apache.spark.deploy.yarn": 1) Add a ClientException with an exitCode 2) Throws exception in ClinetArguments and ClientBase instead of exit directly 3) in Client's main method, catch exception and exit with the exitCode.
      f95ac686
    • Andrew Or's avatar
      [Minor] Fix style, formatting and naming in BlockManager etc. · 44daec5a
      Andrew Or authored
      This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.
      
      (Warning: this PR is full of unimportant nitpicks)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1058 from andrewor14/bm-minor and squashes the following commits:
      
      8e12eaf [Andrew Or] SparkException -> BlockException
      c36fd53 [Andrew Or] Make parts of BlockManager more readable
      0a5f378 [Andrew Or] Entry -> MemoryEntry
      e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
      c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
      b3470f1 [Andrew Or] More string interpolation (minor)
      7f9dcab [Andrew Or] Use string interpolation (minor)
      94a425b [Andrew Or] Refactor against duplicate code + minor changes
      8a6a7dc [Andrew Or] Exception -> SparkException
      97c410f [Andrew Or] Deal with MIMA excludes
      2480f1d [Andrew Or] Fixes in StorgeLevel.scala
      abb0163 [Andrew Or] Style, formatting and naming fixes
      44daec5a
    • Doris Xin's avatar
      SPARK-1939 Refactor takeSample method in RDD to use ScaSRS · 1de1d703
      Doris Xin authored
      Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      Author: dorx <doris.s.xin@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #916 from dorx/takeSample and squashes the following commits:
      
      5b061ae [Doris Xin] merge master
      444e750 [Doris Xin] edge cases
      3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
      82dde31 [Xiangrui Meng] update pyspark's takeSample
      48d954d [Doris Xin] remove unused imports from RDDSuite
      fb1452f [Doris Xin] allowing num to be greater than count in all cases
      1481b01 [Doris Xin] washing test tubes and making coffee
      dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
      64e445b [Doris Xin] logwarnning as soon as it enters the while loop
      55518ed [Doris Xin] added TODO for logging in rdd.py
      eff89e2 [Doris Xin] addressed reviewer comments.
      ecab508 [Doris Xin] "fixed checkstyle violation
      0a9b3e3 [Doris Xin] "reviewer comment addressed"
      f80f270 [Doris Xin] Merge branch 'master' into takeSample
      ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
      065ebcd [Doris Xin] Merge branch 'master' into takeSample
      9bdd36e [Doris Xin] Check sample size and move computeFraction
      e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
      7cab53a [Doris Xin] fixed import bug in rdd.py
      ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
      1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
      1de1d703
    • Ariel Rabkin's avatar
      document laziness of parallelize · 0154587a
      Ariel Rabkin authored
      Took me several hours to figure out this behavior. It would be good to highlight it in the documentation.
      
      Author: Ariel Rabkin <asrabkin@cs.princeton.edu>
      
      Closes #1070 from asrabkin/master and squashes the following commits:
      
      29a076e [Ariel Rabkin] doc fix
      0154587a
    • Shuo Xiang's avatar
      SPARK-2085: [MLlib] Apply user-specific regularization instead of uniform regularization in ALS · a6e0afdc
      Shuo Xiang authored
      The current implementation of ALS takes a single regularization parameter and apply it on both of the user factors and the product factors. This kind of regularization can be less effective while user number is significantly larger than the number of products (and vice versa). For example, if we have 10M users and 1K product, regularization on user factors will dominate. Following the discussion in [this thread](http://apache-spark-user-list.1001560.n3.nabble.com/possible-bug-in-Spark-s-ALS-implementation-tt2567.html#a2704), the implementation in this PR will regularize each factor vector by #ratings * lambda.
      
      Author: Shuo Xiang <sxiang@twitter.com>
      
      Closes #1026 from coderxiang/als-reg and squashes the following commits:
      
      93dfdb4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into als-reg
      b98f19c [Shuo Xiang] merge latest master
      52c7b58 [Shuo Xiang] Apply user-specific regularization instead of uniform regularization in Alternating Least Squares (ALS)
      a6e0afdc
Loading