Skip to content
Snippets Groups Projects
  1. Jun 20, 2014
    • Michael Armbrust's avatar
      [SQL] Improve Speed of InsertIntoHiveTable · d3b7671c
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1130 from marmbrus/noFunctional and squashes the following commits:
      
      ccdb68c [Michael Armbrust] Remove functional programming and Array allocations from fast path in InsertIntoHiveTable.
      d3b7671c
    • Reynold Xin's avatar
      More minor scaladoc cleanup for Spark SQL. · 278ec8a2
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1142 from rxin/sqlclean and squashes the following commits:
      
      67a789e [Reynold Xin] More minor scaladoc cleanup for Spark SQL.
      278ec8a2
  2. Jun 19, 2014
    • Patrick Wendell's avatar
      HOTFIX: SPARK-2208 local metrics tests can fail on fast machines · e5514790
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1141 from pwendell/hotfix and squashes the following commits:
      
      83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
      e5514790
    • Reynold Xin's avatar
      A few minor Spark SQL Scaladoc fixes. · 5464e791
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1139 from rxin/sparksqldoc and squashes the following commits:
      
      c3049d8 [Reynold Xin] Fixed line length.
      66dc72c [Reynold Xin] A few minor Spark SQL Scaladoc fixes.
      5464e791
    • nravi's avatar
      [SPARK-2151] Recognize memory format for spark-submit · f14b00a9
      nravi authored
      int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark.
      
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #1095 from nishkamravi2/master and squashes the following commits:
      
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      f14b00a9
    • Michael Armbrust's avatar
      [SPARK-2191][SQL] Make sure InsertIntoHiveTable doesn't execute more than once. · 777c5958
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1129 from marmbrus/doubleCreateAs and squashes the following commits:
      
      9c6d9e4 [Michael Armbrust] Fix typo.
      5128fe2 [Michael Armbrust] Make sure InsertIntoHiveTable doesn't execute each time you ask for its result.
      777c5958
    • witgo's avatar
      [SPARK-2051]In yarn.ClientBase spark.yarn.dist.* do not work · bce0897b
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #969 from witgo/yarn_ClientBase and squashes the following commits:
      
      8117765 [witgo] review commit
      3bdbc52 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      5261b6c [witgo] fix sys.props.get("SPARK_YARN_DIST_FILES")
      e3c1107 [witgo] update docs
      b6a9aa1 [witgo] merge master
      c8b4554 [witgo] review commit
      2f48789 [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      8d7b82f [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      1048549 [witgo] remove Utils.resolveURIs
      871f1db [witgo] add spark.yarn.dist.* documentation
      41bce59 [witgo] review commit
      35d6fa0 [witgo] move to ClientArguments
      55d72fc [witgo] Merge branch 'master' of https://github.com/apache/spark into yarn_ClientBase
      9cdff16 [witgo] review commit
      8bc2f4b [witgo] review commit
      20e667c [witgo] Merge branch 'master' into yarn_ClientBase
      0961151 [witgo] merge master
      ce609fc [witgo] Merge branch 'master' into yarn_ClientBase
      8362489 [witgo] yarn.ClientBase spark.yarn.dist.* do not work
      bce0897b
    • WangTao's avatar
      Minor fix · 67fca189
      WangTao authored
      The value "env" is never used in SparkContext.scala.
      Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one.
      
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #1105 from WangTaoTheTonic/master and squashes the following commits:
      
      688358e [WangTao] Minor fix
      67fca189
    • Reynold Xin's avatar
      [SPARK-2187] Explain should not run the optimizer twice. · 640c2943
      Reynold Xin authored
      @yhuai @marmbrus @concretevitamin
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1123 from rxin/explain and squashes the following commits:
      
      def83b0 [Reynold Xin] Update unit tests for explain.
      a9d3ba8 [Reynold Xin] [SPARK-2187] Explain should not run the optimizer twice.
      640c2943
    • Doris Xin's avatar
      Squishing a typo bug before it causes real harm · 566f70f2
      Doris Xin authored
      in updateNumRows method in RowMatrix
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1125 from dorx/updateNumRows and squashes the following commits:
      
      8564aef [Doris Xin] Squishing a typo bug before it causes real harm
      566f70f2
  3. Jun 18, 2014
    • Michael Armbrust's avatar
      [SPARK-2184][SQL] AddExchange isn't idempotent · 5ff75c74
      Michael Armbrust authored
      ...redPartitioning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1122 from marmbrus/fixAddExchange and squashes the following commits:
      
      3417537 [Michael Armbrust] Don't bind partitioning expressions as that breaks comparison with requiredPartitioning.
      5ff75c74
    • Doris Xin's avatar
      Remove unicode operator from RDD.scala · 45a95f82
      Doris Xin authored
      Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1119 from dorx/unicode and squashes the following commits:
      
      05618c3 [Doris Xin] Remove unicode operator from RDD.scala
      45a95f82
    • Mark Hamstra's avatar
      SPARK-2158 Clean up core/stdout file from FileAppenderSuite · 4cbeea83
      Mark Hamstra authored
      @tdas
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits:
      
      ae8e069 [Mark Hamstra] Response to TD's review
      2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
      4cbeea83
    • Kay Ousterhout's avatar
      [SPARK-1466] Raise exception if pyspark Gateway process doesn't start. · 38702487
      Kay Ousterhout authored
      If the gateway process fails to start correctly (e.g., because JAVA_HOME isn't set correctly, there's no Spark jar, etc.), right now pyspark fails because of a very difficult-to-understand error, where we try to parse stdout to get the port where Spark started and there's nothing there. This commit properly catches the error and throws an exception that includes the stderr output for much easier debugging.
      
      Thanks to @shivaram and @stogers for helping to fix this issue!
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #383 from kayousterhout/pyspark and squashes the following commits:
      
      36dd54b [Kay Ousterhout] [SPARK-1466] Raise exception if Gateway process doesn't start.
      38702487
    • Reynold Xin's avatar
      Updated the comment for SPARK-2162. · dd96fcda
      Reynold Xin authored
      A follow up on #1103
      
      @andrewor14
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #1117 from rxin/SPARK-2162 and squashes the following commits:
      
      a4231de [Reynold Xin] Updated the comment for SPARK-2162.
      dd96fcda
    • Raymond Liu's avatar
      [SPARK-2162] Double check in doGetLocal to avoid read on removed block. · 5ad5e348
      Raymond Liu authored
      other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed.
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1103 from colorant/bm and squashes the following commits:
      
      daac114 [Raymond Liu] Address comments
      d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
      5ad5e348
    • Yin Huai's avatar
      [SPARK-2176][SQL] Extra unnecessary exchange operator in the result of an explain command · 587d3201
      Yin Huai authored
      ```
      hql("explain select * from src group by key").collect().foreach(println)
      
      [ExplainCommand [plan#27:0]]
      [ Aggregate false, [key#25], [key#25,value#26]]
      [  Exchange (HashPartitioning [key#25:0], 200)]
      [   Exchange (HashPartitioning [key#25:0], 200)]
      [    Aggregate true, [key#25], [key#25]]
      [     HiveTableScan [key#25,value#26], (MetastoreRelation default, src, None), None]
      ```
      
      There are two exchange operators.
      
      However, if we do not use explain...
      ```
      hql("select * from src group by key")
      
      res4: org.apache.spark.sql.SchemaRDD =
      SchemaRDD[8] at RDD at SchemaRDD.scala:100
      == Query Plan ==
      Aggregate false, [key#8], [key#8,value#9]
       Exchange (HashPartitioning [key#8:0], 200)
        Aggregate true, [key#8], [key#8]
         HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
      ```
      The plan is fine.
      
      The cause of this bug is explained below.
      
      When we create an `execution.ExplainCommand`, we use the `executedPlan` as the child of this `ExplainCommand`. But, this `executedPlan` is prepared for execution again when we generate the `executedPlan` for the `ExplainCommand`. Basically, `prepareForExecution` is called twice on a physical plan. Because after `prepareForExecution` we have already bounded those references (in `BoundReference`s), `AddExchange` cannot figure out we are using the same partitioning (we use `AttributeReference`s to create an `ExchangeOperator` and then those references will be changed to `BoundReference`s after `prepareForExecution` is called). So, an extra `ExchangeOperator` is inserted.
      
      I think in `CommandStrategy`, we should just use the `sparkPlan` (`sparkPlan` is the input of `prepareForExecution`) to initialize the `ExplainCommand` instead of using `executedPlan`.
      
      The link to JIRA: https://issues.apache.org/jira/browse/SPARK-2176
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1116 from yhuai/SPARK-2176 and squashes the following commits:
      
      197c19c [Yin Huai] Use sparkPlan to initialize a Physical Explain Command instead of using executedPlan.
      587d3201
    • Vadim Chekan's avatar
      [STREAMING] SPARK-2009 Key not found exception when slow receiver starts · 889f7b76
      Vadim Chekan authored
      I got "java.util.NoSuchElementException: key not found: 1401756085000 ms" exception when using kafka stream and 1 sec batchPeriod.
      
      Investigation showed that the reason is that ReceiverLauncher.startReceivers is asynchronous (started in a thread).
      https://github.com/vchekan/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L206
      
      
      
      In case of slow starting receiver, such as Kafka, it easily takes more than 2sec to start. In result, no single "compute" will be called on ReceiverInputDStream before first batch job is executed and receivedBlockInfo remains empty (obviously). Batch job will cause ReceiverInputDStream.getReceivedBlockInfo call and "key not found" exception.
      
      The patch makes getReceivedBlockInfo more robust by tolerating missing values.
      
      Author: Vadim Chekan <kot.begemot@gmail.com>
      
      Closes #961 from vchekan/branch-1.0 and squashes the following commits:
      
      e86f82b [Vadim Chekan] Fixed indentation
      4609563 [Vadim Chekan] Key not found exception: if receiver is slow to start, it is possible that getReceivedBlockInfo will be called before compute has been called
      (cherry picked from commit 26f6b989)
      
      Signed-off-by: default avatarPatrick Wendell <pwendell@gmail.com>
      889f7b76
  4. Jun 17, 2014
    • Patrick Wendell's avatar
      Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop functions" · 9e4b4bd0
      Patrick Wendell authored
      This reverts commit 443f5e1b.
      
      This commit unfortunately would break source compatibility if users have named
      the hadoopConf parameter.
      9e4b4bd0
    • Yin Huai's avatar
      [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL · d2f4f30b
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2060
      
      Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
      
      Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #999 from yhuai/newJson and squashes the following commits:
      
      227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ce8eedd [Yin Huai] rxin's comments.
      bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      94ffdaa [Yin Huai] Remove "get" from method names.
      ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      79ea9ba [Yin Huai] Fix typos.
      5428451 [Yin Huai] Newline
      1f908ce [Yin Huai] Remove extra line.
      d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      7ea750e [Yin Huai] marmbrus's comments.
      6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      83013fb [Yin Huai] Update Java Example.
      e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
      6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4fbddf0 [Yin Huai] Programming guide.
      9df8c5a [Yin Huai] Python API.
      7027634 [Yin Huai] Java API.
      cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
      d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ab810b0 [Yin Huai] Make JsonRDD private.
      6df0891 [Yin Huai] Apache header.
      8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
      8ffed79 [Yin Huai] Update the example.
      a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
      65b87f0 [Yin Huai] Fix sampling...
      8846af5 [Yin Huai] API doc.
      52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      0387523 [Yin Huai] Address PR comments.
      666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      a2313a6 [Yin Huai] Address PR comments.
      f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
      0576406 [Yin Huai] Add Apache license header.
      af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
      f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
      d2f4f30b
    • Patrick Wendell's avatar
      HOTFIX: bug caused by #941 · b2ebf429
      Patrick Wendell authored
      This patch should have qualified the use of PIPE. This needs to be back ported into 0.9 and 1.0.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1108 from pwendell/hotfix and squashes the following commits:
      
      711c58d [Patrick Wendell] HOTFIX: bug caused by #941
      b2ebf429
    • Andrew Or's avatar
      [SPARK-2147 / 2161] Show removed executors on the UI · a14807e8
      Andrew Or authored
      This PR includes two changes
      - **[SPARK-2147]** When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens.
      - **[SPARK-2161]** This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table).
      
      This should go into 1.0.1 if possible.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1102 from andrewor14/remember-removed-executors and squashes the following commits:
      
      2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor)
      abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors
      792f992 [Andrew Or] Add missing equals method in ExecutorInfo
      3390b49 [Andrew Or] Add executor state column to WorkerPage
      161f8a2 [Andrew Or] Display finished executors table (fix bug)
      fbb65b8 [Andrew Or] Removed unused method
      c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI
      fe47402 [Andrew Or] Show exited executors on the Master UI
      a14807e8
    • CodingCat's avatar
      SPARK-2038: rename "conf" parameters in the saveAsHadoop functions · 443f5e1b
      CodingCat authored
      to distinguish with SparkConf object
      
      https://issues.apache.org/jira/browse/SPARK-2038
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #1087 from CodingCat/SPARK-2038 and squashes the following commits:
      
      763975f [CodingCat] style fix
      d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
      443f5e1b
    • Sandy Ryza's avatar
      SPARK-2146. Fix takeOrdered doc · 2794990e
      Sandy Ryza authored
      Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits:
      
      185ff18 [Sandy Ryza] Use Seq instead of Array
      c996120 [Sandy Ryza] SPARK-2146.  Fix takeOrdered doc
      2794990e
    • Andrew Ash's avatar
      SPARK-1063 Add .sortBy(f) method on RDD · b92d16b1
      Andrew Ash authored
      This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
      
      I think this is ready for merging.
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Reynold Xin <rxin@apache.org>
      
      Closes #369 from ash211/sortby and squashes the following commits:
      
      d09147a [Andrew Ash] Fix Ordering import
      43d0a53 [Andrew Ash] Fix missing .collect()
      29a54ed [Andrew Ash] Re-enable test by converting to a closure
      5a95348 [Andrew Ash] Add license for RDDSuiteUtils
      64ed6e3 [Andrew Ash] Remove leaked diff
      d4de69a [Andrew Ash] Remove scar tissue
      63638b5 [Andrew Ash] Add Python version of .sortBy()
      45e0fde [Andrew Ash] Add Java version of .sortBy()
      adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
      9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
      0457b69 [Andrew Ash] Ignore failing test
      99f0baf [Andrew Ash] Merge branch 'master' into sortby
      222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
      3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
      b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
      8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
      381eef2 [Andrew Ash] Correct silly typo
      7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
      0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
      ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
      b92d16b1
    • Zongheng Yang's avatar
      [SPARK-2053][SQL] Add Catalyst expressions for CASE WHEN. · e243c5ff
      Zongheng Yang authored
      JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2053
      
      This PR adds support for two types of CASE statements present in Hive. The first type is of the form `CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END`, with the semantics like a chain of if statements. The second type is of the form `CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END`, with the semantics like a switch statement on key `a`. Both forms are implemented in `CaseWhen`.
      
      [This link](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-ConditionalFunctions) contains more detailed descriptions on their semantics.
      
      Notes / Open issues:
      
      * Please check if any implicit contracts / invariants are broken in the implementations (especially for the operators). I am not very familiar with them and I currently find them tricky to spot.
      * We should decide whether or not a non-boolean condition is allowed in a branch of `CaseWhen`. Hive throws a `SemanticException` for this situation and I think it'd be good to mimic it -- the question is where in the whole Spark SQL pipeline should we signal an exception for such a query.
      
      Author: Zongheng Yang <zongheng.y@gmail.com>
      
      Closes #1055 from concretevitamin/caseWhen and squashes the following commits:
      
      4226eb9 [Zongheng Yang] Comment.
      79d26fc [Zongheng Yang] Merge branch 'master' into caseWhen
      caf9383 [Zongheng Yang] Update a FIXME.
      9d26ab8 [Zongheng Yang] Add @transient marker.
      788a0d9 [Zongheng Yang] Implement CastNulls, which fixes udf_case and udf_when.
      7ef284f [Zongheng Yang] Refactors: remove redundant passes, improve toString, mark transient.
      f47ae7b [Zongheng Yang] Modify queries in tests to have shorter golden files.
      1c1fbfc [Zongheng Yang] Cleanups per review comments.
      7d2b7e2 [Zongheng Yang] Translate CaseKeyWhen to CaseWhen at parsing time.
      47d406a [Zongheng Yang] Do toArray once and lazily outside of eval().
      bb3d109 [Zongheng Yang] Update scaladoc of a method.
      aea3195 [Zongheng Yang] Fix bug that branchesArr is not used; remove unused import.
      96870a8 [Zongheng Yang] Turn off scalastyle for some comments.
      7392f3a [Zongheng Yang] Minor cleanup.
      2cf08bb [Zongheng Yang] Merge branch 'master' into caseWhen
      9f84b40 [Zongheng Yang] Add golden outputs from Hive.
      db51a85 [Zongheng Yang] Add allCondBooleans check; uncomment tests.
      3f9ef0a [Zongheng Yang] Cleanups and bug fixes (mainly in eval() and resolved).
      be54bc8 [Zongheng Yang] Rewrite eval() to a low-level implementation. Separate two CASE stmts.
      f2bcb9d [Zongheng Yang] WIP
      5906f75 [Zongheng Yang] WIP
      efd019b [Zongheng Yang] eval() and toString() bug fixes.
      7d81e95 [Zongheng Yang] Clean up resolved.
      a31d782 [Zongheng Yang] Finish up Case.
      e243c5ff
    • Xi Liu's avatar
      [SPARK-2164][SQL] Allow Hive UDF on columns of type struct · f5a4049e
      Xi Liu authored
      Author: Xi Liu <xil@conviva.com>
      
      Closes #796 from xiliu82/sqlbug and squashes the following commits:
      
      328dfc4 [Xi Liu] [Spark SQL] remove a temporary function after test
      354386a [Xi Liu] [Spark SQL] add test suite for UDF on struct
      8fc6f51 [Xi Liu] [SparkSQL] allow UDF on struct
      f5a4049e
    • Andrew Or's avatar
      [SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks · 09deb3ee
      Andrew Or authored
      This is reproducible whenever we drop a block because of memory pressure.
      
      This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map.
      
      This PR includes this simple fix.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1080 from andrewor14/ui-blocks and squashes the following commits:
      
      fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached
      09deb3ee
    • Daniel Darabos's avatar
      SPARK-2035: Store call stack for stages, display it on the UI. · 23a12ce2
      Daniel Darabos authored
      I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).
      
      I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #981 from darabos/darabos-call-stack and squashes the following commits:
      
      f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d4 by Doris.
      3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
      b857849 [Daniel Darabos] Style: Break long line.
      ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
      d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
      b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
      4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
      0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
      a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
      932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
      ccd89d1 [Daniel Darabos] Fix long lines.
      ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
      6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
      8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
      23a12ce2
    • Anant's avatar
      SPARK-1990: added compatibility for python 2.6 for ssh_read command · 8cd04c3e
      Anant authored
      https://issues.apache.org/jira/browse/SPARK-1990
      
      There were some posts on the lists that spark-ec2 does not work with Python 2.6. In addition, we should check the Python version at the top of the script and exit if it's too old
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #941 from anantasty/SPARK-1990 and squashes the following commits:
      
      4ca441d [Anant] Implmented check_optput withinthe module to work with python 2.6
      c6ed85c [Anant] added compatibility for python 2.6 for ssh_read command
      8cd04c3e
    • Kan Zhang's avatar
      [SPARK-2130] End-user friendly String repr for StorageLevel in Python · d81c08ba
      Kan Zhang authored
      JIRA issue https://issues.apache.org/jira/browse/SPARK-2130
      
      This PR adds an end-user friendly String representation for StorageLevel
      in Python, similar to ```StorageLevel.description``` in Scala.
      ```
      >>> rdd = sc.parallelize([1,2])
      >>> storage_level = rdd.getStorageLevel()
      >>> storage_level
      StorageLevel(False, False, False, False, 1)
      >>> print(storage_level)
      Serialized 1x Replicated
      ```
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1096 from kanzhang/SPARK-2130 and squashes the following commits:
      
      7c8b98b [Kan Zhang] [SPARK-2130] Prettier epydoc output
      cc5bf45 [Kan Zhang] [SPARK-2130] End-user friendly String representation for StorageLevel in Python
      d81c08ba
    • Anatoli Fomenko's avatar
      MLlib documentation fix · 7afa912e
      Anatoli Fomenko authored
      Synchronized mllib-optimization.md with Spark Scaladoc: removed reference to GradientDescent.runMiniBatchSGD method
      
      This is a temporary fix to remove  a link from http://spark.apache.org/docs/latest/mllib-optimization.html to GradientDescent.runMiniBatchSGD which is not in the current online GradientDescent Scaladoc.
      FIXME: revert this commit after GradientDescent Scaladoc is updated.
      See images for details.
      
      ![mllib-docs-fix-1](https://cloud.githubusercontent.com/assets/1375501/3294410/ccf19bb8-f5a8-11e3-93f1-f593016209eb.png)
      ![mllib-docs-fix-2](https://cloud.githubusercontent.com/assets/1375501/3294411/d0b59a7e-f5a8-11e3-8fc8-329c177ef8c8.png)
      
      Author: Anatoli Fomenko <fa@apache.org>
      
      Closes #1098 from afomenko/master and squashes the following commits:
      
      5cb0758 [Anatoli Fomenko] MLlib documentation fix
      7afa912e
  5. Jun 16, 2014
    • Cheng Lian's avatar
      Minor fix: made "EXPLAIN" output to play well with JDBC output format · 237b96bc
      Cheng Lian authored
      Fixed the broken JDBC output. Test from Shark `beeline`:
      
      ```
      beeline> !connect jdbc:hive2://localhost:10000/
      scan complete in 2ms
      Connecting to jdbc:hive2://localhost:10000/
      Enter username for jdbc:hive2://localhost:10000/: lian
      Enter password for jdbc:hive2://localhost:10000/:
      Connected to: Hive (version 0.12.0)
      Driver: Hive (version 0.12.0)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      0: jdbc:hive2://localhost:10000/>
      0: jdbc:hive2://localhost:10000/> explain select * from src;
      +-------------------------------------------------------------------------------+
      |                                     plan                                      |
      +-------------------------------------------------------------------------------+
      | ExplainCommand [plan#2:0]                                                     |
      |  HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
      +-------------------------------------------------------------------------------+
      2 rows selected (1.386 seconds)
      ```
      
      Before this change, the output looked something like this:
      
      ```
      +-------------------------------------------------------------------------------+
      |                                     plan                                      |
      +-------------------------------------------------------------------------------+
      | ExplainCommand [plan#2:0]
       HiveTableScan [key#0,value#1], (MetastoreRelation default, src, None), None  |
      +-------------------------------------------------------------------------------+
      ```
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1097 from liancheng/multiLineExplain and squashes the following commits:
      
      eb37967 [Cheng Lian] Made output of "EXPLAIN" play well with JDBC output format
      237b96bc
    • Cheng Lian's avatar
      [SQL][SPARK-2094] Follow up of PR #1071 for Java API · 273afcb2
      Cheng Lian authored
      Updated `JavaSQLContext` and `JavaHiveContext` similar to what we've done to `SQLContext` and `HiveContext` in PR #1071. Added corresponding test case for Spark SQL Java API.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1085 from liancheng/spark-2094-java and squashes the following commits:
      
      29b8a51 [Cheng Lian] Avoided instantiating JavaSparkContext & JavaHiveContext to workaround test failure
      92bb4fb [Cheng Lian] Marked test cases in JavaHiveQLSuite with "ignore"
      22aec97 [Cheng Lian] Follow up of PR #1071 for Java API
      273afcb2
    • witgo's avatar
      [SPARK-1930] The Container is running beyond physical memory limits, so as to be killed · cdf2b045
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #894 from witgo/SPARK-1930 and squashes the following commits:
      
      564307e [witgo] Update the running-on-yarn.md
      3747515 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      172647b [witgo] add memoryOverhead docs
      a0ff545 [witgo] leaving only two configs
      a17bda2 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      478ca15 [witgo] Merge branch 'master' into SPARK-1930
      d1244a1 [witgo] Merge branch 'master' into SPARK-1930
      8b967ae [witgo] Merge branch 'master' into SPARK-1930
      655a820 [witgo] review commit
      71859a7 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1930
      e3c531d [witgo] review commit
      e16f190 [witgo] different memoryOverhead
      ffa7569 [witgo] review commit
      5c9581f [witgo] Merge branch 'master' into SPARK-1930
      9a6bcf2 [witgo] review commit
      8fae45a [witgo] fix NullPointerException
      e0dcc16 [witgo] Adding  configuration items
      b6a989c [witgo] Fix container memory beyond limit, were killed
      cdf2b045
    • Kan Zhang's avatar
      [SPARK-2010] Support for nested data in PySpark SQL · 4fdb4917
      Kan Zhang authored
      JIRA issue https://issues.apache.org/jira/browse/SPARK-2010
      
      This PR adds support for nested collection types in PySpark SQL, including
      array, dict, list, set, and tuple. Example,
      
      ```
      >>> from array import array
      >>> from pyspark.sql import SQLContext
      >>> sqlCtx = SQLContext(sc)
      >>> rdd = sc.parallelize([
      ...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
      ...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
      >>> srdd = sqlCtx.inferSchema(rdd)
      >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
      ...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
      True
      >>> rdd = sc.parallelize([
      ...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
      ...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
      >>> srdd = sqlCtx.inferSchema(rdd)
      >>> srdd.collect() == \
      ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
      ...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
      True
      ```
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:
      
      1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
      504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
      4fdb4917
    • CodingCat's avatar
      SPARK-2039: apply output dir existence checking for all output formats · 716c88aa
      CodingCat authored
      https://issues.apache.org/jira/browse/SPARK-2039
      
      apply output dir existence checking for all output formats
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #1088 from CodingCat/SPARK-2039 and squashes the following commits:
      
      c52747a [CodingCat] apply output dir existence checking for all output formats
      716c88aa
    • Ali Ghodsi's avatar
      Updating docs to include missing information about reducers and clarify ... · 119b06a0
      Ali Ghodsi authored
      ...how the OFFHEAP storage level works (there has been confusion around this).
      
      Author: Ali Ghodsi <alig@cs.berkeley.edu>
      
      Closes #1089 from alig/master and squashes the following commits:
      
      ca8114d [Ali Ghodsi] Updating docs to include missing information about reducers and clarify how the OFFHEAP storage level works (there has been confusion around this).
      119b06a0
    • Andrew Ash's avatar
      SPARK-2148 Add link to requirements for custom equals() and hashcode() methods · 9672ee07
      Andrew Ash authored
      https://issues.apache.org/jira/browse/SPARK-2148
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #1092 from ash211/SPARK-2148 and squashes the following commits:
      
      93513df [Andrew Ash] SPARK-2148 Add link to requirements for custom equals() and hashcode() methods
      9672ee07
    • CrazyJvm's avatar
      SPARK-1999: StorageLevel in storage tab and RDD Storage Info never changes · a63aa1ad
      CrazyJvm authored
      StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if you call rdd.unpersist() and then you give the rdd another different storage level.
      
      Author: CrazyJvm <crazyjvm@gmail.com>
      
      Closes #968 from CrazyJvm/ui-storagelevel and squashes the following commits:
      
      62555fa [CrazyJvm] change RDDInfo constructor param 'storageLevel' to var, so there's need to add another variable _storageLevel。
      9f1571e [CrazyJvm] JIRA https://issues.apache.org/jira/browse/SPARK-1999 UI : StorageLevel in storage tab and RDD Storage Info never changes
      a63aa1ad
Loading