Skip to content
Snippets Groups Projects
  1. Mar 29, 2015
    • Nishkam Ravi's avatar
      [SPARK-6406] Launch Spark using assembly jar instead of a separate launcher jar · e3eb3939
      Nishkam Ravi authored
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      
      Closes #5085 from nishkamravi2/master_nravi and squashes the following commits:
      
      bad4349 [nishkamravi2] Update Main.java
      36a6f87 [Nishkam Ravi] Minor changes and bug fixes
      b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
      d9658d6 [Nishkam Ravi] Changes for SPARK-6406
      ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
      345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ac58975 [Nishkam Ravi] spark-class changes
      06bfeb0 [nishkamravi2] Update spark-class
      35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
      4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
      746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
      bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      d453197 [nishkamravi2] Update NewHadoopRDD.scala
      6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
      0ce2c32 [nishkamravi2] Update HadoopRDD.scala
      f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
      ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
      71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      494d8c0 [nishkamravi2] Update DiskBlockManager.scala
      3c5ddba [nishkamravi2] Update DiskBlockManager.scala
      f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
      79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
      535295a [nishkamravi2] Update TaskSetManager.scala
      3e1b616 [Nishkam Ravi] Modify test for maxResultSize
      9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
      5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      e3eb3939
    • Brennon York's avatar
      [SPARK-4123][Project Infra]: Show new dependencies added in pull requests · 55153f5c
      Brennon York authored
      Starting work on this, but need to find a way to ensure that, after doing a checkout from `apache/master`, we can successfully return to the current checkout. I believe that `git rev-parse HEAD` will get me what I want, but pushing this PR up to test what the Jenkins boxes are seeing.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5093 from brennonyork/SPARK-4123 and squashes the following commits:
      
      42e243e [Brennon York] moved starting test output to before pr tests, fixed indentation, changed mvn call to build/mvn
      dadd941 [Brennon York] reverted assembly pom, put the regular test suite back in play
      7aa1dee [Brennon York] set new dendencies into a <code> block, removed the bash debugging flag
      0074566 [Brennon York] fixed minor echo issue with quotes
      e229802 [Brennon York] updated to print the new dependency found
      27bb9b5 [Brennon York] changed the assembly pom to test whether the pr test will pick up new deps
      5375ad8 [Brennon York] git output to dev null
      9bce980 [Brennon York] ensure both gate files exist
      8f3c4b4 [Brennon York] updated to reflect the correct pushed in HEAD variable
      2bc7b27 [Brennon York] added a pom gate check
      a18db71 [Brennon York] full test of new deps script
      ea170de [Brennon York] dont let mvn execute tests
      f70d8cd [Brennon York] testing mvn with package
      62ffd65 [Brennon York] updated dependency output message and changed compile to package given the jenkins failure output
      04747e4 [Brennon York] adding simple mvn statement to see if command executes and prints compile output
      87f9bea [Brennon York] added -x flag with bash to get insight into what is executing and what isnt
      9e87208 [Brennon York] added set blocks to catch any non-zero exit codes and updated output
      6b3042b [Brennon York] removed excess git checkout print statements
      4077d46 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      2bb5527 [Brennon York] added echo statement so jenkins logs which pr tests are running
      d027f8f [Brennon York] proper piping of unnecessary stderr and stdout
      6e2890d [Brennon York] updated test output newlines
      d9f6f7f [Brennon York] removed echo
      bad9a3a [Brennon York] added back the new deps test
      e9e3ad1 [Brennon York] removed escapes for quotes
      97e5cfb [Brennon York] commenting out new deps script
      17379a5 [Brennon York] Merge remote-tracking branch 'upstream/master' into SPARK-4123
      56f74a8 [Brennon York] updated the unop for ensuring a test is available
      f2abc8c [Brennon York] removed the git checkout
      6912584 [Brennon York] added this_mssg echo output
      c610d42 [Brennon York] removed the error to dev/null
      b98f78c [Brennon York] added the removed deps and echo output for jenkins testing
      291a8fe [Brennon York] updated location of maven binary
      126ce61 [Brennon York] removing new deps test to isolate why jenkins isn't posting messages
      f8011d8 [Brennon York] minor updates and style changes
      63a35c9 [Brennon York] updated new dependencies test
      dae7ba8 [Brennon York] Capturing output directly from dependency builds
      94d3547 [Brennon York] adding the new dependencies script into the test mix
      2bca3c3 [Brennon York] added a git checkout 'git rev-parse HEAD' to the end of each pr test
      ae83b90 [Brennon York] removed jenkins tests to grab some values from the jenkins box
      4110993 [Brennon York] beginning work on pr test to add new dependencies
      55153f5c
    • Reynold Xin's avatar
      [DOC] Improvements to Python docs. · 5eef00d0
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5238 from rxin/pyspark-docs and squashes the following commits:
      
      c285951 [Reynold Xin] Reset deprecation warning.
      8c1031e [Reynold Xin] inferSchema
      dd91b1a [Reynold Xin] [DOC] Improvements to Python docs.
      5eef00d0
  2. Mar 28, 2015
  3. Mar 27, 2015
    • Adam Budde's avatar
      [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a Parquet schema · 5909f097
      Adam Budde authored
      Opening to replace #5188.
      
      When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore.
      
      In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema.
      
      In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore.
      
      This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there.
      
      Author: Adam Budde <budde@amazon.com>
      
      Closes #5214 from budde/nullable-fields and squashes the following commits:
      
      a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538
      9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema
      5909f097
    • Reynold Xin's avatar
      [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 row, not 1 row · 3af73343
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5226 from rxin/empty-df and squashes the following commits:
      
      1306d88 [Reynold Xin] Proper fix.
      e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
      3af73343
    • Xusen Yin's avatar
      [SPARK-6526][ML] Add Normalizer transformer in ML package · d5497ab1
      Xusen Yin authored
      See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526).
      
      mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits:
      
      6faa7bf [Xusen Yin] fix style
      8a462da [Xusen Yin] remove duplications
      ab35ab0 [Xusen Yin] add test suite
      bc8cd0f [Xusen Yin] fix comment
      79774c9 [Xusen Yin] add Normalizer transformer in ML package
      d5497ab1
    • Davies Liu's avatar
      [SPARK-6574] [PySpark] fix sql example · 887e1b72
      Davies Liu authored
      Fix the import in sql example.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5230 from davies/fix_sql_example and squashes the following commits:
      
      7ecc5f4 [Davies Liu] fix sql example
      887e1b72
    • Michael Armbrust's avatar
      [SPARK-6550][SQL] Use analyzed plan in DataFrame · 5d9c37c2
      Michael Armbrust authored
      This is based on bug and test case proposed by viirya.  See #5203 for a excellent description of the problem.
      
      TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`.  However, this `AttributeReference` is based on an analyzed plan which is thrown away.  At execution time, we once again analyze the plan.  However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid.
      
      As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5217 from marmbrus/preanalyzer and squashes the following commits:
      
      1f98e2d [Michael Armbrust] revert change
      dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame
      089c52e [Michael Armbrust] WIP
      5d9c37c2
    • Dean Chen's avatar
      [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7 · aa2b9917
      Dean Chen authored
      Fixes bug causing Kryo serialization to fail with Avro files in between stages.
      
      https://issues.apache.org/jira/browse/AVRO-1476?focusedCommentId=13999249&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13999249
      
      Author: Dean Chen <deanchen5@gmail.com>
      
      Closes #5193 from deanchen/SPARK-6544 and squashes the following commits:
      
      813d4c5 [Dean Chen] [SPARK-6544][build] Increment Avro version from 1.7.6 to 1.7.7
      aa2b9917
    • zsxwing's avatar
      [SPARK-6556][Core] Fix wrong parsing logic of executorTimeoutMs and... · da546b7b
      zsxwing authored
      [SPARK-6556][Core] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
      
      The current reading logic of `executorTimeoutMs` is:
      ```Scala
      private val executorTimeoutMs = sc.conf.getLong("spark.network.timeout",
          sc.conf.getLong("spark.storage.blockManagerSlaveTimeoutMs", 120)) * 1000
      ```
      So if `spark.storage.blockManagerSlaveTimeoutMs` is 10000 and `spark.network.timeout` is not set, executorTimeoutMs will be 10000 * 1000. But the correct value should have been 10000.
      
      `checkTimeoutIntervalMs` has the same issue.
      
      This PR fixes them.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5209 from zsxwing/SPARK-6556 and squashes the following commits:
      
      6a0a411 [zsxwing] Fix docs
      c7d5422 [zsxwing] Add comments for executorTimeoutMs and checkTimeoutIntervalMs
      ccd5147 [zsxwing] Fix wrong parsing logic of executorTimeoutMs and checkTimeoutIntervalMs in HeartbeatReceiver
      da546b7b
    • Yu ISHIKAWA's avatar
      [SPARK-6341][mllib] Upgrade breeze from 0.11.1 to 0.11.2 · f43a6103
      Yu ISHIKAWA authored
      There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2.
      https://issues.apache.org/jira/browse/SPARK-6341
      
      And thanks you for your great cooperation, David Hall(dlwh)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits:
      
      ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
      f43a6103
    • mcheah's avatar
      [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB. · 49d2ec63
      mcheah authored
      Kryo buffers are backed by byte arrays, but primitive arrays can only be
      up to 2GB in size. It is misleading to allow users to set buffers past
      this size.
      
      Author: mcheah <mcheah@palantir.com>
      
      Closes #5218 from mccheah/feature/limit-kryo-buffer and squashes the following commits:
      
      1d6d1be [mcheah] Fixing numeric typo
      e2e30ce [mcheah] Removing explicit int and double type to match style
      09fd80b [mcheah] Should be >= not >. Slightly more consistent error message.
      60634f9 [mcheah] [SPARK-6405] Limiting the maximum Kryo buffer size to be 2GB.
      49d2ec63
  4. Mar 26, 2015
    • Brennon York's avatar
      [SPARK-6510][GraphX]: Add Graph#minus method to act as Set#difference · 39fb5796
      Brennon York authored
      Adds a `Graph#minus` method which will return only unique `VertexId`'s from the calling `VertexRDD`.
      
      To demonstrate a basic example with pseudocode:
      
      ```
      Set((0L,0),(1L,1)).minus(Set((1L,1),(2L,2)))
      > Set((0L,0))
      ```
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #5175 from brennonyork/SPARK-6510 and squashes the following commits:
      
      248d5c8 [Brennon York] added minus(VertexRDD[VD]) method to avoid createUsingIndex and updated the mask operations to simplify with andNot call
      3fb7cce [Brennon York] updated graphx doc to reflect the addition of minus method
      6575d92 [Brennon York] updated mima exclude
      aaa030b [Brennon York] completed graph#minus functionality
      7227c0f [Brennon York] beginning work on minus functionality
      39fb5796
    • Michael Armbrust's avatar
      [DOCS][SQL] Fix JDBC example · aad00322
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5192 from marmbrus/fixJDBCDocs and squashes the following commits:
      
      b48a33d [Michael Armbrust] [DOCS][SQL] Fix JDBC example
      aad00322
    • Cheng Lian's avatar
      [SPARK-6554] [SQL] Don't push down predicates which reference partition column(s) · 71a0d40e
      Cheng Lian authored
      There are two cases for the new Parquet data source:
      
      1. Partition columns exist in the Parquet data files
      
         We don't need to push-down these predicates since partition pruning already handles them.
      
      1. Partition columns don't exist in the Parquet data files
      
         We can't push-down these predicates since they are considered as invalid columns by Parquet.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5210)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5210 from liancheng/spark-6554 and squashes the following commits:
      
      4f7ec03 [Cheng Lian] Adds comments
      e134ced [Cheng Lian] Don't push down predicates which reference partition column(s)
      71a0d40e
    • Reynold Xin's avatar
      [SPARK-6117] [SQL] Improvements to DataFrame.describe() · 784fcd53
      Reynold Xin authored
      1. Slightly modifications to the code to make it more readable.
      2. Added Python implementation.
      3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5201 from rxin/df-describe and squashes the following commits:
      
      25a7834 [Reynold Xin] Reset run-tests.
      6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
      784fcd53
    • Sean Owen's avatar
      SPARK-6532 [BUILD] LDAModel.scala fails scalastyle on Windows · c3a52a08
      Sean Owen authored
      Use standard UTF-8 source / report encoding for scalastyle
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5211 from srowen/SPARK-6532 and squashes the following commits:
      
      16a33e5 [Sean Owen] Use standard UTF-8 source / report encoding for scalastyle
      c3a52a08
    • Sean Owen's avatar
      SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases · fe15ea97
      Sean Owen authored
      Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5148 from srowen/SPARK-6480 and squashes the following commits:
      
      974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
      23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      fe15ea97
    • Yuhao Yang's avatar
      [MLlib]remove unused import · 3ddb975f
      Yuhao Yang authored
      minor thing. Let me know if jira is required.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5207 from hhbyyh/adjustImport and squashes the following commits:
      
      2240121 [Yuhao Yang] remove unused import
      3ddb975f
    • Yash Datta's avatar
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema... · 1c05027a
      Yash Datta authored
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      
      Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
      But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #5141 from saucam/replace_col and squashes the following commits:
      
      e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
      5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      1c05027a
    • zsxwing's avatar
      [SPARK-6468][Block Manager] Fix the race condition of subDirs in DiskBlockManager · 0c88ce54
      zsxwing authored
      There are two race conditions of `subDirs` in `DiskBlockManager`:
      
      1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition.
      2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here).
      
      This PR fixed the above race conditions.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5136 from zsxwing/SPARK-6468 and squashes the following commits:
      
      cbb872b [zsxwing] Fix the race condition of subDirs in DiskBlockManager
      0c88ce54
    • Michael Armbrust's avatar
      [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryo · f88f51bb
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:
      
      bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
      f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
      f88f51bb
    • DoingDone9's avatar
      [SPARK-6546][Build] Using the wrong code that will make spark compile failed!! · 855cba8f
      DoingDone9 authored
      wrong code : val tmpDir = Files.createTempDir()
      not Files should Utils
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5198 from DoingDone9/FilesBug and squashes the following commits:
      
      6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala
      e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      855cba8f
    • azagrebin's avatar
      [SPARK-6117] [SQL] add describe function to DataFrame for summary statis... · 5bbcd130
      azagrebin authored
      Please review my solution for SPARK-6117
      
      Author: azagrebin <azagrebin@gmail.com>
      
      Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits:
      
      f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case
      ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns
      9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics
      5bbcd130
    • Davies Liu's avatar
      [SPARK-6536] [PySpark] Column.inSet() in Python · f5358029
      Davies Liu authored
      ```
      >>> df[df.name.inSet("Bob", "Mike")].collect()
      [Row(age=5, name=u'Bob')]
      >>> df[df.age.inSet([1, 2, 3])].collect()
      [Row(age=2, name=u'Alice')]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5190 from davies/in and squashes the following commits:
      
      6b73a47 [Davies Liu] Column.inSet() in Python
      f5358029
  5. Mar 25, 2015
    • Michael Armbrust's avatar
      [SPARK-6463][SQL] AttributeSet.equal should compare size · 276ef1c3
      Michael Armbrust authored
      Previously this could result in sets compare equals when in fact the right was a subset of the left.
      
      Based on #5133 by sisihj
      
      Author: sisihj <jun.hejun@huawei.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5194 from marmbrus/pr/5133 and squashes the following commits:
      
      5ed4615 [Michael Armbrust] fix imports
      d4cbbc0 [Michael Armbrust] Add test cases
      0a0834f [sisihj]  AttributeSet.equal should compare size
      276ef1c3
    • KaiXinXiaoLei's avatar
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about... · e87bf371
      KaiXinXiaoLei authored
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about creating table “test”
      
      If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are  running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala  ,and this table not droped. So when running "drop cached table", table test already exists.
      
      There is error info:
      01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists)
      at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616)
      at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
      at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
      at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test”
      
      And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is:
      
        test("SPARK-4825 save join to table") {
          val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF()
          sql("CREATE TABLE test1 (key INT, value STRING)")
          testData.insertInto("test1")
          sql("CREATE TABLE test2 (key INT, value STRING)")
          testData.insertInto("test2")
          testData.insertInto("test2")
          sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key =   b.key")
          checkAnswer(
            table("test"),
            sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
        }
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits:
      
      7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
      e87bf371
    • Daoyuan Wang's avatar
      [SPARK-6202] [SQL] enable variable substitution on test framework · 5ab6e9f0
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4930 from adrian-wang/testvs and squashes the following commits:
      
      2ce590f [Daoyuan Wang] add explicit function types
      b1d68bf [Daoyuan Wang] only substitute for parseSql
      9c4a950 [Daoyuan Wang] add a comment explaining
      18fb481 [Daoyuan Wang] enable variable substitute on test framework
      5ab6e9f0
    • DoingDone9's avatar
      [SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl · 328daf65
      DoingDone9 authored
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #4973 from DoingDone9/sort_token and squashes the following commits:
      
      855fa10 [DoingDone9] Update HiveQl.scala
      c7080b3 [DoingDone9] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      328daf65
    • Liang-Chi Hsieh's avatar
      [SPARK-6326][SQL] Improve castStruct to be faster · 73d57754
      Liang-Chi Hsieh authored
      Current `castStruct` should be very slow. This pr slightly improves it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5017 from viirya/faster_caststruct and squashes the following commits:
      
      385d5b0 [Liang-Chi Hsieh] Further improved.
      746fcfb [Liang-Chi Hsieh] Make castStruct faster.
      73d57754
    • jeanlyn's avatar
      [SPARK-5498][SQL]fix query exception when partition schema does not match table schema · e6d1406a
      jeanlyn authored
      In hive,the schema of partition may be difference from  the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example:
      * We take a look of the schema for the partition and the table
      
      ```sql
      DESCRIBE partition_test PARTITION (dt='1');
      id                  	int              	None
      name                	string              	None
      dt                  	string              	None
      
      # Partition Information
      # col_name            	data_type           	comment
      
      dt                  	string              	None
      ```
      ```
      DESCRIBE partition_test;
      OK
      id                  	bigint              	None
      name                	string              	None
      dt                  	string              	None
      
      # Partition Information
      # col_name            	data_type           	comment
      
      dt                  	string              	None
      ```
      *  run the sql
      ```sql
      SELECT * FROM partition_test where dt='1';
      ```
      we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt`
      
      Author: jeanlyn <jeanlyn92@gmail.com>
      
      Closes #4289 from jeanlyn/schema and squashes the following commits:
      
      9c8da74 [jeanlyn] fix style
      b41d6b9 [jeanlyn] fix compile errors
      07d84b6 [jeanlyn] Merge branch 'master' into schema
      535b0b6 [jeanlyn] reduce conflicts
      d6c93c5 [jeanlyn] fix bug
      1e8b30c [jeanlyn] fix code style
      0549759 [jeanlyn] fix code style
      c879aa1 [jeanlyn] clean the code
      2a91a87 [jeanlyn] add more test case and clean the code
      12d800d [jeanlyn] fix code style
      63d170a [jeanlyn] fix compile problem
      7470901 [jeanlyn] reduce conflicts
      afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1
      b1527d5 [jeanlyn] fix type mismatch
      10744ca [jeanlyn] Insert a space after the start of the comment
      3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
      e6d1406a
    • Cheng Lian's avatar
      [SPARK-6450] [SQL] Fixes metastore Parquet table conversion · 8c3b0052
      Cheng Lian authored
      The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed.
      
      Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5183 from liancheng/spark-6450 and squashes the following commits:
      
      3536780 [Cheng Lian] Fixes metastore Parquet table conversion
      8c3b0052
    • Josh Rosen's avatar
      [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() · d44a3362
      Josh Rosen authored
      `StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.
      
      This patch adds a new map to `JobProgressListener` in order to speed up these lookups.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits:
      
      e39c5c7 [Josh Rosen] Address review feedback
      6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing
      2c49614 [Josh Rosen] getOrElse
      97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener
      d44a3362
    • MechCoder's avatar
      [SPARK-5987] [MLlib] Save/load for GaussianMixtureModels · 4fc4d036
      MechCoder authored
      Should be self explanatory.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4986 from MechCoder/spark-5987 and squashes the following commits:
      
      7d2cd56 [MechCoder] Iterate over dataframe in a better way
      e7a14cb [MechCoder] Minor
      33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
      505bd57 [MechCoder] Rebased over master and used MatrixUDT
      7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
      b9794e4 [MechCoder] Minor
      cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
      4fc4d036
    • Yanbo Liang's avatar
      [SPARK-6256] [MLlib] MLlib Python API parity check for regression · 43533738
      Yanbo Liang authored
      MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
      ```scala
      LinearRegressionWithSGD
          setValidateData
      LassoWithSGD
          setIntercept
          setValidateData
      RidgeRegressionWithSGD
          setIntercept
          setValidateData
      ```
      setFeatureScaling is mllib private function which is not needed to expose in pyspark.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #4997 from yanboliang/spark-6256 and squashes the following commits:
      
      102f498 [Yanbo Liang] fix intercept issue & add doc test
      1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
      de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
      43533738
    • Andrew Or's avatar
      [SPARK-5771] Master UI inconsistently displays application cores · c1b74df6
      Andrew Or authored
      If the user calls `sc.stop()`, then the number of cores under "Completed Applications" will be 0. If the user does not call `sc.stop()`, then the number of cores will be however many cores were being used before the application exited. This PR makes both cases have the behavior of the latter.
      
      Note that there have been a series of PR that attempted to fix this. For the full discussion, please refer to #4841. The unregister event is necessary because of a subtle race condition explained in that PR.
      
      Tested this locally with and without calling `sc.stop()`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5177 from andrewor14/master-ui-cores and squashes the following commits:
      
      62449d1 [Andrew Or] Freeze application state before finishing it
      c1b74df6
    • Kousuke Saruta's avatar
      [SPARK-6537] UIWorkloadGenerator: The main thread should not stop SparkContext... · acef51de
      Kousuke Saruta authored
      [SPARK-6537] UIWorkloadGenerator: The main thread should not stop SparkContext until all jobs finish
      
      The main thread of UIWorkloadGenerator spawn sub threads to launch jobs but the main thread stop SparkContext without waiting for finishing those threads.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5187 from sarutak/SPARK-6537 and squashes the following commits:
      
      4e9307a [Kousuke Saruta] Fixed UIWorkloadGenerator so that the main thread stop SparkContext after all jobs finish
      acef51de
    • zsxwing's avatar
      [SPARK-6076][Block Manager] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER · 883b7e90
      zsxwing authored
      In https://github.com/apache/spark/blob/dcd1e42d6b6ac08d2c0736bf61a15f515a1f222b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L538 , when StorageLevel is `MEMORY_AND_DISK_SER`, it will copy the content from file into memory, then put it into MemoryStore.
      ```scala
                    val copyForMemory = ByteBuffer.allocate(bytes.limit)
                    copyForMemory.put(bytes)
                    memoryStore.putBytes(blockId, copyForMemory, level)
                    bytes.rewind()
      ```
      However, if the file is bigger than the free memory, OOM will happen. A better approach is testing if there is enough memory. If not, copyForMemory should not be created, since this is an optional operation.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4827 from zsxwing/SPARK-6076 and squashes the following commits:
      
      7d25545 [zsxwing] Add alias for tryToPut and dropFromMemory
      1100a54 [zsxwing] Replace call-by-name with () => T
      0cc0257 [zsxwing] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER
      883b7e90
Loading