Skip to content
Snippets Groups Projects
  1. Mar 26, 2015
    • Reynold Xin's avatar
      [SPARK-6117] [SQL] Improvements to DataFrame.describe() · 784fcd53
      Reynold Xin authored
      1. Slightly modifications to the code to make it more readable.
      2. Added Python implementation.
      3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5201 from rxin/df-describe and squashes the following commits:
      
      25a7834 [Reynold Xin] Reset run-tests.
      6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
      784fcd53
    • Sean Owen's avatar
      SPARK-6532 [BUILD] LDAModel.scala fails scalastyle on Windows · c3a52a08
      Sean Owen authored
      Use standard UTF-8 source / report encoding for scalastyle
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5211 from srowen/SPARK-6532 and squashes the following commits:
      
      16a33e5 [Sean Owen] Use standard UTF-8 source / report encoding for scalastyle
      c3a52a08
    • Sean Owen's avatar
      SPARK-6480 [CORE] histogram() bucket function is wrong in some simple edge cases · fe15ea97
      Sean Owen authored
      Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #5148 from srowen/SPARK-6480 and squashes the following commits:
      
      974a0a0 [Sean Owen] Additional test of huge ranges, and a few more comments (and comment fixes)
      23ec01e [Sean Owen] Fix fastBucketFunction for histogram() to handle edge conditions more correctly. Add a test, and fix existing one accordingly
      fe15ea97
    • Yuhao Yang's avatar
      [MLlib]remove unused import · 3ddb975f
      Yuhao Yang authored
      minor thing. Let me know if jira is required.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5207 from hhbyyh/adjustImport and squashes the following commits:
      
      2240121 [Yuhao Yang] remove unused import
      3ddb975f
    • Yash Datta's avatar
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema... · 1c05027a
      Yash Datta authored
      [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      
      Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
      But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #5141 from saucam/replace_col and squashes the following commits:
      
      e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
      5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
      1c05027a
    • zsxwing's avatar
      [SPARK-6468][Block Manager] Fix the race condition of subDirs in DiskBlockManager · 0c88ce54
      zsxwing authored
      There are two race conditions of `subDirs` in `DiskBlockManager`:
      
      1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. Although it's designed for testing, it's still worth to add correct locks to eliminate the race condition.
      2. The double-check has a race condition in `getFile(filename: String)`. If a thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` block, it may not be able to see the correct content of the File instance pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there is no volatile variable here).
      
      This PR fixed the above race conditions.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #5136 from zsxwing/SPARK-6468 and squashes the following commits:
      
      cbb872b [zsxwing] Fix the race condition of subDirs in DiskBlockManager
      0c88ce54
    • Michael Armbrust's avatar
      [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryo · f88f51bb
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:
      
      bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
      f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
      f88f51bb
    • DoingDone9's avatar
      [SPARK-6546][Build] Using the wrong code that will make spark compile failed!! · 855cba8f
      DoingDone9 authored
      wrong code : val tmpDir = Files.createTempDir()
      not Files should Utils
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5198 from DoingDone9/FilesBug and squashes the following commits:
      
      6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala
      e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala
      802261c [DoingDone9] Merge pull request #7 from apache/master
      d00303b [DoingDone9] Merge pull request #6 from apache/master
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      855cba8f
    • azagrebin's avatar
      [SPARK-6117] [SQL] add describe function to DataFrame for summary statis... · 5bbcd130
      azagrebin authored
      Please review my solution for SPARK-6117
      
      Author: azagrebin <azagrebin@gmail.com>
      
      Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits:
      
      f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case
      ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns
      9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics
      5bbcd130
    • Davies Liu's avatar
      [SPARK-6536] [PySpark] Column.inSet() in Python · f5358029
      Davies Liu authored
      ```
      >>> df[df.name.inSet("Bob", "Mike")].collect()
      [Row(age=5, name=u'Bob')]
      >>> df[df.age.inSet([1, 2, 3])].collect()
      [Row(age=2, name=u'Alice')]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5190 from davies/in and squashes the following commits:
      
      6b73a47 [Davies Liu] Column.inSet() in Python
      f5358029
  2. Mar 25, 2015
    • Michael Armbrust's avatar
      [SPARK-6463][SQL] AttributeSet.equal should compare size · 276ef1c3
      Michael Armbrust authored
      Previously this could result in sets compare equals when in fact the right was a subset of the left.
      
      Based on #5133 by sisihj
      
      Author: sisihj <jun.hejun@huawei.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5194 from marmbrus/pr/5133 and squashes the following commits:
      
      5ed4615 [Michael Armbrust] fix imports
      d4cbbc0 [Michael Armbrust] Add test cases
      0a0834f [sisihj]  AttributeSet.equal should compare size
      276ef1c3
    • KaiXinXiaoLei's avatar
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about... · e87bf371
      KaiXinXiaoLei authored
      The UT test of spark is failed. Because there is a test in SQLQuerySuite about creating table “test”
      
      If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are  running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala  ,and this table not droped. So when running "drop cached table", table test already exists.
      
      There is error info:
      01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists)
      at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616)
      at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189)
      at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281)
      at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
      at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
      at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503)
      at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270)
      at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
      at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test”
      
      And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is:
      
        test("SPARK-4825 save join to table") {
          val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF()
          sql("CREATE TABLE test1 (key INT, value STRING)")
          testData.insertInto("test1")
          sql("CREATE TABLE test2 (key INT, value STRING)")
          testData.insertInto("test2")
          testData.insertInto("test2")
          sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key =   b.key")
          checkAnswer(
            table("test"),
            sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq)
        }
      
      Author: KaiXinXiaoLei <huleilei1@huawei.com>
      
      Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits:
      
      7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
      e87bf371
    • Daoyuan Wang's avatar
      [SPARK-6202] [SQL] enable variable substitution on test framework · 5ab6e9f0
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #4930 from adrian-wang/testvs and squashes the following commits:
      
      2ce590f [Daoyuan Wang] add explicit function types
      b1d68bf [Daoyuan Wang] only substitute for parseSql
      9c4a950 [Daoyuan Wang] add a comment explaining
      18fb481 [Daoyuan Wang] enable variable substitute on test framework
      5ab6e9f0
    • DoingDone9's avatar
      [SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl · 328daf65
      DoingDone9 authored
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #4973 from DoingDone9/sort_token and squashes the following commits:
      
      855fa10 [DoingDone9] Update HiveQl.scala
      c7080b3 [DoingDone9] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      328daf65
    • Liang-Chi Hsieh's avatar
      [SPARK-6326][SQL] Improve castStruct to be faster · 73d57754
      Liang-Chi Hsieh authored
      Current `castStruct` should be very slow. This pr slightly improves it.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #5017 from viirya/faster_caststruct and squashes the following commits:
      
      385d5b0 [Liang-Chi Hsieh] Further improved.
      746fcfb [Liang-Chi Hsieh] Make castStruct faster.
      73d57754
    • jeanlyn's avatar
      [SPARK-5498][SQL]fix query exception when partition schema does not match table schema · e6d1406a
      jeanlyn authored
      In hive,the schema of partition may be difference from  the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example:
      * We take a look of the schema for the partition and the table
      
      ```sql
      DESCRIBE partition_test PARTITION (dt='1');
      id                  	int              	None
      name                	string              	None
      dt                  	string              	None
      
      # Partition Information
      # col_name            	data_type           	comment
      
      dt                  	string              	None
      ```
      ```
      DESCRIBE partition_test;
      OK
      id                  	bigint              	None
      name                	string              	None
      dt                  	string              	None
      
      # Partition Information
      # col_name            	data_type           	comment
      
      dt                  	string              	None
      ```
      *  run the sql
      ```sql
      SELECT * FROM partition_test where dt='1';
      ```
      we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt`
      
      Author: jeanlyn <jeanlyn92@gmail.com>
      
      Closes #4289 from jeanlyn/schema and squashes the following commits:
      
      9c8da74 [jeanlyn] fix style
      b41d6b9 [jeanlyn] fix compile errors
      07d84b6 [jeanlyn] Merge branch 'master' into schema
      535b0b6 [jeanlyn] reduce conflicts
      d6c93c5 [jeanlyn] fix bug
      1e8b30c [jeanlyn] fix code style
      0549759 [jeanlyn] fix code style
      c879aa1 [jeanlyn] clean the code
      2a91a87 [jeanlyn] add more test case and clean the code
      12d800d [jeanlyn] fix code style
      63d170a [jeanlyn] fix compile problem
      7470901 [jeanlyn] reduce conflicts
      afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1
      b1527d5 [jeanlyn] fix type mismatch
      10744ca [jeanlyn] Insert a space after the start of the comment
      3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
      e6d1406a
    • Cheng Lian's avatar
      [SPARK-6450] [SQL] Fixes metastore Parquet table conversion · 8c3b0052
      Cheng Lian authored
      The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed.
      
      Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5183 from liancheng/spark-6450 and squashes the following commits:
      
      3536780 [Cheng Lian] Fixes metastore Parquet table conversion
      8c3b0052
    • Josh Rosen's avatar
      [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() · d44a3362
      Josh Rosen authored
      `StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.
      
      This patch adds a new map to `JobProgressListener` in order to speed up these lookups.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits:
      
      e39c5c7 [Josh Rosen] Address review feedback
      6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing
      2c49614 [Josh Rosen] getOrElse
      97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener
      d44a3362
    • MechCoder's avatar
      [SPARK-5987] [MLlib] Save/load for GaussianMixtureModels · 4fc4d036
      MechCoder authored
      Should be self explanatory.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4986 from MechCoder/spark-5987 and squashes the following commits:
      
      7d2cd56 [MechCoder] Iterate over dataframe in a better way
      e7a14cb [MechCoder] Minor
      33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array]
      505bd57 [MechCoder] Rebased over master and used MatrixUDT
      7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]]
      b9794e4 [MechCoder] Minor
      cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
      4fc4d036
    • Yanbo Liang's avatar
      [SPARK-6256] [MLlib] MLlib Python API parity check for regression · 43533738
      Yanbo Liang authored
      MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
      ```scala
      LinearRegressionWithSGD
          setValidateData
      LassoWithSGD
          setIntercept
          setValidateData
      RidgeRegressionWithSGD
          setIntercept
          setValidateData
      ```
      setFeatureScaling is mllib private function which is not needed to expose in pyspark.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #4997 from yanboliang/spark-6256 and squashes the following commits:
      
      102f498 [Yanbo Liang] fix intercept issue & add doc test
      1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
      de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
      43533738
    • Andrew Or's avatar
      [SPARK-5771] Master UI inconsistently displays application cores · c1b74df6
      Andrew Or authored
      If the user calls `sc.stop()`, then the number of cores under "Completed Applications" will be 0. If the user does not call `sc.stop()`, then the number of cores will be however many cores were being used before the application exited. This PR makes both cases have the behavior of the latter.
      
      Note that there have been a series of PR that attempted to fix this. For the full discussion, please refer to #4841. The unregister event is necessary because of a subtle race condition explained in that PR.
      
      Tested this locally with and without calling `sc.stop()`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #5177 from andrewor14/master-ui-cores and squashes the following commits:
      
      62449d1 [Andrew Or] Freeze application state before finishing it
      c1b74df6
    • Kousuke Saruta's avatar
      [SPARK-6537] UIWorkloadGenerator: The main thread should not stop SparkContext... · acef51de
      Kousuke Saruta authored
      [SPARK-6537] UIWorkloadGenerator: The main thread should not stop SparkContext until all jobs finish
      
      The main thread of UIWorkloadGenerator spawn sub threads to launch jobs but the main thread stop SparkContext without waiting for finishing those threads.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #5187 from sarutak/SPARK-6537 and squashes the following commits:
      
      4e9307a [Kousuke Saruta] Fixed UIWorkloadGenerator so that the main thread stop SparkContext after all jobs finish
      acef51de
    • zsxwing's avatar
      [SPARK-6076][Block Manager] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER · 883b7e90
      zsxwing authored
      In https://github.com/apache/spark/blob/dcd1e42d6b6ac08d2c0736bf61a15f515a1f222b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L538 , when StorageLevel is `MEMORY_AND_DISK_SER`, it will copy the content from file into memory, then put it into MemoryStore.
      ```scala
                    val copyForMemory = ByteBuffer.allocate(bytes.limit)
                    copyForMemory.put(bytes)
                    memoryStore.putBytes(blockId, copyForMemory, level)
                    bytes.rewind()
      ```
      However, if the file is bigger than the free memory, OOM will happen. A better approach is testing if there is enough memory. If not, copyForMemory should not be created, since this is an optional operation.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #4827 from zsxwing/SPARK-6076 and squashes the following commits:
      
      7d25545 [zsxwing] Add alias for tryToPut and dropFromMemory
      1100a54 [zsxwing] Replace call-by-name with () => T
      0cc0257 [zsxwing] Fix a potential OOM issue when StorageLevel is MEMORY_AND_DISK_SER
      883b7e90
    • DoingDone9's avatar
      [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, because... · 968408b3
      DoingDone9 authored
      [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, because this will make some UDAF can not work.
      
      spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage"
      
      Author: DoingDone9 <799203320@qq.com>
      
      Closes #5131 from DoingDone9/udaf and squashes the following commits:
      
      9de08d0 [DoingDone9] Update HiveUdfSuite.scala
      49c62dc [DoingDone9] Update hiveUdfs.scala
      98b134f [DoingDone9] Merge pull request #5 from apache/master
      161cae3 [DoingDone9] Merge pull request #4 from apache/master
      c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
      cb1852d [DoingDone9] Merge pull request #2 from apache/master
      c3f046f [DoingDone9] Merge pull request #1 from apache/master
      968408b3
    • Augustin Borsu's avatar
      [ML][FEATURE] SPARK-5566: RegEx Tokenizer · 982952f4
      Augustin Borsu authored
      Added a Regex based tokenizer for ml.
      Currently the regex is fixed but if I could add a regex type paramater to the paramMap,
      changing the tokenizer regex could be a parameter used in the crossValidation.
      Also I wonder what would be the best way to add a stop word list.
      
      Author: Augustin Borsu <augustin@sagacify.com>
      Author: Augustin Borsu <a.borsu@gmail.com>
      Author: Augustin Borsu <aborsu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4504 from aborsu985/master and squashes the following commits:
      
      716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566'
      cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566
      5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      a164800 [Xiangrui Meng] remove tabs
      556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566
      9651aec [Xiangrui Meng] update test
      f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566
      2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests
      148126f [Augustin Borsu] Added return type to public functions
      12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      6a85982 [Augustin Borsu] Style corrections
      38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer
      b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer
      e262bac [Augustin Borsu] Added unit tests in scala
      cd6642e [Augustin Borsu] Changed regex to pattern
      132b00b [Augustin Borsu] Changed matching to gaps and removed case folding
      201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer
      9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc
      19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      9f8685a [Augustin Borsu] RegexTokenizer
      9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master'
      9547e9d [Augustin Borsu] RegEx Tokenizer
      01cd26f [Augustin Borsu] RegExTokenizer
      982952f4
    • Yanbo Liang's avatar
      [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights)... · 10c78607
      Yanbo Liang authored
      [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) should initialize numFeatures
      
      In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
      ```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
      In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #5167 from yanboliang/spark-6496 and squashes the following commits:
      
      8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
      10c78607
    • zzcclp's avatar
      [SPARK-6483][SQL]Improve ScalaUdf called performance. · 64262ed9
      zzcclp authored
      As issue [SPARK-6483](https://issues.apache.org/jira/browse/SPARK-6483) description, ScalaUdf is low performance because of calling *asInstanceOf* to convert per record.
      With this, the performance of ScalaUdf is the same as other case.
      thank lianhuiwang for telling me how to resolve this problem.
      
      Author: zzcclp <xm_zzc@sina.com>
      
      Closes #5154 from zzcclp/SPARK-6483 and squashes the following commits:
      
      5ac6e09 [zzcclp] Add a newline at the end of source file
      cc6868e [zzcclp] Fix for fail on unit test.
      0a8cdc3 [zzcclp] indention issue
      b73836a [zzcclp] Access Seq[Expression] element by :: operator, and update the code gen script.
      7763848 [zzcclp] rebase from master
      64262ed9
    • Bill Chambers's avatar
      [DOCUMENTATION]Fixed Missing Type Import in Documentation · c5cc4146
      Bill Chambers authored
      Needed to import the types specifically, not the more general pyspark.sql
      
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      Author: anabranch <wac.chambers@gmail.com>
      
      Closes #5179 from anabranch/master and squashes the following commits:
      
      8fa67bf [anabranch] Corrected SqlContext Import
      603b080 [Bill Chambers] [DOCUMENTATION]Fixed Missing Type Import in Documentation
      c5cc4146
  3. Mar 24, 2015
    • Xiangrui Meng's avatar
      [SPARK-6515] update OpenHashSet impl · c14ddd97
      Xiangrui Meng authored
      Though I don't see any bug in the existing code, the update in this PR makes it read better. rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5176 from mengxr/SPARK-6515 and squashes the following commits:
      
      134494d [Xiangrui Meng] update OpenHashSet impl
      c14ddd97
    • Reynold Xin's avatar
      [SPARK-6428][Streaming] Added explicit types for all public methods. · 94598653
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5110 from rxin/streaming-explicit-type and squashes the following commits:
      
      2c2db32 [Reynold Xin] [SPARK-6428][Streaming] Added explicit types for all public methods.
      94598653
    • Xiangrui Meng's avatar
      [SPARK-6512] add contains to OpenHashMap · 6930e965
      Xiangrui Meng authored
      Add `contains` to test whether a key exists in an OpenHashMap. rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5171 from mengxr/openhashmap-contains and squashes the following commits:
      
      d6e6f1f [Xiangrui Meng] add contains to primitivekeyopenhashmap
      748a69b [Xiangrui Meng] add contains to OpenHashMap
      6930e965
    • Christophe Préaud's avatar
      [SPARK-6469] Improving documentation on YARN local directories usage · 05c2214b
      Christophe Préaud authored
      Clarify the local directories usage in YARN
      
      Author: Christophe Préaud <christophe.preaud@kelkoo.com>
      
      Closes #5165 from preaudc/yarn-doc-local-dirs and squashes the following commits:
      
      6912b90 [Christophe Préaud] Fix some formatting issues.
      4fa8ec2 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' into yarn-doc-local-dirs
      eaaf519 [Christophe Préaud] Clarify the local directories usage in YARN
      436fb7d [Christophe Préaud] Revert "Clarify the local directories usage in YARN"
      876ae5e [Christophe Préaud] Clarify the local directories usage in YARN
      608dbfa [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      a49a2ce [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      9ba89ca [Christophe Préaud] Ensure that files are fetched atomically
      54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      c6a5590 [Christophe Préaud] Revert commit 8ea871f8130b2490f1bad7374a819bf56f0ccbbd
      7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
      8ea871f [Christophe Préaud] Ensure that files are fetched atomically
      05c2214b
    • Andrew Or's avatar
      Revert "[SPARK-5771] Number of Cores in Completed Applications of Standalone... · dd907d1a
      Andrew Or authored
      Revert "[SPARK-5771] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called"
      
      This reverts commit dd077abf.
      
      Conflicts:
      	core/src/main/scala/org/apache/spark/deploy/master/ApplicationInfo.scala
      	core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
      dd907d1a
    • Andrew Or's avatar
    • Kay Ousterhout's avatar
      [SPARK-3570] Include time to open files in shuffle write time. · d8ccf655
      Kay Ousterhout authored
      Opening shuffle files can be very significant when the disk is
      contended, especially when using ext3. While writing data to
      a file can avoid hitting disk (and instead hit the buffer
      cache), opening a file always involves writing some metadata
      about the file to disk, so the open time can be a very significant
      portion of the shuffle write time. In one job I ran recently, the time to
      write shuffle data to the file was only 4ms for each task, but
      the time to open the file was about 100x as long (~400ms).
      
      When we add metrics about spilled data (#2504), we should ensure
      that the file open time is also included there.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits:
      
      ea3a4ae [Kay Ousterhout] Added comment about excluded open time
      fdc5185 [Kay Ousterhout] Improved comment
      42b7e43 [Kay Ousterhout] Fixed parens for nanotime
      2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.
      d8ccf655
    • Kay Ousterhout's avatar
      [SPARK-6088] Correct how tasks that get remote results are shown in UI. · 6948ab6f
      Kay Ousterhout authored
      It would be great to fix this for 1.3. since the fix is surgical and it helps understandability for users.
      
      cc shivaram pwendell
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4839 from kayousterhout/SPARK-6088 and squashes the following commits:
      
      3ab012c [Kay Ousterhout] Update getting result time incrementally, correctly set GET_RESULT status
      f346b49 [Kay Ousterhout] Typos
      748ea6b [Kay Ousterhout] Fixed build failure
      84d617c [Kay Ousterhout] [SPARK-6088] Correct how tasks that get remote results are shown in the UI.
      6948ab6f
    • Reynold Xin's avatar
      [SPARK-6428][SQL] Added explicit types for all public methods in catalyst · 73348012
      Reynold Xin authored
      I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5162 from rxin/catalyst-explicit-types and squashes the following commits:
      
      e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst.
      73348012
    • Josh Rosen's avatar
      [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load... · 7215aa74
      Josh Rosen authored
      [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to load classes (master branch PR)
      
      ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.
      
      This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.
      
      This PR is an extended version of #4935 and adds a regression test.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:
      
      e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
      961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
      7ee2261 [Josh Rosen] Add a failing regression test
      e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader
      7215aa74
    • Michael Armbrust's avatar
      [SPARK-6458][SQL] Better error messages for invalid data sources · a8f51b82
      Michael Armbrust authored
      Avoid unclear match errors and use `AnalysisException`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5158 from marmbrus/dataSourceError and squashes the following commits:
      
      af9f82a [Michael Armbrust] Yins comment
      90c6ba4 [Michael Armbrust] Better error messages for invalid data sources
      a8f51b82
    • Michael Armbrust's avatar
      [SPARK-6376][SQL] Avoid eliminating subqueries until optimization · cbeaf9eb
      Michael Armbrust authored
      Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again.  However, with eager analysis in `DataFrame`s this can cause errors for queries such as:
      
      ```scala
      val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
      df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()
      ```
      
      As a result, in this PR we defer the elimination of subqueries until the optimization phase.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits:
      
      a9bb262 [Michael Armbrust] Update Optimizer.scala
      27d25bf [Michael Armbrust] fix hive tests
      9137e03 [Michael Armbrust] add type
      81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
      cbeaf9eb
Loading