Skip to content
Snippets Groups Projects
  1. Jul 09, 2015
    • jerryshao's avatar
      [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python · 3ccebf36
      jerryshao authored
      This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:
      
      ```
      dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
      ```
      
      Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:
      
      4c6d320 [jerryshao] Another way to fix subclass deserialization issue
      e6a8011 [jerryshao] Address the comments
      fd13937 [jerryshao] Fix serialization bug
      7debf1c [jerryshao] bug fix
      cff3893 [jerryshao] refactor the code according to the comments
      2aabf9e [jerryshao] Style fix
      848c708 [jerryshao] Add HasOffsetRanges for Python
      3ccebf36
    • zsxwing's avatar
      [SPARK-8701] [STREAMING] [WEBUI] Add input metadata in the batch page · 1f6b0b12
      zsxwing authored
      This PR adds `metadata` to `InputInfo`. `InputDStream` can report its metadata for a batch and it will be shown in the batch page.
      
      For example,
      
      ![screen shot](https://cloud.githubusercontent.com/assets/1000778/8403741/d6ffc7e2-1e79-11e5-9888-c78c1575123a.png)
      
      FileInputDStream will display the new files for a batch, and DirectKafkaInputDStream will display its offset ranges.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7081 from zsxwing/input-metadata and squashes the following commits:
      
      f7abd9b [zsxwing] Revert the space changes in project/MimaExcludes.scala
      d906209 [zsxwing] Merge branch 'master' into input-metadata
      74762da [zsxwing] Fix MiMa tests
      7903e33 [zsxwing] Merge branch 'master' into input-metadata
      450a46c [zsxwing] Address comments
      1d94582 [zsxwing] Raname InputInfo to StreamInputInfo and change "metadata" to Map[String, Any]
      d496ae9 [zsxwing] Add input metadata in the batch page
      1f6b0b12
    • Iulian Dragos's avatar
      [SPARK-6287] [MESOS] Add dynamic allocation to the coarse-grained Mesos scheduler · c4830598
      Iulian Dragos authored
      This is largely based on extracting the dynamic allocation parts from tnachen's #3861.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #4984 from dragos/issue/mesos-coarse-dynamicAllocation and squashes the following commits:
      
      39df8cd [Iulian Dragos] Update tests to latest changes in core.
      9d2c9fa [Iulian Dragos] Remove adjustment of executorLimitOption in doKillExecutors.
      8b00f52 [Iulian Dragos] Latest round of reviews.
      0cd00e0 [Iulian Dragos] Add persistent shuffle directory
      15c45c1 [Iulian Dragos] Add dynamic allocation to the Spark coarse-grained scheduler.
      c4830598
    • Andrew Or's avatar
      [SPARK-2017] [UI] Stage page hangs with many tasks · ebdf5853
      Andrew Or authored
      (This reopens a patch that was closed in the past: #6248)
      
      When you view the stage page while running the following:
      ```
      sc.parallelize(1 to X, 10000).count()
      ```
      The page never loads, the job is stalled, and you end up running into an OOM:
      ```
      HTTP ERROR 500
      
      Problem accessing /stages/stage/. Reason:
          Server Error
      Caused by:
      java.lang.OutOfMemoryError: Java heap space
          at java.util.Arrays.copyOf(Arrays.java:2367)
          at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
      ```
      This patch compresses Jetty responses in gzip. The correct long-term fix is to add pagination.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7296 from andrewor14/gzip-jetty and squashes the following commits:
      
      a051c64 [Andrew Or] Use GZIP to compress Jetty responses
      ebdf5853
    • zsxwing's avatar
      [SPARK-7419] [STREAMING] [TESTS] Fix CheckpointSuite.recovery with file input stream · 88bf4303
      zsxwing authored
      Fix this failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2886/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/testReport/junit/org.apache.spark.streaming/CheckpointSuite/recovery_with_file_input_stream/
      
      To reproduce this failure, you can add `Thread.sleep(2000)` before this line
      https://github.com/apache/spark/blob/a9c4e29950a14e32acaac547e9a0e8879fd37fc9/streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala#L477
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7323 from zsxwing/SPARK-7419 and squashes the following commits:
      
      b3caf58 [zsxwing] Fix CheckpointSuite.recovery with file input stream
      88bf4303
    • xutingjun's avatar
      [SPARK-8953] SPARK_EXECUTOR_CORES is not read in SparkSubmit · 930fe953
      xutingjun authored
      The configuration ```SPARK_EXECUTOR_CORES``` won't put into ```SparkConf```, so it has no effect to the dynamic executor allocation.
      
      Author: xutingjun <xutingjun@huawei.com>
      
      Closes #7322 from XuTingjun/SPARK_EXECUTOR_CORES and squashes the following commits:
      
      2cafa89 [xutingjun] make SPARK_EXECUTOR_CORES has effect to dynamicAllocation
      930fe953
    • Tathagata Das's avatar
      [MINOR] [STREAMING] Fix log statements in ReceiverSupervisorImpl · 7ce3b818
      Tathagata Das authored
      Log statements incorrectly showed that the executor was being stopped when receiver was being stopped.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #7328 from tdas/fix-log and squashes the following commits:
      
      9cc6e99 [Tathagata Das] Fix log statements.
      7ce3b818
    • Cheng Hao's avatar
      [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258]... · 0b0b9cea
      Cheng Hao authored
      [SPARK-8247] [SPARK-8249] [SPARK-8252] [SPARK-8254] [SPARK-8257] [SPARK-8258] [SPARK-8259] [SPARK-8261] [SPARK-8262] [SPARK-8253] [SPARK-8260] [SPARK-8267] [SQL] Add String Expressions
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6762 from chenghao-intel/str_funcs and squashes the following commits:
      
      b09a909 [Cheng Hao] update the code as feedback
      7ebbf4c [Cheng Hao] Add more string expressions
      0b0b9cea
    • Yuhao Yang's avatar
      [SPARK-8703] [ML] Add CountVectorizer as a ml transformer to convert document to words count vector · 0cd84c86
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-8703
      
      Converts a text document to a sparse vector of token counts.
      
      I can further add an estimator to extract vocabulary from corpus if that's appropriate.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7084 from hhbyyh/countVectorization and squashes the following commits:
      
      5f3f655 [Yuhao Yang] text change
      24728e4 [Yuhao Yang] style improvement
      576728a [Yuhao Yang] rename to model and some fix
      1deca28 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
      99b0c14 [Yuhao Yang] undo extension from HashingTF
      12c2dc8 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into countVectorization
      7ee1c31 [Yuhao Yang] extends HashingTF
      809fb59 [Yuhao Yang] minor fix for ut
      7c61fb3 [Yuhao Yang] add countVectorizer
      0cd84c86
    • JPark's avatar
      [SPARK-8863] [EC2] Check aws access key from aws credentials if there is no boto config · c59e268d
      JPark authored
      'spark_ec2.py' use boto to control ec2.
      And boto can support '~/.aws/credentials' which is AWS CLI default configuration file.
      
      We can check this information from ref of boto.
      
      "A boto config file is a text file formatted like an .ini configuration file that specifies values for options that control the behavior of the boto library. In Unix/Linux systems, on startup, the boto library looks for configuration files in the following locations and in the following order:
      /etc/boto.cfg - for site-wide settings that all users on this machine will use
      (if profile is given) ~/.aws/credentials - for credentials shared between SDKs
      (if profile is given) ~/.boto - for user-specific settings
      ~/.aws/credentials - for credentials shared between SDKs
      ~/.boto - for user-specific settings"
      
      * ref of boto: http://boto.readthedocs.org/en/latest/boto_config_tut.html
      * ref of aws cli : http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
      
      However 'spark_ec2.py' only check boto config & environment variable even if there is '~/.aws/credentials', and 'spark_ec2.py' is terminated.
      
      So I changed to check '~/.aws/credentials'.
      
      cc rxin
      
      Jira : https://issues.apache.org/jira/browse/SPARK-8863
      
      Author: JPark <JPark@JPark.me>
      
      Closes #7252 from JuhongPark/master and squashes the following commits:
      
      23c5792 [JPark] Check aws access key from aws credentials if there is no boto config
      c59e268d
    • Wenchen Fan's avatar
      [SPARK-8938][SQL] Implement toString for Interval data type · f6c0bd5c
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7315 from cloud-fan/toString and squashes the following commits:
      
      4fc8d80 [Wenchen Fan] Implement toString for Interval data type
      f6c0bd5c
    • Reynold Xin's avatar
      [SPARK-8926][SQL] Code review followup. · a870a82f
      Reynold Xin authored
      I merged https://github.com/apache/spark/pull/7303 so it unblocks another PR. This addresses my own code review comment for that PR.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7313 from rxin/adt and squashes the following commits:
      
      7ade82b [Reynold Xin] Fixed unit tests.
      f8d5533 [Reynold Xin] [SPARK-8926][SQL] Code review followup.
      a870a82f
    • Reynold Xin's avatar
      [SPARK-8948][SQL] Remove ExtractValueWithOrdinal abstract class · e204d22b
      Reynold Xin authored
      Also added more documentation for the file.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7316 from rxin/extract-value and squashes the following commits:
      
      069cb7e [Reynold Xin] Removed ExtractValueWithOrdinal.
      621b705 [Reynold Xin] Reverted a line.
      11ebd6c [Reynold Xin] [Minor][SQL] Improve documentation for complex type extractors.
      e204d22b
    • Liang-Chi Hsieh's avatar
      [SPARK-8940] [SPARKR] Don't overwrite given schema in createDataFrame · 59cc3894
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8940
      
      The given `schema` parameter will be overwritten in `createDataFrame` now. If it is not null, we shouldn't overwrite it.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7311 from viirya/df_not_overwrite_schema and squashes the following commits:
      
      2385139 [Liang-Chi Hsieh] Don't overwrite given schema if it is not null.
      59cc3894
    • Tarek Auel's avatar
      [SPARK-8830] [SQL] native levenshtein distance · a1964e9d
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-8830
      
      rxin and HuJiayin can you have a look on it.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      
      Closes #7236 from tarekauel/native-levenshtein-distance and squashes the following commits:
      
      ee4c4de [Tarek Auel] [SPARK-8830] implemented improvement proposals
      c252e71 [Tarek Auel] [SPARK-8830] removed chartAt; use unsafe method for byte array comparison
      ddf2222 [Tarek Auel] Merge branch 'master' into native-levenshtein-distance
      179920a [Tarek Auel] [SPARK-8830] added description
      5e9ed54 [Tarek Auel] [SPARK-8830] removed StringUtils import
      dce4308 [Tarek Auel] [SPARK-8830] native levenshtein distance
      a1964e9d
    • Davies Liu's avatar
      [SPARK-8931] [SQL] Fallback to interpreted evaluation if failed to compile in codegen · 23448a9e
      Davies Liu authored
      Exception will not be catched during tests.
      
      cc marmbrus rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7309 from davies/fallback and squashes the following commits:
      
      969a612 [Davies Liu] throw exception during tests
      f844f77 [Davies Liu] fallback
      a3091bc [Davies Liu] Merge branch 'master' of github.com:apache/spark into fallback
      364a0d6 [Davies Liu] fallback to interpret mode if failed to compile
      23448a9e
    • lewuathe's avatar
      [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, values · f88b1253
      lewuathe authored
      Write missing pydocs in `SparseVector` attributes.
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits:
      
      51d9895 [lewuathe] Update docs
      0480d35 [lewuathe] Merge branch 'master' into SPARK-6266
      ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
      f88b1253
    • Wenchen Fan's avatar
      [SPARK-8942][SQL] use double not decimal when cast double and float to timestamp · 09cb0d9c
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7312 from cloud-fan/minor and squashes the following commits:
      
      a4589fa [Wenchen Fan] use double not decimal when cast double and float to timestamp
      09cb0d9c
    • Weizhong Lin's avatar
      [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when... · 851e247c
      Weizhong Lin authored
      [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode
      
      This PR is based on #7209 authored by Sephiroth-Lin.
      
      Author: Weizhong Lin <linweizhong@huawei.com>
      
      Closes #7314 from liancheng/spark-8928 and squashes the following commits:
      
      75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
      851e247c
    • Cheng Lian's avatar
      Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x-... · c056484c
      Cheng Lian authored
      Revert "[SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode"
      
      This reverts commit 3dab0da4.
      c056484c
    • Cheng Lian's avatar
      [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when... · 3dab0da4
      Cheng Lian authored
      [SPARK-8928] [SQL] Makes CatalystSchemaConverter sticking to 1.4.x- when handling Parquet LISTs in compatible mode
      
      This PR is based on #7209 authored by Sephiroth-Lin.
      
      Author: Weizhong Lin <linweizhong@huawei.com>
      
      Closes #7304 from liancheng/spark-8928 and squashes the following commits:
      
      75267fe [Cheng Lian] Makes CatalystSchemaConverter sticking to 1.4.x- when handling LISTs in compatible mode
      3dab0da4
    • Reynold Xin's avatar
      Closes #7310. · a240bf3b
      Reynold Xin authored
      a240bf3b
    • Michael Armbrust's avatar
      [SPARK-8926][SQL] Good errors for ExpectsInputType expressions · 768907eb
      Michael Armbrust authored
      For example: `cannot resolve 'testfunction(null)' due to data type mismatch: argument 1 is expected to be of type int, however, null is of type datetype.`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7303 from marmbrus/expectsTypeErrors and squashes the following commits:
      
      c654a0e [Michael Armbrust] fix udts and make errors pretty
      137160d [Michael Armbrust] style
      5428fda [Michael Armbrust] style
      10fac82 [Michael Armbrust] [SPARK-8926][SQL] Good errors for ExpectsInputType expressions
      768907eb
  2. Jul 08, 2015
    • Kousuke Saruta's avatar
      [SPARK-8937] [TEST] A setting `spark.unsafe.exceptionOnMemoryLeak ` is missing in ScalaTest config. · aba5784d
      Kousuke Saruta authored
      `spark.unsafe.exceptionOnMemoryLeak` is present in the config of surefire.
      
      ```
              <!-- Surefire runs all Java tests -->
              <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <!-- Note config is repeated in scalatest config -->
      ...
      
      <spark.unsafe.exceptionOnMemoryLeak>true</spark.unsafe.exceptionOnMemoryLeak>
                  </systemProperties>
      ...
      ```
      
       but is absent in the config ScalaTest.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7308 from sarutak/add-setting-for-memory-leak and squashes the following commits:
      
      95644e7 [Kousuke Saruta] Added a setting for memory leak
      aba5784d
    • Andrew Or's avatar
      [SPARK-8910] Fix MiMa flaky due to port contention issue · 47ef423f
      Andrew Or authored
      Due to the way MiMa works, we currently start a `SQLContext` pretty early on. This causes us to start a `SparkUI` that attempts to bind to port 4040. Because many tests run in parallel on the Jenkins machines, this  causes port contention sometimes and fails the MiMa tests.
      
      Note that we already disabled the SparkUI for scalatests. However, the MiMa test is run before we even have a chance to load the default scalatest settings, so we need to explicitly disable the UI ourselves.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7300 from andrewor14/mima-flaky and squashes the following commits:
      
      b55a547 [Andrew Or] Do not enable SparkUI during tests
      47ef423f
    • Josh Rosen's avatar
      [SPARK-8932] Support copy() for UnsafeRows that do not use ObjectPools · b55499a4
      Josh Rosen authored
      We call Row.copy() in many places throughout SQL but UnsafeRow currently throws UnsupportedOperationException when copy() is called.
      
      Supporting copying when ObjectPool is used may be difficult, since we may need to handle deep-copying of objects in the pool. In addition, this copy() method needs to produce a self-contained row object which may be passed around / buffered by downstream code which does not understand the UnsafeRow format.
      
      In the long run, we'll need to figure out how to handle the ObjectPool corner cases, but this may be unnecessary if other changes are made. Therefore, in order to unblock my sort patch (#6444) I propose that we support copy() for the cases where UnsafeRow does not use an ObjectPool and continue to throw UnsupportedOperationException when an ObjectPool is used.
      
      This patch accomplishes this by modifying UnsafeRow so that it knows the size of the row's backing data in order to be able to copy it into a byte array.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7306 from JoshRosen/SPARK-8932 and squashes the following commits:
      
      338e6bf [Josh Rosen] Support copy for UnsafeRows that do not use ObjectPools.
      b55499a4
    • Yijie Shen's avatar
      [SPARK-8866][SQL] use 1us precision for timestamp type · a2908148
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8866
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7283 from yijieshen/micro_timestamp and squashes the following commits:
      
      dc735df [Yijie Shen] update CastSuite to avoid round error
      714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
      c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
      8d4aa6b [Yijie Shen] use 1us precision for timestamp type
      a2908148
    • Jonathan Alter's avatar
      [SPARK-8927] [DOCS] Format wrong for some config descriptions · 28fa01e2
      Jonathan Alter authored
      A couple descriptions were not inside `<td></td>` and were being displayed immediately under the section title instead of in their row.
      
      Author: Jonathan Alter <jonalter@users.noreply.github.com>
      
      Closes #7292 from jonalter/docs-config and squashes the following commits:
      
      5ce1570 [Jonathan Alter] [DOCS] Format wrong for some config descriptions
      28fa01e2
    • Davies Liu's avatar
      [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame · 74d8d3d9
      Davies Liu authored
      This PR fixes the converter for Python DataFrame, especially for DecimalType
      
      Closes #7106
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7131 from davies/decimal_python and squashes the following commits:
      
      4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7d73168 [Davies Liu] fix conflit
      6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7104e97 [Davies Liu] improve type infer
      9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
      829a05b [Davies Liu] fix UDT in python
      c99e8c5 [Davies Liu] fix mima
      c46814a [Davies Liu] convert decimal for Python DataFrames
      74d8d3d9
    • Kousuke Saruta's avatar
      [SPARK-8914][SQL] Remove RDDApi · 2a4f88b6
      Kousuke Saruta authored
      As rxin suggested in #7298 , we should consider to remove `RDDApi`.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #7302 from sarutak/remove-rddapi and squashes the following commits:
      
      e495d35 [Kousuke Saruta] Fixed mima
      cb7ebb9 [Kousuke Saruta] Removed overriding RDDApi
      2a4f88b6
    • Feynman Liang's avatar
      [SPARK-5016] [MLLIB] Distribute GMM mixture components to executors · f472b8cd
      Feynman Liang authored
      Distribute expensive portions of computation for Gaussian mixture components (in particular, pre-computation of `MultivariateGaussian.rootSigmaInv`, the inverse covariance matrix and covariance determinant) across executors. Repost of PR#4654.
      
      Notes for reviewers:
       * What should be the policy for when to distribute computation. Always? When numClusters > threshold? User-specified param?
      
      TODO:
       * Performance testing and comparison for large number of clusters
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7166 from feynmanliang/GMM_parallel_mixtures and squashes the following commits:
      
      4f351fa [Feynman Liang] Update heuristic and scaladoc
      5ea947e [Feynman Liang] Fix parallelization logic
      00eb7db [Feynman Liang] Add helper method for GMM's M step, remove distributeGaussians flag
      e7c8127 [Feynman Liang] Add distributeGaussians flag and tests
      1da3c7f [Feynman Liang] Distribute mixtures
      f472b8cd
    • Feynman Liang's avatar
      [SPARK-8877] [MLLIB] Public API for association rule generation · 8c32b2e8
      Feynman Liang authored
      Adds FPGrowth.generateAssociationRules to public API for generating association rules after mining frequent itemsets.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7271 from feynmanliang/SPARK-8877 and squashes the following commits:
      
      83b8baf [Feynman Liang] Add API Doc
      867abff [Feynman Liang] Add FPGrowth.generateAssociationRules and change access modifiers for AssociationRules
      8c32b2e8
    • Yanbo Liang's avatar
      [SPARK-8068] [MLLIB] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib · 381cb161
      Yanbo Liang authored
      Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7286 from yanboliang/spark-8068 and squashes the following commits:
      
      6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
      381cb161
    • Cheng Lian's avatar
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
      Cheng Lian authored
      [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
      
      This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
      
      ### Major changes
      
      1. `CatalystConverter` class hierarchy refactoring
      
         - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
      
           Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
      
           This simplifies the design since converters don't need to care about details of their parent converters anymore.
      
         - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
      
           Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
      
         - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
      
           `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
      
           The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
      
         - Implements backwards-compatibility rules in `CatalystArrayConverter`
      
           When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
      
      2. Requested columns handling
      
         When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
      
         In this PR, the schema for requested columns is constructed using the following method:
      
         - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
         - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
         - Unions all single-field `MessageType`s into a full schema containing all requested fields
      
         With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
      
      ### Testing
      
      This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
      
      [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
      [2]: https://issues.apache.org/jira/browse/SPARK-6774
      [3]: https://issues.apache.org/jira/browse/SPARK-6123
      [4]: https://issues.apache.org/jira/browse/SPARK-8848
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7231 from liancheng/spark-6776 and squashes the following commits:
      
      360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
      c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
      b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
      598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
      926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
      7946ee1 [Cheng Lian] Fixes Scala styling issues
      3d7ab36 [Cheng Lian] Fixes .rat-excludes
      a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
      f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
      1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
      440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
      13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
      06cfe9d [Cheng Lian] Adds comments about TimestampType handling
      a099d3e [Cheng Lian] More comments
      0cc1b37 [Cheng Lian] Fixes MiMa checks
      884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
      802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
      38fe1e7 [Cheng Lian] Adds explicit return type
      7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
      1781dff [Cheng Lian] Adds test case for SPARK-8811
      6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
      bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
      a74fb2c [Cheng Lian] More comments
      0525346 [Cheng Lian] Removes old Parquet record converters
      03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
      4ffc27ca
    • Daniel Darabos's avatar
      [SPARK-8902] Correctly print hostname in error · 5687f765
      Daniel Darabos authored
      With "+" the strings are separate expressions, and format() is called on the last string before concatenation. (So substitution does not happen.) Without "+" the string literals are merged first by the parser, so format() is called on the complete string.
      
      Should I make a JIRA for this?
      
      Author: Daniel Darabos <darabos.daniel@gmail.com>
      
      Closes #7288 from darabos/patch-2 and squashes the following commits:
      
      be0d3b7 [Daniel Darabos] Correctly print hostname in error
      5687f765
    • DB Tsai's avatar
      [SPARK-8700][ML] Disable feature scaling in Logistic Regression · 57221934
      DB Tsai authored
      All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same.
      
      In R, there is an option for this.
      `standardize`
      Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian".
      
      +cc holdenk mengxr jkbradley
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #7080 from dbtsai/lors and squashes the following commits:
      
      877e6c7 [DB Tsai] repahse the doc
      7cf45f2 [DB Tsai] address feedback
      78d75c9 [DB Tsai] small change
      c2c9e60 [DB Tsai] style
      6e1a8e0 [DB Tsai] first commit
      57221934
    • Cheolsoo Park's avatar
      [SPARK-8908] [SQL] Add () to distinct definition in dataframe · 00b265f1
      Cheolsoo Park authored
      Adding `()` to the definition of `distinct` in DataFrame allows distinct to be called with parentheses, which is consistent with `dropDuplicates`.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7298 from piaozhexiu/SPARK-8908 and squashes the following commits:
      
      7f0d923 [Cheolsoo Park] Add () to distinct definition in dataframe
      00b265f1
    • Alok Singh's avatar
      [SPARK-8909][Documentation] Change the scala example in sql-programmi… · 8f3cd932
      Alok Singh authored
      …ng-guide#Manually Specifying Options to be in sync with java,python, R version
      
      Author: Alok Singh <“singhal@us.ibm.com”>
      
      Closes #7299 from aloknsingh/aloknsingh_SPARK-8909 and squashes the following commits:
      
      d3c20ba [Alok Singh] fix the file to .parquet from .json
      d476140 [Alok Singh] [SPARK-8909][Documentation] Change the scala example in sql-programming-guide#Manually Specifying Options to be in sync with java,python, R version
      8f3cd932
    • Feynman Liang's avatar
      [SPARK-8457] [ML] NGram Documentation · c5532e2f
      Feynman Liang authored
      Add documentation for NGram feature transformer.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7244 from feynmanliang/SPARK-8457 and squashes the following commits:
      
      5aface9 [Feynman Liang] Pretty print Scala output and add API doc to each codetab
      60d5ac0 [Feynman Liang] Inline API doc and fix indentation
      736ccbc [Feynman Liang] NGram feature transformer documentation
      c5532e2f
    • Keuntae Park's avatar
      [SPARK-8783] [SQL] CTAS with WITH clause does not work · f0315437
      Keuntae Park authored
      Currently, CTESubstitution only handles the case that WITH is on the top of the plan.
      I think it SHOULD handle the case that WITH is child of CTAS.
      This patch simply changes 'match' to 'transform' for recursive search of WITH in the plan.
      
      Author: Keuntae Park <sirpkt@apache.org>
      
      Closes #7180 from sirpkt/SPARK-8783 and squashes the following commits:
      
      e4428f0 [Keuntae Park] Merge remote-tracking branch 'upstream/master' into CTASwithWITH
      1671c77 [Keuntae Park] WITH clause can be inside CTAS
      f0315437
Loading