Skip to content
Snippets Groups Projects
  1. Jul 14, 2015
    • Josh Rosen's avatar
      [SPARK-8962] Add Scalastyle rule to ban direct use of Class.forName; fix existing uses · 11e5c372
      Josh Rosen authored
      This pull request adds a Scalastyle regex rule which fails the style check if `Class.forName` is used directly.  `Class.forName` always loads classes from the default / system classloader, but in a majority of cases, we should be using Spark's own `Utils.classForName` instead, which tries to load classes from the current thread's context classloader and falls back to the classloader which loaded Spark when the context classloader is not defined.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7350)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7350 from JoshRosen/ban-Class.forName and squashes the following commits:
      
      e3e96f7 [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
      c0b7885 [Josh Rosen] Hopefully fix the last two cases
      d707ba7 [Josh Rosen] Fix uses of Class.forName that I missed in my first cleanup pass
      046470d [Josh Rosen] Merge remote-tracking branch 'origin/master' into ban-Class.forName
      62882ee [Josh Rosen] Fix uses of Class.forName or add exclusion.
      d9abade [Josh Rosen] Add stylechecker rule to ban uses of Class.forName
      11e5c372
    • Sean Owen's avatar
      [SPARK-4362] [MLLIB] Make prediction probability available in NaiveBayesModel · 740b034f
      Sean Owen authored
      Add predictProbabilities to Naive Bayes, return class probabilities.
      
      Continues https://github.com/apache/spark/pull/6761
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7376 from srowen/SPARK-4362 and squashes the following commits:
      
      23d5a76 [Sean Owen] Fix model.labels -> model.theta
      95d91fb [Sean Owen] Check that predicted probabilities sum to 1
      b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
      740b034f
    • Liang-Chi Hsieh's avatar
      [SPARK-8800] [SQL] Fix inaccurate precision/scale of Decimal division operation · 4b5cfc98
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8800
      
      Previously, we turn to Java BigDecimal's divide with specified ROUNDING_MODE to avoid non-terminating decimal expansion problem. However, as JihongMA reported, for the division operation on some specific values, we get inaccurate results.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #7212 from viirya/fix_decimal4 and squashes the following commits:
      
      4205a0a [Liang-Chi Hsieh] Fix inaccuracy precision/scale of Decimal division operation.
      4b5cfc98
    • zsxwing's avatar
      [SPARK-4072] [CORE] Display Streaming blocks in Streaming UI · fb1d06fc
      zsxwing authored
      Replace #6634
      
      This PR adds `SparkListenerBlockUpdated` to SparkListener so that it can monitor all block update infos that are sent to `BlockManagerMasaterEndpoint`, and also add new tables in the Storage tab to display the stream block infos.
      
      ![screen shot 2015-07-01 at 5 19 46 pm](https://cloud.githubusercontent.com/assets/1000778/8451562/c291a6ec-2016-11e5-890d-0afc174e1f8c.png)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #6672 from zsxwing/SPARK-4072-2 and squashes the following commits:
      
      df2c1d8 [zsxwing] Use xml query to check the xml elements
      54d54af [zsxwing] Add unit tests for StoragePage
      e29fb53 [zsxwing] Update as per TD's comments
      ccbee07 [zsxwing] Fix the code style
      6dc42b4 [zsxwing] Fix the replication level of blocks
      450fad1 [zsxwing] Merge branch 'master' into SPARK-4072-2
      1e9ef52 [zsxwing] Don't categorize by Executor ID
      ca0ab69 [zsxwing] Fix the code style
      3de2762 [zsxwing] Make object BlockUpdatedInfo private
      e95b594 [zsxwing] Add 'Aggregated Stream Block Metrics by Executor' table
      ba5d0d1 [zsxwing] Refactor the unit test to improve the readability
      4bbe341 [zsxwing] Revert JsonProtocol and don't log SparkListenerBlockUpdated
      b464dd1 [zsxwing] Add onBlockUpdated to EventLoggingListener
      5ba014c [zsxwing] Fix the code style
      0b1e47b [zsxwing] Add a developer api BlockUpdatedInfo
      04838a9 [zsxwing] Fix the code style
      2baa161 [zsxwing] Add unit tests
      80f6c6d [zsxwing] Address comments
      797ee4b [zsxwing] Display Streaming blocks in Streaming UI
      fb1d06fc
    • Andrew Ray's avatar
      [SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square number of partitions · 0a4071ea
      Andrew Ray authored
      See https://github.com/aray/e2d/blob/master/EdgePartition2D.ipynb
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #7104 from aray/edge-partition-2d-improvement and squashes the following commits:
      
      3729f84 [Andrew Ray] correct bounds and remove unneeded comments
      97f8464 [Andrew Ray] change less
      5141ab4 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement
      925fd2c [Andrew Ray] use new interface for partitioning
      001bfd0 [Andrew Ray] Refactor PartitionStrategy so that we can return a prtition function for a given number of parts. To keep compatibility we define default methods that translate between the two implementation options. Made EdgePartition2D use old strategy when we have a perfect square and implement new interface.
      5d42105 [Andrew Ray] % -> /
      3560084 [Andrew Ray] Merge branch 'master' into edge-partition-2d-improvement
      f006364 [Andrew Ray] remove unneeded comments
      cfa2c5e [Andrew Ray] Modifications to EdgePartition2D so that it works for non perfect squares.
      0a4071ea
    • Josh Rosen's avatar
      [SPARK-9031] Merge BlockObjectWriter and DiskBlockObject writer to remove abstract class · d267c283
      Josh Rosen authored
      BlockObjectWriter has only one concrete non-test class, DiskBlockObjectWriter. In order to simplify the code in preparation for other refactorings, I think that we should remove this base class and have only DiskBlockObjectWriter.
      
      While at one time we may have planned to have multiple BlockObjectWriter implementations, that doesn't seem to have happened, so the extra abstraction seems unnecessary.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7391 from JoshRosen/shuffle-write-interface-refactoring and squashes the following commits:
      
      c418e33 [Josh Rosen] Fix compilation
      5047995 [Josh Rosen] Fix comments
      d5dc548 [Josh Rosen] Update references in comments
      89dc797 [Josh Rosen] Rename test suite.
      5755918 [Josh Rosen] Remove unnecessary val in case class
      1607c91 [Josh Rosen] Merge BlockObjectWriter and DiskBlockObjectWriter
      d267c283
    • Andrew Or's avatar
      [SPARK-8911] Fix local mode endless heartbeats · 8fb3a65c
      Andrew Or authored
      As of #7173 we expect executors to properly register with the driver before responding to their heartbeats. This behavior is not matched in local mode. This patch adds the missing event that needs to be posted.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #7382 from andrewor14/fix-local-heartbeat and squashes the following commits:
      
      1258bdf [Andrew Or] Post ExecutorAdded event to local executor
      8fb3a65c
    • Brennon York's avatar
      [SPARK-8933] [BUILD] Provide a --force flag to build/mvn that always uses downloaded maven · c4e98ff0
      Brennon York authored
      added --force flag to manually download, if necessary, and use a built-in version of maven best for spark
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #7374 from brennonyork/SPARK-8933 and squashes the following commits:
      
      d673127 [Brennon York] added --force flag to manually download, if necessary, and use a built-in version of maven best for spark
      c4e98ff0
    • Michael Armbrust's avatar
      [SPARK-9027] [SQL] Generalize metastore predicate pushdown · 37f2d963
      Michael Armbrust authored
      Add support for pushing down metastore filters that are in different orders and add some unit tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #7386 from marmbrus/metastoreFilters and squashes the following commits:
      
      05a4524 [Michael Armbrust] [SPARK-9027][SQL] Generalize metastore predicate pushdown
      37f2d963
    • Wenchen Fan's avatar
      [SPARK-9029] [SQL] shortcut CaseKeyWhen if key is null · 59d820aa
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7389 from cloud-fan/case-when and squashes the following commits:
      
      ea4b6ba [Wenchen Fan] shortcut for case key when
      59d820aa
    • Daoyuan Wang's avatar
      [SPARK-6851] [SQL] function least/greatest follow up · 257236c3
      Daoyuan Wang authored
      This is a follow up of remaining comments from #6851
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #7387 from adrian-wang/udflgfollow and squashes the following commits:
      
      6163e62 [Daoyuan Wang] add skipping null values
      e8c2e09 [Daoyuan Wang] use seq
      8362966 [Daoyuan Wang] pr6851 follow up
      257236c3
    • zhaishidan's avatar
      [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about... · c1feebd8
      zhaishidan authored
      [SPARK-9010] [DOCUMENTATION] Improve the Spark Configuration document about `spark.kryoserializer.buffer`
      
      The meaning of spark.kryoserializer.buffer should be "Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.".
      
      The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.
      
      Author: zhaishidan <zhaishidan@haizhi.com>
      
      Closes #7393 from stanzhai/master and squashes the following commits:
      
      69729ef [zhaishidan] fix document error about spark.kryoserializer.buffer.max.mb
      c1feebd8
    • Joseph Gonzalez's avatar
      [SPARK-9001] Fixing errors in javadocs that lead to failed build/sbt doc · 20c1434a
      Joseph Gonzalez authored
      These are minor corrections in the documentation of several classes that are preventing:
      
      ```bash
      build/sbt publish-local
      ```
      
      I believe this might be an issue associated with running JDK8 as ankurdave does not appear to have this issue in JDK7.
      
      Author: Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #7354 from jegonzal/FixingJavadocErrors and squashes the following commits:
      
      6664b7e [Joseph Gonzalez] making requested changes
      2e16d89 [Joseph Gonzalez] Fixing errors in javadocs that prevents build/sbt publish-local from completing.
      20c1434a
  2. Jul 13, 2015
    • Cheolsoo Park's avatar
      [SPARK-6910] [SQL] Support for pushing predicates down to metastore for partition pruning · 408b384d
      Cheolsoo Park authored
      This PR supersedes my old one #6921. Since my patch has changed quite a bit, I am opening a new PR to make it easier to review.
      
      The changes include-
      * Implement `toMetastoreFilter()` function in `HiveShim` that takes `Seq[Expression]` and converts them into a filter string for Hive metastore.
       * This functions matches all the `AttributeReference` + `BinaryComparisonOp` + `Integral/StringType` patterns in `Seq[Expression]` and fold them into a string.
      * Change `hiveQlPartitions` field in `MetastoreRelation` to `getHiveQlPartitions()` function that takes a filter string parameter.
      * Call `getHiveQlPartitions()` in `HiveTableScan` with a filter string.
      
      But there are some cases in which predicate pushdown is disabled-
      
      Case | Predicate pushdown
      ------- | -----------------------------
      Hive integral and string types | Yes
      Hive varchar type | No
      Hive 0.13 and newer | Yes
      Hive 0.12 and older | No
      convertMetastoreParquet=false | Yes
      convertMetastoreParquet=true | No
      
      In case of `convertMetastoreParquet=true`, predicates are not pushed down because this conversion happens in an `Analyzer` rule (`HiveMetastoreCatalog.ParquetConversions`). At this point, `HiveTableScan` hasn't run, so predicates are not available. But reading the source code, I think it is intentional to convert the entire Hive table w/ all the partitions into `ParquetRelation` because then `ParquetRelation` can be cached and reused for any query against that table. Please correct me if I am wrong.
      
      cc marmbrus
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7216 from piaozhexiu/SPARK-6910-2 and squashes the following commits:
      
      aa1490f [Cheolsoo Park] Fix ordering of imports
      c212c4d [Cheolsoo Park] Incorporate review comments
      5e93f9d [Cheolsoo Park] Predicate pushdown into Hive metastore
      408b384d
    • Neelesh Srinivas Salian's avatar
      [SPARK-8743] [STREAMING] Deregister Codahale metrics for streaming when StreamingContext is closed · b7bcbe25
      Neelesh Srinivas Salian authored
      The issue link: https://issues.apache.org/jira/browse/SPARK-8743
      Deregister Codahale metrics for streaming when StreamingContext is closed
      
      Design:
      Adding the method calls in the appropriate start() and stop () methods for the StreamingContext
      
      Actions in the PullRequest:
      1) Added the registerSource method call to the start method for the Streaming Context.
      2) Added the removeSource method to the stop method.
      3) Added comments for both 1 and 2 and comment to show initialization of the StreamingSource
      4) Added a test case to check for both registration and de-registration of metrics
      
      Previous closed PR for reference: https://github.com/apache/spark/pull/7250
      
      Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
      
      Closes #7362 from nssalian/branch-SPARK-8743 and squashes the following commits:
      
      7d998a3 [Neelesh Srinivas Salian] Removed the Thread.sleep() call
      8b26397 [Neelesh Srinivas Salian] Moved the scalatest.{} import
      0e8007a [Neelesh Srinivas Salian] moved import org.apache.spark{} to correct place
      daedaa5 [Neelesh Srinivas Salian] Corrected Ordering of imports
      8873180 [Neelesh Srinivas Salian] Removed redundancy in imports
      59227a4 [Neelesh Srinivas Salian] Changed the ordering of the imports to classify  scala and spark imports
      d8cb577 [Neelesh Srinivas Salian] Added registerSource to start() and removeSource to stop(). Wrote a test to check the registration and de-registration
      b7bcbe25
    • Hari Shreedharan's avatar
      [SPARK-8533] [STREAMING] Upgrade Flume to 1.6.0 · 0aed38e4
      Hari Shreedharan authored
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6939 from harishreedharan/upgrade-flume-1.6.0 and squashes the following commits:
      
      94b80ae [Hari Shreedharan] [SPARK-8533][Streaming] Upgrade Flume to 1.6.0
      0aed38e4
    • Vinod K C's avatar
      [SPARK-8636] [SQL] Fix equalNullSafe comparison · 4c797f2b
      Vinod K C authored
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #7040 from vinodkc/fix_CaseKeyWhen_equalNullSafe and squashes the following commits:
      
      be5e641 [Vinod K C] Renamed equalNullSafe to threeValueEquals
      aac9f67 [Vinod K C] Updated test suite and genCode method
      f2d0b53 [Vinod K C]  Fix equalNullSafe comparison
      4c797f2b
    • Vinod K C's avatar
      [SPARK-8991] [ML] Update SharedParamsCodeGen's Generated Documentation · 714fc55f
      Vinod K C authored
      Removed private[ml] from Generated documentation
      
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #7367 from vinodkc/fix_sharedparmascodegen and squashes the following commits:
      
      4fa3c8f [Vinod K C] Adding auto generated code
      7e19025 [Vinod K C] Removed private[ml]
      714fc55f
    • yongtang's avatar
      [SPARK-8954] [BUILD] Remove unneeded deb repository from Dockerfile to fix build error in docker. · 5c41691f
      yongtang authored
      [SPARK-8954] [Build]
      1. Remove unneeded deb repository from Dockerfile to fix build error in docker.
      2. Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB).
      
      Author: yongtang <yongtang@users.noreply.github.com>
      
      Closes #7346 from yongtang/SPARK-8954 and squashes the following commits:
      
      36024a1 [yongtang] [SPARK-8954] [Build] Remove unneeded /var/lib/apt/lists/* after install to reduce the docker image size (by ~30MB)
      7084941 [yongtang] [SPARK-8954] [Build] Remove unneeded deb repository from Dockerfile to fix build error in docker.
      5c41691f
    • Davies Liu's avatar
      79c35826
    • Carson Wang's avatar
      [SPARK-8950] [WEBUI] Correct the calculation of SchedulerDelay in StagePage · 5ca26fb6
      Carson Wang authored
      In StagePage, the SchedulerDelay is calculated as totalExecutionTime - executorRunTime - executorOverhead - gettingResultTime.
      But the totalExecutionTime is calculated in the way that doesn't include the gettingResultTime.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7319 from carsonwang/SchedulerDelayTime and squashes the following commits:
      
      f66fb6e [Carson Wang] Update the code style
      7d971ae [Carson Wang] Correct the calculation of SchedulerDelay
      5ca26fb6
    • MechCoder's avatar
      [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark · 9b62e937
      MechCoder authored
      This adds Pylint checks to PySpark.
      
      For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
      We still need to figure out what rules to be allowed.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7241 from MechCoder/pylint and squashes the following commits:
      
      8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
      57393a3 [MechCoder] undefined-variable
      a8e2547 [MechCoder] Minor changes
      7753810 [MechCoder] remove trailing whitespace
      75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check
      6bde250 [MechCoder] Disable all checks for now
      3464666 [MechCoder] Add pylint configuration file
      d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
      9b62e937
    • Sun Rui's avatar
      [SPARK-6797] [SPARKR] Add support for YARN cluster mode. · 7f487c8b
      Sun Rui authored
      This PR enables SparkR to dynamically ship the SparkR binary package to the AM node in YARN cluster mode, thus it is no longer required that the SparkR package be installed on each worker node.
      
      This PR uses the JDK jar tool to package the SparkR package, because jar is thought to be available on both Linux/Windows platforms where JDK has been installed.
      
      This PR does not address the R worker involved in RDD API. Will address it in a separate JIRA issue.
      
      This PR does not address SBT build. SparkR installation and packaging by SBT will be addressed in a separate JIRA issue.
      
      R/install-dev.bat is not tested. shivaram , Could you help to test it?
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #6743 from sun-rui/SPARK-6797 and squashes the following commits:
      
      ca63c86 [Sun Rui] Adjust MimaExcludes after rebase.
      7313374 [Sun Rui] Fix unit test errors.
      72695fb [Sun Rui] Fix unit test failures.
      193882f [Sun Rui] Fix Mima test error.
      fe25a33 [Sun Rui] Fix Mima test error.
      35ecfa3 [Sun Rui] Fix comments.
      c38a005 [Sun Rui] Unzipped SparkR binary package is still required for standalone and Mesos modes.
      b05340c [Sun Rui] Fix scala style.
      2ca5048 [Sun Rui] Fix comments.
      1acefd1 [Sun Rui] Fix scala style.
      0aa1e97 [Sun Rui] Fix scala style.
      41d4f17 [Sun Rui] Add support for locating SparkR package for R workers required by RDD APIs.
      49ff948 [Sun Rui] Invoke jar.exe with full path in install-dev.bat.
      7b916c5 [Sun Rui] Use 'rem' consistently.
      3bed438 [Sun Rui] Add a comment.
      681afb0 [Sun Rui] Fix a bug that RRunner does not handle client deployment modes.
      cedfbe2 [Sun Rui] [SPARK-6797][SPARKR] Add support for YARN cluster mode.
      7f487c8b
    • Vincent D. Warmerdam's avatar
      [SPARK-8596] Add module for rstudio link to spark · a5bc803b
      Vincent D. Warmerdam authored
      shivaram, added module for rstudio install
      
      Author: Vincent D. Warmerdam <vincentwarmerdam@gmail.com>
      
      Closes #7366 from koaning/rstudio-install and squashes the following commits:
      
      e47c2da [Vincent D. Warmerdam] added rstudio module
      a5bc803b
    • Wenchen Fan's avatar
      [SPARK-8944][SQL] Support casting between IntervalType and StringType · 6b899438
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7355 from cloud-fan/fromString and squashes the following commits:
      
      3bbb9d6 [Wenchen Fan] fix code gen
      7dab957 [Wenchen Fan] naming fix
      0fbbe19 [Wenchen Fan] address comments
      ac1f3d1 [Wenchen Fan] Support casting between IntervalType and StringType
      6b899438
    • Daoyuan Wang's avatar
      [SPARK-8203] [SPARK-8204] [SQL] conditional function: least/greatest · 92540d22
      Daoyuan Wang authored
      chenghao-intel zhichao-li qiansl127
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #6851 from adrian-wang/udflg and squashes the following commits:
      
      0f1bff2 [Daoyuan Wang] address comments from davis
      7a6bdbb [Daoyuan Wang] add '.' for hex()
      c1f6824 [Daoyuan Wang] add codegen, test for all types
      ec625b0 [Daoyuan Wang] conditional function: least/greatest
      92540d22
  3. Jul 12, 2015
    • Davies Liu's avatar
      [SPARK-9006] [PYSPARK] fix microsecond loss in Python 3 · 20b47433
      Davies Liu authored
      It may loss a microsecond if using timestamp as float, should be `int` instead.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7363 from davies/fix_microsecond and squashes the following commits:
      
      36f6007 [Davies Liu] fix microsecond loss in Python 3
      20b47433
    • Kay Ousterhout's avatar
      [SPARK-8880] Fix confusing Stage.attemptId member variable · 30090884
      Kay Ousterhout authored
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #7275 from kayousterhout/SPARK-8880 and squashes the following commits:
      
      3e9ce7c [Kay Ousterhout] Added missing return type
      e150278 [Kay Ousterhout] [SPARK-8880] Fix confusing Stage.attemptId member variable
      30090884
  4. Jul 11, 2015
  5. Jul 10, 2015
    • Joseph K. Bradley's avatar
      [SPARK-8994] [ML] tiny cleanups to Params, Pipeline · 0c5207c6
      Joseph K. Bradley authored
      Made default impl of Params.validateParams empty
      CC mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits:
      
      4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
      0c5207c6
    • zhangjiajin's avatar
      [SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib · 7f6be1f2
      zhangjiajin authored
      Add parallel PrefixSpan algorithm and test file.
      Support non-temporal sequences.
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7258 from zhangjiajin/master and squashes the following commits:
      
      ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
      574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
      ba5df34 [zhangjiajin] Fix a Scala style error.
      4c60fb3 [zhangjiajin] Fix some Scala style errors.
      1dd33ad [zhangjiajin] Modified the code according to the review comments.
      89bc368 [zhangjiajin] Fixed a Scala style error.
      a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
      951fd42 [zhang jiajin] Delete Prefixspan.scala
      575995f [zhangjiajin] Modified the code according to the review comments.
      91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      7f6be1f2
    • jose.cambronero's avatar
      [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs · 9c507577
      jose.cambronero authored
      This contribution is my original work and I license it to the project under it's open source license.
      
      Author: jose.cambronero <jose.cambronero@cloudera.com>
      
      Closes #6994 from josepablocam/master and squashes the following commits:
      
      bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
      0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
      1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
      a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
      1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
      2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
      a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
      7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
      e760ebd [jose.cambronero] line length changes to fit style check
      3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
      9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
      3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
      992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
      6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
      4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
      0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
      16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
      c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
      f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
      b9cff3a [jose.cambronero] made small changes to pass style check
      ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
      4da189b [jose.cambronero] added user facing ks test functions
      c659ea1 [jose.cambronero] created KS test class
      13dfe4d [jose.cambronero] created test result class for ks test
      9c507577
    • Scott Taylor's avatar
      [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands · 6e1c7e27
      Scott Taylor authored
      This will allow problems with piped commands to be detected.
      This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits:
      
      04ae1d5 [Scott Taylor] Remove spurious empty line
      98fa101 [Scott Taylor] fix blank line style error
      574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
      0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
      ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
      eb4801c [Scott Taylor] fix fail_condition
      b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
      a307d13 [Scott Taylor] update rdd tests to test pipe modes
      34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
      a0c0161 [Scott Taylor] fix generator issue
      8a9ef9c [Scott Taylor] make check_return_code an iterator
      0486ae3 [Scott Taylor] style fixes
      8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
      4153b02 [Scott Taylor] fix list.sort returns None
      491d3fc [Scott Taylor] Pass a function handle to assertRaises
      3344a21 [Scott Taylor] wrap assertRaises with QuietTest
      3ab8c7a [Scott Taylor] remove whitespace for style
      cc1a73d [Scott Taylor] fix style issues in pipe test
      8db4073 [Scott Taylor] Add a test for rdd pipe functions
      1b3dc4e [Scott Taylor] fix missing space around operator style
      0974f98 [Scott Taylor] add space between words in multiline string
      45f4977 [Scott Taylor] fix line too long style error
      5745d85 [Scott Taylor] Remove space to fix style
      f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
      6e1c7e27
    • Cheng Lian's avatar
      [SPARK-8961] [SQL] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row · 33630883
      Cheng Lian authored
      This is a follow-up of [SPARK-8888] [1], which also aims to optimize writing dynamic partitions.
      
      Three more changes can be made here:
      
      1. Using `InternalRow` instead of `Row` in `BaseWriterContainer.outputWriterForRow`
      2. Using `Cast` expressions to convert partition columns to strings, so that we can leverage code generation.
      3. Replacing the FP-style `zip` and `map` calls with a faster imperative `while` loop.
      
      [1]: https://issues.apache.org/jira/browse/SPARK-8888
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7331 from liancheng/spark-8961 and squashes the following commits:
      
      b5ab9ae [Cheng Lian] Casts Java iterator to Scala iterator explicitly
      719e63b [Cheng Lian] Makes BaseWriterContainer.outputWriterForRow accepts InternalRow instead of Row
      33630883
    • Davies Liu's avatar
      add inline comment for python tests · b6fc0adf
      Davies Liu authored
      b6fc0adf
    • Cheng Lian's avatar
      [SPARK-8990] [SQL] SPARK-8990 DataFrameReader.parquet() should respect user specified options · 857e325f
      Cheng Lian authored
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #7347 from liancheng/spark-8990 and squashes the following commits:
      
      045698c [Cheng Lian] SPARK-8990 DataFrameReader.parquet() should respect user specified options
      857e325f
    • Josh Rosen's avatar
      [SPARK-7078] [SPARK-7079] Binary processing sort for Spark SQL · fb8807c9
      Josh Rosen authored
      This patch adds a cache-friendly external sorter which operates on serialized bytes and uses this sorter to implement a new sort operator for Spark SQL and DataFrames.
      
      ### Overview of the new sorter
      
      The new sorter design is inspired by [Alphasort](http://research.microsoft.com/pubs/68249/alphasort.doc) and implements a key-prefix optimization in order to improve the cache friendliness of the sort.  In naive sort implementations, the sorting algorithm operates on an array of record pointers.  To compare two records for ordering, the sorter must dereference these pointers, which likely involves random memory access, then compare the objects themselves.
      
      ![image](https://cloud.githubusercontent.com/assets/50748/8611390/3b1402ae-2675-11e5-8308-1a10bf347e6e.png)
      
      In a key-prefix sort, the sort operates on an array which stores the record pointer alongside a prefix of the record's key. When comparing two records for ordering, the sorter first compares the the stored key prefixes. If the ordering can be determined from the key prefixes (i.e. the prefixes are unequal), then the sort can avoid directly comparing the records, avoiding random memory accesses and full record comparisons. For example, if we're sorting a list of strings then we can store the first 8 bytes of the UTF-8 encoded string as the key-prefix and can perform unsigned byte-at-a-time comparisons to determine the ordering of strings based on their prefixes, only resorting to full comparisons for strings that share a common prefix.  In cases where the sort key can fit entirely in the space allotted for the key prefix (e.g. the sorting key is an integer), we completely avoid direct record comparison.
      
      In this patch's implementation of key-prefix sorting, our sorter's internal array stores a 64-bit long and 64-bit pointer for each record being sorted. The key prefixes are generated by the user when inserting records into the sorter, which uses a user-defined comparison function for comparing them.  The `PrefixComparators` object implements a set of comparators for many common types, including primitive numeric types and UTF-8 strings.
      
      The actual sorting is implemented by `UnsafeInMemorySorter`.  Most consumers will not use this directly, but instead will use `UnsafeExternalSorter`, a class which implements a sort that can spill to disk in response to memory pressure.  Internally, `UnsafeExternalSorter` creates `UnsafeInMemorySorters` to perform sorting and uses `UnsafeSortSpillReader/Writer` to spill and read back runs of sorted records and `UnsafeSortSpillMerger` to merge multiple sorted spills into a single sorted iterator.  This external sorter integrates with Spark's existing ShuffleMemoryManager for controlling spilling.
      
      Many parts of this sorter's design are based on / copied from the more specialized external sort implementation that I designed for the new UnsafeShuffleManager write path; see #5868 for more details on that patch.
      
      ### Sorting rows in Spark SQL
      
      For now, `UnsafeExternalSorter` is only used by Spark SQL, which uses it to implement a new sort operator, `UnsafeExternalSort`.  This sort operator uses a SQL-specific class called `UnsafeExternalRowSorter` that configures an `UnsafeExternalSorter` to use prefix generators and comparators that operate on rows encoded in the UnsafeRow format that was designed for Project Tungsten.
      
      I used some interesting unit-testing techniques to test this patch's SQL-specific components.  `UnsafeExternalSortSuite` uses the SQL random data generators introduced in #7176 to test the UnsafeSort operator with all atomic types both with and without nullability and in both ascending and descending sort orders.  `PrefixComparatorsSuite` contains a cool use of ScalaCheck + ScalaTest's `GeneratorDrivenPropertyChecks` in order to test UTF8String prefix comparison.
      
      ### Misc. additional improvements made in this patch
      
      This patch made several miscellaneous improvements to related code in Spark SQL:
      
      - The logic for selecting physical sort operator implementations, which was partially duplicated in both `Exchange` and `SparkStrategies, has now been consolidated into a `getSortOperator()` helper function in `SparkStrategies`.
      - The `SparkPlanTest` unit testing helper trait has been extended with new methods for comparing the output produced by two different physical plans. This makes it easy to write tests which assert that two physical operator implementations should produce the same output.  I also added a method for disabling the implicit sorting of outputs prior to comparing them, a change which is necessary in order to be able to write proper SparkPlan tests for sort operators.
      
      ### Tasks deferred to followup patches
      
      While most of this patch's features are reasonably well-tested and complete, there are a number of tasks that are intentionally being deferred to followup patches:
      
      - Add tests which mock the ShuffleMemoryManager to check that memory pressure properly triggers spilling (there are examples of this type of test in #5868).
      - Add tests to ensure that spill files are properly cleaned up after errors.  I'd like to do this in the context of a patch which introduces more general metrics for ensuring proper cleanup of tasks' temporary files; see https://issues.apache.org/jira/browse/SPARK-8966 for more details.
      - Metrics integration: there are some open questions regarding how to track / report spill metrics for non-shuffle operations, so I've deferred most of the IO / shuffle metrics integration for now.
      - Performance profiling.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/6444)
      <!-- Reviewable:end -->
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6444 from JoshRosen/sql-external-sort and squashes the following commits:
      
      6beb467 [Josh Rosen] Remove a bunch of overloaded methods to avoid default args. issue
      2bbac9c [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      35dad9f [Josh Rosen] Make sortAnswers = false the default in SparkPlanTest
      5135200 [Josh Rosen] Fix spill reading for large rows; add test
      2f48777 [Josh Rosen] Add test and fix bug for sorting empty arrays
      d1e28bc [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      cd05866 [Josh Rosen] Fix scalastyle
      3947fc1 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      d13ac55 [Josh Rosen] Hacky approach to copying of UnsafeRows for sort followed by limit.
      845bea3 [Josh Rosen] Remove unnecessary zeroing of row conversion buffer
      c56ec18 [Josh Rosen] Clean up final row copying code.
      d31f180 [Josh Rosen] Re-enable NullType sorting test now that SPARK-8868 is fixed
      844f4ca [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      293f109 [Josh Rosen] Add missing license header.
      f99a612 [Josh Rosen] Fix bugs in string prefix comparison.
      9d00afc [Josh Rosen] Clean up prefix comparators for integral types
      88aff18 [Josh Rosen] NULL_PREFIX has to be negative infinity for floating point types
      613e16f [Josh Rosen] Test with larger data.
      1d7ffaa [Josh Rosen] Somewhat hacky fix for descending sorts
      08701e7 [Josh Rosen] Fix prefix comparison of null primitives.
      b86e684 [Josh Rosen] Set global = true in UnsafeExternalSortSuite.
      1c7bad8 [Josh Rosen] Make sorting of answers explicit in SparkPlanTest.checkAnswer().
      b81a920 [Josh Rosen] Temporarily enable only the passing sort tests
      5d6109d [Josh Rosen] Fix inconsistent handling / encoding of record lengths.
      87b6ed9 [Josh Rosen] Fix critical issues in test which led to false negatives.
      8d7fbe7 [Josh Rosen] Fixes to multiple spilling-related bugs.
      82e21c1 [Josh Rosen] Force spilling in UnsafeExternalSortSuite.
      88b72db [Josh Rosen] Test ascending and descending sort orders.
      f27be09 [Josh Rosen] Fix tests by binding attributes.
      0a79d39 [Josh Rosen] Revert "Undo part of a SparkPlanTest change in #7162 that broke my test."
      7c3c864 [Josh Rosen] Undo part of a SparkPlanTest change in #7162 that broke my test.
      9969c14 [Josh Rosen] Merge remote-tracking branch 'origin/master' into sql-external-sort
      5822e6f [Josh Rosen] Fix test compilation issue
      939f824 [Josh Rosen] Remove code gen experiment.
      0dfe919 [Josh Rosen] Implement prefix sort for strings (albeit inefficiently).
      66a813e [Josh Rosen] Prefix comparators for float and double
      b310c88 [Josh Rosen] Integrate prefix comparators for Int and Long (others coming soon)
      95058d9 [Josh Rosen] Add missing SortPrefixUtils file
      4c37ba6 [Josh Rosen] Add tests for sorting on all primitive types.
      6890863 [Josh Rosen] Fix memory leak on empty inputs.
      d246e29 [Josh Rosen] Fix consideration of column types when choosing sort implementation.
      6b156fb [Josh Rosen] Some WIP work on prefix comparison.
      7f875f9 [Josh Rosen] Commit failing test demonstrating bug in handling objects in spills
      41b8881 [Josh Rosen] Get UnsafeInMemorySorterSuite to pass (WIP)
      90c2b6a [Josh Rosen] Update test name
      6d6a1e6 [Josh Rosen] Centralize logic for picking sort operator implementations
      9869ec2 [Josh Rosen] Clean up Exchange code a bit
      82bb0ec [Josh Rosen] Fix IntelliJ complaint due to negated if condition
      1db845a [Josh Rosen] Many more changes to harmonize with shuffle sorter
      ebf9eea [Josh Rosen] Harmonization with shuffle's unsafe sorter
      206bfa2 [Josh Rosen] Add some missing newlines at the ends of files
      26c8931 [Josh Rosen] Back out some Hive changes that aren't needed anymore
      62f0bb8 [Josh Rosen] Update to reflect SparkPlanTest changes
      21d7d93 [Josh Rosen] Back out of BlockObjectWriter change
      7eafecf [Josh Rosen] Port test to SparkPlanTest
      d468a88 [Josh Rosen] Update for InternalRow refactoring
      269cf86 [Josh Rosen] Back out SMJ operator change; isolate changes to selection of sort op.
      1b841ca [Josh Rosen] WIP towards copying
      b420a71 [Josh Rosen] Move most of the existing SMJ code into Java.
      dfdb93f [Josh Rosen] SparkFunSuite change
      73cc761 [Josh Rosen] Fix whitespace
      9cc98f5 [Josh Rosen] Move more code to Java; fix bugs in UnsafeRowConverter length type.
      c8792de [Josh Rosen] Remove some debug logging
      dda6752 [Josh Rosen] Commit some missing code from an old git stash.
      58f36d0 [Josh Rosen] Merge in a sketch of a unit test for the new sorter (now failing).
      2bd8c9a [Josh Rosen] Import my original tests and get them to pass.
      d5d3106 [Josh Rosen] WIP towards external sorter for Spark SQL.
      fb8807c9
    • rahulpalamuttam's avatar
      [SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpm · 0772026c
      rahulpalamuttam authored
      Author: rahulpalamuttam <rahulpalamut@gmail.com>
      
      Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits:
      
      bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm
      cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
      0772026c
    • Davies Liu's avatar
      [HOTFIX] fix flaky test in PySpark SQL · 05ac023d
      Davies Liu authored
      It may loss precision in microseconds when using float for it.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7344 from davies/fix_date_test and squashes the following commits:
      
      249ec61 [Davies Liu] fix flaky test
      05ac023d
    • Min Zhou's avatar
      [SPARK-8675] Executors created by LocalBackend won't get the same classpath as... · c185f3a4
      Min Zhou authored
      [SPARK-8675] Executors created by LocalBackend won't get the same classpath as other executor backends
      
      AFAIK, some spark application always use LocalBackend to do some local initiatives, spark sql is an example. Starting a LocalPoint won't add user classpath into executor.
      ```java
        override def start() {
          localEndpoint = SparkEnv.get.rpcEnv.setupEndpoint(
            "LocalBackendEndpoint", new LocalEndpoint(SparkEnv.get.rpcEnv, scheduler, this, totalCores))
        }
      ```
      Thus will cause local executor fail with these scenarios, loading hadoop built-in native libraries, loading other user defined native libraries, loading user jars, reading s3 config from a site.xml file, etc
      
      Author: Min Zhou <coderplay@gmail.com>
      
      Closes #7091 from coderplay/master and squashes the following commits:
      
      365838f [Min Zhou] Fixed java.net.MalformedURLException, add default scheme, support relative path
      d215b7f [Min Zhou] Follows spark standard scala style, make the auto testing happy
      84ad2cd [Min Zhou] Use system specific path separator instead of ','
      01f5d1a [Min Zhou] Merge branch 'master' of https://github.com/apache/spark
      e528be7 [Min Zhou] Merge branch 'master' of https://github.com/apache/spark
      45bf62c [Min Zhou] SPARK-8675 Executors created by LocalBackend won't get the same classpath as other executor backends
      c185f3a4
Loading