Skip to content
Snippets Groups Projects
  1. Jan 16, 2015
  2. Jan 15, 2015
    • Reynold Xin's avatar
      [SPARK-5274][SQL] Reconcile Java and Scala UDFRegistration. · 1881431d
      Reynold Xin authored
      As part of SPARK-5193:
      
      1. Removed UDFRegistration as a mixin in SQLContext and made it a field ("udf").
      2. For Java UDFs, renamed dataType to returnType.
      3. For Scala UDFs, added type tags.
      4. Added all Java UDF registration methods to Scala's UDFRegistration.
      5. Documentation
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4056 from rxin/udf-registration and squashes the following commits:
      
      ae9c556 [Reynold Xin] Updated example.
      675a3c9 [Reynold Xin] Style fix
      47c24ff [Reynold Xin] Python fix.
      5f00c45 [Reynold Xin] Restore data type position in java udf and added typetags.
      032f006 [Reynold Xin] [SPARK-5193][SQL] Reconcile Java and Scala UDFRegistration.
      1881431d
  3. Jan 12, 2015
    • Gabe Mulley's avatar
      [SPARK-5138][SQL] Ensure schema can be inferred from a namedtuple · 1e42e96e
      Gabe Mulley authored
      When attempting to infer the schema of an RDD that contains namedtuples, pyspark fails to identify the records as namedtuples, resulting in it raising an error.
      
      Example:
      
      ```python
      from pyspark import SparkContext
      from pyspark.sql import SQLContext
      from collections import namedtuple
      import os
      
      sc = SparkContext()
      rdd = sc.textFile(os.path.join(os.getenv('SPARK_HOME'), 'README.md'))
      TextLine = namedtuple('TextLine', 'line length')
      tuple_rdd = rdd.map(lambda l: TextLine(line=l, length=len(l)))
      tuple_rdd.take(5)  # This works
      
      sqlc = SQLContext(sc)
      
      # The following line raises an error
      schema_rdd = sqlc.inferSchema(tuple_rdd)
      ```
      
      The error raised is:
      ```
        File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 107, in main
          process()
        File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/worker.py", line 98, in process
          serializer.dump_stream(func(split_index, iterator), outfile)
        File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/serializers.py", line 227, in dump_stream
          vs = list(itertools.islice(iterator, batch))
        File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1107, in takeUpToNumLeft
          yield next(iterator)
        File "/opt/spark-1.2.0-bin-hadoop2.4/python/pyspark/sql.py", line 816, in convert_struct
          raise ValueError("unexpected tuple: %s" % obj)
      TypeError: not all arguments converted during string formatting
      ```
      
      Author: Gabe Mulley <gabe@edx.org>
      
      Closes #3978 from mulby/inferschema-namedtuple and squashes the following commits:
      
      98c61cc [Gabe Mulley] Ensure exception message is populated correctly
      375d96b [Gabe Mulley] Ensure schema can be inferred from a namedtuple
      1e42e96e
  4. Dec 27, 2014
    • Brennon York's avatar
      [SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac · a3e51cc9
      Brennon York authored
      Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
      
      0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
      9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
      d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
      b979c58 [Brennon York] updated the merge conflict
      c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
      b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
      be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
      28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
      7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
      14a5da0 [Brennon York] removed unnecessary zinc output on startup
      1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
      3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
      a680d12 [Brennon York] Added comments to functions and tested various mvn calls
      bb8cc9d [Brennon York] removed package files
      ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
      07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
      f914dea [Brennon York] Beginning final portions of localized scala home
      69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
      4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
      cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
      a3e51cc9
  5. Dec 23, 2014
    • jbencook's avatar
      [SPARK-4860][pyspark][sql] speeding up `sample()` and `takeSample()` · fd41eb95
      jbencook authored
      This PR modifies the python `SchemaRDD` to use `sample()` and `takeSample()` from Scala instead of the slower python implementations from `rdd.py`. This is worthwhile because the `Row`'s are already serialized as Java objects.
      
      In order to use the faster `takeSample()`, a `takeSampleToPython()` method was implemented in `SchemaRDD.scala` following the pattern of `collectToPython()`.
      
      Author: jbencook <jbenjamincook@gmail.com>
      Author: J. Benjamin Cook <jbenjamincook@gmail.com>
      
      Closes #3764 from jbencook/master and squashes the following commits:
      
      6fbc769 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing sloppy indentation for takeSampleToPython() arguments
      5170da2 [J. Benjamin Cook] [SPARK-4860][pyspark][sql] fixing typo: from RDD to SchemaRDD
      de22f70 [jbencook] [SPARK-4860][pyspark][sql] using sample() method from JavaSchemaRDD
      b916442 [jbencook] [SPARK-4860][pyspark][sql] adding sample() to JavaSchemaRDD
      020cbdf [jbencook] [SPARK-4860][pyspark][sql] using Scala implementations of `sample()` and `takeSample()`
      fd41eb95
  6. Dec 17, 2014
  7. Dec 16, 2014
    • Davies Liu's avatar
      [SPARK-4866] support StructType as key in MapType · ec5c4279
      Davies Liu authored
      This PR brings support of using StructType(and other hashable types) as key in MapType.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3714 from davies/fix_struct_in_map and squashes the following commits:
      
      68585d7 [Davies Liu] fix primitive types in MapType
      9601534 [Davies Liu] support StructType as key in MapType
      ec5c4279
  8. Nov 24, 2014
    • Davies Liu's avatar
      [SPARK-4578] fix asDict() with nested Row() · 050616b4
      Davies Liu authored
      The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3434 from davies/fix_asDict and squashes the following commits:
      
      b20f1e7 [Davies Liu] fix asDict() with nested Row()
      050616b4
  9. Nov 20, 2014
    • Dan McClary's avatar
      [SPARK-4228][SQL] SchemaRDD to JSON · b8e6886f
      Dan McClary authored
      Here's a simple fix for SchemaRDD to JSON.
      
      Author: Dan McClary <dan.mcclary@gmail.com>
      
      Closes #3213 from dwmclary/SPARK-4228 and squashes the following commits:
      
      d714e1d [Dan McClary] fixed PEP 8 error
      cac2879 [Dan McClary] move pyspark comment and doctest to correct location
      f9471d3 [Dan McClary] added pyspark doc and doctest
      6598cee [Dan McClary] adding complex type queries
      1a5fd30 [Dan McClary] removing SPARK-4228 from SQLQuerySuite
      4a651f0 [Dan McClary] cleaned PEP and Scala style failures.  Moved tests to JsonSuite
      47ceff6 [Dan McClary] cleaned up scala style issues
      2ee1e70 [Dan McClary] moved rowToJSON to JsonRDD
      4387dd5 [Dan McClary] Added UserDefinedType, cleaned up case formatting
      8f7bfb6 [Dan McClary] Map type added to SchemaRDD.toJSON
      1b11980 [Dan McClary] Map and UserDefinedTypes partially done
      11d2016 [Dan McClary] formatting and unicode deserialization default fixed
      6af72d1 [Dan McClary] deleted extaneous comment
      4d11c0c [Dan McClary] JsonFactory rewrite of toJSON for SchemaRDD
      149dafd [Dan McClary] wrapped scala toJSON in sql.py
      5e5eb1b [Dan McClary] switched to Jackson for JSON processing
      6c94a54 [Dan McClary] added toJSON to pyspark SchemaRDD
      aaeba58 [Dan McClary] added toJSON to pyspark SchemaRDD
      1d171aa [Dan McClary] upated missing brace on if statement
      319e3ba [Dan McClary] updated to upstream master with merged SPARK-4228
      424f130 [Dan McClary] tests pass, ready for pull and PR
      626a5b1 [Dan McClary] added toJSON to SchemaRDD
      f7d166a [Dan McClary] added toJSON method
      5d34e37 [Dan McClary] merge resolved
      d6d19e9 [Dan McClary] pr example
      b8e6886f
  10. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  11. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4192][SQL] Internal API for Python UDT · 04450d11
      Xiangrui Meng authored
      Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
      
      marmbrus jkbradley davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
      
      acff637 [Xiangrui Meng] merge master
      dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
      2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
      7c4a6a9 [Xiangrui Meng] address comments
      75223db [Xiangrui Meng] minor update
      f740379 [Xiangrui Meng] remove UDT from default imports
      e98d9d0 [Xiangrui Meng] fix py style
      4e84fce [Xiangrui Meng] remove local hive tests and add more tests
      39f19e0 [Xiangrui Meng] add tests
      b7f666d [Xiangrui Meng] add Python UDT
      04450d11
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
  12. Nov 01, 2014
    • Matei Zaharia's avatar
      [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations · 23f966f4
      Matei Zaharia authored
      - Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
      - Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs
      
      This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #2983 from mateiz/decimal-1 and squashes the following commits:
      
      35e6b02 [Matei Zaharia] Fix issues after merge
      227f24a [Matei Zaharia] Review comments
      31f915e [Matei Zaharia] Implement Davies's suggestions in Python
      eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
      4dc6bae [Matei Zaharia] Fix decimal support in PySpark
      d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
      b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
      2118c0d [Matei Zaharia] Some test and bug fixes
      81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
      7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
      ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
      23f966f4
    • Xiangrui Meng's avatar
      [SPARK-3569][SQL] Add metadata field to StructField · 1d4f3552
      Xiangrui Meng authored
      Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.
      
      Metadata is preserved through simple operations like `SELECT`.
      
      marmbrus liancheng
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2701 from mengxr/structfield-metadata and squashes the following commits:
      
      dedda56 [Xiangrui Meng] merge remote
      5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
      886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
      589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      1e2abcf [Xiangrui Meng] change default value of metadata to None in python
      611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
      ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
      3f49aab [Xiangrui Meng] remove StructField.toString
      24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
      473a7c5 [Xiangrui Meng] merge master
      c9d7301 [Xiangrui Meng] organize imports
      1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
      60cc131 [Xiangrui Meng] add doc and header
      60614c7 [Xiangrui Meng] add metadata
      e42c452 [Xiangrui Meng] merge master
      93518fb [Xiangrui Meng] support metadata in python
      905bb89 [Xiangrui Meng] java conversions
      618e349 [Xiangrui Meng] make tests work in scala
      61b8e0f [Xiangrui Meng] merge master
      7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
      c41a664 [Xiangrui Meng] merge master
      d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
      67fdebb [Xiangrui Meng] add test on join
      d65072e [Xiangrui Meng] remove Map.empty
      367d237 [Xiangrui Meng] add test
      c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
      1d4f3552
  13. Oct 31, 2014
    • wangfei's avatar
      [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 · 7c41d135
      wangfei authored
       In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.
      
       1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility
      
       2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version
      
       3 SBT cmd for different version as follows:
         hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
         hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly
      
       4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:
      
      f26f3be [wangfei] remove clean to save time
      f5cac74 [wangfei] remove local hivecontext test
      578234d [wangfei] use new shaded hive
      18fb1ff [wangfei] exclude kryo in hive pom
      fa21d09 [wangfei] clean package assembly/assembly
      8a4daf2 [wangfei] minor fix
      0d7f6cf [wangfei] address comments
      f7c93ae [wangfei] adding build with hive 0.13 before running tests
      bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      c359822 [wangfei] reuse getCommandProcessor in hiveshim
      52674a4 [scwf] sql/hive included since examples depend on it
      3529e98 [scwf] move hive module to hive profile
      f51ff4e [wangfei] update and fix conflicts
      f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      41f727b [scwf] revert pom changes
      13afde0 [scwf] fix small bug
      4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
      0bc53aa [scwf] fixed when result filed is null
      dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
      c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
      7c66b8e [scwf] update pom according spark-2706
      ae47489 [scwf] update and fix conflicts
      7c41d135
  14. Oct 28, 2014
    • Daoyuan Wang's avatar
      [SPARK-3988][SQL] add public API for date type · 47a40f60
      Daoyuan Wang authored
      Add json and python api for date type.
      By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2901 from adrian-wang/spark3988 and squashes the following commits:
      
      c51a24d [Daoyuan Wang] convert datetime to date
      5670626 [Daoyuan Wang] minor line combine
      f760d8e [Daoyuan Wang] fix indent
      444f100 [Daoyuan Wang] fix a typo
      1d74448 [Daoyuan Wang] fix scala style
      8d7dd22 [Daoyuan Wang] add json and python api for date type
      47a40f60
  15. Oct 24, 2014
    • Davies Liu's avatar
      [SPARK-4051] [SQL] [PySpark] Convert Row into dictionary · d60a9d44
      Davies Liu authored
      Added a method to Row to turn row into dict:
      
      ```
      >>> row = Row(a=1)
      >>> row.asDict()
      {'a': 1}
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2896 from davies/dict and squashes the following commits:
      
      8d97366 [Davies Liu] convert Row into dict
      d60a9d44
  16. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings · 7a3f589e
      cocoatomo authored
      Sphinx documents contains a corrupted ReST format and have some warnings.
      
      The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
      
      commit: 0e8203f4
      
      output
      ```
      $ cd ./python/docs
      $ make clean html
      rm -rf _build/*
      sphinx-build -b html -d _build/doctrees   . _build/html
      Making output directory...
      Running Sphinx v1.2.3
      loading pickled environment... not yet created
      building [html]: targets for 4 source files that are out of date
      updating environment: 4 added, 0 changed, 0 removed
      reading sources... [100%] pyspark.sql
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
      looking for now-outdated files... none found
      pickling environment... done
      checking consistency... done
      preparing documents... done
      writing output... [100%] pyspark.sql
      writing additional files... (12 module code pages) _modules/index search
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      done
      copying extra files... done
      dumping search index... done
      dumping object inventory... done
      build succeeded, 4 warnings.
      
      Build finished. The HTML pages are in _build/html.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
      
      2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      7a3f589e
  17. Oct 08, 2014
    • Cheng Lian's avatar
      [SPARK-3713][SQL] Uses JSON to serialize DataType objects · a42cc08d
      Cheng Lian authored
      This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases.
      
      Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`.
      
      JoshRosen davies Please help review PySpark related changes, thanks!
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2563 from liancheng/datatype-to-json and squashes the following commits:
      
      fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation
      438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments
      6b6387b [Cheng Lian] Removes debugging code
      6a3ee3a [Cheng Lian] Addresses per review comments
      dc158b5 [Cheng Lian] Addresses PEP8 issues
      99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion
      a983a6c [Cheng Lian] Adds PySpark support
      f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON
      a42cc08d
  18. Oct 07, 2014
  19. Oct 06, 2014
    • Sandy Ryza's avatar
      [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel · 20ea54cc
      Sandy Ryza authored
      Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr.
      
      This PR is based on #1388, thanks to sryza!
      
      closes #1388
      
      Author: Sandy Ryza <sandy@cloudera.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2625 from davies/string and squashes the following commits:
      
      3544aad [Davies Liu] fix LinearModel
      0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark
      1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places
      aa9e962 [Sandy Ryza] Switch __str__ to __repr__
      a0c5041 [Sandy Ryza] Add labels back in
      1aa17f5 [Sandy Ryza] Match existing conventions
      fac1bc4 [Sandy Ryza] Fix PEP8 error
      f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
      20ea54cc
  20. Oct 01, 2014
    • Davies Liu's avatar
      [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD · abf588f4
      Davies Liu authored
      1. broadcast is triggle unexpected
      2. fd is leaked in JVM (also leak in parallelize())
      3. broadcast is not unpersisted in JVM after RDD is not be used any more.
      
      cc JoshRosen , sorry for these stupid bugs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2603 from davies/fix_broadcast and squashes the following commits:
      
      080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
      abf588f4
  21. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
  22. Sep 27, 2014
    • Davies Liu's avatar
      [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD · 0d8cdf0e
      Davies Liu authored
      Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.
      
      This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2526 from davies/nested and squashes the following commits:
      
      2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
      0d8cdf0e
  23. Sep 26, 2014
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
  24. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row · a95ad99e
      Davies Liu authored
      Fix the issue when applySchema() to an RDD of Row.
      
      Also add type mapping for BinaryType.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2448 from davies/row and squashes the following commits:
      
      dd220cf [Davies Liu] fix test
      3f3f188 [Davies Liu] add more test
      f559746 [Davies Liu] add tests, fix serialization
      9688fd2 [Davies Liu] support applySchema to RDD of Row
      a95ad99e
  25. Sep 18, 2014
    • Davies Liu's avatar
      [SPARK-3554] [PySpark] use broadcast automatically for large closure · e77fa81a
      Davies Liu authored
      Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2417 from davies/command and squashes the following commits:
      
      fbf4e97 [Davies Liu] bugfix
      aefd508 [Davies Liu] use broadcast automatically for large closure
      e77fa81a
  26. Sep 16, 2014
    • Davies Liu's avatar
      [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx · ec1adecb
      Davies Liu authored
      Using Sphinx to generate API docs for PySpark.
      
      requirement: Sphinx
      
      ```
      $ cd python/docs/
      $ make html
      ```
      
      The generated API docs will be located at python/docs/_build/html/index.html
      
      It can co-exists with those generated by Epydoc.
      
      This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2292 from davies/sphinx and squashes the following commits:
      
      425a3b1 [Davies Liu] cleanup
      1573298 [Davies Liu] move docs to python/docs/
      5fe3903 [Davies Liu] Merge branch 'master' into sphinx
      9468ab0 [Davies Liu] fix makefile
      b408f38 [Davies Liu] address all comments
      e2ccb1b [Davies Liu] update name and version
      9081ead [Davies Liu] generate PySpark API docs using Sphinx
      ec1adecb
    • Aaron Staple's avatar
      [SPARK-2314][SQL] Override collect and take in python library, and count in... · 8e7ae477
      Aaron Staple authored
      [SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions.
      
      SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer.  The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well.  This patch overrides relevant calls in the python and java interfaces with optimized versions.
      
      Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #1592 from staple/SPARK-2314 and squashes the following commits:
      
      89ff550 [Aaron Staple] Merge with master.
      6bb7b6c [Aaron Staple] Fix typo.
      b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
      0fc9d40 [Aaron Staple] Fix comment typos.
      f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
      8e7ae477
    • Matthew Farrellee's avatar
      [SPARK-3519] add distinct(n) to PySpark · 9d5fa763
      Matthew Farrellee authored
      Added missing rdd.distinct(numPartitions) and associated tests
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2383 from mattf/SPARK-3519 and squashes the following commits:
      
      30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
      6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
      7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
      9d5fa763
  27. Sep 12, 2014
    • Davies Liu's avatar
      [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd · 885d1621
      Davies Liu authored
      Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.
      
      In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().
      
      BTW, Do we need JavaSQLContext?
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2369 from davies/fix_schemardd and squashes the following commits:
      
      abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
      885d1621
  28. Sep 08, 2014
    • Matthew Rocklin's avatar
      [SPARK-3417] Use new-style classes in PySpark · 939a322c
      Matthew Rocklin authored
      Tiny PR making SQLContext a new-style class.  This allows various type logic to work more effectively
      
      ```Python
      In [1]: import pyspark
      
      In [2]: pyspark.sql.SQLContext.mro()
      Out[2]: [pyspark.sql.SQLContext, object]
      ```
      
      Author: Matthew Rocklin <mrocklin@gmail.com>
      
      Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits:
      
      4aadab6 [Matthew Rocklin] update other old-style classes
      a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
      939a322c
  29. Sep 06, 2014
    • Davies Liu's avatar
      [SPARK-2334] fix AttributeError when call PipelineRDD.id() · 110fb8b2
      Davies Liu authored
      The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.
      
      The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.
      
      closes #1276
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2296 from davies/id and squashes the following commits:
      
      e197958 [Davies Liu] fix style
      9721716 [Davies Liu] fix id of PipelineRDD
      110fb8b2
    • Holden Karau's avatar
      Spark-3406 add a default storage level to python RDD persist API · da35330e
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and squashes the following commits:
      
      33eaade [Holden Karau] As Josh pointed out, sql also override persist. Make persist behave the same as in the underlying RDD as well
      e658227 [Holden Karau] Fix the test I added
      e95a6c5 [Holden Karau] The Python persist function did not have a default storageLevel unlike the Scala API. Noticed this issue because we got a bug report back from the book where we had documented it as if it was the same as the Scala API
      da35330e
  30. Sep 04, 2014
  31. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF · c5cbc492
      Davies Liu authored
      After this patch, broadcast can be used in Python UDF.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2243 from davies/udf_broadcast and squashes the following commits:
      
      7b88861 [Davies Liu] support broadcast in UDF
      c5cbc492
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  32. Aug 26, 2014
    • Takuya UESHIN's avatar
      [SPARK-2969][SQL] Make ScalaReflection be able to handle... · 98c2bb0b
      Takuya UESHIN authored
      [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
      
      Make `ScalaReflection` be able to handle like:
      
      - `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
      - `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
      - `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
      - `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:
      
      24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
      79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
      7cd1a7a [Takuya UESHIN] Fix json test failures.
      2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
      2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
      9fa02f5 [Takuya UESHIN] Fix a test failure.
      1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
      98c2bb0b
  33. Aug 16, 2014
Loading